Repa performance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
I'm trying to decide if Repa is going to be performant
enough to pursue it for some work with images.

The following shows comparisons between Repa and C
for the last example in dons' Repa turorial:
http://www.haskell.org/haskellwiki/Numeric_Haskell:_A_Repa_Tutorial

I'm using the most current hackage version of Repa, repa-3.2.3.3.
The machine is an ASUS 1015PN with Intel Atom N550 / 1.5 GHz dual-core,
with 2 GB RAM.  OS is Ubuntu 11.04.  GHC version is 7.6.3.

I don't read/write the image in the C, but I measured
this in the Repa version (bypassing the processing)
and it was 2 seconds, so assume 2 seconds for the
C as well, and add that to the runtime.

The test image used is:
http://www.fremissant.net/repa-test/test-1024x600.png

-----------------------------------------------
--- The Repa source as per the haskellwiki: ---
-----------------------------------------------

module Main where

import System.Environment
import Data.Word
import Data.Array.Repa hiding ((++))
import Data.Array.Repa.IO.DevIL
import Data.Array.Repa.Repr.ForeignPtr
 
--
-- Read an image, desaturate, write out with new name.
--
main = do
  [f] <- getArgs
  runIL $ do
    (RGB a) <- readImage f
    b <- (computeP $ traverse a id luminosity) :: IL (Array F DIM3 Word8)
    writeImage ("grey-" ++ f) (RGB b)

--
-- (Parallel) desaturation of an image via the luminosity method.
--
luminosity :: (DIM3 -> Word8) -> DIM3 -> Word8
luminosity _ (Z :. _ :. _ :. 3) = 255   -- alpha channel
luminosity f (Z :. i :. j :. _) = ceiling $ 0.21 * r + 0.71 * g + 0.07 * b
    where
        r = fromIntegral $ f (Z :. i :. j :. 0)
        g = fromIntegral $ f (Z :. i :. j :. 1)
        b = fromIntegral $ f (Z :. i :. j :. 2)

------------------------------------------------
--- The equivalent C program (not parallel): ---
------------------------------------------------

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
typedef unsigned char byte;
int main()
{
   int i;
   int c=1024*600;
   int c4=1024*600*4;
   byte *data;
   byte *dataout;
   // since the image IO was shown to be 2 seconds in repa.hs, will
   // just add that to this runtime, and scrap worrying about it.
   data=(byte *)calloc(c4,sizeof(byte));
   dataout=(byte *)calloc(c,sizeof(byte));
   int k=0,kout=0;
   for( i=0;i<c;i++ ){
     byte r = data[k++];
     byte g = data[k++];
     byte b = data[k++];
     byte a = data[k++];
     dataout[kout++] = ceil ( 0.21 * r + 0.71 * g + 0.07 * b );
   }
}

--------------------------------------------------------------------
--- Compilation and run command for the Haskell, as recommended: ---
--------------------------------------------------------------------

echo "Compiling..."
time ghc \
  --make repa.hs \
  -Odph  \
  -rtsopts  \
  -threaded  \
  -fno-liberate-case  \
  -funfolding-use-threshold1000  \
  -funfolding-keeness-factor1000  \
  -fllvm  \
  -optlo-O3

echo "Running..."
time ./repa test-640x480.png +RTS -N2 -H

------------------------------------------------------------
--- Running this script twice: -----------------------------
------------------------------------------------------------

Compiling...
0.588u 0.072s 0:00.81 80.2%     0+0k 3144+0io 21pf+0w
Running...
13.236u 0.784s 0:07.34 190.8%   0+0k 528+1288io 4pf+0w

Compiling...
0.608u 0.048s 0:00.66 96.9%     0+0k 0+0io 0pf+0w
Running...
13.148u 0.964s 0:07.30 193.1%   0+0k 0+1288io 0pf+0w

----------------------------------------------------------------
--- Compilation and running for the C (no optimisation): -------
----------------------------------------------------------------

gcc -O0 -lm repa-c.c -o repa-c

repa-c

------------------------------------------------------------
--- Running this script twice: -----------------------------
------------------------------------------------------------

Compiling...
0.180u 0.044s 0:00.23 95.6%     0+0k 0+32io 0pf+0w
Running...
0.096u 0.008s 0:00.10 90.0%     0+0k 0+0io 0pf+0w

Compiling...
0.176u 0.028s 0:00.21 90.4%     0+0k 0+32io 0pf+0w
Running...
0.100u 0.004s 0:00.10 100.0%    0+0k 0+0io 0pf+0w

------------------------------------------------------------
--- Sizes of binaries: -------------------------------------
------------------------------------------------------------

Haskell: 1233640 bytes
C:          8714 bytes

----------------------------------------------------

Remember the C should have 2 seconds added to the time
to account for the image I/O.

I think Repa is not very performant for the given example. :(