In preparation for some research work I'm doing in parallelism, I finally managed to get some comprehension of tasks and multi-programming in APLX.

I came up with the following code, and I would appreciate any improvements or suggestions. It's fast in the case where you have significant amounts of stenciling to do (I used a multi-pass mean filter/blur stencil on an RGB Image). I managed nearly 100% CPU utilization throughout the computation.

Aaron W. Hsu

∇Z←PBLUR IMG;F;R;G;B;T;TOV;INIT 
 0 0⍴1 ⎕WE 0 
 (R G B)←T←'⎕' ⎕NEW¨3⍴⊂'APL' 
 T.background←1 ⋄ T.wssize←400000000 
 T.∆M←⊂[2 3](3⍴256)⊤IMG ⋄ Z←0 ⋄ F←3⍴0 
 T.∆TOV←⊂0 ⎕OV ⎕BOX 'CONVOLUTE3 CONVOLVEMANY CHILD' 
 R.onSignal←'R.onSignal←"F[1]←1 ⋄ →(^/F)/XT" ⋄ 0 0⍴R.Signal' 
 G.onSignal←'G.onSignal←"F[2]←1 ⋄ →(^/F)/XT" ⋄ 0 0⍴G.Signal' 
 B.onSignal←'B.onSignal←"F[3]←1 ⋄ →(^/F)/XT" ⋄ 0 0⍴B.Signal' 
 INIT←'Sys←''⎕'' ⎕NEW ''System'' ⋄ 2 ⎕OV Sys.∆TOV ⋄ CHILD' 
 0 0⍴T.Open 
 0 0⍴R.Execute INIT 
 0 0⍴G.Execute INIT 
 0 0⍴B.Execute INIT 
 0 0⍴⎕WE ¯1 
 XT:Z←B.∆M+256×G.∆M+256×R.∆M 
 ∇ 

∇Z←S CONVOLUTE3 IMG;M;W;H;PW;PH;I;SR;SS;R;X 
 ⍝ 
 ⍝ Morten contributed optimizations to the original form. 
 ⍝ 
 ⍝ Determine size and row size of the stencil. 
 SR←↑⍴S ⋄ SS←×/⍴S 
 ⍝ 
 ⍝ Pad the incoming matrix with zeros around the edge. 
 (PH PW)←(2×R←⌊SR÷2)+(H W)←⍴IMG 
 M←(-R)⌽(-R)⊖PH PW↑IMG 
 ⍝ 
 ⍝ Generate the template column for a region of the matrix. 
 I←(+PW×(SR⍴0),(SS-SR)⍴1,(¯1+SR)⍴0)+SS⍴⍳SR 
 ⍝ 
 ⍝ Generate all the other region columns. 
 I←I∘.+(X⍴(W⍴1),(2×R)⍴0)/0,⍳¯1+X←(PH×PW)-(2×R×PW)+2×R 
 ⍝ 
 ⍝ Compute the stencil 
 Z←⌊0.5+H W⍴(,S)+.×(,M)[I] 
 ∇ 

∇Z←S CONVOLVEMANY IMG;I 
 I←0 ⋄ Z←IMG 
 LP:→(10<I←I+1)/0 ⋄ Z←S CONVOLUTE3 Z ⋄ →LP 
 ∇

∇CHILD;S
S←3 3⍴÷9
Sys.onSignal←'Sys.∆M←S CONVOLVEMANY Sys.∆M ⋄ Sys.Signal ⋄ →XT'
Sys.Signal
0 0⍴⎕WE ¯1
∇XT:⍎')OFF'