3 min read

Improving Performance with DWA

In my 2014 Dyalog presentation I demonstrated the Co-dfns compiler and its ability to target GPU and CPU platforms without altering the source code. Unfortunately, at the time, there was a regression in the system that lead to slower-than-interpreter results. I did get that fixed, but found that I was hitting a hard limit of around 20 - 30% performance gaines over the interpreter, no matter how much I played with the code.

Further analysis of the code showed that most of the computation time was being taken up in something other than the primary code. Looking through this, it appeared that the core of this "extra computation" was coming from the cost of transferring data in and out of the interpreter.

The original design of Co-dfns called for support of 64-bit integers and a separate data representation than what was used for the interpreter. The current version of the Dyalog interpreter only supports 32-bit integers (thought it does support 64-bit floating point values) for various reasons, and this meant that every call into compiled Co-dfns code had to go through the ⎕NA interface. The data would come in, and the Co-dfns runtime would convert the data passed to from the Dyalog interpreter into a set of Co-dfns Array structures which would then be given back to the interpreter as pointers. At that point, the data would be passed directly into the Co-dfns FFI call, and the resulting Array pointer would be converted back into the interpreter format before being used by the interpreter as the result of the computation.

This original design arose because of two intentional design choices. The first was to support 64-bit integers. The second was to ensure that we would later have the ability to decouple array headers from the memory regions storing the data. While this in theory is a better design, my performance analysis revealed that the overhead of copying data in and out of the interpreter overwhelmed the core computation by such a wide margin that even should the computation happen nearly instantaneously, the best I could hope for was a 30% performance improvement, simply because of the limitation of memory bandwidth.

To determine whether this was indeed what was holding me back, I programmed two prototypes (GPU and CPU) that represented the compiler output after basic optimizatios were performed. I then used these and measured their performance without the data copying overhead. The result? I saw 80-90% performance improvements on the CPU and over 200× improvements on the GPU.

This left me in a bit of a sticky mess. In the end, performance won out and I set about rewriting the compiler runtime elements to eliminate the now defunct data representation and instead use the DWA system that Dyalog provides.

DWA stands for Direct Workspace Access, and is, essentially, an API into the Dyalog memory manager. It allows me to construct arrays directly inside of the memory management system of the Dyalog APL interpreter and in turn to use those arrays inside of my own foreign code that the Co-dfns compiler generates. Furthermore, because of this, the ⎕NA system has support for this array format directly, enabling me to avoid any copying overhead calling into and out of the Co-dfns world.

I took this opportunity to also alter the runtime to include complete inlining of primitive functions and perform loop fusion for all scalar computations. This has two specific benefits. Firstly, it simplifies deployment, as there is no runtime shared object required to use Co-dfns compiled code (just DWA), and secondly, I am able to more carefully construct the code to enable large blocks of vectorizable or parallelizable code directly through the compiler.

The results of this work is that the Co-dfns compiler no longer experiences a hard limit of 20-30% on its code. The CPU backend, which is now completely rewritten in this form is delivering performance of around 75% above the interpreter, and I suspect that I will be able to get the 80-90% of the prototype by implementing one or two more optimizations that were omitted in this first development cycle. The GPU backend is not quite complete, as I am currently evaluating a new implementation technology (OpenACC) to determine whether this will make the code both fast and more platform independent, which is important for a system like Co-dfns. However, based on the performance of the CPU code and how it matches the results predicted by the prototype, I anticipate that we will still see the 200× performance improvements over the interpreter on Fermi GPU cards. I'll be working with Xeon Phi and Maxwell NVIDIA cards soon to determine what sort of performance users can expect on newer hardware as well.

In short, this is an interesting story of performance analysis where data transfer overheads overwhelmed everything else. By completely eliminating data transfer overhead, I'm able to deliver significant performance benefits now, on a wide variety of platforms without requiring users to write complicated code. APL truly is a great cross-platform performance juggernaut.