I wrote a few matrix multiplication (MM) applications for benchmarking my parallel processing research. I was blown away by how slow the C# implementations were working. I tried to write the MM application a few ways. It turns out that the fastest is a jagged array, type[N][N], rather than type[N*N] or type[N,N]. I even tried to use unsafe C# to use pointers, but the type[N][N] was still by far the fastest. I am going to compare the type[N][N] and the type[N,N] here.
Here is the C# code for the type[N,N]. I am using int as it benchmarks faster and I spend less time waiting. I have removed the initialization of the arrays as it just added more code to analyze and isn’t needed for this analysis:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
I have added the console output to identify the IL more easily.
Here is the type[N][N] version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
On first glance it looks like we have a bit more initialization overhead for the type[N][N]. I compiled everything and reflected out the IL. I then stripped all locations to keep this as clean and simple as possible (and the diff cleaner). Though there is a difference. The type[N][N] uses ldelem and stelem to initialize while type[N,N] uses call instance void int32[0…,0…]::Set(int32, int32, int32). I think this is where we start seeing why the C# array is slower than the jagged version.
If we look at the IL for the initialization that takes care of everything up to the writing of “Calculate” to the console:
On the type[N,N] side we are creating a new object on the evaluation stack and then popping from the stack and storing in a local variable – this is done for each the three arrays.
On the type[N][N] we push a new primitive array of int32 onto the stack and for each position we use the same call to create a new primitive array and replace the value on the stack at the given position.
Here is the grand conclusion to our IL:
As you can see, the calculations happen ‘almost’ identically. In the type[N,N] array we are making external calls to the runtime in order to get an address and set values whereas the type[N][N] simply issue a ldelem. Now, I wish there were a way to see exactly how long the Address and Get methods take, but the evidence is clear that ldelem is a lot faster.
I ran all of the matrix permutations for all of the types of arrays I could use. The ikj MM was the fastest for all. I benchmarked all of these on two machines (S01, S02). The machines were as identical as possible and nothing else running except background services. Each point was run three times and the average of the three was plotted. The matrix size (bottom) was NxN (square) and the left hand side is the number of algorithmic steps per second. For MM the multiplication, addition, and assignment are 1 operation.
For a 3000×3000 matrix we are making millions or method calls using the type[N,N] matrix that the type[N,N] does. Why is the C# matrix implemented in the way? Please let me know if I missed anything or you have some insight.