When we typically try to measure the speedup of a parallelized algorithm, many people only calculate the serial vs. parallel running time (Sp = Tseq/Tpar). The effective speedup can more closely be approximated with the following equation to further define Tpar:
1 2 3 4 5 

All of these are easily quantifiable with the exception of Tsync. In an ideal world Tsync is zero, but this is never the case. A single slow processor/machine/data link can make a huge difference with Tsync. So, how can we manage it?
The most straightforward way to try to minimize Tsync it it to apply a fixed chunk load. MPI uses this method. This is fine unless you have a heterogeneous cluster or other applications are running on the processors/machines being used.
An extension of the fixed chunking is to create a normalized value for all of your computing resources. Then, let the computing resources get jobs whose size is comparable to their ability. Another way, say we have three machines whose computing powers are A:1, B:2, C:3 (C is three times as powerful as A). If we have twelve pieces of work to distribute, we can group them into six groups of two (total / sum of power). In the time C will compute three of the chunks, B will have done two and A will have done one. The entire job should wrap up with a minimal Tsync. This kind of work load balancing works very well for heterogeneous clusters.
Other scheduling methods had been tried in clusters; factoring and guided selfscheduling being two, but they seem to add too much overhead for the Tcomm. There has to be a balance in creating chunky communication and minimizing Tsync. They may however work as valid scheduling algorithms for multicore processing. The Tcomm is greatly reduced compared to cluster computing.
Working again from Joe Duffy’s MSDN article ”Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms”, I have reworked the loop tiling code to support scheduling algorithms.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

With this setup, all you have to do is set the algorithm that you would like to use along with the scheduling algorithm. The variables are thread static so that many threads can have their own algorithms, but since the for loop is blocking you only need one copy per thread.
It will take a little bit of tweaking for your application’s algorithm, but these should produce reasonable results. My next goal is to get them benchmarked under different scenarios.