In my previous post here, I mentioned that:
We also need to allow contexts to be loaded outside of spu_run to implement gang scheduling correctly and efficiently.
I think that may require a little explanation, so here we go:
gang scheduling policies
The idea behind SPE gang scheduling is to allow a set of related SPE contexts to be scheduled together, to allow interactions between contexts to be performed in a timely manner. For example, consider two contexts (A and B) that send mailbox messages to each other. If context A is running while context B is scheduled out, then A will spend its time-slice waiting for a message from B. If they are scheduled as a single entity, neither context will have to spend its timeslice blocked, waiting for the other context to be run.
So, we have to come up with a policy to define the behaviour of the gang scheduler. When does the gang become schedulable? Under which conditions should the gang be descheduled?
I can see four possible approaches:
policy 1: the gang is only schedulable when all contexts are runnable
In this case, the gang is only ever scheduled when all of the gang's
contexts are runnable (ie, they are being run by the spu_run
system
call).
Although the simplest, this approach will never complete—consider the following:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- Because only one of the two contexts is runnable, the gang is no longer schedulable. Context B is never re-scheduled, so cannot complete the rest of its task
So, this policy isn't much use; perhaps we can solve this with a new approach:
policy 2: the gang is scheduled when all contexts are runnable, and descheduled when no contexts are runnable
This will solve the previous non-termination problem, in that context B will be able to terminate - the context isn't immediately descheduled when A finishes.
However, now we have a new, slightly more complex non-termination case:
- Context A becomes runnable
- Context B becomes runnable
- The gang is now schedulable, so both contexts are scheduled
- Because it has slightly less work to do, context A finishes before context B
- At the same time, context B does a PPE-assisted callback, which requires a stop-and-signal (and so leaves spu_run for just a moment)
- Because neither context is currently runnable, the gang is descheduled
- Context B finishes its callback, so re-enters spu_run to be re-scheduled. However, the policy does not allow context B to be re-scheduled, as only one of the two contexts is runnable.
Although this may sound like a rare occurrence, it's not a restriction we can pass on to the programmer. Imagine the following SPE code:
int main(void) { do_work(); printf("work done!\n"); return 0; }
Here we're doing a PPE-assisted callback (the call to printf
is
implemented as a callback) before finishing. If this callback were to occur
when the other context has already completed, we would hit the non-termination
condition above.
This means that the last-running context of a gang can never do a PPE-assisted callback. In fact, to be completely safe against this non-termination, a programmer would have to avoid callbacks after any context has finished, for risk of callbacks on the rest of the gang being synchronised.
So, it looks like we need to be a little more permissive when deciding if the gang is schedulable.
policy 3: the gang is scheduled when any context is runnable, and descheduled when no contexts are runnable
This is another fairly simple approach—the gang is scheduled whenever there is any work to do. We no longer have any non-termination conditions, as 'having work to do' will result in 'doing work'.
The tricky part is that it will require us to change one of the fundamental assumptions about spufs: currently, we don't schedule any context unless it is runnable. Because we schedule the entire gang when one if its context becomes runnable, we have to now schedule a number of non-runnable contexts.
The good news is that I've already done a little experimental work to overcome this general restriction in spufs.
The last approach is a little more complex, but works around this restriction:
policy 4: schedule the runnable contexts of a gang, and reserve SPEs for the non-runnable contexts
This is just like policy 3, but instead of actually scheduling the non-runnable contexts, we reserve a SPE for them.
This way, a non-runnable context does not need to be loaded, but can be quickly scheduled when it becomes runnable. The downside is that we're only half-implementing gang scheduling; there still may be interactions to a non-runnable SPE (eg, accesses to the problem state mapping from a running context in the same gang) that will cause running contexts to become blocked.
So, which policy is best for spufs?
Policies 1 and 2 have significant flaws in their approach. It's quite possible that either will lead to non-termination conditions in fairly simple user programs. I don't think we can 'work around' this with a restriction on the programmer.
Policy 4 will require a mechanism for reserving SPEs for a particular context; I'm not convinced the extra complexity is worth the effort, especially as this doesn't allow us to implement gang scheduling properly.
Currently, Luke Browning and André Detsch have a work-in progress patch series for gang scheduling, based on policy 3.