In part two of this series, we had just ported a fractal renderer to the SPEs on a Cell Broadband Engine machine. Now we're going to start the optimisation...

Our baseline performance is 40.7 seconds to generate the sample fractal (using the sample fractal parameters).

The initial SPE-based fractal renderer used only one SPE. However, we have more available:

Number of SPEs in currently-available in CBEA machines
Machine SPEs available
Sony Playstation 3 6
IBM QS21 / QS22 blades. 16 (8 per CPU)

So, we should be able to get better performance by distributing the render work between the SPEs. We can do this by dividing the fractal into a set of n strips, where n is the number of SPEs available. Then, each SPE renders its own strip of the fractal.

The following image shows the standard fractal, as would be rendered by 8 SPEs, with shading to show the division of the work amongst the SPEs.

SPE fractal divisions

In order to split up the work, we first need to tell each SPE which part of the fractal it is to render. We'll add two variables to the spe_args structure (which is passed to the SPE during program setup), to provide the total number of threads (n_threads) and the index of this SPE thread (thread_idx).

struct spe_args {
        struct fractal_params fractal;
        int n_threads;
        int thread_idx;
};

spe changes

On the SPE side, we use these parameters to alter the invocations of render_fractal. We set up another couple of convenience variables:

        rows_per_spe = args.fractal.rows / args.n_threads;
	start_row = rows_per_spe * args.thread_idx;

And just alter our for-loop to only render rows_per_spe rows, rather than the entire fractal:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
	}

ppe changes

The changes to the PPE code are fairly simple - instead of just creating one thread, create n threads.

First though, let's add a '-n' argument to the program to specify the number of threads to start:

        while ((opt = getopt(argc, argv, "p:o:n:")) != -1) {
                switch (opt) {

                /* other options omitted */

                case 'n':
                        n_threads = atoi(optarg);
                        break;

Rather than just creating one SPE thread, we create n_threads. Also, we have to set the thread-specific arguments for each thread:

        for (i = 0; i < n_threads; i++) {
                /* copy the fractal data into this thread's args */
                memcpy(&threads[i].args.fractal, fractal, sizeof(*fractal));

                /* set thread-specific arguments */
                threads[i].args.n_threads = n_threads;
                threads[i].args.thread_idx = i;

                threads[i].ctx = spe_context_create(0, NULL);

                spe_program_load(threads[i].ctx, &spe_fractal);

                pthread_create(&threads[i].pthread, NULL,
                                spethread_fn, &threads[i]);
        }

Now, the SPEs should be running, and generating their own slice of the fractal. We just have to wait for them all to finish:

        /* wait for the threads to finish */
        for (i = 0; i < n_threads; i++)
                pthread_join(threads[i].pthread, NULL);

If you're after the source code for the multi-threaded SPE fractal renderer, it's available in fractal.3.tar.gz.

That's it! Now we have a multi-threaded SPE application to do the fractal rendering. In theory, an n threaded program will take 1/n of the time of a single-threaded implementation, let's see how that fares with reality...

performance

Let's compare invocations of our multi-threaded fractal renderer, with different values for the n_threads parameter.

Performance of multi-threaded SPE fractal renderer
SPE threads Running time (sec)
140.72
230.14
418.84
612.72
810.89

Not too bad, but we're definitely not seeing linear scalability here; we could expect the 8 SPE case to take around 5 seconds, rather than 11.

what's slowing us down?

A little investigation into the fractal generator will show that some SPE threads are finishing long before others. This is due to the variability in calculation time between pixels. In order to see if a point (ie, pixel) in the fractal does not converge towards infinity (and gets coloured blue), we need to do the full i_max tests in render_fractal (i_max is 10,000 in our sample fractal). Other pixels may converge much more quickly (possibly in under 10 iterations), and so are orders of mangitude faster to calculate.

We end up with slices that are very quick to calculate, and others that take longer. Of course, we have to wait for the longest-running SPE thread to complete, so our overall runtime will be that of the slowest thread.

So, let's take another aproach to distributing the workload. Rather than dividing the fractal into contiguous slices, we can interleave the SPE work. For example, if we were to use 2 SPE threads, then thread 0 would render all the even chunks, and thread 1 would render all the odd chunks (where a "chunk" is a set of rows that fit into a single DMA). This should even-out the work between SPE threads.

Interleaved SPE fractal divisions

This is just a matter of changing the SPE for-loop a little. Rather than the current code, which divides the work into n_threads contiguous chunks:

        for (row = 0; row < rows_per_spe; row += rows_per_dma) {

                render_fractal(&args.fractal, start_row + row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + (start_row + row) * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

We change this to render every n_threadth chunk, starting from thread_idx, which gives us the the interleaved pattern:

        for (row = rows_per_dma * args.thread_idx;
                        row < args.fractal.rows;
                        row += rows_per_dma * args.n_threads) {

                render_fractal(&args.fractal, row,
                                rows_per_dma);

                mfc_put(buf, ppe_buf + row * bytes_per_row,
                                bytes_per_row * rows_per_dma,
                                0, 0, 0);

                /* Wait for the DMA to complete */
                mfc_write_tag_mask(1 << 0);
                mfc_read_tag_status_all();
        }

An updated renderer is available in fractal.4.tar.gz.

Making this small change gives some better performance figures:

Performance of multi-threaded, interleaved SPE fractal renderer
SPE threads Running time (sec)
140.72
220.75
410.78
67.44
85.81

We're doing much better now, but we're still nowhere near the theoretical maximum performance of the SPEs. More optimisations in the next article...