In the previous article in this series, we finished with a parellel SPE-based fractal renderer. This article will cover some of the major optimisations we can do to make use of the Cell Broadband Engine.

Our last benchmark gave us the following results:

Performance of multi-threaded, interleaved SPE fractal renderer
140.72
220.75
410.78
67.44
85.81

### SPE vector instructions

The registers on the SPEs are actually 128-bits wide. Rather than using the entire register for one value, most of the SPE instruction set consists of vector instructions.

For example, say we wanted to add 7 to the value in register one, and place the result in register two. To do this, we'd use the `ai` ("add immediate") instruction:

```        ai r2,r1,7
```

This instruction operates on the four 32-bit 'slots' of the 128-bit register (4 x 32 = 128) in parallel: By using these vector instructions, we can do a lot of the fractal rendering work in parallel.

Vector instructions aren't specific to the Cell Broadband Engine - Altivec (on PowerPC processors) and SSE (on Pentium processors) offer similar capabilities for parallel processing of data. In fact, the PPE on the Cell Broadband Engine supports Altivec extensions.

### vectorising the fractal renderer

The hackfest task description contains a little information on how to vectorise code:

The SPEs of the Cell/B.E. have an instruction set almost entirely composed of SIMD instructions - so you can do most computations on four operands at a time.

To take advantage of these SIMD capabilities, you can use "vector" types in your code.

For example, consider a simple function to multiply a number by three:

```float multiply_by_three(float f)
{
return f * 3;
}
```

By vectorising this function, we can do four operations in the same amount of time:

```vector float multiply_by_three(vector float f);
{
return f * (vector float){3, 3, 3, 3};
}
```

The `vector float` type is similar to an array of 4 `float`s, except that we can operate on the vector as a whole.

If we look at the instructions generated for both the scalar and vector versions of the multiply_by_three function, they're exactly the same. The following code shows the compiled assembly of `multiply_by_three()`, both vector and scalar versions. The function argument is in register 3, return value is placed in register 3.

```multiply_by_three:
ilhu   \$2,16448    ; load {3, 3, 3, 3} into register 2
fm     \$3,\$3,\$2    ; register 3 = (register 3) * (register 2)
```

So, if you're using scalar operations, we get one multiply in three instructions. Using the vector equivalent, we get four multiplies in three instructions.

So, we should look at the hot paths of the current fractal renderer, and try to vectorise where possible.

Currently, this loop of our `render_fractal` function is where the SPE is doing most of the work, iterating over each pixel to find the point where it diverges to infinity:

```        for (x = 0; x < params->cols; x++) {
cr = x_min + x * params->delta;

zr = 0;
zi = 0;

for (i = 0; i < params->i_max; i++)  {
/* z = z^2 + c */
tmp = zr*zr - zi*zi + cr;
zi =  2.0 * zr * zi + ci;
zr = tmp;

/* if abs(z) > 2.0 */
if (unlikely(zr*zr + zi*zi > 4.0))
break;
}

colour_map(&params->imgbuf[r * params->cols + x],
i, params->i_max);
}
```

The renderer will execute the inner loop up to `rows` * `columns` * `i` times, where `i` is between 1 and `i_max`. Seeing as this is the most executed path in our renderer code, it is a good place to start our vectorisation efforts.

By vectorising, we can perform the contents of the loop on four pixels at a time (rather than just one), enabling us to reduce the number of iterations to around one quarter.

The new vectorised fractal generator is available in fractal.5.tar.gz. All we have changed here is the SPE-side code (`spe-fractal.c`). The most significant change is to the `render_fractal()` function, where we have replaced the above loop with:

```        for (x = 0; x < params->cols; x += 4) {
escaped_i = (vector unsigned int){0, 0, 0, 0};
cr = x_min + spu_splats(params->delta * x);

zr = (vector float){0, 0, 0, 0};
zi = (vector float){0, 0, 0, 0};

for (i = 0; i < params->i_max; i++)  {

/* z = z^2 + c */
tmp = zr*zr - zi*zi + cr;
zi =  (vector float){2.0f, 2.0f, 2.0f, 2.0f}
* zr * zi + ci;
zr = tmp;

/* escaped = abs(z) > 2 */
escaped = spu_cmpgt(zr * zr + zi * zi, limit);

/* escaped_i = escaped ? escaped_i : i */
escaped_i = spu_sel(spu_splats(i), escaped_i,
escaped);

/* if all four words in escaped are non-zero,
* we're done */
if (!spu_extract(spu_orx(~escaped), 0))
break;
}

colour_map(&params->imgbuf[r * params->cols + x],
escaped_i, params->i_max);
}
```

This new code processes four columns at each iteration, and increments `x` by four each time. We set up a vector of pixels (one per column) at each iteration, and do the divergence tests on a vector at a time.

Most of the actual aritmetic is easy to convert; the standard multiply, add and subtract operators will work just fine with vector types.

However, the `if`-test to see if the number has diverged is more complex - since we're dealing with a vector of four pixels, we can't just break out of the inner loop when one pixel has diverged. Instead, we need to keep processing until all vector elements to reach the divergence point, and remember the value of `i` where this divergence occurs.

Basically, the above code keeps two extra variables: `escaped` and `escaped_i`. `escaped` is a vector of four flags - if the nth word of `escaped` is non-zero, we have detected that the nth pixel in our vector has diverged to infinity. The `escaped_i` is a similar vector of the value of `i` where we first detected this divergence, which is used as the 'value' for each point in the fractal.

We do this with a few SPE intrinsics, which allow us to use some of the lower-level SPE vector functionality:

 `spu_cmpgt(a,b)` compare greater than returns one bits in each word slot in `a` that is greater than the corresponding word slot in `b`. `spu_sel(a,b,c)` select bits For bit each bit in `c`: if the bit is zero, return the corresponding bit from `a`, otherwise return the corresponding bit from `b`. `spu_orx(a)` or across Logical-or each word in `a`, and return the result in the preferred slot (ie, the first word). The remaining three slots are set to zero. `spu_extract(a,n)` extract vector element Return a scalar value from the `n`th element of vector `a` `spu_splats(a)` splat scalar Return a vector consisting of each word equal to `a`

The first three intrinsics will result in just one instruction in the compiled code. Since slot 0 is the preferred slot (and therefore where the compiler would normally store a scalar value in a register), the call to `spu_extract(x, 0)` will not result in any instructions at all. The `spu_splats` instrinsic may require a variable number of instructions, depending on the value to splat.

More detail on the SPE instrinsics is available in the PPU & SPU C/C++ Language Extension Specification and SPU Instruction Set Architecture documents on the IBM developerWorks Cell Broadband Engine Resource Center.

We also need to change the `colour_map` function to take a vector of `i` values (rather than just a single value), and calulate the colours of four pixels at a time. For now, we can just do this one element at a time, using `spu_extract`, but this could be improved later.

### benchmarking

After vectorising the SPE fractal code, we can benchmark to see how the vectorisation affects run time. The following table shows run times for the new renederer, along with the proportion of time against the previous scalar implementation.

Performance of multi-threaded, interleaved, vectorised SPE fractal renderer
SPE threads Running time (sec) Vector / Scalar
110.700.262
25.720.372
43.290.305
62.460.330
82.060.354

Not too bad! We're doing much better than our original PPE-based renderer, which took 55.7 seconds to render the same fractal.

Ideally, we'd see a constant vector / scalar figure of 0.25, meaning that we're doing four times the amount of work in the same run time. However, there are still a lot of operations that we haven't optimised yet.

More optimisations in the next article!