Exercises

We have seen that it could be extremely beneficial to compute $C=A\times B^T$ by directly implementing a mult_trans operation rather than explicitly forming the transpose of $B$ and calling mult.

In this exercise we are going to explore a few more details about optimization.

mult vs. mult_trans

Your starter repository contains a benchmark driver mmult.cpp that exercises the matrix multiply and matrix multiply transpose operations we wrote in the last assignment. It loops over a set of matrix sizes (8 to 256) and measures the performance of each function at that size. For each size it prints a row of performance numbers in GFlop/s – each column is the performance for one of the functions.

The functions mult_0 through mult_3 and mult_trans_0 through mult_trans_3 are provided for you. Compile and run mmult (the default size used with no command line argument should be fine). There is a rule in the provided CMakeLists.txt that invokes the compiler so you should just be able to build the program and run it afterwards:

$ ./build/mmult

Please make sure to compile all tests related to measuring performance in Release mode.

If your computer and compiler are more or less like mine, you should see something printed that looks like this:

  N      mult_0      mult_1      mult_2      mult_3     mul_t_0     mul_t_1     mul_t_2     mul_t_3
   3.00795     2.90768     6.23075     5.81536      3.6346     3.35502     7.93004     7.26921
   2.35758      2.4572     6.23075     6.12144     2.50123     6.58343     11.0769     11.8279
   1.98706     2.01804     6.61387     6.56595     2.29975     12.0813     15.8965      17.425
   1.70172     1.94222     6.08737     6.62057     2.05673     13.9541      20.156     17.7847
   1.57708     1.72746     6.05582     6.35951     1.79506     14.7662      21.136      16.457
   1.28681     1.46371     4.87796     5.92137     1.62958      11.073     19.4263     15.7063

Huh. Although we applied all of the same optimizations to the mult and mult_trans functions – the mult_trans functions are significantly faster – even for the very simplest cases. That is fine, when you need to compute $C = A\times B^T$ – but we also need to just be able to compute $C = A\times B$ – it is a much more common operation – and we would like to do it just as fast as for $C = A\times B^T$.

Save the output from a run of mmult into a file results/mmult.log and submit that as part of your homework.

Loop Ordering

The mathematical formula for matrix-matrix product is

\[C_{ij} = \sum_{k=0}^{K-1} A_{ik} B_{kj} ,\]

which we realized using our Matrix class as three nested loops over i, j, and k (in that nesting order). This expression says “for each $i$ and $j$, sum each of the $A_{ik} B_{kj}$ products.

The function(s) we have been presenting, we have been using the “ijk” ordering, where i is the loop variable for the outer loop, j is the loop variable for the middle loop, and k is the loop variable for the inner loop:

void mult(const Matrix& A, const Matrix& B, Matrix& C) 
{
    for (size_t i = 0; i < C.num_rows(); ++i)
        for (size_t j = 0; j < C.num_cols(); ++j)
            for (size_t k = 0; k < A.num_cols(); ++k)
                C(i, j) += A(i, k) * B(k, j);
}

But the summation is valid regardless of the order in which we add the k-indexed terms to form each result matrix element $C_{ij}$$. Computationally, if we look at the inner term C(i, j) = C(i, j) + A(i, k) * B(k, j) – there is no dependence of i on j or k (nor any of the loop indices on any of the other loop indices); we can do the inner summation in any order. That is, we can nest the i, j, k loops in any order, giving us six variants of the matrix-matrix product algorithm. For instance, an “ikj” ordering would have i in the outer loop, k in the middle loop, and j in the inner loop.

void mult_ikj(const Matrix& A, const Matrix& B, Matrix& C) 
{
    for (size_t i = 0; i < C.num_rows(); ++i)
        for (size_t k = 0; k < A.num_cols(); ++k)
            for (size_t j = 0; j < C.num_cols(); ++j)
                C(i, j) += A(i, k) * B(k, j);
}

Benchmark Different Loop Orderings

The program mmult_indices.cpp in your work directory is a benchmark driver that calls six different matrix multiply functions: mult_ijk, mult_ikj, mult_jik, mult_jki, mult_kij, and mult_kji. The corresponding implementations are included in your matrix_operations.cpp file. Right now those implementations are completely unoptimized – they are each just three loops (in the indicated order, with C(i, j) += A(i, k) * B(k, j) in the inner loop). The program runs each of these six functions over a set of problem sizes ($N \times N$ matrices) from 8 to 256, in powers of two. The maximum problem size can also be passed in on the command line as an argument. The program prints the GFlop/s that it measures for each of the six functions, for each of the sizes it runs.

Compile and run mmult_indices (the default arguments should be fine). There is a CMake rule that builds this executable. Please make sure to build in Release mode (as usual for performance measurements):

$ ./build/mmult_indices

Look carefully at the performance numbers that are printed. You should see two of the columns with not so good performance, two columns with okay performance, and two columns with the best performance. The columns within each pair have something in common – what is it?

Depending on your CPU and your compiler, some of the differences will be more or less pronounced, but they should be visible as the problem size increases.

Here is a discussion question for you on Discord. Explain what you are seeing in terms of data accesses at each inner loop iteration.

Save the output from a run of mmult_indices with “_0” in the results folder (i.e. as results/mmult_indices_0.log) to indicate it is a run before any optimizations have been introduced.

Answer the following questions in results/answers.md:

There are three pairs of index orderings that give similar performance. What have those orderings in common?

Save the output from a run of mmult_indices into a file results/mmult_indices_0.log and submit that as part of your homework.

What is the Best Loop Ordering?

I claim that certain loop ordering give good performance because of the memory access pattern being favorable. Is that consistent with what you are seeing in these (unoptimized) functions?

To guide your discussion, consider the following diagram of matrices A, B, and C.

Here, each box is a matrix entry. Recall that for row-ordered matrices (such as we are using in this course), the elements in each row are stored next to each other in memory.

Now, let’s think about what happens when we are executing matrix multiply with the “ijk” ordering. From one iteration of the inner loop to the next (from one step of the algorithm to the next), what are the next data elements to be accessed? For “ijk”, here is what happens next:

What happens in the case of each of the six orderings for mult – what are the next elements to be accessed in each case? Does this help explain the similarities between certain orderings – as well as why there are good, bad, and ugly (er, medium) pairs? Refer again to how data are stored for three matrices A, B, and C.

Answer the following question in results/answers.md:

Referring to the figures about how data are stored in memory, what is it about the best performing pair of loops that is so advantageous?

Optimizing mult_trans

If we look at the output from mmult, the biggest jump in performance comes between mult_trans_0 and mult_trans_1. The difference between the code in the two functions is quite small though – we just “hoisted” the reference to C(i, j) out of the inner loop.

What will the data access pattern be when we are executing mult_trans in “ijk” order? What data are accessed in each of the matrices at step (i, j, k) and what data are accessed at step (i, j, k+1)? Are these accesses advantageous in any way?

Answer the following questions in results/answers.md:

Referring again to how data are stored in memory, explain why hoisting C(i, j) out of the inner loop is so beneficial in mult_trans with the “ijk” loop ordering.

Using mult_trans to Optimize mult (Take 1)

Modify mult_ijk so that it achieves performance on par with mult_trans_1 (or mult_trans_2 or mult_trans_3). (Hint: Be lazy: mult_trans_1 et.al. are already written. Don’t repeat yourself. Don’t even copy and paste.)

Compile and run mmult_indices with your modified version of mult_ijk. Save the output to a file results/mmult_indices_1.log.

Sometimes when you are developing code, you might want to keep older code around to refer to, or to go back to if an exploration you are making doesn’t pan out. One common way of saving things in this way is to use the preprocessor and “#if 0” the code you don’t want to be compiled.
#if 0
Stuff I want to save but not compile
#endif

You may (but are not required) to “#if 0” the initial version of mult_ijk.

Answer the following questions in results/answers.md:

Save the output from a run of your optimized mmult_indices into a file results/mmult_indices_1.log and submit that as part of your homework.

Using mult_trans to Optimize mult (Take 2)

Modify mult_3 so that it achieves performance on par with trans_mult_1 (or trans_mult_2 or trans_mult_3). (This may not be quite as straightforward as for mult_ijk).

Compile and run mmult_indices with your modified version of mult_3 with problem sizes up to 1024 (say). Save the output to a file results/mmult_indices_2.log.

Answer the following question in results/answers.md:

What optimization is applied in going from mult_2 to mult_3?

How does your maximum achieved performance for mmult_indices (any version) compare to what bandwidth and roofline predicted? Show your analysis.

Save the output from a run of your optimized mmult_indices into a file results/mmult_indices_2.log and submit that as part of your homework.

Array of Structs vs. Struct of Arrays

To store the “color” of a particular pixel, a color image stores the values of the three additive primary colors (red, green, and blue) necessary to reconstruct that color. A color image is thus a three-dimensional structure: 3 color planes, each of which is the same height by width of the image.

A question arises then about what is the best way to store / access that data in memory? We could, for instance, store each color plane separately, so that the image is a “structure of arrays”:

Or, we could store the three color values of each pixel together, so that the image is an “array of structures”:

Or, we could store just a single large array, treating the image as a “tensor” and use a lookup formula to determine how the data are stored.

The struct of arrays data structure could be implemented like this:

class SOA_Image {
public:
    SOA_Image(size_t M, size_t N)
      : nrows(M)
      , ncols(N)
      , storage 
    {}

    size_t num_rows() const { return nrows; }
    size_t num_cols() const { return ncols; }

    const double& operator()(size_t i, size_t j, size_t k) const { 
        return storage[k][i * ncols + j]; 
    }
    double&       operator()(size_t i, size_t j, size_t k) { 
        return storage[k][i * ncols + j]; 
    }

private:
    size_t                             nrows;
    size_t                             ncols;
    std::array<std::vector<double>, 3> storage;
};

Again, the terminology “struct of arrays” is generic, referring to how data are organized in memory. Our implementation (in storage) of the “struct of arrays” is as a array of vector. Note also that the accessor operator() accesses pixels with (i, j, k) where i and j are pixel coordinates and k is the color plane.

Similarly, the array of structs data structure could be implemented this way:

class AOS_Image {
public:
    AOS_Image(size_t M, size_t N) 
      : nrows(M)
      , ncols(N)
      , storage(M * N) 
    {}

    const double& operator()(size_t i, size_t j, size_t k) const { 
        return storage[i * ncols + j][k]; 
    }
    double&       operator()(size_t i, size_t j, size_t k) { 
        return storage[i * ncols + j][k]; 
    }

    size_t num_rows() const { return nrows; }
    size_t num_cols() const { return ncols; }

private:
    size_t                             nrows;
    size_t                             ncols;
    std::vector<std::array<double, 3>> storage;
};

Again, the terminology “array of structs” is generic. Our implementation (in storage_) of the “array of structs” is as a vector of array. Note also that the accessor operator() accesses pixels with (i, j, k) where i and j are pixel coordinates and k is the color plane.

Finally, a “tensor” based storage could be implemented as:

class Tensor_Image {
public:
    Tensor_Image(size_t M, size_t N) 
      : nrows(M)
      , ncols(N)
      , storage(3 * M * N) 
    {}

    const double& operator()(size_t i, size_t j, size_t k) const { 
        return storage[k * nrows * ncols + i * ncols + j]; 
    }
    double&       operator()(size_t i, size_t j, size_t k) { 
        return storage[k * nrows * ncols + i * ncols + j]; 
    }

    size_t num_rows() const { return nrows; }
    size_t num_cols() const { return ncols; }

private:
    size_t              nrows;
    size_t              ncols;
    std::vector<double> storage;
};

Blurring

Although blurring an image sounds like something you would not want to do, small amounts of blur can be useful in image processing for, say, removing high frequency noise from an image. One simple blurring function is a box blur, which computes an output pixel based on the average of nine neighboring input pixels.

So, for instance, we could blur an image with the following:

// Iterate through the image
for (size_t i = 1; i < img.num_rows() - 1; ++i) {
    for (size_t j = 1; j < img.num_cols() - 1; ++j) {
        // Iterate through the color planes
        for (size_t k = 0; k < 3; ++k) {
            // Apply blurring kernel
            blurred(i, j, k) = (
                img(i + 0, j + 0, k) +
                img(i + 0, j + 1, k) +
                img(i + 0, j - 1, k) +
                img(i + 1, j + 0, k) +
                img(i + 1, j + 1, k) +
                img(i + 1, j - 1, k) +
                img(i - 1, j + 0, k) +
                img(i - 1, j + 1, k) +
                img(i - 1, j - 1, k)) / 9.0;
        }
    }
}

Note that since we have defined the accessor generically, the inner blurring kernel will be the same regardless of which representation we use for the image. We do have a choice, however in how we access the data: do we iterate over the color planes as the outer loop or as the inner loop?

Loop Ordering

In your source code for this assignment is the file soa_vs_aos.cpp. From it you can generate soa_vs_aos. When you run that executable on an image, it will apply six different blurring operations: for the three different formats (aos, soa, tensor) and for the image planes iterated on the outer loop and on the inner loop. Read the source code to familiarize yourself with which blurring kernel does which. Note, as mentioned above, that the inner loop of each function is the same – and that the functions doing the color plane on the outer loop are the same (similarly for the inner loop). (If it seems like we are violating DRY, we are, we will see how to fix this in a few more assignments.)

Your source code has one julia image: results/julia.bmp. You can generate your own images based on your previous assignment, however. Run the program on them. The program will print out execution times for the six implementation variants of the blurring function as well as save the blurred output for each variant. (If you don’t want to see these (or spend the time computing them), you can comment out the lines in the code that save the output images.):

$ cd results
$ ../build/soa_vs_aos julia.bmp

Answer the following questions in results/answers.md:

Which variant of blurring function between struct of arrays and array of structs gives the better performance? Explain what about the data layout and access pattern would result in better performance.

Which variant of the blurring function has the best performance overall? Explain, taking into account not only the data layout and access pattern but also the accessor function.

Next up: Buddhabrot Fractal