Table of Contents

Measuring Performance

Measuring Performance

Timing

As discussed in lecture – if we are going to achieve “high performance computing” we need to be able to measure performance. Performance is the ratio of how much work is done in how much time – and so we need to measure (or calculate) both work and time to quantitatively characterize performance.

To measure time, in lecture we also introduced the class timer (also available as timer.hpp in the starter code repository):

#pragma once

#include <chrono>

class timer {
private:
    using time_t = std::chrono::time_point<std::chrono::system_clock>;
    time_t start_time, stop_time;

public:
    timer() = default;

    time_t start()   { return (start_time = std::chrono::system_clock::now()); }
    time_t stop()    { return (stop_time  = std::chrono::system_clock::now()); }

    double elapsed() { 
        auto diff = stop_time - start_time;
        return std::chrono::duration_cast<std::chrono::milliseconds>(diff).count(); 
    }
};

To use this timer, you just need to include it (with the appropriate preprocessor directive) in any file where you want to use a timer object. To start timing invoke the start() member function, to stop the timer invoke the stop() member function. The elapsed time (in milliseconds) between the most recent calls to start and stop is reported with elapsed().

Include guards

In C++ we are often including many different header files, and those in turn include other header files. But what happens if header one.hpp includes two.hpp and two.hpp includes one.hpp? Or even if main.hpp includes two.hpp which already includes one.hpp? It’s very important that the contents of each header only be read once, even though the source code being compiled may have multiple #include statements for a given file. The standard technique for doing this is to add a #pragma once at the beginning of the include file. In timer.hpp you will see its include guard statement.
#pragma once
  // ... file contents
This technique uses the preprocessor to make sure that the file is included exactly once. You can find a bit more information on the preprocessor in The Pre-Processor

To practice using the timer class, compile the program timer.cpp:

#include <print>

#include "timer.hpp"

int main(int argc, char* argv[]) {
    if (argc != 2) {
        std::cout << "Usage: " << argv[0] << " N" << std::endl;
    }
    size_t loops = std::stol(argv[1]);

    timer t;
    t.start();
    for (size_t i = 0; i < loops; ++i)
        ;
    t.stop();

    std::println("{} loops took {} milliseconds", loops, t.elapsed());
    return 0;
}

First, to get a baseline, compile it with no optimization at all (using Debug mode). How many loops do you need to run to get an elapsed time of (about) 1 second? How fast is each loop iteration being executed? Note that the empty statement “;” in the loop body just means “do nothing.”

Second, let’s look at how much optimizing this program will help. Compile the same program as before, but this time in Release mode. How much did your program speed up? If it executes too quickly, again, try increasing the loop count. How much time per iteration is spent now? Try enough cases to see if your answer makes sense.

For instructions on how to switch between Debug and Release modes, see Compilation Modes.

If you are unsure about the answer you are getting here, start a discussion on Discord. Try to have this discussion sooner rather than later, as you will need some of the information gained for later in this assignment.

Abstraction Penalty and Efficiency

One question that arises as we continue to optimize our algorithms is: how much performance is available? The performance gains we saw in class were impressive, but are we doing well in an absolute sense? To flip the question around, and perhaps make it more specific: We are using a fairly deep set of abstractions to give ourselves notational convenience. That is, rather than computing linear offsets from a pointer directly to access memory, we are invoking a member function of a class (recall that operator()) is just a function. Then from that function we are invoking another function in the underlying std::vector<double> class. And there may even be more levels of indirection underneath that function – we don’t know what the vector’s operator[] function does under the hood.

Calling a function requires a number of operations to be performed by the CPU: saving return addresses on the stack, saving parameters on the stack, jumping to a new program location – and then unwinding all of that when the function call has completed. When we were analyzing algorithms in lecture, we were assuming that the inner loop just involved a small number of memory accesses and floating point operations. We didn’t consider the cost we might pay for having all of those function calls – calls we could be making at very iteration of the multiply function. If we were making those – or doing anything extraneous – we would also be measuring those when timing the multiply function. And, obviously, we would be giving up performance. The performance loss due to the use of programming abstractions is called the abstraction penalty.

One can measure the difference between achieved performance vs. maximum possible performance as a ratio – as efficiency. Efficiency is simply:

\[\text{Efficiency} = \frac{\text{Achieved performance}}{\text{Maximum performance}}\]

Measuring Maximum Performance

Let’s write a short program to measure maximum performance – or at least measure performance that has the basic operations in our algorithm, but without abstractions in the way, for instance:

size_t loops = argv[1]; // FIXME
double a = 3.14, b = 3.14159, c = 0.0;
timer t;
t.start();
for (size_t i = 0; i < loops; ++i) {
  c += a * b;
}
t.stop();

To save space, I am just including the timing portion of the program. In this loop we are multiplying two doubles and adding to doubles.

Time this loop with and without optimization (Debug and Release mode). What happens? How can this be fixed? By which I mean we would like to have a number printed out that helps us estimate the performance of the computer for the loop in question when optimization is enabled. Again, this is a question I would like the class to discuss on Discord as a whole and arrive at a solution.

Bonus

Consider the program above, but now with double replaced by long. What happens when you time this loop with and without optimization (in Release and Debug modes)? What happens when you apply the same fix as for the double example above? Again, discuss as a class.

size_t loops = argv[1]; // FIXME
long a = 3, b = 14, c = 0;
timer t;
t.start();
for (size_t i = 0; i < loops; ++i) {
    c += a * b;
}
t.stop();

Exercises

Floats and Doubles

Most microprocessors today support (at least) two kinds of floating point numbers: single precision (float) and double precision (double). In the (distant) past, arithmetic operations using float was more efficient than with double. Moreover, the amount of chip space that would have to be devoted to computing with float was smaller than with double. Today these differences in computation time are less pronounced (non-existent); an arithmetic operation with a float takes the same amount of time as an arithmetic operation with a double.

The IEEE 754 floating point standard specifies how many bits are used for representing floats and doubles, how the are each represented, what the semantics are for various operations, and what to do when exceptions occur (divide by zero, overflow, denormalization, etc.). This standard, and the hardware that supports it (notably the original 8087 chip) is one of the most significant achievements in numerical computing. (William Kahan won the Turing award for his work on floating point numbers.)

Since a double has so much better precision than a float, and since the computation costs are a wash, why would one ever use a float these days? Interestingly, 16-bit floating point numbers are starting to be available on many architectures in order to support machine learning. Again, if the computational cost is the same, why would one ever give up that much precision?

In this exercise we are going to investigate some operations with doubles and floats and see what may or may not still be true (and hypothesize why).

The program float_vs_double.cpp has two sections one for single precision and one for double precision. The functionality of each section is the same, only the data types are different. Each section times two things: the time it takes to allocate and initialize three arrays and the time it takes to multiply two of them and set the third to the result.

The section for double looks like this, for instance (with the versions for float being very similar):

{
    size_t dim = argv[1]; // FIXME
    timer t;    t.start();
    std::vector<double> x(dim, 3.14);
    std::vector<double> y(dim, 27.0);
    std::vector<double> z(dim, 0.0);
    t.stop();
    std::println("Construction time: {} [ms]", t.elapsed());

    t.start();
    for (size_t i = 0; i < dim; ++i)
        z[i] = x[i] * y[i];
    t.stop();
    std::println("Multiplication time: {} [ms]", t.elapsed());
}

For this exercise, you are asked to generate four sets of numbers by running float_vs_double in Debug and Release mode: Single precision with optimization off, double precision with optimization off, single precision with optimization on and double precision with optimization on. Run the programs with a variety of problem sizes that give you times between 100ms and 1s (more or less).

Explain your results.

Efficiency

For this problem we want to investigate how to properly measure a loop with just a multiply and an add in the inner loop. In particular, we want to see if that can tell us the maximum FLOP/s count you can get on one core. We would like to relate that to the clock rate of the CPU on your computer and estimate how many floating point operations being done on each clock cycle. See the paragraph marked “Measuring maximum performance” above.

Once you have a meaningful answer for the maximum floating point rate of your computer, we can use this result as the denominator in computing efficiency (fraction of peak).

Modify the program in float_vs_double.cpp so that the timing gives you a more sensible answer. (By “sensible” I mean something useful for measuring the performance of your computer running a long series of arithmetic operations.)

Note that since we are interested in maximum efficiency, we are interested in the rate possible with fully optimized code. Give some thought to the numbers that you generate and make sure they make sense to you. Use double precision for your experiments.

Questions

Answer the following questions in results/answers.md.

What is the difference (ratio) in execution times between single and double precision for construction with no optimization? Explain.
What is the difference (ratio) in execution times between single and double precision for multiplication with no optimization? Explain.
What is the difference (ratio) in execution times between single and double precision for construction with optimization? Explain.
What is the difference (ratio) in execution times between single and double precision for multiplication with optimization? Explain.
What is the difference (ratio) in execution times for double precision multiplication with and without optimization? Explain.

Some utility functions

To run our test programs for this problem set we need to be able to compare two matrices. More specifically, we need to be able to compare the result of doing two equivalent computations – multiplying two matrices with two different algorithms. In the ideal mathematical world of real numbers, two equivalent ways of computing a given result will give exactly the same answer (that is what is meant by equivalent) – meaning, the difference between the two results is exactly zero.

Unfortunately, in the world of finite precision, we don’t necessary have that nice property. Two answers computed in two different ways should be close, but it is very difficult to ensure exact equality. (That is why, in general, you should never test even individual floating point numbers against each other for equality, much less entire matrices or vectors.)

To check whether two computations produce the same result for matrix computations, instead of looking for exact equality, we are going to look for something “small”. In particular, we will assume two non-zero matrices are “close enough” if

\[|| A - B || < \epsilon \times || A ||\]

Where \(\epsilon\) is small (1e-12, say).

In your repository is a program test_matrix_mult.cpp which compiles to ./build/test_matrix_mult. This program runs the four different versions of matrix-matrix product contained in the file matrix_operators.cpp and tests to make sure they all compute the “same” answer, using an expression like:

REQUIRE(f_norm(A-B) < 1.e-12 * f_norm(A));

Notice the nice natural syntax in this expression for computing the difference between two matrix objects.

As was discussed in lecture, this functionality is provided by overloading a function with a particular name, namely operator-(), which takes two matrix objects (by const reference) as arguments and returns a new matrix object.

matrix operator+(matrix const& A, matrix const& B);
matrix operator-(matrix const& A, matrix const& B);

Question for discussion on Discord. We saw that it is a “good thing” to pass large objects by reference (or by const reference), since then we don’t pay the overhead of copying. Should the same apply to returning objects? Note that the return type of operator-() is a matrix, not a matrix& or a matrix const&. Why are we not returning matrix C by reference?

The sum of two matrices is defined as the element-wise sum of their elements. That means that the matrix returned from operator+() should be created by adding the elements of the two matrices passed in as the arguments. Similarly, the difference of two matrices is defined as the element-wise difference of their elements.

The f-norm of a matrix (also called ‘Frobenius Norm’ or ‘Euclidian Norm’) is defined as the square root of the sum of the squares of all matrix elements.

The version of test_matrix_mult.cpp that you have is incomplete. In particular, the body of operator-() and operator+() are not filled in and the body of f_norm() is not filled in. Complete the definitions of f_norm(), complete the definition of operator+(), and complete the definition of operator-().

Once your definitions are complete (those should go into the file matrix_operators.cpp – this file contains some TODO comments), the test program ./build/test_matrix_operators should pass, as should the test program ./build/test_matrix_mult. We are testing these functions because we will be using them later in testing other functions.

matrix-matrix product

Your starter repository has a simple matrix-matrix product driver program in matrix_mult_time.cpp. The program multiplies two matrices together and reports the time. You pass in the matrix size on the command line.

As given to you, the program calls the function mult_0 – found in matrix_operators.cpp:

void mult_0(matrix const& A, matrix const& B, matrix& C) {
  for (size_t i = 0; i < C.num_rows(); ++i) {
    for (size_t j = 0; j < C.num_cols(); ++j) {
      for (size_t k = 0; k < A.num_cols(); ++k) {
        C(i, j) += A(i, k) * B(k, j);
      }
    }
  }
}

This is just our basic matrix-matrix product. Run ./build/matrix_mult_time for a few problem sizes, say powers of two from 128 to 1024.

Replace the call to mult_0 with a call to mult_3 and repeat for the same problem sizes. You may also need to try some larger problem sizes to get useful times.

Questions

Answer the following questions in results/answers.md.

If you double the problem size for matrix-matrix product, how many more operations will matrix-matrix product perform?
If the time to perform a given operation is constant, when you double the problem size for matrix-matrix product, how much more time will be required to compute the answer?
What ratio did you see when doubling the problem size when ./build/matrix_mult_time called mult_0? (Hint, it may be larger than what pure operation count would predict.) Explain.
What ratio did you see when doubling the problem size when ./build/matrix_mult_time called mult_3? Was this the same for mult_0? Referring to the function in matrix_operators.cpp, what optimizations are implemented and what kinds of performance benefits might they provide?
(Extra credit.) Try also with mult_1 and mult_2. Explain.

Next up: Monte Carlo Methods.