Friday, July 27, 2012

A generic loop unroller based on template meta-programming

Loop unrolling (or unwinding) is code transformation used by compilers to improve the utilization of functional units present in modern super-scalar CPUs. Indeed, processors have a pipelined architecture consisting of multiple staged (minimum are 5). While the CPU is executing the instruction in one of the stages he can simultaneously load and decode the next operation pointed by the program counter. However, in the presence of branch instructions, the CPU needs to wait the decode stage in order to know whether the branch has been taken or not in order to adjust the program counter and correctly load the next assembly instruction. Over the years several architectural optimizations have been introduced to reduce the problem (e.g. branch prediction units), however in specific situation the CPU can loose up to 20 cycles because of a branch instruction. 

For this reason it is very important to reduce the amount of branched in the input code. This is the job of the compiler since it is the software agent closest to the actual hardware and it can produce code which better fits the underlying CPU. However, compilers are quite complex and often they even fail in applying elementary optimizations. An example is loop unrolling. Because the compiler often fails to produce such transformation, developers, especially in High Performance Computing (HPC), tend to tune their code by manually unroll loops. This is (in my opinion) a bad practice since the tuned code is not portable anymore (the optimal unroll factor for one machine can be bad for another). A second problem is that by manually unrolling a loop the body is replicated many times which is never a good thing if a bug shows up.

However we don't necessary have to renounce to performance if we want code readability and structure. With C++11 we can have them both. :) The idea is to have an meta-function which we call unroller, this class takes care of unrolling N invocations of a generic functor like below:

This simple function takes care of unrolling N function calls. Because we use inline, the generated code will actually not contain function calls. The next code snippet shows how we can use the unroller

And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture. 

The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)

The loop our unroller will fight against is the following:  
The unroller turned out to be pretty awesome. As expected we have a 2.5 speedup. The amount of computation in the loop body is enough that we start seeing improvements starting from an unrolling factor of 2 or 5. However depending on the number of iteration of the loop and the amount of computation the best unrolling factor may change.  

C++ <3

Thursday, July 26, 2012

decltype Insanity... a.k.a. when the return type depends on the function itself?

In a previous post I introduced basic usage of the auto and decltype keywords. In most cases the use of auto or decltype is a shortcut to avoid to manually write down the type of an object. However these two new keyword enables new capabilities of the C++ language. 

During this week I stumble on one of those cases. I don't want to overload you with the details of the specific problem I was solving therefore I will use an equivalent example just for the sake of presentation. What I want to achieve is the following, I want to write a meta-function which takes a variable number of types as input and returns a type structured as a (degenerated) binary tree. An example follows:
The way we want to write this in C++ is by defining a function which makes use of variadic templates (a feature introduced in C++11). In this way we can have a typed function accepting an unbounded number of arguments. However if you start declaring the function signature you will soon realize that you cannot write the return type of this function:
How get around this? So, first of all we have to realize that the return type depends on the type and number of the parameters of the function. Fortunately, C++11 introduced the so called trailing return types, which, as the name suggests, enable the return type of a function to be specified after the input parameters (so that the actual function parameter types can be used in the return type). A second thing we must do, because meta-programming is heavily based on recursion, we must define a termination case for the recursion which, in this case boils down to a function accepting 2 arguments and returning the composition of the two bundled into a pair object.
Now the only thing remaining is to write down the return type of this function. Basically what we want to write in the return type is that my return type is a pair object where the first argument is the type of Arg1 while the second argument is the return type of the recursive invocation of the same function on the rest of the provided arguments. And, thanks to decltype, we can write it down as:
Now we can test whether everything is working by using the following simple main:
I have to admit I was quite impressed for a couple of seconds when I saw the first two tests working, I suppose at this point I have to point out which kind of C++ compiler I am using. Currently in my machine I have installed GCC 4.7.1. Considering that many C++ compilers are still struggling to support variadic templates and GCC can already eat code like this, it is quite cool. As my research activity mostly focuses around compilers, I was wondering how could GCC internally determine the type of the makeTree function considering that the type depends on itself. However I soon find out that GCC is not that smart, indeed if we try to invoke the function with more that 3 arguments the compiler generates the following error:
test.cpp: In function ‘int main(int, char**)’:
test.cpp:26:22: error: no matching function for call to ‘makeTree(int, int, int, int)’
test.cpp:26:22: note: candidates are:
test.cpp:8:24: note: template<class LhsTy, class RhsTy> std::pair<_T1, _T2> makeTree(const LhsTy&, const RhsTy&)
test.cpp:8:24: note:   template argument deduction/substitution failed:
test.cpp:26:22: note:   candidate expects 2 arguments, 4 provided
test.cpp:13:6: note: template<class Arg1, class Arg2, class Arg3, class ... Args> std::pair<Arg1, decltype (makeTree(arg2, arg3, makeTree::args ...))> makeTree(const Arg1&, const Arg2&, const Arg3&, const Args& ...)
test.cpp:13:6: note:   template argument deduction/substitution failed:
test.cpp: In substitution of ‘template<class Arg1, class Arg2, class Arg3, class ... Args> std::pair<Arg1, decltype (makeTree(arg2, arg3, args ...))> makeTree(const Arg1&, const Arg2&, const Arg3&, const Args& ...) [with Arg1 = int; Arg2 = int; Arg3 = int; Args = {int}]’:
test.cpp:26:22:   required from here
test.cpp:13:6: error: no matching function for call to ‘makeTree(const int&, const int&, const int&)’
test.cpp:13:6: note: candidate is:
test.cpp:8:24: note: template<class LhsTy, class RhsTy> std::pair<_T1, _T2> makeTree(const LhsTy&, const RhsTy&)
test.cpp:8:24: note:   template argument deduction/substitution failed:
test.cpp:13:6: note:   candidate expects 2 arguments, 3 provided
Which made me very sad. Now I wonder if the C++ code I wrote fails because of a faulty implementation of the GCC compiler or if what I wrote is not valid C++11 code. Comments are welcome. 

C++ <3

Wednesday, July 25, 2012

Printing tuples

One of the power tools of C++11 standard library are tuples... a.k.a. std::tuple<Args...>. It was not possible to have such utility in C++98 because of the lack of variadic templates. Indeed the std::tuple object heavily relies on this feature which has been introduced with the C++11 standard.

Tuples are collections composed of heterogeneous objects of pre-arranged dimensions. A tuple can be considered a generalization of a struct's member variables. It's use is very similar to the std::pair class which was available since C++98, however while pairs can only contains 2 generic elements, a tuple can be of undefined size. 

The standard way of using tuples in C++11 is the following:
Tuple t1 is constructed using the std::make_tuple function is an utility which easy the construction of tuples without worrying about the type of the single elements which is instead inferred thanks to the template mechanism. Another way to build a tuple is shown for tuple t2 for which we use the tuple class constructor. It is worth noting that we use an initializer list ( { } ) which is again one of the new feature of the C++11 standard (which we will cover one day). When building this tuple, the type of the first element is a const reference to the first element of the tuple t1. At last, tuple t3 is constructed using the std::tie function which creates a tuple of lvalue references. Therefore by writing the first element of tuple t3 we indeed propagate the value to both t1 and t2. And the assert in the last line of the code snippet is satisfied. 


When working with tuples, it is sometimes useful, for debugging purposes for example, to print their values to an output stream. This can be done by overloading the << operator, however, because the access to the tuple's elements is strongly typed, we need some metaprogramming magic in order to be able to print each tuple element. The problem is the following, accessing an element of the tuple is only possible via the std::get method which takes a constant teamplate parameter representing the index of the the value we want to access. Because this value needs to be a constant expression (otherwise the compiler would not be able to determine the return value of the function) we cannot simply iterate through the elements of a the tuple using a loop iterator. What instead we need to do is use recursion:
This method is generic in the sense that it can be used to print any tuple, the output obtained by printing the tuples t1, t2 and t3 at the exit of the program is:
 
Easy, isn't it? :)


C++ <3

Tuesday, July 24, 2012

Once you go 'decltype'...

Since C++11 became available several things have changed in C++ and also the way you used to solve problems now changes. It is indeed an interesting time for C++ developers because it's time to forget old practices and embrace the new features since C++11 is truly awesome.

I want to start this blog introducing decltype, a new keyword of the language which given an expression returns its type.

For example, in C++11 you can write this:

The type of b will be derived from a, therefore it will be int. decltype accepts any expression, therefore it is also possible to write:

In this way we don't need to infer what would be the type of the expression, but we let the compiler do the job considering that it has this knowledge. Although useful, the syntax is quite cumbersome since we have to write the expression twice and this is not probably handy to use in real codes. A better way to obtain the same result is by using the auto keyword, which is also part of the new C++11 standard. With auto declaring b is as easy as follows:


The combination of auto and decltype allows the developer to take advantage of type inference information derived by the compiler. It also allows you to do more with much lesser code. An example I found myself using lately is the declaration of sets/maps which requires to explicitly provide a comparator. The common way of doing this before C++11 was basically in two ways, either by providing a function which implements the key comparator, e.g.:  

Or by using a functor object as follows:

The advantage of using a functor object is that the comparator can have state. Similar behavior can be obtained using a function if static or global variables are being used.  

In C++11, the same thing can be coded with less effort as follows:


We take advantage of the auto initializer to define a lambda function (which I will cover in the future) and using the decltype keyword we get its type (which it would not be trivial to write down ourselves considering that it could be implementation dependent). 


This is just a basic introduction to the wonders of decltype, there is much more to unveil, however, as we say in italian... "you cannot start running if you didn't learn to walk in the first place". Indeed, decltype might seem pretty lame and not useful alone, but combined with other features of C++11 it becomes a very powerful tool. Stay tuned for more C++ love. 


C++ <3