Sunday, March 30, 2014

Generating Flex and Bison rules with X-Macros

Lately I found myself working on a pet project for which I needed a parser for an assembly-like programming language for a real architecture. Since I wanted to use this opportunity to learn something new, I decided to use a combination of tools, i.e. Flex (for lexing) and Bison (for parsing)... obviously in C++ :)

However I found myself facing a pretty interesting problem for which I could not find a workaround, therefore I set myself with the goal of filling that void with this post. 

The problem is easily explained, in the language I am parsing (let's call it X for brevity), there  is a large amount of opcodes, e.g. add, sub, bra. These operations have different properties (e.g. accepting different number and type of parameter) and I wanted to have a centralized specification for those properties which I can pull-out as needed throughout my project. This is a typical case where X-macros should be used. However since flex and bison specification files are not a C-like language, I cannot use the preprocessor capabilities. Before we jump to the problem I wanted to recap what X-Macros are and why they are awesome. 

Let us consider the following example:

By properly defining the two macros in the opcodes.def file, and thus take advantage of the C preprocessor, we can reshape the information in different ways. As an example consider a function which determines if an argument is valid for a given operation:
As you expect the output of the code above is:
Is 'sub(_,_,_,reg)' valid? false
Is 'add(_,reg,_)'   valid? true
Is 'add(_,imm,_)'   valid? false
Is 'sub(_,_,imm)'   valid? true
Is 'sub(_,_,lab)'   valid? false

Cool eh? and think how easy is going to be adding a new opcode or change their semantics. This is quite standard practice in compiler-related project, beware that working with macros can get messy... but what not to love here? If you start to get weird errors by your compiler, sometimes it is instructive to run only through the preprocessor and have a look at the generated code. This can be done in GCC using the -E option (checkout an example for the code above here).
In my experience many programmers consider these meta-programming techniques (i.e., macros and templates) harmful and therefore to avoid like the plague. Of course everyone is entitled to his opinion but in a problem like this one, X-macros can save you an huge amount of time. The generated code is also a lot faster and more memory efficient then any implementation based on lookup tables.

Anyhow I am digressing... I hope all those macro haters have had enough by now and they left this post so that we can start to get things to the next level. Without all those 'C is better than C++' holding us back, can move on solving the actual problem which inspired this post. The idea is to use the the properties of X-macros inside the lexer and parser specification for a language parser. Since we are working in C++, the natural choice of tools is Flex (for lexing) and Bison (for parsing) [another choice could be Boost.Spirit, but I never felt the tool to be mature enough to be practical useful, feel free to comment on this]. For those who are not aware of parser generator tools, these are (meta-)programs which take a language specification (usually in a EBNF form) and produce code to determine whether an input stream matches that specification.

This is not going to be a tutorial on how to use Flex and Bison, there are other tutorials/blogs you can check out for that (Google is your friend). I start by setting up a basic flex and bison specification for parsing our language X. X allows 1 instruction per line, where each instruction is composed by an operator and a list of comma-separated arguments. Labels are allowed as well. For example the following is valid X program:

add r1, r2, r3 
sub r1, r1, #10
bra r1, exit

Follows the Flex and Bison specification files for the X language:
In order to compile this code we need a Makefile which triggers the generation of the lexer and parser. We are going to base our setup on cmake, checkout the CMakeLists.txt file for details. You can test out the code so far by checking out the following github repository (refer to the no_autogen branch).

You may start to notice the problem now. Our opcodes, which we defined in the opcodes.def file, need to be defined (again and again) both in flex and bison specifications. Since these are not C/C++ files, we cannot rely on the preprocessor for generating these rules automatically. This makes adding new opcodes to the X language very cumbersome since many files need to be modified. However there is my solution to the problem.

The idea is to generate temporary C/C++ files used to reshape the information contained in the opcodes.def file in a form that it is suitable for flex and bison. After that we replace the content generated by X-Macros in the language specification. The workflow is not particularly difficult, however we the main effort would be on automatizing this entire process so that specification files are refreshed when needed (i.e. when we lunch make).

Let's start with the Flex specification file (lexer.ll). My solution of the problem is the following. We define a placeholder, i.e. @@{{OPCODES}}@@ (you can choose any name for the placeholder, however make sure that it is unique within your flex specification),  which we are going to replace with some content automatically generated by the preprocessor. We also rename the file into which we are now going to call the template file. Next thing we need to do is to add to CMake the necessary action for generating and replacing the placeholder with content.
Similar thing can be done for the bison specification file. Note that here things are a little bit more complex since we need to declare opcodes both in the token section (line 28 of the parser.y file) and later in the grammar itself within the parsing rules (line 53 of the parser.y file). The full CMakeLists.txt file is available here.

In order to make replacement of placeholders with the preprocessor output I wrote a simple python script (i.e. The script does remove empty lines and comments produced by the preprocessor and collapse the lines if needed. I actually spent lot of time finding out a way of doing this without the need of an additional script. I found a command line which will do the same work but unfortunately the way CMake treats the command line arguments in the COMMAND section made it impossible to use (damn you CMake). Anyway, in the case anyone is curious
awk -vf2="$(while read x; do echo $x; done)" '/$1/{print;print f2;next}1' $2
This like will read content from stdin into variable 'f2', then match the input file $2 with the regex provided in $1 and append the content of f2 at that location. The patched file will be retuned on the standard output.

And that's really it. We can now add new operators to our opcodes.def, press make... and our parser will now accept the broader language.

C++ <3

Saturday, May 18, 2013

My take on serialization (Part III: deserialize)

After seeing serialization, one fundamental thing is missing, i.e. deserialization!

There is not much to say here, this is the inverse operation we performed during serialization... therefore similar patters apply.

And we are done! 

In the std::tuple<T...> deserializer I would have liked to use the commented line. With that line I could remove 20+ lines of code which are used by the deserialize_tuple method. However, in that way the object is deserialized in the inverse order. The type is correct, but since it seems that the arguments of the make_tuple function are evaluated right-to-left, the resulting elements of the tuple are inverted. Therefore a serialized tuple (1,2,3) is deserialized back as (3,2,1) :(. This is caused by the fact that the apply function has side-effects and in C++ we cannot rely on the evaluation order of the arguments of a function, therefore this code is not safe and better take the safe solution (however it might work in some C++ compiler). 

Just to see if everything is working fine we write our usual test cases using google-test: 

Last thing to do is running a complete benchmark where we serialize and deserialize an object and compare it with boost::serialization. This time I compiled everything with optimizations enabled (-O3) and I am using gcc 4.8.0 20130502 (pre-release).

The code is similar to the one we saw in the previous post this time I add a call to deserialize and a stupid if to be sure the compiler is not doing any dead-code elimination. The code for boost::serialization is similar, just trust me (I know I am Italian... it might be difficult... but come on, as my supervisor says... "give me a break").

The result is well... quite impressive. I didn't do this exercise with performance in mind, rather than my goal was to eliminate a dependency on the boost libraries. Now I realize that boost is definitely doing something really wrong in the serialization library. The added storing of typing info does not justify the huge performance penalty. My solution is 20x faster! Since the messages I produce are half of the size (thanks to the missing typing info) I would expect boost to be twice as slow.

I am frankly quite pleased by the performance improvements I saw within the libWater project after replacing boost::serialization with this solution. We had a 10% performance improvement which in HPC is quite welcome.

The full code, plus the test cases are available on github (under the BSD license):
(contributions are welcome)

Read: PART I: get_size(...)

C++ <3

Friday, May 17, 2013

My take on serialization (Part II: serialize)

After discussed the get_size(...) functor, which given an object returns its serialized size in bytes, we can go on and write the serialize function.

We can follow the same pattern of the get_size, but this time we have to store the content to a stream.

As before, we have a specialization of the serialize_helper template for tuples, vectors, strings and POD datatypes. In order to be performance efficient we presize the vector in lines 66-67 (because we know how many bytes we need to store the object) thanks to the get_size predicate.

Let's write some unit tests (because they are very important):

So it's time for running some benchmarks. YAY! Let's compare this serializer with boost and check whether spending time re implementing this was worth some. Since I don't have much time, we only run 1 experiment... if you are interested you can run more :) 

So, as long as size of the serialized stream is concerned, boost uses 63 bytes while ours only 26 bytes (more than half). Performance-wise boost::serialization needs 4.410 secs to run the test, our serialization solution 0.975 seconds, more than 4 times faster. Of course we are not saying that boost::serialization sucks... as a matter of fact it solves a different problem since it also store the typing information which allows everyone (even a Java program) to reload the stream. Our serialization strategy (as I explained in the first post) is based on the fact that the type is known at receiver side (therefore we can avoid to store typing).

Let's show some graphs, comparison of our serialization strategy relative to boost::serialization:

Thursday, May 16, 2013

My take on C++ serialization (Part I: get_size)

Long time, no see!

Lately I have been busy writing... and writing... and we know that: "All writing and no coding makes Simone a dull boy" :) so I needed to go back to C++ writing... which means new blog entries!

Lately, in a project (libwater, soon released) I have been using boost::serialization in order to transfer C++ objects from one node to another via network. Boost serialization is pretty neat, especially when you are under a deadline, but releasing the code with a dependency on boost is in most of the case an overkill.  Especially when the boost library you are using is not an "header-only" library, which means the user needs it installed in his system. Since the project focuses on large clusters, this is always a big problem because these advanced libraries are in the majority of the cases missing (or a very old version is installed) meaning that the user has to do lot of installation work just to try out our library.

Beside that, boost::serialization was doing more than we actually needed, in fact I am in the particular case where I exactly know the type of the object I am receiving on the remote node, therefore I don't need to store in the serialization stream the type information like boost does. This is typical when for example you are implementing a network protocol where the structure of messages is known. In such case we can assign a tag to each message type and easily assign a C++ type to it. Since the list of messages is known statically, we can use some meta-programming just to have some fun.

Actually I could literally half the size of each transferred message by only storing the object content. Therefore I started to work on a "header-only"  serialization interface... of course it had to involve template meta-programming! :) I am going to post the various pieces of my take on object serialization on multiple posts (otherwise it may get to heavy to read).

The first thing (which I will introduce today) was defining a functor which given an object returns the size (in bytes) of its serialized representation, I called it size_t get_size(const T& obj). This is some easy code which needs to be written, so let's get to it.

Stated that the types I deal with are (for now) restricted to std::string, integral types (uint8_t, uint16_t, uint32_t, uint64_t), std::vector<T> and std::tuple<T...>, this code should make the trick. While for tuples and integral types we directly store the value, for vectors and string we prepend the number of elements which will be stored (since this information is not explicit in the type). 

For tuples I I hope I had a better solution using template expansion. This could be possible if for example the get_size was more like sizeof... which means it can be computed based on the object type . This would allow me to write something like this: std::sum(sizeof(T)...).

However, since we may have strings and vectors as elements of the tuple, this is not possible because we would need to also unpack the elements of the tuple (not only the type). Actually I would appreciated anyone that could give me a better solution for that. 

A simple test:

C++ <3

Monday, September 17, 2012

The Wonderful World of C++11's Smart Pointers

While I was reading the C++11 my attention was caught by the std::weak_ptr<T> class. Let's rewind a little bit and explain what smart pointer are. The C++03 introduced the std::auto_ptr<T> class which was meant to be a basic implementation of smart pointer. A smart pointer is an object which behaves like a pointer but it doesn't need to be manually cleaned because the pointed are of memory is automatically freed when the pointer object is destroyed. It is quite handy especially in those situation where you want to throw an exception and you don't want to leak any memory. 

Since its introduction, the std::auto_ptr<T> class had several problems (a brief history of std::auto_ptr<T> I found interesting is from Scott Mayers' (C++ God level) blog. One of the main problem is that auto_ptr is not reference counting, which means that there is only one smart pointer owning the reference of memory location. Whenever an object a (of type std::auto_ptr<T>) is copied into b the ownership of the pointer is transferred. 

That means that after the assignment the object a will be an invalid pointer (therefore it will not perform the cleanup of memory) while the pointer b now owns the reference to the memory and he should take care of performing the delete operation. std::auto_ptr<T> for example to be used with containers. Indeed every attempt to access element x of a std::vector<std::auto_ptr<int>> will transfer the ownership of the object in the array to the left side of the assignment leading to disasters. 

The C++11 finally solved this problem. First of all C++11 introduces several smart pointers, exactly 3: std::unique_ptr<T>, std::shared_ptr<T> and std::weak_ptr<T>.
  • std::unique_ptr<T> is very similar to std::auto_ptr<T> of C++03. However because of the introduction of move semantics in C++11 (for which I promised to post something and yet didn't do it) the implementation is more safe. For example now it is possible to have containers of unique_ptrs. In the case you want more details on this have a look to this blog post
  • std::shared_ptr<T> in an implementation of a smart pointer with reference counting. This means that dislike std::unique_ptr<T>, a std::shared_ptr<T> can be copied and assigned. Every time the pointer is copied the object keeps track of how many smart pointer objects are pointing to the same memory location. When the smart pointer goes out of scoped and its distructor gets invoked the object checks whether he is the last smart pointer owning a reference to that memory, if this is the case the delete operation is invoked. 
And finally we arrive at the std::weak_ptr<T> which is the main reason why I started this post in the first place. :) When I first read the definition of a weak_ptr on che C++11 standard I said to my self... nah you will never use this... what are the odds of you needing a weak_ptr??. And I indeed was wrong, couple of days ago I stumble upon a problem that really required to use a weak_ptr, true story. 

It all started with a simple problem. I wanted to have an instance manager handling a number objects. These objects need to refer to other instances of the instance manager. I am not a fan of using pointer, therefore I start with a simple structure which turned to be a disaster :) 

Can you see it? Yep, it always gets me. The main problem is that vectors expands during execution. When an array expand its content is relocated in memory. The array begins its life with a certain capacity. Although the size is zero, the capacity is usually larger than that. Every time the push_back method is invoked the size of the array increases. When the size reaches the capacity the vector needs to be expanded so that more elements can be append. The content of the array is therefore copied over the new allocated memory and the original location is deleted. In my implementation I keep a pointer to an element of the array but surprise surprise... that pointer is going to be invalid if too many elements are pushed into the array.

A workaround is to keep the instances in a std::vector<Instance*>.  However in the era of C++11 goodies we don't want to take care of pointers ourselves, therefore the use of smart pointer is screaming out. My second try was by using a shared_ptrs. Now there it is the implementation which uses shared_ptrs, where is the problem now?

I tricked a little bit the code to make it clear what can go wrong with this implementation. When the instance manager goes out of scope is freed. This means that the repos array gets deallocated. However here something nasty happens, a circular dependency can generate among Instance objects. Indeed we end up in a situation where the first two elements of the vector refers, the situation in memory is the following:
When the first element of the array is removed the pointer has reference counting 2, therefore the number is decreased and the object is not removed. We move on to the second element and same repeats. We therefore end up loosing any reference to the two instances and leak memory. Valgrind should be quite angry if you run this code :). Where do we do now?

That's it. That the reason why C++11 also has std::weak_ptr<T>. Their purpose is to be used in such situation to break cycles. This kind of smart pointer holds a non-owning ("weak") reference to an object that is managed by std::shared_ptr<T>. It must be converted to std::shared_ptr<T> in order to access the referenced object.
The weak_ptr is quite easy to use. You can construct a weak_ptr from a shared_ptr. Every time you want to access the pointed object you need to obtain a shared_ptr. This is done through the lock() method. The lock method requires the object to be valid. The validity of a weak_ptr can be tested using the expired() method. The use of weak_ptr makes Valgrind happy and concludes this post. 

C++ <3

Saturday, August 4, 2012

Speeding up quicksort with template meta-programming

As you may have realized by now I am a little bit fanatic of C++ templates. I indeed have really a hard time listen to programmers saying: "I do program in C++ but I am not using templates". I swear I heard such statement several times in my life, and it is such a pity there is a number of C++ programmers avoiding templates like they are a disease. Templates are awesome and template meta-programming is way beyond awesomeness (at least for me). From my point of view (involved in compiler design), C++ meta-programming gives me a way to interact with the compiler in a much closer way... and this allows you to test out new compiler transformations before implementing them in the compiler itself. 
Anyhow, this week I spend some spare time trying to make a point (mostly just for the sake of this post), my point is the following: "if you want speed... you shall go crazy with meta-programming". Forget your beloved C (I know several people will not agree on this statement, but seriously guys, C is just a subset of C++). Therefore, I was thinking an algorithm I could speedup with template meta-programming and I thought the most relevant thing would be some kind of sorting algorithm. Indeed there is lot of research around sorting algorithms, mostly related to the computational complexity of algorithm. However, most of the time people forgets that their algorithm is running on top of a super-scalar architecture which, as we seen in the last post, is not that found of conditional jumps. 

Among the myriad of sorting algorithms I choose the quicksort because it's complexity is O(n logn) and in practice, because the sorting is done in-place therefore it plays well with the cache, it performs better than other sorting algorithm with same computational complexity. As prove of its relevance note that the C library provides the std::qsort function and the C++ STL std::sort method also uses a combination of quicksort and introsort. Here it follows the classic implementation of quicksort:  
I tested the execution time of the three sorting routines using the following code. I generate a vector of N elements and initialize each value randomly. Afterwards I sort the same array for 1 million times and sum the total execution time. At the end of the loop I divide by the number of iterations in order to obtain the execution time of a single invocation of the sorting routine. The skeleton of the testing code is the following (ITER = 1000000):
Compared to std::sort and std::qsort our method is not that bad. 

The major differences with the std::qsort (which is a C implementation) is that the C version uses a memcpy operation to perform the swap operation. This is very inefficient especially if compared to the C++ std::swap which uses move semantics (another wonder introduced with C++11) eliminating any need of creating a temporary copy (we will cover move semantics and R-value references in a later post). 

My implementation of quicksort is however not faster than std::sort for larger arrays, this might be caused by std::sort function to switch to the introsort algorithm for such arrays. 

Nevertheless this implementation is a good starting point for start playing with templates (YAY). Indeed the compiler is not able to inline recursive functions, however if we know statically the size of the array we are going to sort we can inline entirely the computation of the quicksort therefore optimizing the sorting for pipeline usage. It is worth noting that not all the control flow statements can be eliminated as the comparison between array elements (which are not known at compile time) needs to be performed at runtime when we actually know the value of the array elements. 

A way of writing the template version of the quicksort algorithm is the following:
Compiling this version of the quicksort is more demanding than the previous one. Indeed the recursive invocation is unrolled at compile time which means the compiler needs to generate an awful amount of code. For example compiling the quicksort for arrays of 50 elements requires 7 minutes and GCC uses around 1.3GB of main memory. The produced binary is 4 MB compared to 27KB which is the size of the executable obtained when the previous sorting function is invoked. 

However if we run the template-based quicksort (template_sort), we obtain the following performance results:

Clearly there is an advantage for arrays up to 25 elements. The template quicksort is 2 times faster than the standard std::sort. Note that the same algorithm is utilized, the only difference is that we optimized the pipeline usage by eliminating recursive function calls by completely unwinding the call stack at compile time. However there is a drawback, because of the size of the generated code, the program doesn't fit into the cache anymore and therefore we miss-utilize L1 Instruction cache. However until 25 elements the generated code is small enough to fit into the instruction cache and the benefits of unrolling are clearly visible. 

The nice thing is that we doubled the performance of the quicksort using an high level language without messing with any assembly instruction. Also the template code is not that different than the trivial implementation of the quicksort. 

C++ <3

Friday, July 27, 2012

A generic loop unroller based on template meta-programming

Loop unrolling (or unwinding) is code transformation used by compilers to improve the utilization of functional units present in modern super-scalar CPUs. Indeed, processors have a pipelined architecture consisting of multiple staged (minimum are 5). While the CPU is executing the instruction in one of the stages he can simultaneously load and decode the next operation pointed by the program counter. However, in the presence of branch instructions, the CPU needs to wait the decode stage in order to know whether the branch has been taken or not in order to adjust the program counter and correctly load the next assembly instruction. Over the years several architectural optimizations have been introduced to reduce the problem (e.g. branch prediction units), however in specific situation the CPU can loose up to 20 cycles because of a branch instruction. 

For this reason it is very important to reduce the amount of branched in the input code. This is the job of the compiler since it is the software agent closest to the actual hardware and it can produce code which better fits the underlying CPU. However, compilers are quite complex and often they even fail in applying elementary optimizations. An example is loop unrolling. Because the compiler often fails to produce such transformation, developers, especially in High Performance Computing (HPC), tend to tune their code by manually unroll loops. This is (in my opinion) a bad practice since the tuned code is not portable anymore (the optimal unroll factor for one machine can be bad for another). A second problem is that by manually unrolling a loop the body is replicated many times which is never a good thing if a bug shows up.

However we don't necessary have to renounce to performance if we want code readability and structure. With C++11 we can have them both. :) The idea is to have an meta-function which we call unroller, this class takes care of unrolling N invocations of a generic functor like below:

This simple function takes care of unrolling N function calls. Because we use inline, the generated code will actually not contain function calls. The next code snippet shows how we can use the unroller

And that's it. You can now control the unrolling factor using the pre-processor directive UnrollFactor, which means you can either define it in your code or provide a value through a Makefile in order to fit the target architecture. 

The next question comes natural, how slower this loop is going to be compared with the C-like version which uses no lambda and which can be unrolled by the underlying compiler? This is a legittimate question and that's why we are going to do some old good benchmarking right now! :)

The loop our unroller will fight against is the following:  
The unroller turned out to be pretty awesome. As expected we have a 2.5 speedup. The amount of computation in the loop body is enough that we start seeing improvements starting from an unrolling factor of 2 or 5. However depending on the number of iteration of the loop and the amount of computation the best unrolling factor may change.  

C++ <3