Saturday, July 4, 2015

Move semantics with containers and algorithms

Move semantics is one of the coolest new features of (the-now-so-not-that-new) C++11. The main idea is the ability to steal the implementation (the guts) of temporary objects and therefore be able to avoid creating object copies. This of course is very important for performance; while it is true that many mainstream C++ compilers have been able to optimize away such copies (e.g. return value elimination), move-semantics gives us developers an explicit way of controlling it.

Another important aspect of move-semantics is that it allows definition of move-only data types, a.k.a. objects that cannot be copied but only moved. An example for all, move-semantics makes possible the definition of the std::unique_ptr<T> data type.

Let us consider a simple ADT, called Object, with properties very similar to the ones of an std::unique_ptr. As you can see it is a simply wrapper for a string literal allocated on the heap that only offers move semantics (no copying is allowed). It additionally contains a type that will be useful later.
We can use it populate a container as shown in the main function. emplace_back allows for the object to be built in place so no copy constructor is going to be invoked.

Real problems start when we want to use this container with standard algorithms. If the algorithm uses move-semantics then there are no issues, for example trying to sort the array can be achieved with the following code:
If we print the addresses at which each message is allocated we see that it is indeed the same before and after sorting. This happens because std::sort internally uses std::swap which uses move-semantics.
// before sorting
obj(1st message) addr(str): 0x12a9030
obj(2nd message) addr(str): 0x12a9080
obj(3rd message) addr(str): 0x12a9010
obj(4th message) addr(str): 0x12a90f0
// after sorting
obj(1st message) addr(str): 0x12a9030
obj(4th message) addr(str): 0x12a90f0
obj(2nd message) addr(str): 0x12a9080
obj(3rd message) addr(str): 0x12a9010

So far so good! However, when trying to use algorithms which require copy, i.e. std::copy, std::copy_if & friends, the compiler is going to give us a nasty error message (which may change depending on your compiler, I am using GCC 4.9.2):

In instantiation of ‘_OIter std::copy_if(_IIter, _IIter, _OIter, _Predicate) [with _IIter = __gnu_cxx::__normal_iterator >; _OIter = __gnu_cxx::__normal_iterator >; _Predicate = main(int, char**)::]’:
container_move.cpp:67:63:   required from here
/usr/include/c++/4.9/bits/stl_algo.h:748:16: error: use of deleted function ‘Object& Object::operator=(const Object&)’
      *__result = *__first;
container_move.cpp:24:10: note: declared here
  Object& operator=(const Object&) = delete;

Which basically means the algorithm is trying to copy an instance of the Object class, but that is not allowed since we deleted the copy constructor. How to solve it? Well, you could write your implementation of std::copy_if and explicitly use std::move to promote the operation to move semantics, e.g.:
However there is an easier and cleaner way of achieving the same behaviour, behold the std::move_iterator. It is an adaptor for iterators which yields an R-value reference when the dereference operator (i.e. *) is used. That means using std::copy_if on a container with only moveable objects is as easy as:
Printing the dst_objs array yields the following:

obj(1st message) addr(str): 0x12a9030
obj(4th message) addr(str): 0x12a90f0

No object have been copied, mission accomplished! :)

It has to be added that

C++ <3

Monday, December 8, 2014

A type-safe definition for OpenCL's enqueue_kernel function

I want to share with you something that I initially thought it wouldn't work... but it does. No reason behind it, just to prove (once again) that C++11 is indeed fantastic and it can handle (almost) whatever you throw at it.

Since 1 year the OpenCL 2.0 standard was ratified. The thing which is most exciting for me is device-side enqueue. This is a functionality which allows a kernel to submit new work directly on the device without the need for host intervention.

However there is something fishy with the way the function is defined and I am going to explain why. The new enqueue_kernel function (defined in the OpenCL 2.0 language specification) has several overloads:
Ok, we like it. But wait for the next two ones:
Emh.. wait a second.. what? How many variadic arguments are in there?

The issue here are the two sets of "...". How many arguments should we accept? It is funny since the specs are not saying much about these additional arguments. However the only way to read this in a way that makes sense (in my understanding) is the following:
"If a closure function (or block) is passed which accepts N OpenCL "local" pointers, then their size is defined by an equal number of unsigned values (i.e., size0,..., sizeN-1). It is responsibility of the runtime support to allocate memory before the nested kernel is executed."
All of this to say that the length of the two variadic argument lists (the lambda's and the one internal to enqueue_kernel) must match. This means that it is responsibility of the compiler to perform this additional check. 

I can see many people being happy with this...  but couldn't we use the type system to enforce that? Can our beloved meta-programming fix this? Let's assume we were in C++? Would the API designer able to to express this concept (number of arguments in the closure equal number of arguments passed) just with the means of the type system? You will be glad to ear that with C++11, YES WE CAN! ...and I am going to show you how to do that.

For this example we use the sizes as input to the lambda (not for allocating device local memory as the actual implementation of enqueue_kernel is supposed to do). This is just a proof of concept, we are not interested in the actual implementation of OpenCL's device-side enqueue. Execution of the program will produce the following expected output:

> 10
> 20
> 30
> Calling closure
> Computed value: 60

This highlight the power of the ... expansion operator of C++11 variadic templates. For example if we try to call this function using an invalid number of sizes a compiler error will be generated:

@ThinkPad-X1-Carbon:~$ g++ -std=c++11 test.cpp 
test.cpp: In function ‘int main(int, char**)’:
test.cpp:25:10: error: too few arguments to function ‘int enqueue_kernel(std::function<void(Args ...)>, typename to_int<Args>::type ...) [with Args = {int, int, int}]’
    10, 20);
test.cpp:10:5: note: declared here
 int enqueue_kernel(std::function<void (Args... )> block, typename to_int<Args>::type... sizes)
And there you have it. A type-safe definition of OpenCL's enqueue_kernel using C++11. Just because in C++11 we can! Hate on that C lovers! :)

C++ <3

Tuesday, October 14, 2014

Serious programming on ChromeOS

In my previous post I explained how to get a working OpenCL for ARM's Mali GPUs on the Samsung Chromebook 2. That's great I know and you are very welcome. :) 

However, do you seriously want to write OpenCL goodness on that small 11" screen? And don't get me started on the keyboard... I mean the Samsung Chromebook 2's keyboard is not that bad considering this a 250$ laptop we are talking about... however I am currently typing on a Lenovo X1 carbon chiclet style keyboard and the experience is the closest thing to a nerdgasm (nerd-orgasm... yeah I just made it up, deal with it! ...ah no actually it already exists).  

First thing... we need to enable ssh server. Lucky enough, ChromeOS comes by default with an ssh daemon (/usr/sbin/sshd) , however it is not enabled by default. The way to enable it is described in [1,2]. For short there are the steps:
  1. Remove rootfs verification (!!please backup your stuff!!):
  2. $ sudo /usr/share/vboot/bin/ --remove_rootfs_verification --partitions 4
    $ reboot
  3. Mount the rootfs in rw mode (remember you will need to do this every time you reboot the device and want to write in the root partition):
    $ sudo mount -o remount,rw /
  4. Generate SSH keys:
  5. $ sudo mkdir /mnt/stateful_partition//etc/ssh
    $ ssh-keygen -t dsa -f /mnt/stateful_partition//etc/ssh/ssh_host_dsa_key
    $ ssh-keygen -t rsa -f mnt/stateful_partition/etc/ssh/ssh_host_rsa_key
  6. Allow incoming traffic on PORT 22:
  7. $ sudo /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT
  8. We can now create a new user used to remotely login (alternatively you can set the chronos user password). To create a new user you need to follow these steps:
    $ sudo useradd -G wheel -s /bin/bash mali_compute
    $ sudo passwd mali_compute
    $ sudo mkdir /home/mali_compute
    $ sudo chown /home/mali_compute mali_compute
    Now we have a user, however the user cannot run sudo and we already know we need to be able to run sudo in order to enter the Arch linux chroot. To solve this we need to make user belonging to the wheel group part of the sudoers. This is done using the visudo command.
    $ sudo su
    $ visudo
    Search and uncomment one of following lines:
    ## Uncomment to allow members of the group wheel to execute any command
    # %wheel ALL=(ALL) ALL
    ## Same thing without a password
    # %wheel ALL=(ALL) NOPASSWD: ALL
...and voila' you are done.

Now get your ip address using the ifconfig command

chronos@localhost / $ ifconfig 
lo: flags=73  mtu 65536

mlan0: flags=4163  mtu 1500
        inet  netmask  broadcast
        ether xxx:xxx:xxx:xxx  txqueuelen 1000  (Ethernet)
        RX packets 1908  bytes 802994 (784.1 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1524  bytes 415774 (406.0 KiB)
        TX errors 4  dropped 0 overruns 0  carrier 0  collisions 0

Now from your desktop machine simply type:
$ ssh mali_compute@
motonacciu@ThinkPad-X1-Carbon:~$ ssh chronos@
The authenticity of host ' (' can't be established.
RSA key fingerprint is ...
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '' (RSA) to the list of known hosts.
mali_compute@localhost ~ $ uname -a
Linux localhost 3.8.11 #1 SMP Tue Sep 30 23:28:35 PDT 2014 armv7l SAMSUNG EXYNOS5 (Flattened Device Tree) GNU/Linux

Super cool right? ...well.. not really! Unfortunately you are going to loose the configuration once you reboot your chromebook. Or better, you will need to rerun the iptables and sshd commands manually if you wish to enable the daemon again. No worry, we can automatically start the SSH daemon by adding a script under the /etc/init directory, e.g., sshd.conf; it should contain the following lines (remember to remount the rootfs in rw mode in the case you rebooted the device in the meantime).

start on started system-services
     /sbin/iptables -A INPUT -p tcp --dport 22 -j ACCEPT
end script 

You should be able to login again from your next reboot.

Great, we can login into the super underpowered ChromOS shell... what's the deal? Right, but we are not done yet. The 'prestige' hasn't come yet :). Well, the idea of the last step is to let the user we just created jump into the Arch Linux chroot at login and therefore bypass the ChromeOS shell. This is rather simple to do as well (assuming you have created the chroot, otherwise go here):

$ su mali_compute
$ echo "sudo enter-chroot" > ~/.bash_profile

That's it! Wonder what's going to happen next time you log into your chromebook?

motonacciu@ThinkPad-X1-Carbon:~$ ssh mali_compute@
Last login: Mon Oct 13 22:17:40 BST 2014 from on pts/1
Entering /mnt/stateful_partition/crouton/chroots/arch...
mali_compute@arch ~ $ uname -a
Linux arch 3.8.11 #1 SMP Tue Sep 30 23:28:35 PDT 2014 armv7l GNU/Linux

Limited chrosh... gone!! Welcome to fully fledged Linux environment! :)


C++ <3

Saturday, October 11, 2014

Run OpenCL on the new Samsung Chromebook 2 in 5(-ish) simple steps

Recently a colleague and friend of mine posted a great tutorial on how to run OpenCL on Samsung's Chromebook in 30 minutes. He has tested this tutorial on the older (Series 3) Chromebook.

I bought myself the newer version, the Samsung Chromebook 2 (11" version). The main difference between these two laptops is that the former chromebook hosts a Mali-T604 GPU while the latest model uses a bifier Mali T628-MP6 chip. The Mali-T604 GPU has 4 cores vs the 6 in the latest chip. The latter is definitely an interesting chip for OpenCL's folks since the 6 cores are going to be split into two physical devices with 4 and 2 cores respectively.

In this blog I will present a slightly different way of setting up a working OpenCL environment for the Chromebook 2 (which should be also working on the former chromebook).


  • A Samsung Chromebook 2 (or any other ARM-based device running ChromeOS)
  • Some free space on your drive (2/4 GBs)
  • Knowledge of Arch Linux package manager (pacman)
  • Optional: I will install the development system in the internal SSD. If you think you are going to need more space for development, then you can use the microSD)

Step 1: Enable Developer mode

Enter Recovery Mode by holding the ESC and REFRESH (↻ or F3) buttons, and pressing the POWER button. In Recovery Mode, press Ctrl+D and ENTER to confirm and enable Developer Mode.

Step 2: Install chroarg

chroarg is a fork of the cruton process. It is based on the chroot command available in ChromeOS which allows to spawn lightweight virtual OSs, a more technical explanation follows:
What's a chroot?
Like virtualization, chroots provide the guest OS with their own, segregated file system to run in, allowing applications to run in a different binary environment from the host OS. Unlike virtualization, you are not booting a second OS; instead, the guest OS is running using the Chromium OS system. The benefit to this is that there is zero speed penalty since everything is run natively, and you aren't wasting RAM to boot two OSes at the same time. [...] 
While cruton will install by default Ubuntu, chroarg is based on Arch Linux. I personally prefer Arch Linux, but if you feel more confident with Ubuntu feel free the use cruton. Follows the creation of a chroot (more options are available from the project github page)

  1. Launch a crosh shell (Ctrl+Alt+T, you can paste in the console using Ctrl+Shift+V), then enter the command shell.
  2. Download and extract chroagh:
    $ cd ~/Downloads
    $ wget -O chroagh.tar.gz
    $ tar xvf chroagh.tar.gz
    $ cd drinkcat-chroagh-*
  3. Create the rootfs:
    $ sudo sh -e installer/ -r arch -t cli-extra 
The tool will install a minimal Arch, at some point it will ask to give the user name and password for the main user. If everything went fine (it often does), then you are ready to start your Arch installation within ChromeOS.

NOTE: If you want to install the chroot to a different location (e.g., an SD/microSD card or USB) then use the -p option to specify a destination folder.

Step 3: Enter-chroot and environment setup

After chroarg finishes installing a base Arch Linux installation we can enter this virtual environment using the command (from any crosh shell):

$ sudo enter-chroot

You should see the following output:
chronos@localhost ~/Downloads/drinkcat-chroagh-380f361 $ sudo enter-chroot
Entering /mnt/stateful_partition/crouton/chroots/arch...
[motonacciu@localhost ~]$ 

And magic magic, we are not in Arch linux. At this point you can install a bounce of packages which are going to be useful for OpenCL development.

$ sudo pacman -S gcc vim cmake base-devel git opencl-headers

Next (and final) step is downloading the Mali userspace drivers from. They are available from and continuously updated. This is the moment where you should be aware of the Mali device you have installed in your chromebook. In my case the driver marked as MALI-T62x will do the trick. For the older chromebook the MALI-T604 driver should be used instead. Since we are using the command line we can download the fbdev version of the drivers:

$ wget
$ tar -xf mali-t62x_r4p0-02rel0_linux_1+fbdev.tar.gz
$ ls fbdev

Next we can either edit the ~/.bashrc file to include this folder among the LD_LIBRARY_PATH, alternatively you can copy the libraries in your /usr/lib folder (but you will need to add the user among the sudoers). Or simply manually specify its path to GCC.

Step 4: Compile and Run your CL program

When you compile your program make sure the linker can find the The library is just a wrapper, the only library needed for running CL program is the

Compile your program as follows:
$ g++ -std=c++11 main.cpp -Iinclude -L/home/compute/fbdev -lmali -o clInfo

This is a simple CL program which prints the list of devices. We run using the following command (you can avoid specifying the LD_LIBRARY_PATH if you placed the in a default location):

$ LD_LIBRARY_PATH=/home/compute/fbdev/:$LD_LIBRARY_PATH ./clInfo
Total number of CL devices: 2

Device 0 info
 - Vendor:            'ARM'
 - Name:              'Mali-T628'
 - Type:              'CL_DEVICE_TYPE_GPU'
 - Max frequency:     '533'
 - Max compute units: '2'
 - Global mem size:   '2097192960'
 - Local mem size:    '32768'
 - Profile:           'FULL_PROFILE'
 - Driver version:    '1.1'
 - Extensions:        'cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf'

Device 1 info
 - Vendor:            'ARM'
 - Name:              'Mali-T628'
 - Type:              'CL_DEVICE_TYPE_GPU'
 - Max frequency:     '533'
 - Max compute units: '4'
 - Global mem size:   '2097192960'
 - Local mem size:    '32768'
 - Profile:           'FULL_PROFILE'
 - Driver version:    '1.1'
 - Extensions:        'cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf'

Step 5: Done

Yes, that's all. You can now start writing you multi-device OpenCL codes for ARM's Mali GPU. Let's verify that everything is in order... shall we?

OpenMP 'matmul_1024x1024' on ARM-CPU   [cores:8] => 16632 msecs
OpenCL 'matmul_1024x1024' on Mali-T628 [cores:2] => 616 msecs
OpenCL 'matmul_1024x1024' on Mali-T628 [cores:4] => 319 msecs

Stay tuned for more experiments!

C++ <3