Power & Source of Big Ideas

Cilkplus on the T3

Moderators: chensy, FATechsupport

Last year I ported and tested the Cilkplus Intel/MIT parallel processing extensions for the C programming language to the Raspberry Pi 2B. I teach a course on numerical methods and the scalability of parallel algorithms gets more and more interesting as the number of cores increases. However, 8-core Xeon processors are expensive and the big.LITTLE design of many 8-core ARM devices means that only 4 cores can be used for scaling analysis. I finally got time to make some preliminary tests with the T3.

The T3 I received had Debian preloaded on the eMMC and booted right up. After disabling the autologin and graphical interface, I inserted a SDCARD to use for home directories and downloaded the latest version of gcc with the intention of compiling it along with the modifications needed to enable Cilkplus on ARM. Unfortunately, gcc takes 2GB RAM to compile and swap was disabled in the default kernel. Fortunately, the binary I had compiled for the Raspbery Pi worked just copying it over.

For my first test I chose the parallel merge sort, which I had run earlier on the Raspberry Pi 2B. The combined results for both machines

Code: Select all

                      Serial       Parallel      Speedup
Raspberry Pi 2B       4.269e-01    1.248e-01      3.423
FriendlyArm T3        2.093e-01    3.015e-02      6.941
The above speeds are given in seconds for sorting 1048576 random integers. Note that the single core speed of the T3 is about twice as fast as the 2B while the multi-core performance is about 4 times. Moreover, the algorithm scales to the 8 cores of the T3 with similar efficiency as it scales to 4 cores in the 2B. While efficiency often decreases as more cores are added, that fact that it doesn't in this particular case is interesting. I'll post more details later as well as links to the binary executables that I've been testing.
I ran a program on the NanoPi T3 which tests memory speed when adding 32-bit numbers in a random pattern within a specified memory size. The original program tests single-core speed while simple modifications lead to a Cilk program which tests multi-core speed. The single core results are

Image

while the Cilk multi-core results are

Image

Note that the single-core results are faded in the background of the multi-core graph for comparison. In the multi-core runs 1/8 the total buffer size was allocated to each worker thread. For both Xeon and ARM architectures, the maximum iterations per second does not significantly increase when running with multiple cores. Thus, the performance of this test is constrained by memory bandwidth.

For buffer sizes less than 512K the NanoPi T3 performs faster than the Xeon. This may, in part, be be due to the fact that the Xeon was running with hyper-threading enabled. In particular, the system was configured with dual 6-core CPUs and a total of 24 hardware threads. As Cilk allows the operating system to move worker threads to less busy CPUs, there may have been negative effects to the cache on Xeon hardware resulting from such migration.

Who is online

In total there is 1 user online :: 0 registered, 0 hidden and 1 guest (based on users active over the past 5 minutes)
Most users ever online was 47 on Wed Feb 21, 2018 7:00 am

Users browsing this forum: No registered users and 1 guest