Power & Source of Big Ideas

Bad gpu performance

Moderators: chensy, FATechsupport

Hello everyone!
I have NanoPC-T4, with bionic ubuntu.
We are trying to run TVM opencl accelerated framework (link). But we spend approximately a couple of weeks to find out why NanoPC T4 so slow with TVM (and with plaidml also). So it actually 2x or 3x slower then firefly3399 with computations on gpu.
So finaly we found CLPEAK tool which allows to benchmark opencl hardware and calculate GFLOPS with memory bandwidth, here is the result:

Code: Select all

Platform: ARM Platform
  Device: Mali-T860
    Driver version  : 1.2 (Linux ARM64)
    Compute units   : 4
    Clock frequency : 800 MHz

    Global memory bandwidth (GBPS)
      float   : 3.76 | 3.73
      float2  : 6.15 | 5.99
      float4  : 7.26 | 6.98
      float8  : 6.00 | 5.82
      float16 : 5.30 | 5.14

    Single-precision compute (GFLOPS)
      float   : 23.98 | 24.63
      float2  : 45.76 | 46.73
      float4  : 45.23 | 46.33
      float8  : 40.22 | 41.17
      float16 : 46.41 | 46.45

    half-precision compute (GFLOPS)
      half   : 23.09 | 23.12
      half2  : 48.87 | 49.25
      half4  : 95.32 | 95.45
      half8  : 93.11 | 93.32
      half16 : 87.80 | 89.06

    Double-precision compute (GFLOPS)
      double   : 11.59 | 11.62
      double2  : 3.27  | 3.50
      double4  : 20.35 | 20.71
      double8  : 20.01 | 20.40
      double16 : 19.77 | 19.95

    Integer compute (GIOPS)
      int   : 22.66 | 18.44
      int2  : 47.67 | 31.38
      int4  : 46.97 | 30.97
      int8  : 34.30 | 23.05
      int16 : 47.66 | 30.71

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 1.04 | 0.76
      enqueueReadBuffer          : 1.03 | 0.85
      enqueueMapBuffer(for read) : 4.70 | 3.96
        memcpy from mapped ptr   : 1.50 | 1.68
      enqueueUnmap(after write)  : 4.75 | 3.73
        memcpy to mapped ptr     : 1.68 | 1.73

    Kernel launch latency : 96.29 us | 110.73 us



And we compared T4 benchmark with the Firefly-RK3399, it has same gflops on differents types, but enqueueMapBuffer and enqueueUnmap alot faster than T4, and what is realy important, that these rk3399 was locked to 200 MHz.
Why this is so bad? What I can do with it?

Code: Select all

Platform: ARM Platform
  Device: Mali-T860
    Driver version  : 1.2 (Linux ARM)
    Compute units   : 4
    Clock frequency : 200 MHz

    Global memory bandwidth (GBPS)
      float   : 3.17
      float2  : 6.07
      float4  : 7.88
      float8  : 6.55
      float16 : 6.26

    Single-precision compute (GFLOPS)
      float   : 25.09
      float2  : 45.51
      float4  : 46.22
      float8  : 41.67
      float16 : 46.40

    half-precision compute (GFLOPS)
      half   : 23.11
      half2  : 50.19
      half4  : 98.30
      half8  : 93.48
      half16 : 93.94

    Double-precision compute (GFLOPS)
      double   : 3.59
      double2  : 3.30
      double4  : 20.97
      double8  : 20.65
      double16 : 20.39

    Integer compute (GIOPS)
      int   : 20.15
      int2  : 49.64
      int4  : 47.12
      int8  : 49.17
      int16 : 41.47

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 4.61
      enqueueReadBuffer          : 2.60
      enqueueMapBuffer(for read) : 475.11
        memcpy from mapped ptr   : 2.50
      enqueueUnmap(after write)  : 2790.39
        memcpy to mapped ptr     : 1.92

    Kernel launch latency : 190.64 us
iamlion12 wrote:
And we compared T4 benchmark with the Firefly-RK3399, it has same gflops on differents types, but enqueueMapBuffer and enqueueUnmap alot faster than T4,


- what kernel use FireFly? (https://forum.armbian.com/topic/8097-na ... on-review/ here you can find a performance table where FireFly is typical faster than other rk3399 boards, like T\M4 and rock64)

also in that article you can find:
To reveal the full potential of the board, I'm posting some visualized sbc-bench results taken from mainline 4.19-rc1 kernel here. This is because there might be some DRAM performance issues on RK3399 with 4.4 kernel..


iamlion12 wrote:
and what is realy important, that these rk3399 was locked to 200 MHz

are you sure about 200MHz? maybe it's just a wrong value. did you change GPU speed and re-run tests?

If you don't know, how to change a CPU/GPU speed, please find code below: default values for M\T4 is 1400\1800\800 in FriendlyElec distro, while in Armbian it's 1600\2000\800. I set 1200\1200\600 for easy performance evaluation in my board(and to avoid overheating), but you may find other values more suitable for you test:

Code: Select all

echo "CPU-A53 - set to 1200 MHz"
sudo echo "performance" > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
sudo echo 1200000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
echo "CPU-A72 - set to 1200 MHz"
sudo echo "performance" > /sys/devices/system/cpu/cpufreq/policy4/scaling_governor
sudo echo 1200000 > /sys/devices/system/cpu/cpufreq/policy4/scaling_max_freq
echo "GPU-Mali - set to 600 MHz"
sudo echo "performance" > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/governor
sudo echo 600000000 > /sys/class/misc/mali0/device/devfreq/ff9a0000.gpu/max_freq
For Midgard GPU theoretical maximum is about 0.034*MHz*Core(see GFlops per core), so for T860x4 at 800MHz it's about 108.8 GFlop, while at 200MHz it's only 27.2 GFlop.

Since you get float4 : 45.23 | 46.33 on FireFly - it can't run on 200MHz even theoretically. Based on yours values from T4 - looks like both boards use GPU at 800MHz.
try:

Code: Select all

echo performance > /sys/class/devfreq/ff9a0000.gpu/governor
it worked thanks for the post mcdvoice
Alright, lets go through the easy stuff to look at, then.



Have you checked temperatures on the GPU and the CPU?



Check the clock speeds as well. mcdvoice

Who is online

In total there are 3 users online :: 0 registered, 0 hidden and 3 guests (based on users active over the past 5 minutes)
Most users ever online was 825 on Wed Aug 28, 2019 3:18 pm

Users browsing this forum: No registered users and 3 guests