Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
64 Bit Raspberry Pi 4B Benchmarks
View unanswered posts
View posts from last 24 hours

Goto page Previous  1, 2  
Reply to topic    Gentoo Forums Forum Index Gentoo on ARM
View previous topic :: View next topic  
Author Message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Mon Sep 09, 2019 4:04 pm    Post subject: Multithreading Benchmarks Reply with quote

Multithreading Benchmarks

Many of these benchmarks run using 1, 2, 4 and 8 threads, with others executing programs on all available cores via OpenMP.


MP-Whetstone Benchmark

Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.

As with the single core version, average Pi 4 performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.

Code:
           MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Threads              1      2      3  MOPS   MOPS   MOPS   MOPS   MOPS

                            Gentoo Pi 3B+ 64 Bits

    1       1152    383    383    328   23.2   13.0   N/A    2721   1365
    2       2312    767    767    657   46.5   26.0   N/A    5461   2738
    4       4580   1506   1526   1304   92.0   51.6   N/A   10777   5449
    8       4788   1815   1961   1382   95.0   53.3   N/A   13827   5811

            Overall Seconds   4.96 1T,   4.95 2T,   5.05 4T,  10.07 8T
   
                            Gentoo Pi 4B 64 Bits

    1       2395    536    538    397   60.8   39.0   N/A    4483    997
    2       4784   1062   1079    794  121.2   77.9   N/A    8932   1990
    4       9476   2125   2080   1568  240.8  155.3   N/A   17718   3962
    8       9834   2631   2744   1630  243.6  160.1   N/A   22265   4053

            Overall Seconds   4.99 1T,   5.01 2T,   5.12 4T,  10.17 8T

                             Pi 4B/3B+ 64 Bits
 
    1       2.08   1.40   1.41   1.21   2.62   3.00   N/A    1.65   0.73
    2       2.07   1.39   1.41   1.21   2.61   3.00   N/A    1.64   0.73
    4       2.07   1.41   1.36   1.20   2.62   3.01   N/A    1.64   0.73
    8       2.05   1.45   1.40   1.18   2.56   3.00   N/A    1.61   0.70

                           Raspbian Pi 4B 32 Bits

    1       2059    673    680    311   55.6   33.1   7462   2245    995
    2       4117   1342   1391    624  110.7   65.9  14887   4467   1986
    4       7910   2652   2722   1180  208.5  132.6  29291   8952   3832
    8       8652   3057   2971   1268  233.2  149.6  38368  11923   3942

            Overall Seconds   4.99 1T,   5.01 2T,   5.29 4T,  10.71 8T

                            Pi 4B 64 bits/32 bits

    1       1.16   0.80   0.79   1.28   1.09   1.18   N/A    2.00   1.00
    2       1.16   0.79   0.78   1.27   1.09   1.18   N/A    2.00   1.00
    4       1.20   0.80   0.76   1.33   1.15   1.17   N/A    1.98   1.03
    8       1.14   0.86   0.92   1.28   1.04   1.07   N/A    1.87   1.03



MP Dhrystone Benchmark

This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.

The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.

Code:
 MP-Dhrystone Benchmark armv8 64 Bit Fri Aug 23 00:44:05 2019

                  Using 1, 2, 4 and 8 Threads

Threads                       1        2        4        8

Seconds                     0.54     0.67     1.23     2.46
Dhrystones per Second    7391586 11954301 11300304 13028539
VAX MIPS Pi 3B+ 64 bits     4207     6804     7401     7415
 
VAX MIPS Pi 4B  64 bits     8880     7828     8303     8314

Pi 4B/3B+       64 bits     2.11     1.15     1.12     1.12

VAX MIPS Pi 4B  32 bits     5539     5739     6735     7232

Pi 4B   64 bits/32 bits     1.60     1.36     1.23     1.15



MP Linpack Benchmark (Single Precision NEON)

This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.

This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.

The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.

Code:
           Gentoo Pi 3B+ 64 Bits

 Linpack Single Precision MultiThreaded Benchmark
  64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019

   MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads     None      1       2       4

N  100    642.56   66.69   66.05   65.54
N  500    479.48  274.36  274.85  269.07
N 1000    363.77  316.17  310.37  316.71

 NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1

 N              100             500            1000

 NR            1.97            5.40           13.51
 RE  4.69621336e-05  6.44138840e-04  3.22485110e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -1.31130219e-05  5.79357147e-05 -3.08930874e-04
 XN -1.30534172e-05  3.51667404e-05  1.90019608e-04

Thread
 0 - 4 Same Results    Same Results    Same Results


           Gentoo Pi 4B 64 Bits

N  100   2252.70   97.25   97.43   97.41
N  500   1628.24  665.21  646.63  674.38
N 1000    399.87  406.80  405.84  399.54


           Pi 4B/3B+ 64 Bits

N  100      3.51    1.46    1.48    1.49
N  500      3.40    2.42    2.35    2.51
N 1000      1.10    1.29    1.31    1.26


           Raspbian Pi 4B 32 Bits

N  100   1921.53  108.66  101.88  102.46
N  500   1548.81  530.23  714.37  733.09
N 1000    399.94  378.11  364.78  398.21

           Pi 4B 64 bits/32 bits

N  100      1.17    0.89    0.96    0.95
N  500      1.05    1.25    0.91    0.92
N 1000      1.00    1.08    1.11    1.00

           32 bit numeric results

 N              100             500            1000

 NR            2.17            5.42            9.50
 RE  5.16722466e-05  6.46698638e-04  2.26586126e-03
 MA  1.19209290e-07  1.19209290e-07  1.19209290e-07
 X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
 XN -5.06639481e-06 -4.70876694e-06  1.41978264e-04


MP BusSpeed (read only) Benchmark

Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.

Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.

Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.

Code:
                  Gentoo Pi 3B+ 64 Bits

 MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019

   MB/Second Reading Data, 1, 2, 4 and 8 Threads
   Staggered starting addresses to avoid caching

KB Threads  Inc32   Inc16   Inc8    Inc4    Inc2    RdAll

 12.3 1T     3138    2822    3044    2383    1708    1737
      2T     5354    4865    5647    4519    3303    3362
      4T     7922    7504    9717    6794    6216    6597
      8T     5125    4159    6987    6696    5350    5195
122.9 1T      640     666    1191    1864    1627    1712
      2T     1008    1018    1926    3496    3268    3387
      4T      962    1042    2157    4259    6427    4372
      8T     1031    1047    2147    3952    6317    6514
12288 1T      124     114     260     527    1016    1363
      2T      137     138     275     487     946    2182
      4T      105     118     240     409     975    2158
      8T      108     117     236     504    1077    2051

                                                          RdAll
                      Gentoo Pi 4B 64 Bits              Pi 4B/3B+

 12.3 1T    4864    4879    5378    4379    4115    4221    2.43
      2T    8159    6924    9179    8006    7689    7837    2.33
      4T   12677   11531   14850   12554   13807   14794    2.24
      8T    7398    6927   10881   11675   11497   13075    2.52
122.9 1T     665     926    1869    2714    3557    4152    2.43
      2T     610     696    1549    4898    7188    8184    2.42
      4T     476     865    1885    4107    8058   14617    3.34
      8T     474     883    1848    3919    7939   13633    2.09
12288 1T     202     210     514    1044    2033    3616    2.65
      2T     258     425     853    1551    3693    6228    2.85
      4T     217     346     497    1024    2181    3789    1.76
      8T     220     275     540    1030    1937    3577    1.74


                      Raspbian Pi 4B 32 Bits              64b/32b

 12.3 1T    5263    5637    5809    5894    5936   13445    0.31
      2T    9412   10020   10567   11454   11604   24980    0.31
      4T   16282   15577   16418   21222   20000   45530    0.32
      8T   11600   13285   16070   18579   20593   36837    0.35
122.9 1T     739     956    1888    3153    5008    9527    0.44
      2T     629    1158    1568    5058    9509   16489    0.50
      4T     600    1093    2134    4527    8732   16816    0.87
      8T     593    1104    2121    4382    8629   17158    0.79
12288 1T     238     258     518    1005    2001    4029    0.90
      2T     278     228     453    1690    1826    3628    1.72
      4T     269     257     740    1019    1790    4145    0.91
      8T     233     292     532     926    2186    3581    1.00


MP BusSpeed Disassembly

Code:
        Source Code 64 AND instructions in main loop
 
   for (i=start; i<end; i=i+64)
   {
       andsum1[t] = andsum1[t]
           & array[i   ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
           & array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
    To
           & array[i+56] & array[i+57] & array[i+58] & array[i+59]
           & array[i+60] & array[i+61] & array[i+62] & array[i+63];
   }


Pi 32 Bit Raspbian Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
           -mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7


Pi 64 Bit Gentoo Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -march=armv8-a -o MP-BusSpd2Pi64

Parameters also tried
-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe
-fomit-frame-pointer"




Pi 32 Bit Disassembly          Pi 64 Bit Disassembly

vld1.32 {q6}, [lr]              ldp     w30, w17, [x0, 52]
vld1.32 {q7}, [r6]              and     w18, w18, w30
vand    q10, q10, q6            and     w1, w1, w18
vld1.32 {q6}, [r0]              ldp     w18, w30, [x0, 60]
vand    q9, q9, q7              and     w17, w17, w18
vand    q12, q12, q6            and     w1, w1, w17
vld1.32 {q7}, [ip]              ldp     w17, w18, [x0, 68]
vld1.32 {q6}, [r7]              and     w30, w30, w17
add     r1, r3, #96             and     w1, w1, w30
add     r6, r3, #144            ldp     w30, w17, [x0, 76]
vand    q11, q11, q7            and     w18, w18, w30
vand    q14, q14, q6            and     w1, w1, w18
vld1.32 {q7}, [r1]              ldp     w18, w30, [x0, 84]
vld1.32 {q6}, [r6]              and     w17, w17, w18


MP RandMem Benchmark
This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.

Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.

Code:
 MP-RandMem armv8 64 Bit Aug 2019  Using 1, 2, 4 and 8 Threads

         Serial Serial Random Random Serial Serial Random Random
KB+Thread  Read   RdWr   Read   RdWr     Read   RdWr   Read   RdWr

Gentoo Pi 4B 64 Bits

 12.3 1T    5922   7871   5892   7857
      2T   11856   7882  11902   7923
      4T   22964   7821  22276   7832
      8T   23225   7751  22082   7717
122.9 1T    5827   7276   2052   1921
      2T   10965   7258   1754   1924
      4T   10969   7232   1848   1929
      8T   10896   7158   1834   1909
12288 1T    3879   1052    188    170
      2T    4848    935    218    168
      4T    4684    943    332    170
      8T    3982   1049    340    171

Gentoo Pi 3B+ 64 Bits                    Raspbian Pi 4B 32 Bits

 12.3 1T    4901   3587   4912   3585     5860   7905   5927   7657
      2T    8749   3564   8719   3556    11747   7908  11182   7746
      4T   17108   3504  17160   3505    21416   7626  17382   7731
      8T   16885   3475  16650   3485    20649   7528  20431   7378
122.9 1T    3921   3339   1010    974     5479   7269   1826   1923
      2T    7360   3350   1814    972    10355   6964   1667   1920
      4T   12199   3313   2281    969     9808   7177   1715   1908
      8T   12089   3313   2279    968    11677   7058   1697   1919
12288 1T    2024    828     83     67     3438   1271    179    152
      2T    2169    820    142     67     4176   1204    213    167
      4T    2178    818    154     67     4227   1117    337    161
      8T    2219    821    161     67     3479   1093    287    168

                         4 Thread  Comparisons
            Pi 4B/3B+ 64 Bits             Pi 4B 64 bits/32 bits

 12.3 4T    1.34   2.23   1.30   2.23     1.07   1.03   1.28   1.01
122.9 4T    0.90   2.18   0.81   1.99     1.12   1.01   1.08   1.01
12288 4T    2.15   1.15   2.16   2.54     1.11   0.84   0.99   1.06
 


MP-MFLOPS Benchmarks


MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.

There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.

The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.

The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.

The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.

Code:
                              Single Precision

          MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8    128 12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  2908  2854   459   5778   5734  5405
2T  5700  5311   457  10935  11212  7968
4T 10375  5588   490  18181  21842  7637
8T  9675  8460   511  20128  20567  8568

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   792   806   373   1780   1783  1724   987   993   606   2816   2794  2804
2T  1482  1596   382   3542   3509  3380  1823  1837   567   5610   5541  5497
4T  2861  2742   429   5849   7013  5465  2119  3349   647   9884  10702  9081
8T  2770  2877   429   6434   6700  6101  3136  3783   609  10230  10504  9240

                     4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.67  3.54  1.23   3.25   3.22  3.14  2.95  2.87  0.76   2.05   2.05  1.93
4T  3.63  2.04  1.14   3.11   3.11  1.40  4.90  1.67  0.76   1.84   2.04  0.84

                                   Double Precision

       MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8   128  12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  1464  1386   225   3398   3386  3182
2T  2837  2792   228   6720   6741  4547
4T  5172  3414   251  10405  12762  4763
8T  4774  4353   275  11506  12118  4865

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   415   386   206   1400   1403  1333  1187  1220   309   2682   2714  2701
2T   820   813   209   2804   2767  2597  2420  2416   282   5379   5415  4780
4T  1328  1323   212   5433   5340  2465  4665  2381   317  10256  10336  5242
8T  1343  1308   214   5090   5006  3280  4385  3114   310   9721  10340  5131

                     4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits


                      4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.99  3.88  1.32   2.16   2.19  2.22  1.33  1.33  0.87   1.49   1.53  1.45
4T  2.83  2.16  1.30   2.04   2.07  1.55  0.59  1.02  1.02   1.40   1.46  1.03



                        NEON Single Precision

     MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019

          FPU Add & Multiply using 1, 2, 4 and 8 Threads

    2 Ops/Word         32 Ops/Word        2 Ops/Word         32 Ops/Word
KB  12.8  128  12800   12.8   128  12800  12.8  128  12800   12.8   128  12800

    Gentoo Pi 4B 64 Bits MFLOPS

1T  3311  3192   535   6442   6548  6198
2T  4607  6186   552  13030  13012  8468
4T  6279  5725   562  23798  24128  9374
8T  7815 12044   486  22725  21712  9395

    Gentoo Pi 3B+ 64 Bits MFLOPS           Raspbian Pi 4B 32 Bits MFLOPS

1T   830   823   406   2989   2986  2792  2491  2399   615   4325   4285  4261
2T  1575  1498   414   5981   5872  5445  5629  5520   591   8602   8463  8308
4T  2217  2650   431  11661  11644  6061 10580  5594   553  16991  16493  9124
8T  2733  3197   437  10505  10637  6708  7047 10785   513  14325  16219  8867

                      4 Thread  Comparisons
    Pi 4B/3B+ 64 Bits                     Pi 4B 64 bits/32 bits

1T  3.99  3.88  1.32   2.16   2.19  2.22  1.33  1.33  0.87   1.49   1.53  1.45
4T  2.83  2.16  1.30   2.04   2.07  1.55  0.59  1.02  1.02   1.40   1.46  1.03


MP-MFLOPS Disassembly

On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.

Code:
SP NEON 24.1 GFLOPS 6.55 1 core          DP 12.7 GFLOPS - 3.39 1 core

.L41:                                   .L84:
ldr     q1, [x1]                        ldr     q16, [x2, x0]
ldr     q0, [sp, 64]                    add     w3, w3, 1
fadd    v18.4s, v20.4s, v1.4s           cmp     w3, w6
fadd    v17.4s, v22.4s, v1.4s           fadd    v15.2d, v16.2d, v14.2d
fadd    v0.4s, v0.4s, v1.4s             fadd    v17.2d, v16.2d, v12.2d
fadd    v16.4s, v24.4s, v1.4s           fmul    v15.2d, v15.2d, v13.2d
fadd    v7.4s, v26.4s, v1.4s            fmls    v15.2d, v17.2d, v11.2d
fadd    v6.4s, v28.4s, v1.4s            fadd    v17.2d, v16.2d, v10.2d
fadd    v5.4s, v30.4s, v1.4s            fmla    v15.2d, v17.2d, v9.2d
fmul    v0.4s, v0.4s, v19.4s            fadd    v17.2d, v16.2d, v8.2d
fadd    v4.4s, v10.4s, v1.4s            fmls    v15.2d, v17.2d, v31.2d
fadd    v3.4s, v12.4s, v1.4s            fadd    v17.2d, v16.2d, v30.2d
fadd    v2.4s, v14.4s, v1.4s            fmla    v15.2d, v17.2d, v29.2d
fadd    v1.4s, v8.4s, v1.4s             fadd    v17.2d, v16.2d, v28.2d
fmls    v0.4s, v21.4s, v18.4s           fmls    v15.2d, v17.2d, v0.2d
fmla    v0.4s, v23.4s, v17.4s           fadd    v17.2d, v16.2d, v27.2d
fmls    v0.4s, v25.4s, v16.4s           fmla    v15.2d, v17.2d, v26.2d
fmla    v0.4s, v27.4s, v7.4s            fadd    v17.2d, v16.2d, v25.2d
fmls    v0.4s, v29.4s, v6.4s            fmls    v15.2d, v17.2d, v24.2d
fmla    v0.4s, v31.4s, v5.4s            fadd    v17.2d, v16.2d, v23.2d
fmls    v0.4s, v9.4s, v1.4s             fmla    v15.2d, v17.2d, v22.2d
fmla    v0.4s, v4.4s, v11.4s            fadd    v17.2d, v16.2d, v21.2d
fmls    v0.4s, v3.4s, v13.4s            fadd    v16.2d, v16.2d, v19.2d
fmla    v0.4s, v2.4s, v15.4s            fmls    v15.2d, v17.2d, v20.2d
str     q0, [x1], 16                    fmla    v15.2d, v16.2d, v18.2d
cmp     x1, x0                          str     q15, [x2, x0]
bne     .L41                            add     x0, x0, 16
                                        bcc     .L84


                     32 bit    64 bit    32 bit     64 bit   32 bit    64 bit
                         SP        SP        DP        DP   NEON SP   NEON SP

Maximum GFLOPS         10.7      21.8      10.3      12.7      17.0      24.1

Instructions
Total                    27        39        26        27        67        27
Floating point           22        32        22        32        32        22

FP operations
Total                    32       128        32        64       128       128
Add or subtract          11        44        11        22        21        44
Multiply                  1         4         1         2        11         4
Fused                    20        80        20        40         0        80

Add example           fadds      fadd     faddd      fadd  vadd.f32      fadd
                        s16,   v15.4s,      d25,   v15.2d,       q9,    v1.4s,
                        s23,   v16.4s,      d17,   v16.2d,       q8,    v8.4s,
                        s2     v15.4s       d15    v14.2d        q14    v1.4s

Multiply example     fnmuls      fmul     fmuld      fmul  vmul.f32      fmul
                        s16,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                         s3,   v15.4s,      d16,   v15.2d,       q9,    v0.4s,
                        s16    v17.4s       d5     v13.2d       q12    v19.4s

Fused example      vfma.f32      fmla  vfma.f64      fmla       N/A      fmla
                        s16,   v15.4s,      d16,   v15.2d,              v0.4s,
                        s29,   v17.4s,      d22,   v17.2d,              v4.4s,
                         s9     v0.4s       d28    v22.2d              v11.4s

FP registers used        32         4        32        25        16        32


MP-MFLOPS Sumchecks

Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.

Code:

              2 Ops/Word              32 Ops/Word
  KB          12.8    128    12800    12.8     128   12800 

SP
4B/64    1T    76406   97075   99969   66015   95363   99951
3B/64    1T    76406   97075   99969   66015   95363   99951
4B/32    1T    76406   97075   99969   66015   95363   99951

DP      
4B/64    1T    76384   97072   99969   66065   95370   99951   
3B/64    1T    76384   97072   99969   66065   95370   99951   
4B/32    1T    76384   97072   99969   66065   95370   99951   

NEON Bit SP      
4B/64    1T    76406   97075   99969   66015   95363   99951   
3B/64    1T    76406   97075   99969   66015   95363   99951   
4B/32    1T    76406   97075   99969   66014-X 95363   99951   


OpenMP-MFLOPS Benchmark

This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.

Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA.

https://www.webarchive.org.uk/wayback/archive/20151031003049/http://www.roylongbottom.org.uk/GigaFLOPS%20Benchmarks.htm

The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.

Code:
            OpenMP MFLOPS64 Thu Aug 22 19:54:59 2019

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.092836     5386    0.929538   Yes
 Data in & out    1000000     2      250   0.887743      563    0.992550   Yes
 Data in & out   10000000     2       25   0.917173      545    0.999250   Yes

 Data in & out     100000     8     2500   0.129858    15401    0.957117   Yes
 Data in & out    1000000     8      250   0.899561     2223    0.995518   Yes
 Data in & out   10000000     8       25   0.847036     2361    0.999549   Yes

 Data in & out     100000    32     2500   0.391602    20429    0.890215   Yes
 Data in & out    1000000    32      250   0.989877     8082    0.988088   Yes
 Data in & out   10000000    32       25   0.944493     8470    0.998796   Yes

                End of test Thu Aug 22 19:55:05 2019*


         --------------- MFLOPS -------------- -------- Compare --------
Mbytes/  Pi 3B+       Pi 4B        Pi 4B                    Pi 4B
Threads    64b          64b          32b        4b/3b       64/32b
           All   1CP    All    1CP   All   1CP   All   1CP    All    1CP
 
  0.4/2   2674   755   5386   2780  4716  2850  2.01  3.68   1.14   0.98
    4/2    411   404    563    557   556   429  1.37  1.38   1.01   1.30
   40/2    419   408    545    588   544   632  1.30  1.44   1.00   0.93

  0.4/8   7029  1886  15401   5555  7981  5191  2.19  2.95   1.93   1.07
    4/8   1656  1495   2223   2116  2389  2082  1.34  1.42   0.93   1.02
   40/8   1725  1507   2361   2310  2199  2003  1.37  1.53   1.07   1.15

 0.4/32   6648  1699  20429   5647  8147  5449  3.07  3.32   2.51   1.04
   4/32   5977  1616   8082   5445  7951  5385  1.35  3.37   1.02   1.01
  40/32   6027  1616   8470   5479  8030  5379  1.41  3.39   1.05   1.02


Next More threading Benchmarks
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Mon Sep 09, 2019 10:16 pm    Post subject: More Multithreading Benchmarks Reply with quote

More Multithreading Benchmarks

OpenMP-MemSpeed

This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.

Code:
                         Gentoo Pi 3B+ 64 Bits

      Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom

               Start of test Fri Sep  6 12:44:14 2019

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    6302   3584   1853   9425   5287   2000  12537   4762   2141
       8    5122   3699   1897   8911   5620   2012  13017   5610   2111
      16    7283   3717   1873  10812   5659   1974  13006   6428   2080
      32    6953   3675   1762  10058   5515   1974  11998   6101   1997
      64    6967   3683   1836  10052   5531   1966  12142   6169   2054
     128    7021   3694   1848  10049   5544   2035   9932   6269   2091
     256    7048   3680   1908  10196   5593   1976   8831   6323   2067
     512    4986   1606   1722   6324   4189   1304   4509   4135   1647
    1024    2284   2692   1397   3932   2385   1321   1277   1268    942
    2048     981   2650   1398   1749   3360   1471    758    838   1043
    4096     874   1578   1355   3757   3398    909    760    852    756
    8192    1038   2585   1092   3805   1646   1243    857    751    870
   16384     917   2359   1734   1184   3151   1179    880    776    814
   32768    2983   1229   1916   1519   2880   1293    808    847   1373
   65536    3214   1259   1298   3018   1319   1286    894    857   1147
  131072     839    673    779    918    883    765   1263   1286    512

 Not OMP                                                               
       8    4694   2913   4841   6213   3944   4844   5402   4337   4337
     256    3791   2572   3921   4428   3223   3922   4941   4065   4070
   65536    1064   1070   1106   1075   1086   1028    763    849    847


                         Gentoo Pi 4B 64 Bits

     Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7854   9082   3171   7660   9140   2232  30162  15534   2692
       8    8238   8906   3150   8308   9253   2339  29033  15749   2673
      16    8217   8964   3136   8408   9044   2350  31531  15867   2472
      32    8598   8192   3085   8547   8094   2387  17252  14505   2377
      64    9084   8654   3130   8902   8606   2410  18959  14678   2393
     128   11338  11686   3091  11811  11261   2858  14852  15240   2361
     256   16320  17582   3236  17404  16671   2652  13683  14741   2411
     512   17581  18033   3089  16204  18086   2758  12921  10441   2331
    1024   14527  13629   2891  15196  13782   2682   4323   6169   2272
    2048    5018   7240   3120   7328   7241   2512   3370   3428   2215
    4096    4054   7200   3135   7330   5612   2916   2775   2703   2196
    8192    2130   2261   3867   7731   7527   3823   2701   2615   2184
   16384    3795   4552   3364   2106   7417   3397   1793   2709   2100
   32768    2065   6760   3327   7215   7144   3797   2108   2376   2242
   65536    2462   2245   2390   7160   3945   2742   2746   2386   2259
  131072    3276   3526   2324   8110   1927   2882   2584   2719   1965

 Not OMP                                                               
       8   15527  13976  15533  15504  14021  15537  11563   9311   7794
     256   12236  11434  12096  12084  11740  12156   7883   8044   7818
   65536    2047   2046   2037   2034   2054   2071   2567   2554   2547




                        Raspbian Pi 4B 32 Bits

     Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7650   6427   1247   7389   6401   1587  39558  19538    883
       8    7777   6511   1263   7655   6534   1586  39076  19920    890
      16    8180   7840   1275   8412   7870   1566  38490  20039    846
      32    9355   9127   1295   9685   9266   1612  36718  19878    862
      64    8949   8763   1223   8971   8727   1566  14949  14827    844
     128   12241  11610   1247  12328  12303   1742  13945  15134    876
     256   17543  14765   1300  18010  17894   1748  12710  13167    839
     512   18252  15466   1265  18030  16934   1651  12814  12407    874
    1024    9044  12367   1432  12278  12201   1641   6907   9438    846
    2048    6975   6620   1521   7031   6999   1676   3073   3365    797
    4096    3539   7303   1440   7267   7247   1730   2348   3165    831
    8192    7547   7759   1369   7608   7659   1762   2622   3133    904
   16384    3877   7559   1329   7987   5744   1506   2514   3136    850
   32768    7391   3974   1317   7290   6655   1763   2586   3102    921
   65536    8209   7779   1341   7856   7290   1805   2445   2834    851
  131072    5086   7344   1280   3475   5222   1688   2358   2968    830

 Not OMP                                                               
       8    8603  11757  13383   8607  11754  13384   7827   7796   7796
     256    8312   9879   9991   8355   9988   9993   7530   7803   7805
   65536    2098   2073   2081   2087   2077   2068   2590    961    965
 


Stress Testing Programs Benchmarking Mode

My latest stress testing programs have parameters that specify running time, data size, number of threads, log file number and, in two cases, processing density. When run without parameters, the full range of options are used, providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below. The programs are available in:

https://www.researchgate.net/profile/Roy_Longbottom/project/Performance-of-Raspberry-Pi-and-Android-Devices/attachment/5cc08ba43843b01b9b9c4f64/AS:751246503858192@1556122532665/download/Raspberry-Pi-Stress-2019.tar.gz?context=ProjectUpdatesLog

Integer Stress Test-Benchmark

The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact, used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with nearly four times more using all cores.

The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits, somewhat limited in this case.

Code:
                 Gentoo Pi 4B 64 Bits

  MP-Integer-Test 64 Bit v1.0 Fri Sep  6 16:33:36 2019

      Benchmark 1, 2, 4, 8, 16 and 32 Threads

                   MB/second
                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   4.3    1   7771  7352  3895  00000000    Yes
   3.3    2  15467 14218  3714  FFFFFFFF    Yes
   3.0    4  28715 26652  3345  5A5A5A5A    Yes
   3.0    8  30292 26310  3334  AAAAAAAA    Yes
   3.0   16  29466 28503  3337  CCCCCCCC    Yes
   3.0   32  29351 30358  3390  0F0F0F0F    Yes


              Pi 4B 32 bit MB/sec        Pi 3B+ 64 bit MB/sec

               KB      KB      MB         KB      KB      MB
               16     160      16         16     160      16
   Threads
        1    5964    5756    3931       4823    3884    1209
        2   11787   11430    3748       9613    7709    1908
        4   23214   22060    3456      17737   15137    1779
        6   22197   22171    3472      17651   18692    1767
       16   22671   23299    3256      18255   18793    1757
       32   21379   21881    3346      18246   18674    1748

             Pi 4B 64b/32b              64b Pi 4B/3B+
Average
Gain         1.31    1.25    0.99       1.63    1.67    2.13


Single Precision Floating Point Stress Test-Benchmark

This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one core, and 26.7 GFLOPS with four cores.

These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.

Code:
  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:30:12 2019

             Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   1.7    T1   2  2819  2874   504   40392  76406  99700
   3.2    T2   2  5592  5702   511   40392  76406  99700
   4.6    T4   2  9223  7520   519   40392  76406  99700
   6.0    T8   2  9520 10471   545   40392  76406  99700
   8.2    T1   8  5381  5595  2050   54764  85092  99820
   9.8    T2   8 11039 10883  2173   54764  85092  99820
  11.3    T4   8 19087 21040  2044   54764  85092  99820
  12.9    T8   8 19747 21107  2016   54764  85092  99820
  17.5    T1  32  6693  6753  6377   35206  66015  99520
  20.2    T2  32 13491 13464  8710   35206  66015  99520
  22.2    T4  32 25732 26704  9160   35206  66015  99520
  24.1    T8  32 25708 25770  8927   35206  66015  99520

            End of test Fri Sep  6 16:30:37 2019


              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2       2641    2607     646        838     826     373
T2   2       5089    5116     618       1659    1650     380
T4   2       8282    8522     683       2584    3296     384
T8   2       8756    9847     686       3013    3056     391
T1   8       5543    5428    2597       1981    1972    1354
T2   8      10754   10603    2711       3936    3923    1518
T4   8      18716   20823    2844       7482    7396    1531
T8   8      19859   21684    2555       7399    7705    1534
T1  32       5309    5274    5265       2820    2809    2462
T2  32      10557   10509    9991       5636    5583    4754
T4  32      20416   20919   11340      10640   10882    6020
T8  32      20072   19787    9330      10641   10926    6159

              Average Pi 4B Performance Gains

  Ops/Word      Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.09    1.04    0.79       3.37    3.16    1.36
        8    1.00    1.01    0.77       2.69    2.80    1.40
       32    1.27    1.29    0.96       2.40    2.41    1.85


Double Precision Floating Point Stress Test-Benchmark

Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32 bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS

Code:
  MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep  6 16:31:24 2019

    Double Precision Benchmark 1, 2, 4 and 8 Threads

                        MFLOPS          Numeric Results
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   3.2    T1   2  1398  1462   285   40395  76384  99700
   6.2    T2   2  2799  2807   256   40395  76384  99700
   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  15.7    T1   8  2668  2790  1103   54805  85108  99820
  18.8    T2   8  5670  5545  1158   54805  85108  99820
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  34.1    T1  32  3317  3390  3195   35159  66065  99521
  39.2    T2  32  6791  6754  4753   35159  66065  99521
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

            End of test Fri Sep  6 16:32:11 2019

              Pi 4B 32 bit               Pi 3B+ 64 bit
Threads       KB      KB      MB         KB      KB      MB
Ops/wd       12.8     128    12.8       12.8     128    12.8

T1   2        993     998     329        412     411     193
T2   2       1971    1995     309        828     824     194
T4   2       3633    3937     340       1543    1514     197
T8   2       3635    3796     339       1525    1551     196
T1   8       2378    2445    1288        980     978     696
T2   8       4770    4860    1282       1975    1964     782
T4   8       9281    9556    1210       3688    3688     781
T8   8       9119    9448    1245       3726    3689     787
T1  32       2697    2726    2708       1402    1403    1231
T2  32       5397    5446    5163       2808    2808    2399
T4  32      10689   10806    5146       5379    5413    3195
T8  32      10716   10494    4497       5450    5485    3150

              Average Pi 4B Performance Gains

  Ops/Word   Pi 4B 64b/32b              64b Pi 4B/3B+

        2    1.40    1.37    0.82       3.34    3.39    1.38
        8    1.13    1.12    0.87       2.78    2.83    1.44
       32    1.23    1.24    1.00       2.40    2.41    1.86


High Performance Linpack Benchmark

Earlier, he High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4 system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports.

https://www.researchgate.net/publication/331983549_Raspberry_Pi_3B_and_3B+_High_Performance_Linpack_and_Error_Tests

https://www.researchgate.net/publication/334561068_Raspberry_Pi_4B_Stress_Tests_Including_High_Performance_Linpack

Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the following.

https://computenodes.net/2018/06/28/building-hpl-an-atlas-for-the-raspberry-pi/

The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4. Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data with the rather slow main drive.


The procedure specified in the above was used, successfully leading to a working package. Only one change was required, this was to Make.rpi line 95 to;

Code:
LAdir  = /home/pi/atlas-build to = /home/demouser/atlas-build


Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at full speed.

Results and comparisons are provided below, followed by the main report for he best Pi 4B Gentoo result. Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision) or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.

Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could be produced using different CPU models or alternative compilations but, the least understandable is identified at the end of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.

Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the difference in technology or whether external compile options should be included for the different packages involved.

Code:
        ------ Time ------   ----- GFLOPS -----  ----------- Sumcheck ----------

          4B     4B    3B+     4B     4B    3B+         4B         4B        3B+
    N    64b    32b    64b    64b    32b    64b        64b        32b        64b

 4000   5.51   5.20  14.53   7.75   8.20   2.94  0.0022808  0.0023975  0.0025857
 8000  38.22  36.70 101.59   8.93   9.30   3.36  0.0017216  0.0016746  0.0017518
16000 269.26 263.00         10.14  10.40         0.0012577  0.0011258
20000 513.67 494.30         10.38  10.80         0.0009637  0.0010188

              GFLOPS Comparisons

                4B           64b
    N        64b/32b       4B/3B+

 4000          0.95          2.64
 8000          0.96          2.66
16000          0.98
20000          0.96


--------------------------------------------------------------------------------

The following parameter values will be used:

N      :    1000
NB     :     128
PMAP   : Row-major process mapping
P      :       2
Q      :       2
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             513.67              1.038e+01
HPL_pdgesv() start time Fri Aug 23 10:57:30 2019

HPL_pdgesv() end time   Fri Aug 23 11:06:04 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009637 ...... PASSED
================================================================================

WR11C2R4       20000   128     2     2             516.71              1.032e+01

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008697 ...... PASSED
================================================================================
First Run

WR11C2R4       20000   128     2     2             656.89              8.120e+00

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009470 ...... PASSED
================================================================================
 

_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Wed Sep 25, 2019 8:05 pm    Post subject: 64 Bit Raspberry Pi 4B Java and I/O Benchmarks Reply with quote

64 Bit Raspberry Pi 4B Java and I/O Benchmarks

The benchmark results included below enable comparisons between the Pi 4B and 3B+, using the same benchmark and system software, but a sample of Pi 4 32 bit speeds is included, in some cases.
Three operational differences were observed from the 32 bit Raspbian benchmarking exercise. The first was with dual monitors where displaying a twice monitor pixel width window, the 32 bit program spread this across both monitors, but no mirroring appeared to be available. The 64 bit benchmark provided the latter for smaller windows, but switching off mirroring, squashed the image into the half width monitor display.

The standard version of my DriveSpeed benchmark uses Direct I/O, to avoid caching, on writing and reading large files and works as expected running the 2 bit version. Running a 64 vit version, this lead to failures to write or read. Variations were produced to enable performance measurements.

My broadband hub has dual 2.4 and 5 GHz capabilities. Significant variations in performance can be produced in this mode, on different devices, but appear to be more significant using the 64 bit benchmark. I changed the hub settings to provide different 2.4 and 5 GHz ports but, unlike the 32 bit benchmark, the 64 bit version would not connect to the network, using the same Pi 4B system.

Note that these differences could be due to program, software and/or hub incompatibility.


OpenGL GLUT Benchmark

The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines, followed by one with applied textures.

Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around 1.5 times, with the slow kitchen displays.

Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical displays when the mirror option was selected. Without this, the normal display, from where the program is executed, appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below.
On running the 32 bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when 1920x1080 parameters or less were specified. There was no mirror option.

Code:
############################# Pi 3B+ #############################

 GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    389.6    227.2    122.6     75.3     30.0     21.5
   320   240    328.1    201.7    113.8     73.3     30.2     21.3
   640   480    203.3    144.7     87.8     62.0     30.2     21.0
  1024   768    107.1     94.5     60.3     51.1     28.9     20.0
  1920  1080     45.3     47.5     36.9     33.1     28.7     20.0


############################## Pi 4B #############################

  GLUT OpenGL Benchmark 64 Bit Version 1, Thu Sep 12 20:48:21 2019

          Running Time Approximately 5 Seconds Each Test

 Window Size  Coloured Objects  Textured Objects  WireFrm  Texture
    Pixels        Few      All      Few      All  Kitchen  Kitchen
  Wide  High      FPS      FPS      FPS      FPS      FPS      FPS

   160   120    767.4    420.3    258.3    154.3     45.7     31.7
   320   240    682.9    388.8    245.0    148.3     45.1     30.8
   640   480    367.1    262.6    217.9    140.1     46.2     30.9
  1024   768    150.8    148.8    128.6    117.3     45.3     30.4
  1920  1080     71.9     73.9     64.0     61.6     43.3     27.9

  Pi 4B Gains    1.77     1.74     2.12     2.10     1.52     1.46


  Dual Monitor- mirrored displays
  1920  1080     65.0     66.3     61.6     58.2     42.7     27.5

  Dual Monitor - not mirrored squashed image on one monitor
  3840  1080     60.9     59.6     57.2     54.8     40.8     26.8

  32 Bit
  1920  1080     81.4     79.4     74.6     68.3     30.8     20.0



JavaDraw Benchmark

The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.

Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.

Code:
############################# Pi 3B+ #############################

   Java Drawing Benchmark, Sep 20 2019, 11:08:33
            Produced by javac 1.7.0_02

  Test                              Frames      FPS

  Display PNG Bitmap Twice Pass 1      335    33.46
  Display PNG Bitmap Twice Pass 2      546    54.53
  Plus 2 SweepGradient Circles         502    50.08
  Plus 200 Random Small Circles        366    36.59
  Plus 320 Long Lines                  134    13.30
  Plus 4000 Random Small Circles        46     4.59

         Total Elapsed Time  60.2 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################

   Java Drawing Benchmark, Sep 12 2019, 20:18:28
            Produced by javac 1.7.0_02

  Test                              Frames      FPS  Gains

  Display PNG Bitmap Twice Pass 1     1146   114.52   3.42
  Display PNG Bitmap Twice Pass 2     1318   131.79   2.42
  Plus 2 SweepGradient Circles        1237   123.66   2.47
  Plus 200 Random Small Circles        972    97.13   2.65
  Plus 320 Long Lines                  415    41.48   3.12
  Plus 4000 Random Small Circles        97     9.65   2.10

         Total Elapsed Time  60.1 seconds

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


 32 bit Pi 4B speeds were between 8.25 and 104.47 FPS



Java Whetstone Benchmark

The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million Whetstone Instructions Per Second (MWIPS).

Code:
############################# Pi 3B+ #############################

    Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12

                                                       1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs

  N1 floating point  -1.124750137    310.88             0.0618
  N2 floating point  -1.131330490    289.41             0.4644
  N3 if then else     1.000000000             241.15    0.4292
  N4 fixed point     12.000000000             706.28    0.4460
  N5 sin,cos etc.     0.499110132              23.31    3.5700
  N6 floating point   0.999999821    130.04             4.1480
  N7 assignments      3.000000000              89.19    2.0720
  N8 exp,sqrt etc.    0.825148463              21.92    1.6970

  MWIPS                              775.89            12.8884

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222


############################# Pi 4B ##############################
 
    Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35

                                                      1 Pass
  Test                  Result       MFLOPS     MOPS  millisecs Gains

  N1 floating point  -1.124750137    488.80             0.0393   1.57
  N2 floating point  -1.131330490    475.92             0.2824   1.64
  N3 if then else     1.000000000             344.31    0.3006   1.43
  N4 fixed point     12.000000000            1571.86    0.2004   2.23
  N5 sin,cos etc.     0.499110132              43.55    1.9104   1.87
  N6 floating point   0.999999821    264.15             2.0420   2.03
  N7 assignments      3.000000000             264.00    0.7000   2.96
  N8 exp,sqrt etc.    0.825148463              25.80    1.4420   1.18

  MWIPS                             1445.70             6.9171   1.86

  Operating System    Linux, Arch. aarch64, Version 4.19.67
  Java Vendor         IcedTea, Version  1.8.0_222



DriveSpeed Benchmark

This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random access and numerous small files. Run time parameters are available to specify large file size and the file path.

Code:
########################## Pi 4B USB 3 ###########################

   DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019
 
 Selected File Path:
 /run/media/demouser/PATRIOT//
 Total MB  120832, Free MB  119778, Used MB    1054

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

 512    30.72    31.11    34.01   287.24   295.04   311.90
1024    34.66    36.11    35.45   298.87   302.38   300.26
 Cached
   8    42.03    39.58    38.85  1167.71  1029.35  1061.56

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.007    0.310     9.65    10.42     9.71

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      0.03     0.07     0.13   268.10   427.95   657.48
 ms/file   122.73   122.28   122.22     0.02     0.02     0.02    2.557


For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option, specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To check this and enable main drive measurements, separate direct I/O free large file write and read only programs were produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously writing or reading two USB 3 drives.


USB Flash Drives

Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400 MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do better sometimes).

Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use provided little or no performance gains for the latter. Cached reading reflects RAM speed

Code:
MB/second 16 MB USB 2, 1024 MB USB 3

 System   Drive  Write1  Write2  Write3   Read1   Read2   Read3

Pi 3B+  USB 2 P     11.5    11.4    11.5    36.6    37.7    37.3
Pi 3B+  USB 2 R     15.9    16.4    13.9    37.1    40.1    39.8
Pi 4B   USB 2 P     12.6    12.6    12.6    37.0    37.3    37.2
Pi 4B   USB 2 R     22.6    22.9    22.9    36.5    36.3    36.5
Pi 4B   USB 3 P     34.7    36.1    35.5   298.9   302.4   300.3
Pi 4B   USB 3 R     48.9    44.6    53.4   249.4   248.8   246.2
Compare MB/second
Pi 4B   P USB 3/2   2.75    2.88    2.81    8.07    8.11    8.07
Pi 4B   R USB 3/2   2.17    1.94    2.33    6.83    6.85    6.74

Cached MB/second Write1  Write2  Write3   Read1   Read2   Read3
Pi 3B+  USB 2 P     13.6    14.2    14.4   633.4   544.0   464.3
Pi 3B+  USB 2 R     13.7    14.4    19.4   623.5   661.4   557.6
Pi 4B   USB 2 P     15.0    14.7    14.8  1204.0  1047.3  1066.3
Pi 4B   USB 2 R     20.8    21.2    13.9   930.2   933.6  1230.3
Pi 4B   USB 3 P     42.0    39.6    38.9  1167.7  1029.4  1061.6
Pi 4B   USB 3 R     21.1    15.9    36.2  1103.6   944.9   981.0
Compare
Pi 4B   P USB 3/2   2.80    2.70    2.63    0.97    0.98    1.00
Pi 4B   R USB 3/2   1.01    0.75    2.60    1.19    1.01    0.80

Random milliseconds
                   Read                   Write
Pi 3B+  USB 2 P    0.013   0.013   0.254   11.76   10.18    9.80
Pi 3B+  USB 2 R    0.017   0.008   0.032    1.09    1.39   11.72
Pi 4B   USB 2 P    0.006   0.007   0.215    9.56    8.54    8.75
Pi 4B   USB 2 R    0.009   0.005   0.016    1.35    2.12    1.34
Pi 4B   USB 3 P    0.004   0.007   0.310    9.65   10.42    9.71
Pi 4B   USB 3 R    0.004   0.004   0.008    1.75    0.85    0.92
Compare
Pi 4B   P USB 3/2   1.50    1.00    0.69    0.99    0.82    0.90
Pi 4B   R USB 3/2   2.25    1.25    2.00    0.77    2.49    1.46

200 Small Files  milliseconds
                  Write                    Read                  Delete
Pi 3B+  USB 2 P    134.2   128.6   129.6    0.08    0.12    0.07    3.36
Pi 3B+  USB 2 R    105.5   104.7   107.6    0.05    0.05    0.07    0.26
Pi 4B   USB 2 P    125.8   125.5   125.8    0.02    0.02    0.02    3.12
Pi 4B   USB 2 R    104.1   104.0   104.0    0.02    0.02    0.03    0.14
Pi 4B   USB 3 P    122.7   122.3   122.2    0.02    0.02    0.02    2.56
Pi 4B   USB 3 R    105.4   104.0   104.3    0.02    0.02    0.03    0.15
Compare
Pi 4B   P USB 3/2   1.03    1.03    1.03    1.00    1.00    1.00    1.22
Pi 4B   R USB 3/2   0.99    1.00    1.00    1.00    1.00    1.00    0.95



Drive Write/Reboot/Read Tests

The write test also reads the data for verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided, covering reading speeds.


Main SD Drive

This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4) CPU utilisation and 23% (x4) waiting for I/O.

Code:
 Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
 Total MB   28225, Free MB   18761, Used MB    9464
 
                1024 MB   MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

Write   18.99    19.34    19.47  1337.09  1164.91  1325.96
Read     N/A      N/A      N/A     45.80    45.88    45.89


procs  -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in   cs us sy id wa st

0  1     0 673848  60668 2792716    0    0 45056     0  767 1181  0  2 75 23  0
0  1     0 630228  60668 2835544    0    0 44544     0  789 1199  0  2 74 23  0
0  1     0 585204  60668 2880268    0    0 45056     0  691 1041  0  3 75 23  0



USB 3 Drive P

Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of 17%, equivalent to 68% of one core.

Code:
 Selected File Path:
 /run/media/demouser/PATRIOT/
 Total MB  120832, Free MB  119752, Used MB    1080

                 1024 MB   MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

Write   58.45    23.10    22.91  1368.04  1190.71  1354.84
Read     N/A      N/A      N/A    306.18   294.93   302.91

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

1  0   256 811672  20920 2696504    0    0 305664     0 3898 6182  1 15 73 11  0
0  1   256 510852  20920 2996188    0    0 303616     0 4304 5936  1 16 72 12  0
1  0   256 239400  20920 3267636    0    0 307184     0 4512 6177  1 17 71 11  0



USB 3 Drive R

This time data transfer speed was slower than the earlier example.

Code:
 Selected File Path:
 /run/media/demouser/REMIX_OS/
 Total MB    9017, Free MB    7485, Used MB    1532

                 1024 MB   MBytes/Second                 
       Write1   Write2   Write3    Read1    Read2    Read3

Write   46.43    28.81    36.57  1265.07  1103.23  1236.02
Read     N/A      N/A      N/A    172.71   172.14   176.49

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  1   256 111512    912 3417624    0    0 175189     0 4315 5929  1 12 71 17  0
0  1   256 169756    992 3358840    0    0 169043     0 4064 5515  1 11 71 17  0
0  1   256 177444   1068 3351176    0    0 155724     0 4088 6023  1 12 70 16  0



USB 3 Drives R and P Together

File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near 100% of one core.

Later is a bad example, where one drive appears to be running at USB 2 speed.

Code:
Write/Read Thu Sep 19 16:07:48 2019  /run/media/demouser/REMIX_OS/
Write/Read Thu Sep 19 16:07:46 2019  /run/media/demouser/PATRIOT/

                   512 MB MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3

  R     28.72    33.89    44.69  1302.19  1131.65  1374.24
  P     11.93     8.86     6.21  1232.47  1072.38  1213.36

Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/
Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/

                  512 MB MBytes/Second
       Write1   Write2   Write3    Read1    Read2    Read3   Seconds

 P       N/A      N/A      N/A    159.78   187.44   294.23   7.7
 R       N/A      N/A      N/A    221.83   232.10   230.94   6.7+2 delayed start

procs -----------memory--------- ---swap-- -----io----  -system-- ------cpu----
r  b  swpd   free   buff   cache   si   so    bi    bo   in    cs us sy id wa st

0  0     0 3160720  74616  296092   0    0     0     0  2031 3601  4  2 94  0  0
0  1     0 3112052  74616  342188   0    0  45552    0  1512 2257  1  3 93  4  0
0  1     0 2908004  74616  547600   0    0 206336    0  4684 7169  4 14 67 15  0
2  0     0 2531960  74616  919400   0    0 369136    0  5495 8033  4 24 47 25  0
2  0     0 2149064  74616 1303288   0    0 382960    0  5168 7007  1 21 52 26  0
1  1     0 1771492  74616 1681348   0    0 385024    0  5969 8255  1 23 49 26  0
1  1     0 1383524  74616 2068788   0    0 386016    0  5621 7926  1 21 49 29  0
0  2     0  999100  74616 2453280   0    0 383488    0  4602 6895  1 19 54 26  0
0  1     0  628988  74616 2824188   0    0 368640    0  5405 8153  2 20 56 22  0
1  0     0  310748  74624 3142732   0    0 317424   20  4622 6551  1 17 72 10  0
1  0     0  223052  73680 3231812   0    0 268288    0  2815 5012  1 18 72 10  0
0  0     0  223824  73680 3231280   0    0  32768    0  1044 2009  1  3 95  1  0
0  0     0  223824  73680 3231280   0    0      0    0   393  619  0  0 99  0  0

 ===============================================================================

 Bad Example``````````````

       Write1   Write2   Write3    Read1    Read2    Read3

 P       N/A      N/A      N/A     36.37    37.72    37.48
 R       N/A      N/A      N/A    248.18   248.22   223.53



LAN and WiFi Benchmarks

The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC, where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version, LanSpdx86Win.exe, to be run.

An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands are shown below.

Code:
Commands

sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public

./LanSpeed64 FilePath /media/public/test

Log File

   LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019
 
 Selected File Path:
 /media/public/test/
 Total MB  266240, Free MB   70991, Used MB  195249

                        MBytes/Second
  MB   Write1   Write2   Write3    Read1    Read2    Read3

   8    66.13    92.09    92.76    96.36    96.85    97.30
  16    80.79    93.59    94.61   103.99   104.34   104.57

 Random         Read                       Write
 From MB        4        8       16        4        8       16
 msecs      0.004    0.009    0.435     0.95     0.92     0.93

 200 Files      Write                      Read                  Delete
 File KB        4        8       16        4        8       16     secs
 MB/sec      1.37     2.45     4.77     1.37     2.49     4.92
 ms/file     2.99     3.35     3.43     2.98     3.29     3.33    0.467



LAN and WiFi Benchmark Results

Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC.

Dealing with large files, PC to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around three times faster than on the Pi 3B+. My BT Hub has dual 2.4 and 5 GHz WiFi capabilities, leading to the following erratic WiFi performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.

I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and 8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.

Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN. There were similar relationships on dealing with numerous small files.

Code:
Large Files MB/second

System   MB   Write1   Write2   Write3    Read1    Read2   Read3

PC WiFi  16     4.08     4.16     4.11     2.34     1.68     1.30
PC LAN   16   106.11   106.11   105.89    50.67    33.86    25.47
LAN 3B+  16    28.63    29.03    28.96    22.18    32.28    32.61
3B+ WiFi 16    11.15    11.00    10.76     4.01     3.89     3.09
4B WiFi1 16     6.43     6.39     6.47     4.33     4.13     4.86
4B WiFi2 16    13.26    13.34    13.25     3.69     4.22     4.00
4B LAN   16    80.79    93.59    94.61   103.99   104.34   104.57
4B LAN  128    96.58    96.67    95.74   106.41   107.24   107.82


Random milliseconds

System          Read                     Write

PC WiFi        1.711    1.972    2.015     2.26     2.28     2.25
PC LAN         0.606    0.590    0.532     0.47     0.48     0.47
LAN 3B+        0.030    0.816    0.484     1.19     1.16     1.16
3B+ WiFi       3.052    3.167    3.475     3.60     3.39     3.45
4B WiFi1       3.286    3.549    3.627     4.02     3.45     3.72
4B WiFi2       2.786    2.822    2.944     3.20     2.94     2.92
4B LAN         0.004    0.009    0.435     0.95     0.92     0.93


200 Small Files  milliseconds per file

System      Write                       Read                     Delete

PC WiFi     10.09    12.42    13.81     5.50     6.11     8.06    1.507
PC LAN       4.05     4.59     4.53     2.38     2.23     2.64    0.661
LAN 3B+      3.72     4.36     4.45     3.33     3.40     3.60    0.378
3B+ WiFi    12.61    13.53    14.97    13.17    14.06    15.88    2.534
4B WiFi1    15.08    16.53    22.83    12.96    14.23    17.29    2.509
4B WiFi2    11.38    12.85    12.82    10.64    11.83    14.15    2.083
4B LAN       2.99     3.35     3.43     2.98     3.29     3.33    0.467

_________________
Regards

Roy
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7713
Location: almost Mile High in the USA

PostPosted: Thu Sep 26, 2019 6:22 am    Post subject: Reply with quote

Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont?
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
Gavinmc42
n00b
n00b


Joined: 23 Sep 2019
Posts: 21
Location: Brisbane

PostPosted: Thu Sep 26, 2019 9:03 am    Post subject: Reply with quote

Any difference between A53 and A72 versions?
I barely understand "out of order execution" and where would it make a difference?
_________________
Don't get Pi's if you are scared of learning.
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Sep 26, 2019 9:29 am    Post subject: Reply with quote

eccerr0r wrote:
Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont?


Which report are you referring to? I can't recall including a direct reference in this topic. It would be a few years old anyway.
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Sep 26, 2019 9:43 am    Post subject: Reply with quote

Gavinmc42 wrote:
Any difference between A53 and A72 versions?
I barely understand "out of order execution" and where would it make a difference?


See comparisons of Pi 4B and Pi 3B+, the latter having the A53, where there are lots of Pi 4 A72 improvements above and beyond clock speed ratios.

I take "out of order execution" to mean that later instructions in a sequence can be executed if they have no impact on current calculations. This can improve performance.
_________________
Regards

Roy
Back to top
View user's profile Send private message
eccerr0r
Watchman
Watchman


Joined: 01 Jul 2004
Posts: 7713
Location: almost Mile High in the USA

PostPosted: Thu Sep 26, 2019 2:11 pm    Post subject: Reply with quote

roylongbottom wrote:
eccerr0r wrote:
Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont?


Which report are you referring to? I can't recall including a direct reference in this topic. It would be a few years old anyway.

Oh sorry yeah taken out of context, I think I was looking at your website and not from this thread hence the out of the blue question...
Silvermont is many years old now, and Bonnell is even older.
_________________
Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching?
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Sat Sep 28, 2019 5:00 pm    Post subject: GCC 9 Compiled Benchmarks Reply with quote

GCC 9 Compiled Benchmarks

I am in the process of recompiling my benchmarks using gcc 9, to see if they produce faster performance than the existing gcc 6 compilations. Results of the single core CPU benchmarks and comparisons are provided below for a Pi 3B+ and Pi 4B, using the same Gentoo Operating System SD card and benchmark programs. For details of the latter and earlier results see:

https://www.raspberrypi.org/forums/viewtopic.php?f=31&t=44080&start=125#p1484388
and down the page
https://www.raspberrypi.org/forums/viewtopic.php?f=31&t=44080&start=125#p1485285

In due course, the gcc 9 benchmarks and source codes will be available to download.

The new compiler produces no real performance improvements for the first few benchmarks but provides some significant gains data streaming integer and single precision floating point calculations in the memory tests, where vector SIMD instructions are likely to be generated.

Another reason for new versions of the benchmarks is that the CPUID information included in the earlier programs provided too much unnecessary information, with nine identical lines for each CPU core. The new one provides the following data (except for Linux), using the lscpu command. Here, the CPU model name and normal operating frequency are provided.

Pi 4B Cortex A72
Code:
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               3
Model name:          Cortex-A72
Stepping:            r0p3
CPU max MHz:         1500.0000
CPU min MHz:         600.0000
BogoMIPS:            108.00
Flags:               fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-p4-bis+ #2 SMP PREEMPT
Tue Aug 27 13:58:09 GMT 2019 aarch64 GNU/Linux


Pi 3B+ Cortex A53
Code:
Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1400.0000
CPU min MHz:         600.0000
BogoMIPS:            38.40
Flags:               fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-bis+ #2 SMP PREEMPT
Tue Aug 27 13:29:20 GMT 2019 aarch64 GNU/Linux


Whetstone Benchmark

Performance of the gcc 9 compilations for the Pi 4B was effectively the same as the earlier versions. The Pi 3B+ results indicated improvements, but this was due to the EXP type function calculations. The new compilation included a minor tweak for the IF tests, to avoid overoptimisation.

Code:
System       MHz  MWIPS ------ MFLOPS------  -----  ------ -MOPS-------- ------
                           1      2      3      COS    EXP  FIXPT     IF  EQUAL
gcc 9
Pi 3B+      1400   1482    384    404    329   27.4   28.2   1712   2042   1362
Pi 4B       1500   2330    522    533    398   60.4   40.3   2493   2984    997
Pi4/3B+     1.07   1.57   1.36   1.32   1.21   2.21   1.43   1.46   1.46   0.73

gcc 9/6
Pi 4B       1.00   1.03   1.00   1.00   1.00   1.10   1.01   1.00    N/A   1.00


Dhrystone Benchmark

The gcc 9 compilations lead to no real difference in performance.

Code:
                 Compiled  DMIPS
System       MHz   DMIPS    /MHz

gcc 9
Pi 3B+      1400    3896    2.78
Pi 4B       1500    8190    5.46
Pi4/3B+     1.07    2.10

gcc 9/6
Pi 4B       1.00    1.00


Linpack Benchmarks

The new gcc 9 compilations produced the same performance as the older versions, within the variations normally seen on this benchmark.

Code:
                         MFLOPS
 
System      MHz      DP      SP   SP NEON

gcc 9
Pi 3B+      1400   396.2   571.3   566.7
Pi 4B       1500  1110.6  2052.4  1887.5
Pi4/3B+     1.07    2.80    3.59    3.33

gcc 9/6
Pi 4B       1.00    1.05    1.04    0.96


Livermore Loops Benchmark

There were some performance differences in gcc 9 results but average speeds were quite similar

Code:
                           MFLOPS

System      MHz Maximum Average Geomean Harmean Minimum

gcc 9
Pi 3B+     1400  1000.7   347.8   308.0   275.2   117.3
Pi 4B      1500  2744.5   962.5   768.2   596.2   132.1
Pi4/3B+    1.07    2.74    2.77    2.49    2.17    1.13

gcc 9/6
Pi 4B      1.00    1.10    1.08    1.05    0.99    0.62


MFLOPS Of 24 Kernels

gcc9
Pi 3B+     565   320   319   535   227   207  1001   581   541   234   171   248
           121   160   293   280   456   547   337   287   367   190   386   209

Pi 4B     2146   989   970   965   390   785  2386  2479  1879   632   500   973
           134   423   814   670   726  1177   450   397  1675   561   818   283

Pi 4B/    3.80  3.09  3.04  1.80  1.72  3.80  2.38  4.27  3.48  2.70  2.93  3.93
Pi 3B+    1.10  2.65  2.78  2.39  1.59  2.15  1.33  1.39  4.56  2.95  2.12  1.35
          Min   1.10  Max   4.56

gcc 9/6
Pi 4B     1.06  0.99  0.98  1.02  1.05  1.06  1.17  1.00  0.95  0.83  1.01  1.11
          0.61  1.05  1.00  0.94  0.96  1.05  1.01  1.00  1.58  1.35  1.00  1.00
          Min   0.61  Max   1.58


MemSpeed Benchmark

Many Pi 4B/3B+ comparisons were similar, but the gcc 9 compilation gave rise to a number of changes, compared with the older version. The latter was slightly faster using some double precision calculations, but gcc 9 produced speed increases between 1.3 and 2.6 times with integers and single precision, the latter providing a maximum of 5.5 GFLOPS compared with 3.5.

Code:
                       Gentoo 64b Pi 3B+  gcc 9

    Memory Reading Speed Test 64 Bit gcc 9 by Roy Longbottom

               Start of test Thu Sep 26 12:43:02 2019

     Memory x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]       x[m]=y[m]
     KBytes   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
       Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

          8   4565   5140   7847   5439   5827   7928   6161   4288   4334
         16   4445   5145   7942   5362   5829   7941   6207   4358   4310
         32   4094   4853   7251   4750   5396   7250   6139   4312   4303
         64   3767   4748   7008   4320   5309   6954   5461   4097   4100
        128   3912   4799   7319   4442   5486   7325   5328   4133   4134
        256   3838   4824   6934   4400   5426   7247   5354   3844   4010
        512   2570   3661   3826   2773   3975   4912   3302   2532   3017
       1024    878   2120   2228    938   2182   2239   1098   1215   1361
       2048    848   1961   2046   1016   2008   2033    758    805    814
       4096    856   1961   2040   1007   1984   2036    839    863    856
       8192    885   1940   1956   1013   1921   1957    844    865    868
Max MFLOPS     571   1286
                              Gentoo 64b Pi 4B

          8  13385  21854  24413  13416  23402  24404  11630   9316   9315
         16  13527  22116  24712  13551  23675  24722  11800   9447   9446
         32  12170  19681  21716  12164  21047  21740  11403   9511   9514
         64  11402  19074  20086  11613  20057  20101   9317   8651   8663
        128  11770  20334  21119  12124  21389  21087   8003   8136   8136
        256  11740  20281  21115  12029  21384  21111   8098   8184   8015
        512  11671  20255  20873  12058  21561  21072   7721   6684   6929
       1024   2818   7728   5968   3957   7839   7831   4691   3610   3832
       2048   1884   3436   3743   1880   3578   3281   2597   2717   2696
       4096   1284   2399   2555   1446   3802   3625   2420   2630   2632
       8192   1913   3759   3459   1937   3798   3772   2468   2482   2482
Max MFLOPS    1691   5529
                            Comparison 64b Pi4/3B+

          8   2.93   4.25   3.11   2.47   4.02   3.08   1.89   2.17   2.15
         16   3.04   4.30   3.11   2.53   4.06   3.11   1.90   2.17   2.19

        256   3.06   4.20   3.05   2.73   3.94   2.91   1.51   2.13   2.00
        512   4.54   5.53   5.46   4.35   5.42   4.29   2.34   2.64   2.30
       1024   3.21   3.65   2.68   4.22   3.59   3.50   4.27   2.97   2.82

       4096   1.50   1.22   1.25   1.44   1.92   1.78   2.88   3.05   3.07
       8192   2.16   1.94   1.77   1.91   1.98   1.93   2.92   2.87   2.86

                            Comparison Pi4B gcc 9/6

          8   0.86   1.56   1.95   0.86   1.67   1.57   1.02   1.00   1.19
         16   0.86   1.57   1.94   0.86   1.67   1.58   1.00   1.00   1.20

        256   0.96   1.78   1.97   0.99   1.81   1.75   1.00   1.00   1.02
        512   1.04   1.89   2.05   1.10   1.93   1.82   0.96   1.06   1.06
       1024   0.83   2.93   1.82   1.17   2.42   2.63   1.25   0.91   0.95

       4096   0.91   1.30   1.37   0.78   2.28   1.97   0.97   1.06   1.09
       8192   1.00   1.96   1.80   1.27   2.00   1.99   0.99   1.11   1.00


NeonSpeed Benchmark

With the gcc 9 compilation, the Pi 4B continued to be significantly faster than the 3B+. Comparing Pi 4B gcc 9 an 6 results, performance was essentially the same when NEON Intrinsic Functions were used, but, as with MemSpeed, normal compilations were faster, averaging around 80% faster, in this case.

Code:
               Gentoo 64b Pi 3B+ gcc 9

  NEON Speed Test 64 Bit gcc 9 Thu Sep 26 12:45:07 2019

       Vector Reading Speed in MBytes/Second
  Memory  Float v=v+s*v  Int v=v+v+s   Neon v=v+v
   KBytes   Norm   Neon   Norm   Neon  Float    Int

        16   5118   5461   6218   5298   6024   6011
        32   4894   4980   5886   4855   5431   5445
        64   4713   4557   5669   4452   4868   4867
       128   4824   4703   5814   4598   4995   4946
       256   4857   4750   5815   4643   5028   4964
       512   3694   2652   4265   2675   3003   3007
      1024   2085   1135   2204   1132   1128   1077
      4096   2008   1021   2070   1033   1056   1036
     16384   1912   1061   2042    958   1065   1047
     65536   1783   1062   1873    769   1080   1081

                 Gentoo 64b Pi 4B

        16  21046  14555  16698  13502  14565  16970
        32  17797  12061  14509  10785  12282  13112
        64  19517  10860  15252   9981  10793  11419
       128  19839  10936  15468  10120  11001  11579
       256  20094  10838  15603  10229  10885  11566
       512  20076  10846  15469  10185  10943  11667
      1024   7016   3040   6826   3211   3417   3548
      4096   3945   1940   3599   1950   1768   1937
     16384   3394   2017   3386   1963   1848   2014
     65536   3484   2043   3839   1765   2060   2049

                 Comparison 64b Pi4/3B+

        16   4.11   2.67   2.69   2.55   2.42   2.82
        32   3.64   2.42   2.47   2.22   2.26   2.41
        64   4.14   2.38   2.69   2.24   2.22   2.35
       128   4.11   2.33   2.66   2.20   2.20   2.34
       256   4.14   2.28   2.68   2.20   2.16   2.33
       512   5.43   4.09   3.63   3.81   3.64   3.88
      1024   3.36   2.68   3.10   2.84   3.03   3.29
      4096   1.96   1.90   1.74   1.89   1.67   1.87
     16384   1.78   1.90   1.66   2.05   1.74   1.92
     65536   1.95   1.92   2.05   2.30   1.91   1.90

                 Comparison Pi4B gcc 9/6

        16   1.51   0.89   1.34   0.89   0.91   0.99
        32   1.86   1.12   1.62   1.12   1.12   1.19
        64   1.83   0.92   1.48   0.93   0.89   0.94
       128   1.86   0.92   1.50   0.95   0.92   0.97
       256   1.88   0.91   1.51   0.95   0.91   0.96
       512   1.98   0.95   1.59   1.00   0.97   1.01
      1024   2.37   0.94   2.37   1.00   1.04   1.21
      4096   2.28   1.13   2.08   1.10   1.11   1.12
     16384   2.13   1.05   1.86   1.02   0.96   1.21
     65536   1.77   1.18   1.92   1.01   1.09   1.01
 
  Average    1.95   1.00   1.73   1.00   0.99   1.06


BusSpeed Benchmark

Results from the gcc 9 compilations were virtually the same as those from gcc 6.

Code:
           Gentoo 64b Pi 3B+  gcc 9

   BusSpeed 64 Bit gcc 9 Thu Sep 26 12:51:15 2019

    Reading Speed 4 Byte Words in MBytes/Second
Memory     Inc32  Inc16  Inc8   Inc4   Inc2   Read
KBytes     Words  Words  Words  Words  Words  All

         16   3860   4283   4677   4901   5022   3591
         32   2228   2433   2989   4740   4912   3629
         64    700    697   1299   2200   3310   3348
        128    637    636   1208   2064   3151   3396
        256    597    600   1161   1945   3105   3377
        512    232    194    500    884   1629   2350
       1024    118    131    159    440    692   1682
       4096     91     99    197    463    923   1878
      16384    119    117    200    392    775   1606
      65536    101    105    238    464    873   1876
       
                       Gentoo 64b Pi 4B              Rd All  Rd All
                                                     4B/3B+ gcc 9/6

         16   4815   5060   5573   5808   5741   8935   2.49   1.09
         32   1534   1828   2967   4254   4930   7825   2.16   1.04
         64    792   1007   1988   3269   4844   8062   2.41   1.02
        128    730    950   1881   3133   5007   8162   2.40   1.04
        256    733    955   1901   3128   5071   8236   2.44   1.04
        512    737    952   1885   3139   5058   8237   3.51   1.07
       1024    374    539   1047   1884   3177   5537   3.29   0.97
       4096    235    255    497    990   1975   3386   1.80   0.82
      16384    239    263    501    913   1984   3973   2.47   0.97
      65536    239    237    502    995   1984   3971   2.12   0.98


Fast Fourier Transforms Benchmarks

The Pi 4B/3B+ performance gains were similar using both gcc 9 and gcc 6 compiled programs, but the gcc 9 compilation produced some faster FFT1 speeds, as shown in the Pi 4B gcc 9/6 comparisons.

Code:
           Gentoo Pi 3B+ gcc 9             Gentoo Pi 4B gcc 9

    Size    FFT1            FFT3            FFT1            FFT3
       K      SP      DP      SP      DP      SP      DP      SP      DP
 
       1    0.15    0.16    0.15    0.14    0.04    0.04    0.04    0.04
       2    0.34    0.39    0.31    0.31    0.08    0.13    0.08    0.09
       4    0.89    1.00    0.82    0.79    0.19    0.33    0.19    0.21
       8    2.19    2.70    1.66    1.89    0.71    0.74    0.46    0.46
      16    4.32    5.94    4.88    5.32    1.63    2.06    1.17    1.09
      32   12.47   24.05    9.59   14.82    3.73    4.03    2.44    3.09
      64   66.46  116.11   26.53   36.64    7.92   27.12    5.46    9.06
     128  169.06  268.02   63.65   84.00   43.28  100.75   16.09   22.00
     256  401.86  600.72  141.83  195.69  192.57  254.20   37.08   49.76
     512  853.48 1266.96  329.26  435.23  590.20  651.24   82.54  110.23
    1024 1966.69 2808.07  721.36  981.82 1463.15 1749.37  202.20  251.71
 
            Pi 4B/3B+                       Pi 4B gcc 9/6

       1    3.53    3.77    3.63    3.78    0.97    0.98    1.02    1.18
       2    4.39    3.05    3.97    3.64    1.00    1.06    1.46    1.08
       4    4.75    3.03    4.23    3.81    1.34    1.16    0.98    1.06
       8    3.06    3.62    3.62    4.10    1.10    1.76    1.00    1.09
      16    2.65    2.89    4.16    4.89    1.32    1.41    0.98    1.00
      32    3.34    5.97    3.93    4.79    1.53    1.68    1.02    1.03
      64    8.39    4.28    4.85    4.04    1.92    1.88    0.99    1.03
     128    3.91    2.66    3.96    3.82    1.93    1.51    1.01    1.12
     256    2.09    2.36    3.82    3.93    1.20    1.43    1.06    1.15
     512    1.45    1.95    3.99    3.95    0.95    1.17    1.09    1.21
    1024    1.34    1.61    3.57    3.90    0.85    1.07    1.06    1.21


Multithreaded Benchmarks Next
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Sun Oct 20, 2019 10:16 am    Post subject: GCC 9 Multithreaded Benchmarks Reply with quote

GCC 9 Multithreaded Benchmarks

Compiling with gcc 9 did not provide across the board performance gains. Mainly considering 4 thread results, those within 10% were measured on MP-Whetstone, MP-Dhrystone and MP-Linpack. Some gains and some losses applied to MP-RandMem, MP-MFLOPS NEON, OpenMP MFLOPS and OpenMP MemSpeeds. Then real gains were demonstrated by MP-BusSpeed, MP-MFLOPS SP and MP-MFLOPS DP.

MP-Whetstone Benchmark

Most of the important Pi 4B results were virtually the same as those from the earlier gcc 6 compilations but the 3B+ COS and EXP speeds were somewhat slower using gcc 9..

Code:

           MWIPS MFLOPS MFLOPS MFLOPS    Cos    Exp  Fixpt     If  Equal
  Threads             1      2      3   MOPS   MOPS   MOPS   MOPS   MOPS

                              Gentoo 64b Pi 3B+ gcc 9

    1       1500    381    384    328   27.2   28.1   5098   2049   1368
    2       3001    766    762    656   54.5   56.5  10130   4102   2737
    4       5940   1488   1528   1304  107.8  111.5  19741   7665   5423
    8       5987   1528   1666   1267  107.4  117.9  25862   9518   5666

            Overall Seconds   4.98 1T,   4.98 2T,   5.16 4T,  10.30 8T

                              Gentoo 64b Pi 4B gcc 9

    1       2364    530    532    395   60.6   40.0   7426   2242    996
    2       4724   1060   1052    789  121.0   80.4  14853   4476   1994
    4       9413   2103   2112   1579  241.0  159.5  29161   8638   3968
    8       9848   2671   2453   1644  247.0  168.1  37385  11636   4108

            Overall Seconds   5.00 1T,   5.01 2T,   5.07 4T,  10.20 8T

                              Comparison 64b Pi4/3B+

    1       1.58   1.39   1.38   1.20   2.23   1.42   1.46   1.09   0.73
    2       1.57   1.38   1.38   1.20   2.22   1.42   1.47   1.09   0.73
    4       1.58   1.41   1.38   1.21   2.24   1.43   1.48   1.13   0.73
    8       1.64   1.75   1.47   1.30   2.30   1.43   1.45   1.22   0.72

                              Comparison Pi4B gcc 9/6

    1       0.99   0.99   0.99   1.00   1.00   1.03   N/A    0.50   1.00
    2       0.99   1.00   0.97   0.99   1.00   1.03   N/A    0.50   1.00
    4       0.99   0.99   1.02   1.01   1.00   1.03   N/A    0.49   1.00
    8       1.00   1.02   0.89   1.01   1.01   1.05   N/A    0.52   1.01


MP-Dhrystone Benchmark

As indicated for the earlier gcc 6 results, this benchmark produces inconsistent performance and does not provide a good example of multithreading but, in this case, gcc 6 and gcc 9 results were similar, with a reasonably high Pi 4B/3B+ performance gain.

Code:

Threads                        1       2       4       8

Seconds                     0.54    0.67    1.23    2.46

VAX MIPS rating Pi 3B+ 6    4207    6804    7401    7415
VAX MIPS rating Pi 4B 64    8880    7828    8303    8314
Pi 4B/3B+ 64 bits           2.11    1.15    1.12    1.12


VAX MIPS rating Pi 4B 32    5539    5739    6735    7232
Pi 4B 64 bits/32 bits       1.60    1.36    1.23    1.15

                        Gentoo gcc 9

VAX MIPS rating Pi 3B+ 6    4062    6504    8242    8343
VAX MIPS rating Pi 4B 64    8298    7683    7870    7978
Pi 4B/3B+ 64 bits           2.04    1.18    0.95    0.96

Pi 4B gcc 9/6               0.93    0.98    0.95    0.96


MP Linpack Benchmark (Single Precision NEON)

This benchmark is even less suitable to demonstrate multithreading performance, and that was the intention, as the frequent thread starting overheads are too high. Hence, tests are included with no threading. Results from the old and new compilations were again similar, confirming the high P4B/3B+ performance gains, with no threading.

Code:

MFLOPS 0 to 4 Threads, N 100, 500, 1000

Threads    None        1       2       4

        Gentoo 64b Pi 3B+ gcc 9

N  100     641.6    63.0    62.3    61.9
N  500     326.6   229.3   222.6   227.0
N 1000     320.1   275.0   274.3   275.2

        Gentoo 64b Pi 4B gcc 9

N  100    2076.2    98.6    96.6    96.2
N  500    1327.1   631.9   632.5   639.2
N 1000     394.6   375.3   382.3   375.7

        Comparison 64b Pi4/3B+

N  100      3.24    1.57    1.55    1.55
N  500      4.06    2.76    2.84    2.82
N 1000      1.23    1.36    1.39    1.37

        Comparison Pi4B gcc 9/6

N  100      0.92    1.01    0.99    0.99
N  500      0.82    0.95    0.98    0.95
N 1000      0.99    0.92    0.94    0.94


MP BusSpeed (read only) Benchmark

Other than identifying the likely effects of burst reading on such as random access, the main area for comparison is on reading all data. The Pi 4B behaved as expected, where the speed of this was near twice that of data flow with word address increments of 2. The Pi 3B+ did not follow that normal operation, so the 4B/3B+ comparisons are suspect.

Code:

   MB/Second Reading Data, 1, 2, 4 and 8 Threads

  KB        Inc32   Inc16    Inc8    Inc4    Inc2   RdAll

                          Gentoo 64b Pi 3B+ gcc 9

 12.3 1T     3453    4178    4428    3543    3584    2335
      2T     5594    7732    8086    6856    6924    4654
      4T     9065   12522   13157   12942   13415    9209
      8T     6661   10770   13266   11955   12573    8478
122.9 1T      640     646    1197    1970    2909    2272
      2T     1030    1012    2006    3671    5784    4528
      4T     1001    1041    2145    4266    8337    6729
      8T     1043    1061    2123    4005    8133    8572
12288 1T      114     104     241     444     932    1352
      2T      126     122     253     370    1005    1997
      4T      104     138     197     471    1133    1745
      8T      102      96     231     466     796    1893

                          Gentoo 64b Pi 4B gcc 9            RdAll   Pi 4B
                                                           4B/3B+ gcc 9/6

 12.3 1T     5573    5750    5057    5646    5800    9129    3.91    2.16
      2T     7191    9038   10035   11020   11125   17757    3.82    2.27
      4T     7023   12144   14591   17681   20490   29184    3.17    1.97
      8T     7553   11837   12565   15640   18546   30517    3.60    2.33
122.9 1T      672     922    1864    3092    4744    7741    3.41    1.86
      2T      577     947    2100    3051    8780   14975    3.31    1.83
      4T      519     983    1884    3980    8701   18139    2.70    1.24
      8T      515     951    1913    4181    8797   16899    1.97    1.24
12288 1T      230     261     499    1016    1678    3873    2.86    1.07
      2T      276     225     418     925    1929    5629    2.82    0.90
      4T      258     267     579     802    1749    5758    3.30    1.52
      8T      214     213     538    1069    2145    4680    2.47    1.31



MP-RandMem Benchmark

Some moderate Pi4/3B+ performance gains were produced but the older version was, possibly, a little faster than the gcc 9 compilation.

Code:

             MB/Second Using 1, 2, 4 and 8 Threads

         Serial Serial Random Random Serial Serial Random Random
KB+Thread  Read   RdWr   Read   RdWr   Read   RdWr   Read   RdWr

            Gentoo 64b Pi 3B+ gcc 9    Gentoo 64b Pi 4B gcc 9

 12.3 1T    4886   3581   4878   3590   5737   6884   5763   7537
      2T    8723   3550   8724   3550  11536   7592  10238   6898
      4T   16836   3498  17531   3509  21084   7575  15160   7390
      8T   15777   3459  16783   3466  20089   7339  15311   7200
122.9 1T    3913   3346    987    972   5739   7231   2006   1906
      2T    7285   3339   1753    964  10662   7217   1742   1896
      4T   12354   3344   2350    972  10376   6741   1815   1812
      8T   11841   3333   2300    962  10298   6937   1823   1848
12288 1T    1795    761     69     60   3477    905    181    162
      2T    1915    735    118     60   3750    794    215    164
      4T    2452    730    128     59   4669    968    259    162
      8T    1805    755    137     60   3419    981    301    157

                  4 Thread                    4 Thread
            Comparison 64b Pi4/3B+      Comparison Pi4B gcc 9/6

 12.3 4T    1.25   2.17   0.86   2.11   0.92   0.97   0.68   0.94
122.9 4T    0.84   2.02   0.77   1.86   0.95   0.93   0.98   0.94
12288 4T    1.90   1.33   2.02   2.75   1.00   1.03   0.78   0.95


MP-MFLOPS Benchmarks

There are three versions, single precision, double precision and single precision using NEON intrinsic functions. The single precision ones obtain up to 25 GFLOPS and half that for double precision.

On the Pi 4, the whole of the tests, in each program, can be completed in less than two seconds, probably not long enough for accurate comparisons.

Approximate performance gains, using gcc 9, indicate that Pi 4B was between 3.5 to 4.5 times faster than the Pi 3B+, using cache based data, and around 30% faster when performance became RAM speed dependent. On the Pi 4B, gcc 9 results indicated some improvements in speed, compared to those from the earlier gcc 6 compilation, mainly on running the single precision version.

MP-MFLOPS SP

Code:

              MP-MFLOPS 64 Bit gcc 9 Thu Sep 26 12:36:54 2019

              FPU Add & Multiply using 1, 2, 4 and 8 Threads

           Gentoo 64b Pi 3B+  gcc 9               Gentoo 64b Pi 4B  gcc 9

      2 Ops/Word         32 Ops/Word      2 Ops/Word        32 Ops/Word
KB    12.8   128 12800  12.8   128 12800  12.8   128 12800  12.8   128 12800
MFLOPS
1T     827   805   371  3232  3157  2802  3162  3072   468  6754  6714  6340
2T    1608  1567   360  6420  6423  5286  6498  6029   496 13329 12397  7623
4T    1764  3142   400 11240 12355  6029 11709  6141   529 24825 25055  8723
8T    2548  2575   381 10813 11755  5827 10828  8158   493 19452 22190  8426

      ........... 64b Pi4/3B+ ..........  .......... Pi4B gcc 9/6 ..........

1T    3.82  3.82  1.26  2.09  2.13  2.26  1.09  1.08  1.02  1.17  1.17  1.17
2T    4.04  3.85  1.38  2.08  1.93  1.44  1.14  1.14  1.09  1.22  1.11  0.96
4T    6.64  1.95  1.32  2.21  2.03  1.45  1.13  1.10  1.08  1.37  1.15  1.14



MP-MFLOPS DP

Code:

    MP-MFLOPS 64 Bit gcc 9 Double Precision Thu Sep 26 22:05:10 2019

        FPU Add & Multiply using 1, 2, 4 and 8 Threads

           Gentoo 64b Pi 3B+  gcc 9               Gentoo 64b Pi 4B  gcc 9

    2 Ops/Word         32 Ops/Word        2 Ops/Word        32 Ops/Word
KB    12.8   128 12800  12.8   128 12800  12.8   128 12800  12.8   128 12800
MFLOPS
1T     384   350   127  1582  1546  1372   657   663   183  3283  3358  3169
2T     753   753   184  3109  3157  2645  3203  2690   223  6573  6353  4535
4T    1346  1330   194  4228  6099  3067  5799  3866   292 12432 12665  4906
8T    1234  1340   201  4888  5748  3190  5322  4583   269 10738  8891  4521

      ........... 64b Pi4/3B+ ..........  .......... Pi4B gcc 9/6 ..........

1T    1.71  1.89  1.44  2.08  2.17  2.31  0.45  0.48  0.81  0.97  0.99  1.00
2T    4.25  3.57  1.21  2.11  2.01  1.71  1.13  0.96  0.98  0.98  0.94  1.00
4T    4.31  2.91  1.51  2.94  2.08  1.60  1.12  1.13  1.16  1.19  0.99  1.03


NEON MP MFLOPS SP

Code:

 MP-MFLOPS NEON Intrinsics 64 Bit gcc 9 Thu Sep 26 22:02:00 2019

    FPU Add & Multiply using 1, 2, 4 and 8 Threads

      Gentoo 64b Pi 3B+  gcc 9            Gentoo 64b Pi 4B  gcc 9

      2 Ops/Word                           32 Ops/Word
KB    12.8   128 12800  12.8   128 12800  12.8   128 12800  12.8   128 12800

1T     769   765   354  3009  2967  2638  1233  1313   507  6451  6428  6224
2T    1315  1324   293  5863  5990  5097  6307  4824   389 12559 12784  7612
4T    1750  2647   380 10081 11250  5748  8101  5186   531 24762 24708  7902
8T    2180  2664   392  9719 11010  6368  6782  8444   504 22598 24113  7979

      ........... 64b Pi4/3B+ ..........  .......... Pi4B gcc 9/6 ..........

1T    1.60  1.72  1.43  2.14  2.17  2.36  0.37  0.41  0.95  1.00  0.98  1.00
2T    4.80  3.64  1.33  2.14  2.13  1.49  1.37  0.78  0.70  0.96  0.98  0.90
4T    4.63  1.96  1.40  2.46  2.20  1.37  1.29  0.91  0.94  1.04  1.02  0.84


OpenMP MFLOPS Benchmark

As expected this program uses all four CPU cores, but a second compilation, notOpenMP MFLOPS Benchmark, without OpenMP directives, to use just one core.

The benchmark carries out the same calculations as MP-MFLOPS, with an additional section using 8 operations per data word read. It was a quick conversion from a benchmark that measures CUDA floating point performance. Hence, the meaningless titles included in the following example log file. Data sizes of 400 KB to 40 MB cover L2 cache and RAM

Code:

            OpenMP MFLOPS64g9 Thu Sep 26 16:52:54 2019

  Test             4 Byte  Ops/   Repeat    Seconds   MFLOPS       First   All
                    Words  Word   Passes                         Results  Same

 Data in & out     100000     2     2500   0.124228     4025    0.929538   Yes
 Data in & out    1000000     2      250   0.842066      594    0.992550   Yes
 Data in & out   10000000     2       25   0.873622      572    0.999250   Yes

 Data in & out     100000     8     2500   0.147889    13524    0.957117   Yes
 Data in & out    1000000     8      250   0.904478     2211    0.995518   Yes
 Data in & out   10000000     8       25   0.951405     2102    0.999549   Yes

 Data in & out     100000    32     2500   0.324246    24673    0.890215   Yes
 Data in & out    1000000    32      250   1.097993     7286    0.988088   Yes
 Data in & out   10000000    32       25   1.045087     7655    0.998796   Yes

                End of test Thu Sep 26 16:53:00 2019


Following are results from gcc 9 compiled runs on the Pi 3B+ and Pi 4B for all threads and using the single thread one core version. Maximum speed was near the 25 GFLOPS obtained using MP-MFLOPS.

Pi 4B/Pi 3B+ performance improvements were mainly more than twice, using L2 cache or when the more CPU speed dependent 32 operations per word tests were used. The gcc 9/6 performance rations indicate no real advantage of either compilation.

Code:

           gcc 9
Mbytes/    Pi 3B+       Pi 4B        gcc 9       Pi 4B
Ops/Word   64b          64b          4B/3B+      gcc 9/6
           All    1T    All     1T   All    1T   All    1T

0.4/2     2341   795   4025   2236  1.72  2.81  0.75  0.80
4/2        381   362    594    403  1.56  1.11  1.06  0.72
40/2       401   387    572    493  1.43  1.27  1.05  0.84

0.4/8     6051  1906  13524   5373  2.24  2.82  0.88  0.97
4/8       1491  1352   2211   1948  1.48  1.44  0.99  0.92
40/8      1598  1418   2102   2308  1.32  1.63  0.89  1.00

0.4/32   12002  3185  24673   6786  2.06  2.13  1.21  1.20
4/32      5641  2809   7286   6385  1.29  2.27  0.90  1.17
40/32     6142  2809   7655   6415  1.25  2.28  0.90  1.17


MemSpeed OpenMP Benchmark

As indicated for the earlier, this benchmark is not really suitable to demonstrate multithreading performance, as reflected in an example of the full results below.

Code:
 
     Memory Reading Speed Test OpenMP 64 Bit gcc 9 by Roy Longbottom

               Start of test Thu Sep 26 15:12:19 2019

  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]
  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

       4    7616   8480   8749   7548   8520   8530  35856  18594  18601
       8    8195   8660   8876   8147   5740   8365  37153  18878  18864
      16    7992   7684   8189   8064   8139   8023  35774  18896  18898
      32    8975   8535   8024   9048   8536   8512  37465  18392  19024
      64    8622   7997   8057   8511   7953   7994  19618  16857  16701
     128   11940  11637  11554  12101  11659  11498  13815  13417  13964
     256   17008  17339  16359  17104  17396  17038  11877  12344  12376
     512   17740  15986  18607  17522  18547  15612  12575  13616  13495
    1024    7011  10208  10016  11310   5287  11413   7060   6279  10045
    2048    7024   4201   7006   7017   6943   3225   2822   3386   3391
    4096    3854   7002   7126   6912   7074   3985   2199   3127   3132
    8192    2632   6950   7151   5291   2796   6813   2546   3091   2403
   16384    7350   7073   3537   7583   5327   3200   2609   3053   1907
   32768    7514   7616   7725   7807   2344   2936   2702   2559   3042
   65536    7065   2937   7571   4306   7086   2975   2127   3017   2677
  131072    1772   1779   2562   8092   2583   2800   2035   1866   2869

                End of test Thu Sep 26 15:12:48 2019


As for MP-MFLOPS, there is a not MP version to demonstrate performance when using a single CPU core. A summary of results and comparisons, in key areas with data from L1 cache, L2 cache and RAM, are shown below, using gcc 9 compilations on the Pi 3B+ and Pi 4B, plus earlier gcc 6 details.

The Pi 4B versus Pi 3B+ comparisons, for single core tests, indicate across the board 4B improvements, that are not necessarily carried forward to the multithreaded tests, the highest gains being for single precision floating point calculations. .

The single core single precision and integer test functions indicate faster speed from gcc 9 compilations, where calculations are involved, but not always affecting multithreading in the same way.

Typical MP versus non-MP performance ratios, for each group, were between 0.2 and 3.5, or one core could be five times faster than four cores.

Code:

    Memory   x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]        x[m]=y[m]
    KBytes   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

                         Gentoo Pi 4B gcc 6

          8   8238   8906   3150   8308   9253   2339  29033  15749   2673
        256  16320  17582   3236  17404  16671   2652  13683  14741   2411
      65536   2462   2245   2390   7160   3945   2742   2746   2386   2259
   Not OMP
          8  15527  13976  15533  15504  14021  15537  11563   9311   7794
        256  12236  11434  12096  12084  11740  12156   7883   8044   7818
      65536   2047   2046   2037   2034   2054   2071   2567   2554   2547

                         Gentoo 64b Pi 4B gcc 9

          8   8195   8660   8876   8147   5740   8365  37153  18878  18864
        256  17008  17339  16359  17104  17396  17038  11877  12344  12376
      65536   7065   2937   7571   4306   7086   2975   2127   3017   2677
   Not OMP
          8  13380  21857  24416  13414  23420  24400  11630   9313   9312
        256  11705  20247  21090  12041  21382  21013   8081   8182   5919
      65536   2030   3034   2135   2047   3035   2394   2550   2535   2546

                         Gentoo 64b Pi 3B+ gcc 9

          8   3908   3630   9548  10512   5230   9273  13649   6850   9599
        256   6730   3456   6358   9313   5346   9166   9375   5612    858
      65536   2413   1137   2957   3982   1163   3052    808    904    897
   Not OMP
          8   4274   5139   7932   5442   5827   7934   6162   4334   4339
        256   3703   4670   7152   4322   5378   7166   5452   4092   4094
      65536   1035   1582   1649   1098   1616   1494    652    794    790

                         Pi 4B / Pi3B+ gcc 9

          8   2.10   2.39   0.93   0.78   1.10   0.90   2.72   2.76   1.97
        256   2.53   5.02   2.57   1.84   3.25   1.86   1.27   2.20  14.42
      65536   2.93   2.58   2.56   1.08   6.09   0.97   2.63   3.34   2.98
   Not OMP
          8   3.13   4.25   3.08   2.46   4.02   3.08   1.89   2.15   2.15
        256   3.16   4.34   2.95   2.79   3.98   2.93   1.48   2.00   1.45
      65536   1.96   1.92   1.29   1.86   1.88   1.60   3.91   3.19   3.22

                         gcc 9 / gcc 6

          8   0.99   0.97   2.82   0.98   0.62   3.58   1.28   1.20   7.06
        256   1.04   0.99   5.06   0.98   1.04   6.42   0.87   0.84   5.13
      65536   2.87   1.31   3.17   0.60   1.80   1.08   0.77   1.26   1.19
   Not OMP
          8   0.86   1.56   1.57   0.87   1.67   1.57   1.01   1.00   1.19
        256   0.96   1.77   1.74   1.00   1.82   1.73   1.03   1.02   0.76
      65536   0.99   1.48   1.05   1.01   1.48   1.16   0.99   0.99   1.00


_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Sun Oct 20, 2019 10:36 am    Post subject: Reply with quote

roylongbottom,

very interesting results as always!

Have you tried compiling any of your benchmarks with clang/llvm? This compiler is also included on the gentoo-on-rpi-64bit image, and produced some interesting differences wrt gcc 9 on other benchmarks (see e.g. this post ff).
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Mon Oct 21, 2019 10:10 am    Post subject: Reply with quote

Sakaki wrote:
roylongbottom,

very interesting results as always!

Have you tried compiling any of your benchmarks with clang/llvm? This compiler is also included on the gentoo-on-rpi-64bit image, and produced some interesting differences wrt gcc 9 on other benchmarks (see e.g. this post ff).


I have compiled a number of my benchmarks via clang and run them. In general floating point was a bit slower. Following are results for MP MFLOPS DP, both compiled using the same parameters:

Code:

clang or gcc mpmflopsdp.c -lm -lrt -O3 -lpthread -march=armv8-a


There might be better options to use (for both) but current choices can be simply overwhelming. Anyway, the clang results were slower at 32 Ops/Word.

Code:
     MP-MFLOPS 64 Bit clang Double Precision

   FPU Add & Multiply using 1, 2, 4 and 8 Threads

        2 Ops/Word              32 Ops/Word
 KB     12.8     128   12800    12.8     128   12800
 MFLOPS
 1T     1643    1603     270    2447    2450    2418
 2T     2424    3152     265    4798    4907    4274
 4T     5884    3087     264    9252    9337    3889
 8T     5240    4633     253    7960    8897    4322

        MP-MFLOPS 64 Bit gcc 9 Double Precision

 1T     1656    1567     259    3361    3360    3142
 2T     2770    2951     291    6595    6592    4454
 4T     5800    2766     302   12687   12910    4936
 8T     2286    3149     299   11144   11815    4904


Following are disassembled details of the main inner loops. Both use 64 bit SIMD instructions, but clang failed to ring the bell by not implementing fused multiply and add or subtract (fmla or fmls) instructions, to reduce the count from 32 to 22.

Code:

                clang                                    gcc 9
.LBB4_4                                  .L48:
      ldr     q20, [x11], #16                  ldr     q5, [x2]
      subs    x10, x10, #2                     fadd    v1.2d,  v15.2d, v5.2d
      fadd    v21.2d, v20.2d, v30.2d           fadd    v7.2d,  v13.2d, v5.2d
      fadd    v22.2d, v20.2d, v8.2d            fadd    v17.2d, v11.2d, v5.2d
      fmul    v21.2d, v21.2d, v31.2d           fadd    v16.2d, v5.2d,  v9.2d
      fmul    v22.2d, v22.2d, v9.2d            fmul    v1.2d,  v1.2d,  v14.2d
      fsub    v21.2d, v21.2d, v22.2d           fmls    v1.2d,  v12.2d, v7.2d
      fadd    v22.2d, v20.2d, v10.2d           fadd    v7.2d,  v5.2d,  v31.2d
      fmul    v22.2d, v22.2d, v11.2d           fmla    v1.2d,  v10.2d, v17.2d
      fadd    v21.2d, v22.2d, v21.2d           fadd    v17.2d, v5.2d,  v29.2d
      fadd    v22.2d, v20.2d, v12.2d           fmls    v1.2d,  v16.2d, v8.2d
      fmul    v22.2d, v22.2d, v13.2d           fadd    v16.2d, v5.2d,  v27.2d
      fsub    v21.2d, v21.2d, v22.2d           fmla    v1.2d,  v7.2d,  v30.2d
      fadd    v22.2d, v20.2d, v14.2d           fadd    v7.2d,  v5.2d,  v25.2d
      fmul    v22.2d, v22.2d, v15.2d           fmls    v1.2d,  v17.2d, v28.2d
      fadd    v21.2d, v22.2d, v21.2d           fadd    v17.2d, v5.2d,  v23.2d
      fadd    v22.2d, v20.2d, v16.2d           fmla    v1.2d,  v16.2d, v26.2d
      fmul    v22.2d, v22.2d, v7.2d            fadd    v16.2d, v5.2d,  v21.2d
      fsub    v21.2d, v21.2d, v22.2d           fadd    v5.2d,  v5.2d,  v19.2d
      fadd    v22.2d, v20.2d, v6.2d            fmls    v1.2d,  v7.2d,  v24.2d
      fmul    v22.2d, v22.2d, v5.2d            fmla    v1.2d,  v17.2d, v22.2d
      fadd    v21.2d, v22.2d, v21.2d           fmls    v1.2d,  v16.2d, v20.2d
      fadd    v22.2d, v20.2d, v4.2d            fmla    v1.2d,  v5.2d,  v18.2d
      fmul    v22.2d, v22.2d, v3.2d            str     q1, [x2], 16
      fsub    v21.2d, v21.2d, v22.2d           cmp     x2, x3
      fadd    v22.2d, v20.2d, v2.2d            bne     .L48
      fmul    v22.2d, v22.2d, v1.2d
      fadd    v21.2d, v22.2d, v21.2d
      fadd    v22.2d, v20.2d, v17.2d
      fadd    v20.2d, v20.2d, v19.2d
      fmul    v22.2d, v22.2d, v18.2d
      fmul    v20.2d, v20.2d, v0.2d
      fsub    v21.2d, v21.2d, v22.2d
      fadd    v20.2d, v20.2d, v21.2d
      str     q20, [x12]
      mov     x12, x11
      b.ne    .LBB4_4

_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Mon Nov 11, 2019 5:06 pm    Post subject: Raspberry Pi 4B 64 Bit Stress Tests Reply with quote

Raspberry Pi 4B 64 Bit Stress Tests

The first stress tests used cover the central processor, for which an extra program was produced to measure the environment whilst running. Variable parameters are:

Passes and sampling seconds to determine running time. If the stress test also has sampling periods, it is normally not possible to synchronise them but approximate periods can be matched.

CPU MHz - This can vary faster than any sampling time based on seconds, but the general trend can be useful. Tests that measure speed over sampling periods provide a better indication.

Core Voltage - This appears to vary a little, reason unknown.

CPU Temperature - assuming that it is correct, as it change slowly, this is the most useful measurement.

PMIC temperature - No issue so far with Power Management Integrated Circuit temperatures.

Code:
  ###################################################

 Parameters - upper or lower case

 ./RPiHeatMHzVolts2 passes 33 secs 20 log 12
 or
./RPiHeatMHzVolts2 P 33 S 20 L 12

 For 33 samples at 20 second intervals, log file RPiHeatMHz12.txt
 
 To cover 10 minute test
 
###################################################

 Temperature and CPU MHz Measurement

 Start at Mon Oct 28 20:49:52 2019

 Using 33 samples at 20 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.8490V, CPU temp=61.0'C, pmic temp=55.2'C
   20.0   ARM MHz=1500, core volt=0.8437V, CPU temp=73.0'C, pmic temp=62.8'C
   40.3   ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=66.5'C
   60.5   ARM MHz=1500, core volt=0.8437V, CPU temp=79.0'C, pmic temp=69.4'C
   80.7   ARM MHz=1500, core volt=0.8437V, CPU temp=80.0'C, pmic temp=70.3'C
  101.0   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C
  121.2   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  141.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  161.7   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
  181.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C


Next are results for the High Performance Linpack that runs for a long time, significantly increasing CPU temperatures and slowing down, without a cooling fan being in place. These results can be compared with those for the 32 bit version, available in the report

https://www.researchgate.net/publication/334561068_Raspberry_Pi_4B_Stress_Tests_Including_High_Performance_Linpack

This shows that the same sort of performance levels as the 64 bit version are obtained, with and without a cooling fan.

Following HPL results here, are some for my integer and floating point stress tests. Although further comparative tests are needed to be conclusive, it does seem that the 64 bit floating point versions are faster than the 32 bit varieties and subject to lower temperature increases.


High Performance Linpack Stress Test

The earlier HPL benchmark results quoted obtained speeds of 8.1 GFLOPS on a cold start and 10.8 GFLOPS later, with a cooling fan in operation for both. The first results below were run without a fan, with a room temperature around 21°C, producing 7.6 GFLOPS on a cold start. Then average CPU frequency came out at 1056 MHz, with an average temperature of 80.3°C.

The second results followed a warm reboot to use a different version of Gentoo with HPL installed, obtaining 5.54 GFLOPS, with severe CPU frequency throttling, down to 600 MHz, with temperatures up to 80.3°C. Averages were 790 MHz and 80.3°C.

Shortly afterwards, with the fan in place, the Pi ran at 1500 MHz continuously, achieving 10.4 GFLOPS, with a maximum temperature of 64°C.

Code:
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             702.81              7.589e+00
HPL_pdgesv() start time Sat Aug 24 10:42:58 2019

HPL_pdgesv() end time   Sat Aug 24 10:54:41 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0008453 ...... PASSED
================================================================================

                   Example 2 - Note different sumchecks again

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       20000   128     2     2             963.16              5.538e+00
HPL_pdgesv() start time Tue Oct 29 11:51:10 2019

HPL_pdgesv() end time   Tue Oct 29 12:07:13 2019

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0009005 ...... PASSED
================================================================================

 Temperature and CPU MHz Measurement

 Start at Tue Oct 29 11:50:27 2019

 Using 40 samples at 30 second intervals

 Seconds
    0.0   ARM MHz=1500, core volt=0.8542V, CPU temp=63.0'C, pmic temp=58.0'C
   30.0   ARM MHz=1500, core volt=0.8542V, CPU temp=79.0'C, pmic temp=69.4'C
   60.3   ARM MHz=1000, core volt=0.8542V, CPU temp=83.0'C, pmic temp=72.2'C
   91.6   ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
  122.2   ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=74.1'C
  152.7   ARM MHz= 750, core volt=0.8490V, CPU temp=83.0'C, pmic temp=74.1'C
  183.2   ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
  213.8   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  244.3   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  274.7   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  305.2   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  335.6   ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  366.1   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  396.6   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  427.2   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  457.5   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  488.0   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  518.6   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
  549.0   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  579.6   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
  610.1   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  640.6   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  671.1   ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
  701.6   ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
  732.0   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  762.4   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  792.9   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
  823.4   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
  853.9   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  884.4   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  914.9   ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  945.3   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
  975.8   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
 1006.3   ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
 1036.7   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
 1067.0   ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
 Averages          790                              84.1              75.5


Integer Stress Test - MP-IntStress64, MP-IntStress

As for my other CPU stress tests, the four and 8 thread results are shown, from running in benchmarking mode. Run time parameters are also provided, the commands used for the particular tests being included.

In this case, a summary of separate tests for L1 cache, L2 cache and RAM are given. During the 10 minute sessions, the cache tests were mainly running at 1000 MHz, with those using RAM at the full speed 1500 MHz. No temperatures above 84°C were recorded.

Examining the full detail of the first test indicated that average CPU MHz and measured MB/second were around 75% of the maximum.

Code:


                KB    KB    MB            Same All
   Secs Thrds   16   160    16  Sumcheck   Tests

   3.0    4  28715 26652  3345  5A5A5A5A    Yes
   3.0    8  30292 26310  3334  AAAAAAAA    Yes


./RPiHeatMHzVolts2 passes 66 secs 10 log 34 - used for all 10 minute stress tests

==== Stress Test Parameters - upper or lower case, only first letter counts ====

 Threads 1, 2, 4, 8, 16, 32  KB between 12 and 15624  Log < 100  Minutes any > 0

  ./MP-IntStress64 KB 16 Threads 8 Mins 10 Log 34

Seconds                                                                   MB/sec
  0.0  ARM MHz=1500, core volt=0.8455V, CPU temp=62.0'C, pmic temp=57.1'C
 10.0  ARM MHz=1500, core volt=0.8455V, CPU temp=69.0'C, pmic temp=62.8'C  28695
 20.2  ARM MHz=1500, core volt=0.8402V, CPU temp=73.0'C, pmic temp=64.6'C  28729
152.5  ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=72.2'C  21523
305.5  ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  20026
448.2  ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  19611
601.1  ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  19199
                                                                 %Min/Max   66.9

  ./MP-IntStress64 KB 160 Threads 8 Mins 10 Log 34

Seconds                                                                   MB/sec
  0.0  ARM MHz=1500, core volt=0.8402V, CPU temp=64.0'C, pmic temp=57.1'C
 10.0  ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C  26323
 20.2  ARM MHz=1500, core volt=0.8402V, CPU temp=75.0'C, pmic temp=66.5'C  26140
152.9  ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=74.1'C  18016
306.5  ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C  17306
449.8  ARM MHz=1000, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C  17248
603.3  ARM MHz= 750, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C  16832
                                                                 %Min/Max   63.9

  ./MP-IntStress64 KB 16000 Threads 8 Mins 10 Log 34

Seconds                                                                   MB/sec
  0.0  ARM MHz=1500, core volt=0.8402V, CPU temp=66.0'C, pmic temp=60.9'C
 10.0  ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C   3372
 20.3  ARM MHz=1500, core volt=0.8402V, CPU temp=72.0'C, pmic temp=62.8'C   3369
155.2  ARM MHz=1500, core volt=0.8402V, CPU temp=76.0'C, pmic temp=68.4'C   3365
309.8  ARM MHz=1500, core volt=0.8402V, CPU temp=79.0'C, pmic temp=69.4'C   3367
454.4  ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C   3367
599.7  ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C   3368
                                                                 %Min/Max   99.8


Single Precision Floating Point Stress Test - MP-FPUStress64, MP-FPUStress

Two sets of result summaries are provided below, both using 1280 KB memory space and 8 threads. With four cores, this results in data being in L2 cache (4 x 160 KB) to run at full speed, with additional overhead of moving data to/from RAM. One test uses 8 operations per word, with 32 in the other. With hot starts, neither reached a CPU temperature of 84°C and had similar performance degradation at the highest temperatures.

Following writing the above, the 32 bit stress test was repeated, with results shown below. Although not conclusive from a single run, they indicate that the impact was more severe than the 64 bit run, CPU speed sample reducing to 600 MHz, higher temperatures and a larger performance degradation.

Code:
 
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   4.6    T4   2  9223  7520   519   40392  76406  99700
   6.0    T8   2  9520 10471   545   40392  76406  99700
  11.3    T4   8 19087 21040  2044   54764  85092  99820
  12.9    T8   8 19747 21107  2016   54764  85092  99820
  22.2    T4  32 25732 26704  9160   35206  66015  99520
  24.1    T8  32 25708 25770  8927   35206  66015  99520

 ==== Stress Test Parameters - upper or lower case, only first letter counts ====

Threads 1,2,4,8,16,32,64  KB 12 to 15624  Ops/Wordd 2,8,32  Log<100  Minutes any>0


./MP-FPUStress64 KB 1280 T 8 Ops 8 Mins 10 Log 33

 Seconds                                                                  MFLOPS
  0.0  ARM MHz=1500, core volt=0.8437V, CPU temp=64.0'C, pmic temp=59.0'C
 10.0  ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C  17309
 20.2  ARM MHz=1500, core volt=0.8437V, CPU temp=75.0'C, pmic temp=66.5'C  18018
101.9  ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  14224
204.2  ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  12806
306.8  ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=73.1'C  12447
409.4  ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C  11870
501.6  ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C  12191
604.1  ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C  12169
                                                                 %Min/Max   65.9   

  ./MP-FPUStress64 KB 1280 T 8 Ops 32 Mins 10 Log 33

Seconds                                                                   MFLOPS
  0.0  ARM MHz=1500, core volt=0.8437V, CPU temp=65.0'C, pmic temp=59.0'C
 10.0  ARM MHz=1500, core volt=0.8437V, CPU temp=72.0'C, pmic temp=65.6'C  22634
 20.2  ARM MHz=1500, core volt=0.8437V, CPU temp=76.0'C, pmic temp=67.5'C  22992
101.9  ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  18629
204.0  ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C  16674
306.3  ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  16448
408.6  ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  16158
500.7  ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C  16081
603.0  ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C  15553
                                                                 %Min/Max   67.6

 ================================================================================

          32 Bit Version   ./MP-FPUStress KB 1280 T 8 Ops 32 Mins 10 Log 73

Seconds                                                                   MFLOPS
  0.0  ARM MHz=1500, core volt=0.8560V, CPU temp=56.0'C, pmic temp=50.5'C
 10.0  ARM MHz=1500, core volt=0.8560V, CPU temp=70.0'C, pmic temp=60.9'C  20233
 20.7  ARM MHz=1500, core volt=0.8560V, CPU temp=74.0'C, pmic temp=64.6'C  20221
106.4  ARM MHz=1000, core volt=0.8560V, CPU temp=83.0'C, pmic temp=70.3'C  14173
204.3  ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=73.1'C  13115
302.2  ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  12650
400.2  ARM MHz= 750, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  11957
508.8  ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C  11485
585.1  ARM MHz= 600, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C  11454
606.9  ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C  11242
                                                                 %Min/Max   55.6


Double Precision Floating Point Stress Test - MP-FPUStress64DP, MP-FPUStressDP

Below are full results for a 10 minute test using the double precision floating point stress test, with data in L2 cache with four cores in use. Although the measured MFLOPS was greater than that obtained be HPL Linpack, the same range of high temperatures and performance degradation were not generated.

The 32 bit version was also rerun, producing similar results as those at 64 bits.

Code:
             Ops/   KB    KB    MB      KB     KB     MB
  Secs  Thrd Word 12.8   128  12.8    12.8    128   12.8

   8.9    T4   2  5024  4589   257   40395  76384  99700
  11.5    T8   2  5089  5545   280   40395  76384  99700
  21.7    T4   8 10259 10011  1068   54805  85108  99820
  24.7    T8   8 10239 10824  1036   54805  85108  99820
  43.1    T4  32 12940 13200  4497   35159  66065  99521
  46.9    T8  32 13200 13049  4557   35159  66065  99521

 ==== Stress Test Parameters - upper or lower case, only first letter counts ====

Threads 1,2,4,8,16,32,64  KB 12 to 15624  Ops/Wordd 2,8,32  Log<100  Minutes any>0

 ./MP-FPUStress64DP KB 1280 T 8 Ops 32 Mins 10 Log 31

Seconds                                                                      MFLOPS
    0.0   ARM MHz=1500, core volt=0.8437V, CPU temp=63.0'C, pmic temp=57.1'C
   10.0   ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C  12718
   20.2   ARM MHz=1500, core volt=0.8437V, CPU temp=74.0'C, pmic temp=66.5'C  12755
   30.5   ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=68.4'C  12750
   40.7   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C  12755
   50.9   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C  12183
   61.2   ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  11358
   71.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C  10922
   81.6   ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C  10333
   91.8   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9948
  102.0   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9692
  112.3   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   9466
  122.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   9217
  132.8   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C   9181
  143.0   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   9145
  153.2   ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C   9043
  163.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8921
  173.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   9838
  183.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8755
  194.1   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8737
  204.4   ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C   8721
  214.7   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8721
  224.9   ARM MHz=1500, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C   8670
  235.1   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C   8619
  245.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8592
  255.6   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C   8592
  265.9   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8540
  276.2   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C   8488
  286.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8547
  296.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8510
  307.0   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8473
  317.2   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8507
  327.5   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8541
  337.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8544
  347.9   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8464
  358.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8531
  368.4   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8495
  378.7   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8460
  388.9   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8514
  399.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8484
  409.4   ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C   8454
  419.6   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8459
  429.8   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8489
  440.1   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8472
  450.3   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8428
  460.6   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8384
  470.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8384
  481.2   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8387
  491.4   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8391
  501.7   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8244
  511.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8346
  522.1   ARM MHz= 750, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8272
  532.4   ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C   8272
  542.6   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8329
  552.8   ARM MHz= 750, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8239
  563.1   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8183
  573.3   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8129
  583.6   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8343
  593.9   ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C   8266
  604.1   ARM MHz=1000, core volt=0.8437V, CPU temp=85.0'C, pmic temp=74.1'C   8190


OpenGL + 3 x Livermore Loops - liverloopsPi64Rg9, liverloopsPi64, liverloopsPiA7R

In order make it easier to run these stress tests, lxterminal was installed and the script shown below used to open four terminal windows and run the environmental monitor program plus three copies of a modified Loops benchmark, that allows different log files to be specified. This executes 72 loops for a minimum time of 12 seconds each. The second script file is provided to run the kitchen display tests for 16 minutes in full screen mode. A further terminal was opened to run VMSTAT resource monitor.

The tests were run twice, without and with a cooling fan in place. Results are shown below. In this case, the no fan tests were not that much slower, obtaining averages of 77 to 80% of the fan cooled speeds on OpenGL FPS, CPU MHz and total Loops MFLOPS.

These results were produced with all programs compiled by gcc 9 and not run on a hot day. Compared with performance using 32 bit versions, detailed in this 32 Bit Report, the 64 bit results were far better, but the former were produced by an older compiler and run on a hot day. The tests were repeated, using 32 bit programs produced by the later gcc 8 compiler.

As before, the 64 bit gcc 9 Livermore Loops and OpenGL single core benchmarks were faster than the new 32 bit versions, in this case by 14% for the former and 40% for the latter. On running the stress test, both had similar average CPU MHz, CPU temperature and PMIC temperature, with 64 bit FPS and MFLOPS maintaining performance advantage, with similar ratios as obtained from single core tests.

Code:
run.sh
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 21 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 22 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 23

runogl.sh
export vblank_mode=0  &
./videogl64g9 Test 6 Mins 16 Log 20

            No Fan                          With Fan
 Seconds     MHz   CPU C  PMIC C     FPS     MHz   CPU C  PMIC C     FPS

       0    1500      57      51            1500      37      32
      30    1500      75      63      27    1500      53      44      27
      60    1500      76      68      29    1500      53      44      28
      90    1500      81      72      25    1500      58      50      27
     120    1500      81      70      23    1500      55      48      26
     150    1000      82      74      23    1500      57      49      29
     180    1000      80      72      22    1500      54      47      27
     210    1000      81      72      24    1500      55      46      29
     240    1500      80      72      26    1500      54      44      28
     270    1500      81      72      27    1500      55      47      28
     300    1000      82      72      22    1500      56      48      29
     330    1500      82      72      24    1500      56      50      29
     360    1000      82      72      24    1500      56      49      28
     390    1000      82      72      22    1500      58      50      26
     420    1000      83      72      22    1500      57      50      26
     450    1000      82      74      19    1500      56      50      30
     480    1000      82      74      21    1500      56      48      28
     510    1000      82      72      22    1500      54      46      29
     540    1000      81      72      22    1500      55      47      30
     570    1500      81      72      24    1500      55      47      30
     600    1000      82      74      24    1500      57      49      30
     630    1500      81      72      23    1500      58      51      29
     660    1000      82      72      23    1500      57      50      29
     690    1000      83      73      22    1500      59      51      28
     720    1000      83      72      21    1500      57      51      28
     750    1000      82      74      21    1500      57      50      29
     780    1000      84      74      19    1500      54      47      29
     810    1000      82      72      19    1500      56      48      29
     840    1000      82      72      20    1500      54      46      29
     870    1000      82      72      20    1500      53      46      30
     900    1000      82      72      23    1500      49      42      31

Average     1161      81      71      23    1500      55      47      29
Minimum     1000      57      51      19    1500      37      32      26
Maximum     1500      84      74      29    1500      59      51      31

% Hot/Cold
Average       77      68      66      80
Minimum       67      65      61      73
Maximum      100      70      69      94

 MFLOPS  Average Geomean Harmean         Average Geomean Harmean
       1     684     562     453             898     732     590
       2     716     574     451             887     712     571
       3     716     566     438             895     724     582

Total %Hot/Cold
MFLOPS        79      78      77


Input/Output Stress Test - burnindrive264g9, burnindrive2

This is essentially the same as my program used during hundreds of UK Government and University computer acceptance trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of 64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5 minutes for all tests, with default parameters. The data patterns are shown below, followed by run time parameters, then examples of results provided, including added calculations of speed.

Code:
Patterns

 No.    Hex No.     Hex No.     Hex No.     Hex  No.     Hex No.      Hex No.      Hex

  1       0 25   800000 49        3 73       FF  97 FFFFDFFF 121 FFFFEAAA 145 FFFFF0F0
  2       1 26  1000000 50       33 74   FF00FF  98 FFFFBFFF 122 FFFFAAAA 146 FFF0F0F0
  3       2 27  2000000 51      333 75      1FF  99 FFFF7FFF 123 FFFEAAAA 147 F0F0F0F0
  4       4 28  4000000 52     3333 76      3FF 100 FFFEFFFF 124 FFFAAAAA 148 FFFFFFE0
  5       8 29  8000000 53    33333 77      7FF 101 FFFDFFFF 125 FFEAAAAA 149 FFFF83E0
  6      10 30 10000000 54   333333 78      FFF 102 FFFBFFFF 126 FFAAAAAA 150 FE0F83E0
  7      20 31 20000000 55  3333333 79     1FFF 103 FFF7FFFF 127 FEAAAAAA 151 FFFFFFC0
  8      40 32 40000000 56 33333333 80     3FFF 104 FFEFFFFF 128 FAAAAAAA 152 FFFC0FC0
  9      80 33        1 57        7 81     7FFF 105 FFDFFFFF 129 EAAAAAAA 153 FFFFFF80
 10     100 34        5 58      1C7 82     FFFF 106 FFBFFFFF 130 AAAAAAAA 154 FFE03F80
 11     200 35       15 59     71C7 83 FFFFFFFF 107 FF7FFFFF 131 FFFFFFFC 155 FFFFFF00
 12     400 36       55 60   1C71C7 84 FFFFFFFE 108 FEFFFFFF 132 FFFFFFCC 156 FF00FF00
 13     800 37      155 61  71C71C7 85 FFFFFFFD 109 FDFFFFFF 133 FFFFFCCC 157 FFFFFE00
 14    1000 38      555 62        F 86 FFFFFFFB 110 FBFFFFFF 134 FFFFCCCC 158 FFFFFC00
 15    2000 39     1555 63      F0F 87 FFFFFFF7 111 F7FFFFFF 135 FFFCCCCC 159 FFFFF800
 16    4000 40     5555 64    F0F0F 88 FFFFFFEF 112 EFFFFFFF 136 FFCCCCCC 160 FFFFF000
 17    8000 41    15555 65  F0F0F0F 89 FFFFFFDF 113 DFFFFFFF 137 FCCCCCCC 161 FFFFE000
 18   10000 42    55555 66       1F 90 FFFFFFBF 114 BFFFFFFF 138 CCCCCCCC 162 FFFFC000
 19   20000 43   155555 67     7C1F 91 FFFFFF7F 115 FFFFFFFE 139 FFFFFFF8 163 FFFF8000
 20   40000 44   555555 68  1F07C1F 92 FFFFFEFF 116 FFFFFFFA 140 FFFFFE38 164 FFFF0000
 21   80000 45  1555555 69       3F 93 FFFFFDFF 117 FFFFFFEA 141 FFFF8E38
 22  100000 46  5555555 70    3F03F 94 FFFFFBFF 118 FFFFFFAA 142 FFE38E38
 23  200000 47 15555555 71       7F 95 FFFFF7FF 119 FFFFFEAA 143 F8E38E38
 24  400000 48 55555555 72   1FC07F 96 FFFFEFFF 120 FFFFFAAA 144 FFFFFFF0

 Sequences - First 16

 No.   File         No.   File          No.   File          No.   File

  1    0  1  2  3    5    0  2  1  3     9    0  3  1  2    13    0  1  2  3
  2    1  2  3  0    6    1  3  2  0    10    1  0  3  2    14    1  2  3  0
  3    2  3  0  1    7    2  0  1  3    11    2  1  0  3    15    2  3  0  1
  4    3  0  2  1    8    3  1  2  0    12    3  2  1  0    16    3  0  2  1

 ###########################################################################

Run Time Parameters - Upper or Lower Case
                                                                      Default
R or Repeats             Data size, multiplier of 10.25 MB, more or less     16
P or Patterns            Number of patterns for smaller files < 164         164
M or Minutes             Large file reading time                              2
L or Log                 Log file name extension 0 to 99                      0
S or Seconds             Time to read each block, last section                1
F or FilePath            For other than SD card or SD card directory
C or CacheData           Omit O_DIRECT on opening files to allow caching      No 
O or OutputPatterns      Log patterns and file sequences used as above        No
D or DontRunReadTests    Or only run write tests                              No   

  Format ./burnindrive2 Repeats 16, Minutes 2, Log 0, Seconds 1
     or  ./burnindrive2 R 16, M 2, L 0, S 1

 ###########################################################################

Examples of Results Main SD Card Default Parameters

   File 1  164.00 MB written in   14.66 seconds                - 11.2 MB/second
To File 4  164.00 MB written in   12.15 seconds                - 13.5 MB/second

   Read passes     1 x 4 Files x  164.00 MB in  0.33 minutes   - 33.1 MB/second
To Read passes     7 x 4 Files x  164.00 MB in  2.28 minutes   - 33.6 MB/second

   Passes in 1 second(s) for each of 164 blocks of 64KB:       - 164 measurements

    580    580    580    580    580    580    580    580    580    580    580
    580    580    580    580    580    580    580    580    580    580    580

    95120 read passes of 64KB blocks in  2.76 minutes          - 36.8 MB/second



CPU + Main SD + USB + LAN Test

A system test was run using the following script file, comprising commands to run programs to monitor the environment, and others to exercise the main SD card, two USB 3 drives, 1 Gbps Ethernet and CPU floating point with two threads. The programs were run via the script file so that they all started at the same time, as indicated in the summaries below. They also all ran for between 12 and 13 minutes. The by itself performance levels (BI) are also shown, often not indicating much improvement. Performance is not as high as shown by other benchmarks, probably because data transfers are based on 64 KB block sizes and all data in each block is checked for correctness.

A snapshot of vmstat system performance is also provided. The bo and bi KB/second writing and reading speeds are essentially the same as the sum those reported by the programs handling the main and USB drives. LAN speeds are not included in vmstat. Total CPU utilisation (us + sy) is shown to be nearly 90% at the start of writing and closer to 75% on reading, representing average utilisation per core or at least three cores at 100%. Next page shows variations in performance with time.

Code:
 ############################### Script File ###############################

  lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 Log 21 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
                 FilePath /run/media/demouser/PATRIOT Log 22 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
                 FilePath /run/media/demouser/REMIXOSSYS Log 23 &
  lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
                 FilePath /media/public/test Log 24 &
  lxterminal -e ./MP-FPUStress64 KB 256 T 2 Ops 32 Mins 12 Log 33
  vmstat 10 96 > vmstat.txt

############################################################################

Main SD Drive Tue Nov  5 15:47:03 2019
  End of test Tue Nov  5 16:00:06 2019

Write 164 MB x files 4       53.6 seconds = 12.2 MB/second (BI 12.7)
Read  164 MB x files 3 x 4   67.2 seconds = 29.3 MB/second (BI 33.6)
Read  329480 x 64 KB        659.4 seconds = 32.0 MB/second (BI 36.8)
============================================================

USB 3 Drive 1 Tue Nov  5 15:47:03 2019
  End of test Tue Nov  5 15:59:31 2019

Write 164 MB x files 4       17.5 seconds = 37.5 MB/second (BI 68.3)
Read  164 MB x files 6 x 4   72.0 seconds = 54.7 MB/second (BI 75.0)
Read  735800 x 64 KB        657.6 seconds = 71.6 MB/second (BI 66.5)
============================================================

USB 3 Drive 2  Tue Nov  5 15:47:03 2019
End of test    Tue Nov  5 15:59:57 2019

Write 164 MB x files 4       37.4 seconds = 17.5 MB/second (BI 23.8)
Read  164 MB x files 3 x 4   75.6 seconds = 26.0 MB/second (BI 28.5)
Read  282740 x 64 KB        660.0 seconds = 27.4 MB/second (BI 29.8)
============================================================

1 Gbps LAN     Tue Nov  5 15:47:03 2019
End of test    Tue Nov  5 15:59:35 2019

Write 164 MB x files 4       18.1 seconds = 36.2 MB/second (BI 55.7)
Read  164 MB x files 3 x 4   74.4 seconds = 26.4 MB/second (BI 34.0)
Read  303920 x 64 KB        659.4 seconds = 29.5 MB/second (BI 45.3)       
============================================================

MP-Threaded-MFLOPS 64 Bit v1.1 Tue Nov  5 15:47:03 2019
                   End of test Tue Nov  5 15:59:13 2019

   2 core GFLOPS 10.9 to 7.4 with CPU throttling.
   See RPiHeatMHzVolts2 results where detail is included
============================================================

                      From vmstat 10 second sampling 

Secs procs  ---------memory---------- ---swap--  -----io---- --system--  ------cpu-----
      r  b  swpd   free   buff  cache   si   so     bi    bo    in    cs us sy id wa st

  10  5  3     0 3059800  94956 346060   0    0     14 63204 17819 19051 51 38  2  9  0
  20  3  2     0 3058696  95248 346704   0    0  14265 60713 17613 18789 51 33  4 12  0

  60  4  2     0 3061196  95668 343572   0    0  93479  7577 24239 24987 54 19  4 23  0
  70  4  3     0 3050632  95684 353600   0    0 112115    24 24496 25316 54 20 12 14  0
   
 710  3  3     0 3058696  96532 349460   0    0 132992    16 18936 22387 53 22  3 22  0
 720  5  1     0 3058728  96548 349452   0    0 134400    13 20635 23842 54 23  1 23  0
 


Speeds and Temperature - These tests were run without an active cooling fan, resulting in some CPU throttling, with clock speed down to 1000 MHz some of the time, when the temperature reached 80°C. The MP-Threaded-MFLOPS dual core performance measurements have been added to the environmental details, mainly indicating the effects of throttling.

The burnindrive last results record the number of read passes in 4 seconds, in a table comprising 14 lines of 11 recordings and one with 10, over approximately 11 minutes. The average burnindrive results for each line are provided below, not exactly synchronised, but giving an indication of changes in throughput with time. Total passes and percentage degradation are also shown, the latter not being as severe as CPU speed reductions.


Code:
 Temperature and CPU MHz Measurement + MP-FPUStress64 2 Core MFLOPS

 Start at Tue Nov  5 15:47:03 2019

 Using 25 samples at 30 second intervals

 Seconds                                                                     MFLOPS
    0.0   ARM MHz=1500, core volt=0.8560V, CPU temp=66.0'C, pmic temp=59.0'C
   30.0   ARM MHz=1500, core volt=0.8560V, CPU temp=75.0'C, pmic temp=65.6'C  10890
   60.2   ARM MHz=1500, core volt=0.8560V, CPU temp=78.0'C, pmic temp=68.4'C  10551
   90.4   ARM MHz=1500, core volt=0.8560V, CPU temp=80.0'C, pmic temp=70.3'C  10549
  120.6   ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C  10452
  150.8   ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C   9862
  181.1   ARM MHz=1000, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C   9482
  211.4   ARM MHz=1500, core volt=0.8560V, CPU temp=82.0'C, pmic temp=72.2'C   9137
  241.6   ARM MHz=1500, core volt=0.8507V, CPU temp=81.0'C, pmic temp=72.2'C   9132
  271.9   ARM MHz=1000, core volt=0.8507V, CPU temp=82.0'C, pmic temp=70.3'C   9122
  302.2   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   9389
  332.4   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8550
  362.7   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   9043
  392.9   ARM MHz=1500, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8045
  423.3   ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8174
  453.6   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8444
  483.9   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8335
  514.3   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7951
  544.6   ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   8125
  574.8   ARM MHz=1500, core volt=0.8455V, CPU temp=83.0'C, pmic temp=72.2'C   8078
  605.1   ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C   8280
  635.4   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7845
  665.7   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7761
  696.0   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=73.1'C   8341
  726.2   ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C   7407


  Passes in 4 seconds for each of 164 blocks of 64KB 

 Seconds Main SD   USB 1   USB 2    LAN    Total  %First

      44    2013    4522    1884    1915   10333     100
      88    2007    4533    1838    1911   10289     100
     132    2016    4496    1760    1809   10082      98
     176    2011    4536    1785    1845   10178      99
     220    2002    4493    1729    1913   10136      98
     264    1971    4262    1751    1904    9887      96
     308    1980    4540    1747    1911   10178      99
     352    2002    4464    1660    1845    9971      96
     396    1987    4442    1629    1844    9902      96
     440    1964    4453    1585    1771    9773      95
     484    1995    4504    1635    1731    9864      95
     528    1989    4229    1696    1762    9676      94
     572    1947    4616    1684    1833   10080      98
     616    2013    4476    1660    1798    9947      96
     660    2262    4758    1826    2022   10868     105


########################################################################

New Files At ResearchGate[/b]

The detailed Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests.pdf report has now been uploaded to ResearchGate :

https://www.researchgate.net/publication/337165767_Raspberry_Pi_4B_64_Bit_Benchmarks_and_Stress_Tests

along with the archive file containing the benchmarks and source codes:

https://www.researchgate.net/profile/Roy_Longbottom/project/Performance-of-Raspberry-Pi-and-Android-Devices/attachment/5d108baa3843b0b982580793/AS:773236761579522@1561365418445/download/Raspberry-Pi-4-Benchmarks.tar.gz?context=ProjectUpdatesLog

########################################################################
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Jan 09, 2020 3:52 pm    Post subject: Video Player Benchmarks Reply with quote

Video Player Benchmarks

I recently ran some benchmarks, with I/O content, to indicate relative performance at the full 1500 CPU MHz, compared with the 600 MHz that can be obtained at the extreme of throttling - see:

https://www.researchgate.net/publication/338230582_Raspberry_Pi_4_CPU_MHz_Throttling_Performance_Effects

This includes replaying a programme using BBC iPlayer, running under Raspbian. In this case, playback continued at 600 MHz, without buffering being indicated, but displaying at a lower quality of pixel density.

On trying to run the video test via 64 bit Gentoo 1.5.1, the Chromium browser would not run the BBC iPlayer, reporting that Flash player was needed (noting now that Flash will no longer be supported, from sometime in 2020). After installing Gentoo 1.5.3, I found that the iPlayer was supported.

Then there were Sakaki’s post on video decoding.

https://forums.gentoo.org/viewtopic.php?p=8404846#8404846

indicting that Firefox browser currently does not exploit RPi's built-in h/w video codecs. but the one use in Raspbian might.

Following are results from playing the same BBC iPlayer programme used previously, via Raspbian and Gentoo. The iPlayer accessed only has quality settings for low, medium and highest. It seems that (via Google), the maximum standard HD pixel settings are 1280 x 720 but can be automatically reduced, to suit response conditions. The Pi4 had no cooling fan attachment and was connected via LAN, where Speed Checker indicated 60 Mbps.

The results are for sample periods of 10 to 15 minutes. Quality details were obtained by right clicking on the screen, with others via the usual monitoring tools. There were no sign of data buffering or noticeable differences in quality (with casual viewing), but the CPU became much hotter under Gentoo, with occasional indications of throttling down to 1000 MHz, and CPU utilisation equivalent to nearly three cores in continuous use (software vs hardware decoding?).

Code:

BBC iPlayer Lions Documentary
                                            Av of 4         Throttled
                      Wide    High    kbps    %CPU   Max °C  To 1000   Setting

Raspbian Chromium      840     540    1709      21      65       0     Highest
Raspbian Chromium     1280     720    5166      34      68       0     Highest
Gentoo   Firefox      1280     720    5166      68      83       1     Highest
Gentoo   Firefox      1280     720    5166      65      82       2     Highest


The next tests were via YouTube, playing the same HD widescreen movie (https://youtu.be/rWVXLy_fJGk). The first were from accessing the player via browsers, where quality settings are available for auto and a range of HD and normal options, and run time details shown via right click, selecting Stats for nerds. Here, Gentoo tests again indicated higher CPU utilisation and temperatures.

The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect

Code:

YouTube HD Movie
                                             Av of 4         Throttled
                      Wide    High    kbps    %CPU   Max °C  To 1000   Setting

Raspbian Chromium     1920     816   16000      25      68       0       1080p
Gentoo   Firefox       856     362   18000      30      73       0 Auto (480p)
Gentoo   Firefox      1920     816   18000      60      81       0       1080p

Gentoo   SMPlayer      640     272     300      10      65       0      1080p?


I also tried Prime Video, but this failed, indicating that no decryption add on could be found. I would like to use a Pi 4 to access this and other players on older TVs that I have in various rooms.
_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Thu Jan 09, 2020 9:15 pm    Post subject: Re: Video Player Benchmarks Reply with quote

roylongbottom wrote:
Video Player Benchmarks
The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect

You should be able to set the stream quality via the Preferences -> Network dialog in SMPlayer, per this screenshot. Turn adaptive streams OFF if you want to force a particular resolution in YouTube. You can also change the codec route as shown there.

You can have SMPlayer display the resolution and framerate in the window also (as in the above).
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Fri Jan 10, 2020 12:36 pm    Post subject: Re: Video Player Benchmarks Reply with quote

Sakaki wrote:
roylongbottom wrote:
Video Player Benchmarks
The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect

You should be able to set the stream quality via the Preferences -> Network dialog in SMPlayer, per this screenshot. Turn adaptive streams OFF if you want to force a particular resolution in YouTube. You can also change the codec route as shown there.

You can have SMPlayer display the resolution and framerate in the window also (as in the above).


I had tried using those properties, without success. It needed adaptive streams turned on before the video was loaded. I was running full screen, where resolution and framerate were not displayed, but seen via right click, View, Information and Properties, Information. As yours, 25 FPS (nearly) was selected.

Results are included below, now similar to Raspbian on CPU utilisation and temperature. Dual mirrored monitor results are also shown, where kbps was rather strange but display quality looked fine (without detailed scrutiny).

Code:
YouTube HD Movie
                                             Av of 4         Throttled
                      Wide    High    kbps    %CPU   Max °C  To 1000   Setting

Raspbian Chromium     1920     816   16000      25      68       0       1080p
Gentoo   Firefox       856     362   18000      30      73       0 Auto (480p)
Gentoo   Firefox      1920     816   18000      60      81       0       1080p

Gentoo   SMPlayer      640     272     300      10      65       0     ~~1080p
Gentoo   SMPlayer     1920     816    4428      28      73       0       1080p
Gentoo   SMPlayer    1920x2   816x2 588-2028    32      76       0       1080p


~~1080p - use adaptive streams box not ticked


_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Fri Jan 10, 2020 5:15 pm    Post subject: Reply with quote

Strange. I've just tried this video myself - changing the "Playback quality" under "Options for YouTube" dialog does seem to work fine (720p, 1080p etc) with adaptive streams turned off. But, I found the player needs to be pointed at a different URL first sometimes, as it seems to remember the res for each target.

You can press Shift-I (eye) to show useful live overlaid info even when full screen on SMPlayer.
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo on ARM All times are GMT
Goto page Previous  1, 2
Page 2 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum