View previous topic :: View next topic |
Author |
Message |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Mon Sep 09, 2019 4:04 pm Post subject: Multithreading Benchmarks |
|
|
Multithreading Benchmarks
Many of these benchmarks run using 1, 2, 4 and 8 threads, with others executing programs on all available cores via OpenMP.
MP-Whetstone Benchmark
Multiple threads each run the eight test functions at the same time, but with some dedicated variables. Measured speed is based on the last thread to finish, with Mutex functions, used to avoid the updating conflict by only allowing one thread at a time to access common data. Performance is generally proportional to the number of cores used. There can be some significant differences from the single CPU Whetstone benchmark results on particular tests due to a different compiler being used. None of the test functions are suitable for SIMD operation, with the simpler instructions being used. Overall seconds indicates MP efficiency.
As with the single core version, average Pi 4 performance gain, over the Pi 3B+, was just over 2 times, but more similar compared with 32 bit speed, this time the latter being somewhat faster on some floating point calculations.
Code: | MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS
Gentoo Pi 3B+ 64 Bits
1 1152 383 383 328 23.2 13.0 N/A 2721 1365
2 2312 767 767 657 46.5 26.0 N/A 5461 2738
4 4580 1506 1526 1304 92.0 51.6 N/A 10777 5449
8 4788 1815 1961 1382 95.0 53.3 N/A 13827 5811
Overall Seconds 4.96 1T, 4.95 2T, 5.05 4T, 10.07 8T
Gentoo Pi 4B 64 Bits
1 2395 536 538 397 60.8 39.0 N/A 4483 997
2 4784 1062 1079 794 121.2 77.9 N/A 8932 1990
4 9476 2125 2080 1568 240.8 155.3 N/A 17718 3962
8 9834 2631 2744 1630 243.6 160.1 N/A 22265 4053
Overall Seconds 4.99 1T, 5.01 2T, 5.12 4T, 10.17 8T
Pi 4B/3B+ 64 Bits
1 2.08 1.40 1.41 1.21 2.62 3.00 N/A 1.65 0.73
2 2.07 1.39 1.41 1.21 2.61 3.00 N/A 1.64 0.73
4 2.07 1.41 1.36 1.20 2.62 3.01 N/A 1.64 0.73
8 2.05 1.45 1.40 1.18 2.56 3.00 N/A 1.61 0.70
Raspbian Pi 4B 32 Bits
1 2059 673 680 311 55.6 33.1 7462 2245 995
2 4117 1342 1391 624 110.7 65.9 14887 4467 1986
4 7910 2652 2722 1180 208.5 132.6 29291 8952 3832
8 8652 3057 2971 1268 233.2 149.6 38368 11923 3942
Overall Seconds 4.99 1T, 5.01 2T, 5.29 4T, 10.71 8T
Pi 4B 64 bits/32 bits
1 1.16 0.80 0.79 1.28 1.09 1.18 N/A 2.00 1.00
2 1.16 0.79 0.78 1.27 1.09 1.18 N/A 2.00 1.00
4 1.20 0.80 0.76 1.33 1.15 1.17 N/A 1.98 1.03
8 1.14 0.86 0.92 1.28 1.04 1.07 N/A 1.87 1.03
|
MP Dhrystone Benchmark
This executes multiple copies of the same program, but with some shared data, leading to inconsistent multithreading performance with not much gain using multiple cores.
The single thread speeds were similar to the earlier Dhrystone results, with RPi 4B ratings around twice as fast as those for the Pi 3B+. The single thread Pi 4B 64 bit/32 bit speed ratio was also similar to that during the single core tests.
Code: | MP-Dhrystone Benchmark armv8 64 Bit Fri Aug 23 00:44:05 2019
Using 1, 2, 4 and 8 Threads
Threads 1 2 4 8
Seconds 0.54 0.67 1.23 2.46
Dhrystones per Second 7391586 11954301 11300304 13028539
VAX MIPS Pi 3B+ 64 bits 4207 6804 7401 7415
VAX MIPS Pi 4B 64 bits 8880 7828 8303 8314
Pi 4B/3B+ 64 bits 2.11 1.15 1.12 1.12
VAX MIPS Pi 4B 32 bits 5539 5739 6735 7232
Pi 4B 64 bits/32 bits 1.60 1.36 1.23 1.15
|
MP Linpack Benchmark (Single Precision NEON)
This executes a single copy of the benchmark, at three data sizes, with the critical daxpy code multithreaded. This code was also modified to allow a higher level of parallelism, without changing any calculations. Still MP performance was much slower than running as a single thread. The main reasons appear to be updating data in RAM, to maintain integrity, with performance reflecting memory speeds, and overheads of exceptionally high thread start/stop overheads.
This benchmark uses the same NEON Intrinsic Functions as the single core program, with similar speeds at N = 100, without the threading overheads, but decreasing with larger data sizes, involving RAM accesses.
The full logged output is shown for the first entry, to demonstrate error checking facilities. The sumchecks were identical from the Pi 3B+ and Pi 4B at Gentoo 64 bits, but those from the Raspbian 32 bit test were different, as shown below. Ignoring the slow threaded results, performance ratios of CPU speed limited tests were similar to the single core version.
Code: | Gentoo Pi 3B+ 64 Bits
Linpack Single Precision MultiThreaded Benchmark
64 Bit NEON Intrinsics, Fri Aug 23 00:45:54 2019
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
N 100 642.56 66.69 66.05 65.54
N 500 479.48 274.36 274.85 269.07
N 1000 363.77 316.17 310.37 316.71
NR=norm resid RE=resid MA=machep X0=x[0]-1 XN=x[n-1]-1
N 100 500 1000
NR 1.97 5.40 13.51
RE 4.69621336e-05 6.44138840e-04 3.22485110e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -1.31130219e-05 5.79357147e-05 -3.08930874e-04
XN -1.30534172e-05 3.51667404e-05 1.90019608e-04
Thread
0 - 4 Same Results Same Results Same Results
Gentoo Pi 4B 64 Bits
N 100 2252.70 97.25 97.43 97.41
N 500 1628.24 665.21 646.63 674.38
N 1000 399.87 406.80 405.84 399.54
Pi 4B/3B+ 64 Bits
N 100 3.51 1.46 1.48 1.49
N 500 3.40 2.42 2.35 2.51
N 1000 1.10 1.29 1.31 1.26
Raspbian Pi 4B 32 Bits
N 100 1921.53 108.66 101.88 102.46
N 500 1548.81 530.23 714.37 733.09
N 1000 399.94 378.11 364.78 398.21
Pi 4B 64 bits/32 bits
N 100 1.17 0.89 0.96 0.95
N 500 1.05 1.25 0.91 0.92
N 1000 1.00 1.08 1.11 1.00
32 bit numeric results
N 100 500 1000
NR 2.17 5.42 9.50
RE 5.16722466e-05 6.46698638e-04 2.26586126e-03
MA 1.19209290e-07 1.19209290e-07 1.19209290e-07
X0 -2.38418579e-07 -5.54323196e-05 -1.26898289e-04
XN -5.06639481e-06 -4.70876694e-06 1.41978264e-04
|
MP BusSpeed (read only) Benchmark
Each thread accesses all of the data in separate sections, covering caches and RAM, starting at different points, with this version. See single processor BusSpeed details regarding burst reading that can indicate significant differences.
Comparisons are provided for RdAll, at 1, 2 and 4 threads. Pi 4B/3B+ performance ratios were similar to that for the single core tests. There was an exception with two threads, on the Pi 4, using RAM at 64 bits, probably due to caching effects and not seen on subsequent repeated tests.
Particularly note that performance was significantly better using the 32 bit Raspbian compiler. Below are examples of disassembly, showing that Pi 4 code employed scalar operation, using 32 bit w registers, with the 3B benefiting from using 128 bit q registers, for Single Instruction Multiple Data (SIMD) operation. Compile options are included below, where alternative were also tried on the Pi 4B, but failed to implement SIMD operation.
Code: | Gentoo Pi 3B+ 64 Bits
MP-BusSpd armv8 64 Bit Fri Aug 23 00:47:43 2019
MB/Second Reading Data, 1, 2, 4 and 8 Threads
Staggered starting addresses to avoid caching
KB Threads Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
12.3 1T 3138 2822 3044 2383 1708 1737
2T 5354 4865 5647 4519 3303 3362
4T 7922 7504 9717 6794 6216 6597
8T 5125 4159 6987 6696 5350 5195
122.9 1T 640 666 1191 1864 1627 1712
2T 1008 1018 1926 3496 3268 3387
4T 962 1042 2157 4259 6427 4372
8T 1031 1047 2147 3952 6317 6514
12288 1T 124 114 260 527 1016 1363
2T 137 138 275 487 946 2182
4T 105 118 240 409 975 2158
8T 108 117 236 504 1077 2051
RdAll
Gentoo Pi 4B 64 Bits Pi 4B/3B+
12.3 1T 4864 4879 5378 4379 4115 4221 2.43
2T 8159 6924 9179 8006 7689 7837 2.33
4T 12677 11531 14850 12554 13807 14794 2.24
8T 7398 6927 10881 11675 11497 13075 2.52
122.9 1T 665 926 1869 2714 3557 4152 2.43
2T 610 696 1549 4898 7188 8184 2.42
4T 476 865 1885 4107 8058 14617 3.34
8T 474 883 1848 3919 7939 13633 2.09
12288 1T 202 210 514 1044 2033 3616 2.65
2T 258 425 853 1551 3693 6228 2.85
4T 217 346 497 1024 2181 3789 1.76
8T 220 275 540 1030 1937 3577 1.74
Raspbian Pi 4B 32 Bits 64b/32b
12.3 1T 5263 5637 5809 5894 5936 13445 0.31
2T 9412 10020 10567 11454 11604 24980 0.31
4T 16282 15577 16418 21222 20000 45530 0.32
8T 11600 13285 16070 18579 20593 36837 0.35
122.9 1T 739 956 1888 3153 5008 9527 0.44
2T 629 1158 1568 5058 9509 16489 0.50
4T 600 1093 2134 4527 8732 16816 0.87
8T 593 1104 2121 4382 8629 17158 0.79
12288 1T 238 258 518 1005 2001 4029 0.90
2T 278 228 453 1690 1826 3628 1.72
4T 269 257 740 1019 1790 4145 0.91
8T 233 292 532 926 2186 3581 1.00
|
MP BusSpeed Disassembly
Code: | Source Code 64 AND instructions in main loop
for (i=start; i<end; i=i+64)
{
andsum1[t] = andsum1[t]
& array[i ] & array[i+1 ] & array[i+2 ] & array[i+3 ]
& array[i+4 ] & array[i+5 ] & array[i+6 ] & array[i+7 ]
To
& array[i+56] & array[i+57] & array[i+58] & array[i+59]
& array[i+60] & array[i+61] & array[i+62] & array[i+63];
}
Pi 32 Bit Raspbian Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -mcpu=cortex-a7
-mfloat-abi=hard -mfpu=neon-vfpv4 -o MP-BusSpd2PiA7
Pi 64 Bit Gentoo Compile
gcc mpbusspd2.c cpuidc.c -lpthread -lm -lrt -O3 -march=armv8-a -o MP-BusSpd2Pi64
Parameters also tried
-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe
-fomit-frame-pointer"
Pi 32 Bit Disassembly Pi 64 Bit Disassembly
vld1.32 {q6}, [lr] ldp w30, w17, [x0, 52]
vld1.32 {q7}, [r6] and w18, w18, w30
vand q10, q10, q6 and w1, w1, w18
vld1.32 {q6}, [r0] ldp w18, w30, [x0, 60]
vand q9, q9, q7 and w17, w17, w18
vand q12, q12, q6 and w1, w1, w17
vld1.32 {q7}, [ip] ldp w17, w18, [x0, 68]
vld1.32 {q6}, [r7] and w30, w30, w17
add r1, r3, #96 and w1, w1, w30
add r6, r3, #144 ldp w30, w17, [x0, 76]
vand q11, q11, q7 and w18, w18, w30
vand q14, q14, q6 and w1, w1, w18
vld1.32 {q7}, [r1] ldp w18, w30, [x0, 84]
vld1.32 {q6}, [r6] and w17, w17, w18
|
MP RandMem Benchmark
This benchmark potentially reads and writes all data, in sections covering caches and RAM, each thread starting at different addresses. Random access can select any address after that. Writing tends to be involve updating the appropriate memory area, providing constant speeds. Random access is significantly affected by burst reading and writing.
Pi 4B provided variable gains over the Pi 3B+, at 64 bits but less on the Pi 4B, from 64 bits over 32 bits.
Code: | MP-RandMem armv8 64 Bit Aug 2019 Using 1, 2, 4 and 8 Threads
Serial Serial Random Random Serial Serial Random Random
KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr
Gentoo Pi 4B 64 Bits
12.3 1T 5922 7871 5892 7857
2T 11856 7882 11902 7923
4T 22964 7821 22276 7832
8T 23225 7751 22082 7717
122.9 1T 5827 7276 2052 1921
2T 10965 7258 1754 1924
4T 10969 7232 1848 1929
8T 10896 7158 1834 1909
12288 1T 3879 1052 188 170
2T 4848 935 218 168
4T 4684 943 332 170
8T 3982 1049 340 171
Gentoo Pi 3B+ 64 Bits Raspbian Pi 4B 32 Bits
12.3 1T 4901 3587 4912 3585 5860 7905 5927 7657
2T 8749 3564 8719 3556 11747 7908 11182 7746
4T 17108 3504 17160 3505 21416 7626 17382 7731
8T 16885 3475 16650 3485 20649 7528 20431 7378
122.9 1T 3921 3339 1010 974 5479 7269 1826 1923
2T 7360 3350 1814 972 10355 6964 1667 1920
4T 12199 3313 2281 969 9808 7177 1715 1908
8T 12089 3313 2279 968 11677 7058 1697 1919
12288 1T 2024 828 83 67 3438 1271 179 152
2T 2169 820 142 67 4176 1204 213 167
4T 2178 818 154 67 4227 1117 337 161
8T 2219 821 161 67 3479 1093 287 168
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
12.3 4T 1.34 2.23 1.30 2.23 1.07 1.03 1.28 1.01
122.9 4T 0.90 2.18 0.81 1.99 1.12 1.01 1.08 1.01
12288 4T 2.15 1.15 2.16 2.54 1.11 0.84 0.99 1.06
|
MP-MFLOPS Benchmarks
MP-MFLOPS measures floating point speed on data from caches and RAM. The first calculations are as used in Memory Speed Benchmark, with a multiply and an add per data word read. The second uses 32 operations per input data word of the form x[i] = (x[i]+a)*b-(x[i]+c)*d+(x[i]+e)*f -- more. Tests cover 1, 2, 4 and 8 threads, each carrying out the same calculations but accessing different segments of the data. Versions are available using single precision and double precision data, plus one with NEON intrinsic functions. The numeric results are converted into a simple sumcheck, that should be constant, irrespective of the number of threads used. Correct values are included at the end of the results below. Note the differences using NEON functions and double or single precision floating point instructions.
There can be wide variations in speeds, affected by the short running times and such as cached data variations. In order to help in interpreting results, comparisons are provided of results using one and four threads. These indicate that, with cache based data, the Pi 4B was more than 3.5 times faster than the Pi 3B+ at two operations per word, but less so at 32 operations.
The 64 bit and 32 bit comparisons were, no doubt, influenced by the particular compiler version used, and this is reflected in the main disassembled code shown below, for 32 operations per word. The 32 bit version compile included -mfpu=neon-vfpv4, but NEON was not implemented, resulting in scalar operation, using single word s registers. I have another version with compile including -funsafe-math-optimizations, that compiles NEON instructions, with similar performance as the 64 bit version, but more sumcheck differences.
The benchmark compiled to use NEON Intrinsic Functions does not include any that specify fused multiply and add operations, reducing maximum possible speed. The 64 bit compiler converts the functions to include fused instructions, providing the fastest speeds.
The main compiler independent feature that provides a clear advantage to 64 bit operation is that the CPU, at 32 bits, does not support double precision SIMD (NEON) operation, with single word d registers being compiled. On the other hand, performance gain does not appear to be meet the potential. This suggests that there are other limiting factors - see disassembly below.
Code: | Single Precision
MP-MFLOPS armv8 64Bit Thu Aug 22 19:50:10 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 2908 2854 459 5778 5734 5405
2T 5700 5311 457 10935 11212 7968
4T 10375 5588 490 18181 21842 7637
8T 9675 8460 511 20128 20567 8568
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 792 806 373 1780 1783 1724 987 993 606 2816 2794 2804
2T 1482 1596 382 3542 3509 3380 1823 1837 567 5610 5541 5497
4T 2861 2742 429 5849 7013 5465 2119 3349 647 9884 10702 9081
8T 2770 2877 429 6434 6700 6101 3136 3783 609 10230 10504 9240
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.67 3.54 1.23 3.25 3.22 3.14 2.95 2.87 0.76 2.05 2.05 1.93
4T 3.63 2.04 1.14 3.11 3.11 1.40 4.90 1.67 0.76 1.84 2.04 0.84
Double Precision
MP-MFLOPS armv8 64Bit Double Precision Thu Aug 22 19:51:42 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 1464 1386 225 3398 3386 3182
2T 2837 2792 228 6720 6741 4547
4T 5172 3414 251 10405 12762 4763
8T 4774 4353 275 11506 12118 4865
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 415 386 206 1400 1403 1333 1187 1220 309 2682 2714 2701
2T 820 813 209 2804 2767 2597 2420 2416 282 5379 5415 4780
4T 1328 1323 212 5433 5340 2465 4665 2381 317 10256 10336 5242
8T 1343 1308 214 5090 5006 3280 4385 3114 310 9721 10340 5131
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45
4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03
NEON Single Precision
MP-MFLOPS NEON Intrinsics 64 Bit Thu Aug 22 19:52:48 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
Gentoo Pi 4B 64 Bits MFLOPS
1T 3311 3192 535 6442 6548 6198
2T 4607 6186 552 13030 13012 8468
4T 6279 5725 562 23798 24128 9374
8T 7815 12044 486 22725 21712 9395
Gentoo Pi 3B+ 64 Bits MFLOPS Raspbian Pi 4B 32 Bits MFLOPS
1T 830 823 406 2989 2986 2792 2491 2399 615 4325 4285 4261
2T 1575 1498 414 5981 5872 5445 5629 5520 591 8602 8463 8308
4T 2217 2650 431 11661 11644 6061 10580 5594 553 16991 16493 9124
8T 2733 3197 437 10505 10637 6708 7047 10785 513 14325 16219 8867
4 Thread Comparisons
Pi 4B/3B+ 64 Bits Pi 4B 64 bits/32 bits
1T 3.99 3.88 1.32 2.16 2.19 2.22 1.33 1.33 0.87 1.49 1.53 1.45
4T 2.83 2.16 1.30 2.04 2.07 1.55 0.59 1.02 1.02 1.40 1.46 1.03
|
MP-MFLOPS Disassembly
On the Pi 4B, with single precision floating point and SIMD, four word registers were used (see 4s below). With this, four results of calculations might be expected per clock cycle, or 6 GFLOPS per core and up to 24 GFLOPS using all four cores, Then such as fused multiply and add could double the speed for up to four times to 12 GFLOPS per core. For the mix of instructions below, expectations might by 70% of this or 8.4 GFLOPS. Using double precision, with two words in the 128 bit registers, expectations might be half that at 4.2 GFLOPS per core, with this code.
Code: | SP NEON 24.1 GFLOPS 6.55 1 core DP 12.7 GFLOPS - 3.39 1 core
.L41: .L84:
ldr q1, [x1] ldr q16, [x2, x0]
ldr q0, [sp, 64] add w3, w3, 1
fadd v18.4s, v20.4s, v1.4s cmp w3, w6
fadd v17.4s, v22.4s, v1.4s fadd v15.2d, v16.2d, v14.2d
fadd v0.4s, v0.4s, v1.4s fadd v17.2d, v16.2d, v12.2d
fadd v16.4s, v24.4s, v1.4s fmul v15.2d, v15.2d, v13.2d
fadd v7.4s, v26.4s, v1.4s fmls v15.2d, v17.2d, v11.2d
fadd v6.4s, v28.4s, v1.4s fadd v17.2d, v16.2d, v10.2d
fadd v5.4s, v30.4s, v1.4s fmla v15.2d, v17.2d, v9.2d
fmul v0.4s, v0.4s, v19.4s fadd v17.2d, v16.2d, v8.2d
fadd v4.4s, v10.4s, v1.4s fmls v15.2d, v17.2d, v31.2d
fadd v3.4s, v12.4s, v1.4s fadd v17.2d, v16.2d, v30.2d
fadd v2.4s, v14.4s, v1.4s fmla v15.2d, v17.2d, v29.2d
fadd v1.4s, v8.4s, v1.4s fadd v17.2d, v16.2d, v28.2d
fmls v0.4s, v21.4s, v18.4s fmls v15.2d, v17.2d, v0.2d
fmla v0.4s, v23.4s, v17.4s fadd v17.2d, v16.2d, v27.2d
fmls v0.4s, v25.4s, v16.4s fmla v15.2d, v17.2d, v26.2d
fmla v0.4s, v27.4s, v7.4s fadd v17.2d, v16.2d, v25.2d
fmls v0.4s, v29.4s, v6.4s fmls v15.2d, v17.2d, v24.2d
fmla v0.4s, v31.4s, v5.4s fadd v17.2d, v16.2d, v23.2d
fmls v0.4s, v9.4s, v1.4s fmla v15.2d, v17.2d, v22.2d
fmla v0.4s, v4.4s, v11.4s fadd v17.2d, v16.2d, v21.2d
fmls v0.4s, v3.4s, v13.4s fadd v16.2d, v16.2d, v19.2d
fmla v0.4s, v2.4s, v15.4s fmls v15.2d, v17.2d, v20.2d
str q0, [x1], 16 fmla v15.2d, v16.2d, v18.2d
cmp x1, x0 str q15, [x2, x0]
bne .L41 add x0, x0, 16
bcc .L84
32 bit 64 bit 32 bit 64 bit 32 bit 64 bit
SP SP DP DP NEON SP NEON SP
Maximum GFLOPS 10.7 21.8 10.3 12.7 17.0 24.1
Instructions
Total 27 39 26 27 67 27
Floating point 22 32 22 32 32 22
FP operations
Total 32 128 32 64 128 128
Add or subtract 11 44 11 22 21 44
Multiply 1 4 1 2 11 4
Fused 20 80 20 40 0 80
Add example fadds fadd faddd fadd vadd.f32 fadd
s16, v15.4s, d25, v15.2d, q9, v1.4s,
s23, v16.4s, d17, v16.2d, q8, v8.4s,
s2 v15.4s d15 v14.2d q14 v1.4s
Multiply example fnmuls fmul fmuld fmul vmul.f32 fmul
s16, v15.4s, d16, v15.2d, q9, v0.4s,
s3, v15.4s, d16, v15.2d, q9, v0.4s,
s16 v17.4s d5 v13.2d q12 v19.4s
Fused example vfma.f32 fmla vfma.f64 fmla N/A fmla
s16, v15.4s, d16, v15.2d, v0.4s,
s29, v17.4s, d22, v17.2d, v4.4s,
s9 v0.4s d28 v22.2d v11.4s
FP registers used 32 4 32 25 16 32
|
MP-MFLOPS Sumchecks
Different instructions, like between SP and DP, may not produce identical numeric results. Variations also depend on the number of passes, here they were close to 1.0 as data size increased. Only anomaly is -X below.
Code: |
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66015 95363 99951
DP
4B/64 1T 76384 97072 99969 66065 95370 99951
3B/64 1T 76384 97072 99969 66065 95370 99951
4B/32 1T 76384 97072 99969 66065 95370 99951
NEON Bit SP
4B/64 1T 76406 97075 99969 66015 95363 99951
3B/64 1T 76406 97075 99969 66015 95363 99951
4B/32 1T 76406 97075 99969 66014-X 95363 99951
|
OpenMP-MFLOPS Benchmark
This benchmark carries out the same calculations as the MP-MFLOPS Benchmarks but, in addition, calculations with eight operations per data word. There is also notOpenMP-MFLOPS single core version, compiled from the same code and carrying out identical numbers of floating point calculations, but without an OpenMP compile directive.
Following is an example of full output. The strange test names were carried forward from a 2014 CUDA benchmark, via Windows and Linux Intel CPU versions. Details are in the following GigaFLOPS Benchmarks report, covering MP-MFLOPS, QPAR and OpenMP. This showed nearly 100 GFLOPS from a Core i7 CPU and 400 GFLOPS from a GeForce GTX 650 graphics card, via CUDA.
https://www.webarchive.org.uk/wayback/archive/20151031003049/http://www.roylongbottom.org.uk/GigaFLOPS%20Benchmarks.htm
The detail is followed by MFLOPS results on Pi 3B+ and Pi 4B. The direct conversions of the code from large systems lead to excessive memory demands for Raspberry Pi systems, with too many tests dependent on RAM speed, and low MP performance gains. There were glimpses of the usual performance gains an a maximum of over 20 SP GFLOPS on a 64 bit Pi 4B.
Code: | OpenMP MFLOPS64 Thu Aug 22 19:54:59 2019
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.092836 5386 0.929538 Yes
Data in & out 1000000 2 250 0.887743 563 0.992550 Yes
Data in & out 10000000 2 25 0.917173 545 0.999250 Yes
Data in & out 100000 8 2500 0.129858 15401 0.957117 Yes
Data in & out 1000000 8 250 0.899561 2223 0.995518 Yes
Data in & out 10000000 8 25 0.847036 2361 0.999549 Yes
Data in & out 100000 32 2500 0.391602 20429 0.890215 Yes
Data in & out 1000000 32 250 0.989877 8082 0.988088 Yes
Data in & out 10000000 32 25 0.944493 8470 0.998796 Yes
End of test Thu Aug 22 19:55:05 2019*
--------------- MFLOPS -------------- -------- Compare --------
Mbytes/ Pi 3B+ Pi 4B Pi 4B Pi 4B
Threads 64b 64b 32b 4b/3b 64/32b
All 1CP All 1CP All 1CP All 1CP All 1CP
0.4/2 2674 755 5386 2780 4716 2850 2.01 3.68 1.14 0.98
4/2 411 404 563 557 556 429 1.37 1.38 1.01 1.30
40/2 419 408 545 588 544 632 1.30 1.44 1.00 0.93
0.4/8 7029 1886 15401 5555 7981 5191 2.19 2.95 1.93 1.07
4/8 1656 1495 2223 2116 2389 2082 1.34 1.42 0.93 1.02
40/8 1725 1507 2361 2310 2199 2003 1.37 1.53 1.07 1.15
0.4/32 6648 1699 20429 5647 8147 5449 3.07 3.32 2.51 1.04
4/32 5977 1616 8082 5445 7951 5385 1.35 3.37 1.02 1.01
40/32 6027 1616 8470 5479 8030 5379 1.41 3.39 1.05 1.02
|
Next More threading Benchmarks _________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Mon Sep 09, 2019 10:16 pm Post subject: More Multithreading Benchmarks |
|
|
More Multithreading Benchmarks
OpenMP-MemSpeed
This is the same program as the single core MemSpeed benchmark, but with increased memory sizes and compiled using OpenMP directives. The same program was also compiled without these directives (NotOpenMP-MemSpeed2), with the example single core results also shown after the detailed measurements. Although the source code appears to be suitable for speed up by parallelisation, many of the test functions are slower using OpenMP, with effects on Pi 3B+ and Pi 4B not the same. Detailed comparisons of these results are rather meaningless.
Code: | Gentoo Pi 3B+ 64 Bits
Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom
Start of test Fri Sep 6 12:44:14 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 6302 3584 1853 9425 5287 2000 12537 4762 2141
8 5122 3699 1897 8911 5620 2012 13017 5610 2111
16 7283 3717 1873 10812 5659 1974 13006 6428 2080
32 6953 3675 1762 10058 5515 1974 11998 6101 1997
64 6967 3683 1836 10052 5531 1966 12142 6169 2054
128 7021 3694 1848 10049 5544 2035 9932 6269 2091
256 7048 3680 1908 10196 5593 1976 8831 6323 2067
512 4986 1606 1722 6324 4189 1304 4509 4135 1647
1024 2284 2692 1397 3932 2385 1321 1277 1268 942
2048 981 2650 1398 1749 3360 1471 758 838 1043
4096 874 1578 1355 3757 3398 909 760 852 756
8192 1038 2585 1092 3805 1646 1243 857 751 870
16384 917 2359 1734 1184 3151 1179 880 776 814
32768 2983 1229 1916 1519 2880 1293 808 847 1373
65536 3214 1259 1298 3018 1319 1286 894 857 1147
131072 839 673 779 918 883 765 1263 1286 512
Not OMP
8 4694 2913 4841 6213 3944 4844 5402 4337 4337
256 3791 2572 3921 4428 3223 3922 4941 4065 4070
65536 1064 1070 1106 1075 1086 1028 763 849 847
Gentoo Pi 4B 64 Bits
Memory Reading Speed Test OpenMP 64 Bit by Roy Longbottom
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 7854 9082 3171 7660 9140 2232 30162 15534 2692
8 8238 8906 3150 8308 9253 2339 29033 15749 2673
16 8217 8964 3136 8408 9044 2350 31531 15867 2472
32 8598 8192 3085 8547 8094 2387 17252 14505 2377
64 9084 8654 3130 8902 8606 2410 18959 14678 2393
128 11338 11686 3091 11811 11261 2858 14852 15240 2361
256 16320 17582 3236 17404 16671 2652 13683 14741 2411
512 17581 18033 3089 16204 18086 2758 12921 10441 2331
1024 14527 13629 2891 15196 13782 2682 4323 6169 2272
2048 5018 7240 3120 7328 7241 2512 3370 3428 2215
4096 4054 7200 3135 7330 5612 2916 2775 2703 2196
8192 2130 2261 3867 7731 7527 3823 2701 2615 2184
16384 3795 4552 3364 2106 7417 3397 1793 2709 2100
32768 2065 6760 3327 7215 7144 3797 2108 2376 2242
65536 2462 2245 2390 7160 3945 2742 2746 2386 2259
131072 3276 3526 2324 8110 1927 2882 2584 2719 1965
Not OMP
8 15527 13976 15533 15504 14021 15537 11563 9311 7794
256 12236 11434 12096 12084 11740 12156 7883 8044 7818
65536 2047 2046 2037 2034 2054 2071 2567 2554 2547
Raspbian Pi 4B 32 Bits
Memory Reading Speed Test OpenMP Version 2 by Roy Longbottom
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 7650 6427 1247 7389 6401 1587 39558 19538 883
8 7777 6511 1263 7655 6534 1586 39076 19920 890
16 8180 7840 1275 8412 7870 1566 38490 20039 846
32 9355 9127 1295 9685 9266 1612 36718 19878 862
64 8949 8763 1223 8971 8727 1566 14949 14827 844
128 12241 11610 1247 12328 12303 1742 13945 15134 876
256 17543 14765 1300 18010 17894 1748 12710 13167 839
512 18252 15466 1265 18030 16934 1651 12814 12407 874
1024 9044 12367 1432 12278 12201 1641 6907 9438 846
2048 6975 6620 1521 7031 6999 1676 3073 3365 797
4096 3539 7303 1440 7267 7247 1730 2348 3165 831
8192 7547 7759 1369 7608 7659 1762 2622 3133 904
16384 3877 7559 1329 7987 5744 1506 2514 3136 850
32768 7391 3974 1317 7290 6655 1763 2586 3102 921
65536 8209 7779 1341 7856 7290 1805 2445 2834 851
131072 5086 7344 1280 3475 5222 1688 2358 2968 830
Not OMP
8 8603 11757 13383 8607 11754 13384 7827 7796 7796
256 8312 9879 9991 8355 9988 9993 7530 7803 7805
65536 2098 2073 2081 2087 2077 2068 2590 961 965
|
Stress Testing Programs Benchmarking Mode
My latest stress testing programs have parameters that specify running time, data size, number of threads, log file number and, in two cases, processing density. When run without parameters, the full range of options are used, providing a useful benchmark. Log file results from Pi 4B tests, and comparisons, are provided below. The programs are available in:
https://www.researchgate.net/profile/Roy_Longbottom/project/Performance-of-Raspberry-Pi-and-Android-Devices/attachment/5cc08ba43843b01b9b9c4f64/AS:751246503858192@1556122532665/download/Raspberry-Pi-Stress-2019.tar.gz?context=ProjectUpdatesLog
Integer Stress Test-Benchmark
The integer program test loop comprises 32 add or subtract instructions, operating on hexadecimal data patterns, with sequences of 8 subtracts then 8 adds to restore the original pattern. Disassembly shows that the test loop, in fact, used 68 instructions, most additional ones being load register type. The result of these is 68/32 instructions per 4 byte word. At the maximum of 1943M words per second, using a single core, resultant execution speed was 4129 MIPS with nearly four times more using all cores.
The tables below, with speeds on the considered systems, provide average performance gains of the Pi 4B at 64 bits, somewhat limited in this case.
Code: | Gentoo Pi 4B 64 Bits
MP-Integer-Test 64 Bit v1.0 Fri Sep 6 16:33:36 2019
Benchmark 1, 2, 4, 8, 16 and 32 Threads
MB/second
KB KB MB Same All
Secs Thrds 16 160 16 Sumcheck Tests
4.3 1 7771 7352 3895 00000000 Yes
3.3 2 15467 14218 3714 FFFFFFFF Yes
3.0 4 28715 26652 3345 5A5A5A5A Yes
3.0 8 30292 26310 3334 AAAAAAAA Yes
3.0 16 29466 28503 3337 CCCCCCCC Yes
3.0 32 29351 30358 3390 0F0F0F0F Yes
Pi 4B 32 bit MB/sec Pi 3B+ 64 bit MB/sec
KB KB MB KB KB MB
16 160 16 16 160 16
Threads
1 5964 5756 3931 4823 3884 1209
2 11787 11430 3748 9613 7709 1908
4 23214 22060 3456 17737 15137 1779
6 22197 22171 3472 17651 18692 1767
16 22671 23299 3256 18255 18793 1757
32 21379 21881 3346 18246 18674 1748
Pi 4B 64b/32b 64b Pi 4B/3B+
Average
Gain 1.31 1.25 0.99 1.63 1.67 2.13
|
Single Precision Floating Point Stress Test-Benchmark
This and the double precision program carry out the same calculations as MP-MFLOPS, but are slightly faster by including a loop that repeats the tests within the calculate functions. Maximum speeds were 6.75 GFLOPS, using one core, and 26.7 GFLOPS with four cores.
These programs were written using a later compiler than those used for MP-MFLOPS, at least resulting in similar speeds between 32 bit and 64 bit versions. Typical Pi 4B/3B+ performance improvements were indicated.
Code: | MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep 6 16:30:12 2019
Benchmark 1, 2, 4 and 8 Threads
MFLOPS Numeric Results
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
1.7 T1 2 2819 2874 504 40392 76406 99700
3.2 T2 2 5592 5702 511 40392 76406 99700
4.6 T4 2 9223 7520 519 40392 76406 99700
6.0 T8 2 9520 10471 545 40392 76406 99700
8.2 T1 8 5381 5595 2050 54764 85092 99820
9.8 T2 8 11039 10883 2173 54764 85092 99820
11.3 T4 8 19087 21040 2044 54764 85092 99820
12.9 T8 8 19747 21107 2016 54764 85092 99820
17.5 T1 32 6693 6753 6377 35206 66015 99520
20.2 T2 32 13491 13464 8710 35206 66015 99520
22.2 T4 32 25732 26704 9160 35206 66015 99520
24.1 T8 32 25708 25770 8927 35206 66015 99520
End of test Fri Sep 6 16:30:37 2019
Pi 4B 32 bit Pi 3B+ 64 bit
Threads KB KB MB KB KB MB
Ops/wd 12.8 128 12.8 12.8 128 12.8
T1 2 2641 2607 646 838 826 373
T2 2 5089 5116 618 1659 1650 380
T4 2 8282 8522 683 2584 3296 384
T8 2 8756 9847 686 3013 3056 391
T1 8 5543 5428 2597 1981 1972 1354
T2 8 10754 10603 2711 3936 3923 1518
T4 8 18716 20823 2844 7482 7396 1531
T8 8 19859 21684 2555 7399 7705 1534
T1 32 5309 5274 5265 2820 2809 2462
T2 32 10557 10509 9991 5636 5583 4754
T4 32 20416 20919 11340 10640 10882 6020
T8 32 20072 19787 9330 10641 10926 6159
Average Pi 4B Performance Gains
Ops/Word Pi 4B 64b/32b 64b Pi 4B/3B+
2 1.09 1.04 0.79 3.37 3.16 1.36
8 1.00 1.01 0.77 2.69 2.80 1.40
32 1.27 1.29 0.96 2.40 2.41 1.85
|
Double Precision Floating Point Stress Test-Benchmark
Maximum measured DP speeds were 3.39 GFLOPS, using one core, and 13.2 GFLOPS with four cores. Some of the 64/32 bit and 4B/3B+ performance ratios were similar to those from MP-MFLOPS
Code: | MP-Threaded-MFLOPS 64 Bit v1.0 Fri Sep 6 16:31:24 2019
Double Precision Benchmark 1, 2, 4 and 8 Threads
MFLOPS Numeric Results
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
3.2 T1 2 1398 1462 285 40395 76384 99700
6.2 T2 2 2799 2807 256 40395 76384 99700
8.9 T4 2 5024 4589 257 40395 76384 99700
11.5 T8 2 5089 5545 280 40395 76384 99700
15.7 T1 8 2668 2790 1103 54805 85108 99820
18.8 T2 8 5670 5545 1158 54805 85108 99820
21.7 T4 8 10259 10011 1068 54805 85108 99820
24.7 T8 8 10239 10824 1036 54805 85108 99820
34.1 T1 32 3317 3390 3195 35159 66065 99521
39.2 T2 32 6791 6754 4753 35159 66065 99521
43.1 T4 32 12940 13200 4497 35159 66065 99521
46.9 T8 32 13200 13049 4557 35159 66065 99521
End of test Fri Sep 6 16:32:11 2019
Pi 4B 32 bit Pi 3B+ 64 bit
Threads KB KB MB KB KB MB
Ops/wd 12.8 128 12.8 12.8 128 12.8
T1 2 993 998 329 412 411 193
T2 2 1971 1995 309 828 824 194
T4 2 3633 3937 340 1543 1514 197
T8 2 3635 3796 339 1525 1551 196
T1 8 2378 2445 1288 980 978 696
T2 8 4770 4860 1282 1975 1964 782
T4 8 9281 9556 1210 3688 3688 781
T8 8 9119 9448 1245 3726 3689 787
T1 32 2697 2726 2708 1402 1403 1231
T2 32 5397 5446 5163 2808 2808 2399
T4 32 10689 10806 5146 5379 5413 3195
T8 32 10716 10494 4497 5450 5485 3150
Average Pi 4B Performance Gains
Ops/Word Pi 4B 64b/32b 64b Pi 4B/3B+
2 1.40 1.37 0.82 3.34 3.39 1.38
8 1.13 1.12 0.87 2.78 2.83 1.44
32 1.23 1.24 1.00 2.40 2.41 1.86
|
High Performance Linpack Benchmark
Earlier, he High Performance Linpack Benchmark was run on Raspberry Pi 3 models, and later, on the Raspberry Pi 4 system, both via 32 bit Raspbian Operating System. Details and results can be found in the following reports.
https://www.researchgate.net/publication/331983549_Raspberry_Pi_3B_and_3B+_High_Performance_Linpack_and_Error_Tests
https://www.researchgate.net/publication/334561068_Raspberry_Pi_4B_Stress_Tests_Including_High_Performance_Linpack
Initially, two versions of HPL tests were run, one accessing precompiled Basic Linear Algebra Subprograms and the other with ATLAS alternatives, that had to be built. The whole benchmark suite was produced according to instructions in the following.
https://computenodes.net/2018/06/28/building-hpl-an-atlas-for-the-raspberry-pi/
The ATLAS version was installed, as the older benchmark would not run on the Pi 4. One issue is the time required for the build, apparently due to the numerous tuning tests. Time taken was 14 hours using a Pi 3B+, then 8 hours on a Pi 4. Later, 64 bit ATLAS was built on the Pi 3B+, via Gentoo, taking 26 hours, that included extended periods swapping data with the rather slow main drive.
The procedure specified in the above was used, successfully leading to a working package. Only one change was required, this was to Make.rpi line 95 to;
Code: | LAdir = /home/pi/atlas-build to = /home/demouser/atlas-build
|
Following the introduction of 64 bit Gentoo for the Pi 4B, ATLAS was again created, taking more than 10 hours. As indicated in the above links, the HPL benchmark can be a useful stress test, due to the long running time with heavy processing. It can lead to CPU MHz being throttled on the Pi 4B, producing slow GFLOPS speeds. The tests reported here were run using a Pi 4B with a cooling fan, with CPU MHz monitored to help to indicate that the processor was running at full speed.
Results and comparisons are provided below, followed by the main report for he best Pi 4B Gentoo result. Particularly important, maximum performance is dependent on the amount of RAM available. As with the original single CPU Linpack benchmark, where N is the matrix problem size, minimum memory used is N x N x 8 Bytes (double precision) or 512 MB for N = 8000 or 3.2 GB for N = 20000. The end of the detailed output indicates a further problem, where the first run at maximum size might be slow, with extra time swapping data out of RAM, to create space for the HPL data.
Next, the benchmark produces a sumcheck but, in the case of the ATLAS implementation, these are not consistent using the same problem size, all those shown here were indicated as PASSED (within specified tolerances). The anomaly could be produced using different CPU models or alternative compilations but, the least understandable is identified at the end of the detailed output, where the sumcheck is shown to vary on repeating the program on the same system.
Comparing Pi 4B 32 bit and 64 bit GFLOPS maximum speeds, the 32 bit version appears to be slightly faster (or the same within reasonable tolerances). Then it is not clear (to me), whether the compiled code completely embraces the difference in technology or whether external compile options should be included for the different packages involved.
Code: | ------ Time ------ ----- GFLOPS ----- ----------- Sumcheck ----------
4B 4B 3B+ 4B 4B 3B+ 4B 4B 3B+
N 64b 32b 64b 64b 32b 64b 64b 32b 64b
4000 5.51 5.20 14.53 7.75 8.20 2.94 0.0022808 0.0023975 0.0025857
8000 38.22 36.70 101.59 8.93 9.30 3.36 0.0017216 0.0016746 0.0017518
16000 269.26 263.00 10.14 10.40 0.0012577 0.0011258
20000 513.67 494.30 10.38 10.80 0.0009637 0.0010188
GFLOPS Comparisons
4B 64b
N 64b/32b 4B/3B+
4000 0.95 2.64
8000 0.96 2.66
16000 0.98
20000 0.96
--------------------------------------------------------------------------------
The following parameter values will be used:
N : 1000
NB : 128
PMAP : Row-major process mapping
P : 2
Q : 2
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 513.67 1.038e+01
HPL_pdgesv() start time Fri Aug 23 10:57:30 2019
HPL_pdgesv() end time Fri Aug 23 11:06:04 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009637 ...... PASSED
================================================================================
WR11C2R4 20000 128 2 2 516.71 1.032e+01
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008697 ...... PASSED
================================================================================
First Run
WR11C2R4 20000 128 2 2 656.89 8.120e+00
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009470 ...... PASSED
================================================================================
|
_________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Wed Sep 25, 2019 8:05 pm Post subject: 64 Bit Raspberry Pi 4B Java and I/O Benchmarks |
|
|
64 Bit Raspberry Pi 4B Java and I/O Benchmarks
The benchmark results included below enable comparisons between the Pi 4B and 3B+, using the same benchmark and system software, but a sample of Pi 4 32 bit speeds is included, in some cases.
Three operational differences were observed from the 32 bit Raspbian benchmarking exercise. The first was with dual monitors where displaying a twice monitor pixel width window, the 32 bit program spread this across both monitors, but no mirroring appeared to be available. The 64 bit benchmark provided the latter for smaller windows, but switching off mirroring, squashed the image into the half width monitor display.
The standard version of my DriveSpeed benchmark uses Direct I/O, to avoid caching, on writing and reading large files and works as expected running the 2 bit version. Running a 64 vit version, this lead to failures to write or read. Variations were produced to enable performance measurements.
My broadband hub has dual 2.4 and 5 GHz capabilities. Significant variations in performance can be produced in this mode, on different devices, but appear to be more significant using the 64 bit benchmark. I changed the hub settings to provide different 2.4 and 5 GHz ports but, unlike the 32 bit benchmark, the 64 bit version would not connect to the network, using the same Pi 4B system.
Note that these differences could be due to program, software and/or hub incompatibility.
OpenGL GLUT Benchmark
The benchmark measures graphics speed in terms of Frames Per Second (FPS) via six simple and more complex tests. The first four tests portray moving up and down a tunnel including various independently moving objects, with and without texturing. The last two tests, represent a real application for designing kitchens. The first is in wireframe format, drawn with 23,000 straight lines, followed by one with applied textures.
Pi 4B average performance gains are included below, with textured objects the best, at 2.1 times, and worst, at around 1.5 times, with the slow kitchen displays.
Dual Monitors - The benchmark was also run with two 1920x1080 monitors connected. It displayed two identical displays when the mirror option was selected. Without this, the normal display, from where the program is executed, appeared on one display, and the OpenGL images on the other. This was fine when the usual display dimensions, as shown below, were specified. With no parameters, full screen image was assumed to be 3840x1080 and this was displayed horizontally squashed into 1920 pixels. FPS measurements for the latter are shown below.
On running the 32 bit version via Raspbian, the default display was 3840x1080, across both monitors, but only on one monitor, when 1920x1080 parameters or less were specified. There was no mirror option.
Code: | ############################# Pi 3B+ #############################
GLUT OpenGL Benchmark 64 Bit Version 1, Fri Sep 20 11:15:47 2019
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 389.6 227.2 122.6 75.3 30.0 21.5
320 240 328.1 201.7 113.8 73.3 30.2 21.3
640 480 203.3 144.7 87.8 62.0 30.2 21.0
1024 768 107.1 94.5 60.3 51.1 28.9 20.0
1920 1080 45.3 47.5 36.9 33.1 28.7 20.0
############################## Pi 4B #############################
GLUT OpenGL Benchmark 64 Bit Version 1, Thu Sep 12 20:48:21 2019
Running Time Approximately 5 Seconds Each Test
Window Size Coloured Objects Textured Objects WireFrm Texture
Pixels Few All Few All Kitchen Kitchen
Wide High FPS FPS FPS FPS FPS FPS
160 120 767.4 420.3 258.3 154.3 45.7 31.7
320 240 682.9 388.8 245.0 148.3 45.1 30.8
640 480 367.1 262.6 217.9 140.1 46.2 30.9
1024 768 150.8 148.8 128.6 117.3 45.3 30.4
1920 1080 71.9 73.9 64.0 61.6 43.3 27.9
Pi 4B Gains 1.77 1.74 2.12 2.10 1.52 1.46
Dual Monitor- mirrored displays
1920 1080 65.0 66.3 61.6 58.2 42.7 27.5
Dual Monitor - not mirrored squashed image on one monitor
3840 1080 60.9 59.6 57.2 54.8 40.8 26.8
32 Bit
1920 1080 81.4 79.4 74.6 68.3 30.8 20.0
|
JavaDraw Benchmark
The benchmark uses small to rather excessive simple objects to measure drawing performance in Frames Per Second (FPS). Five tests draw on a background of continuously changing colour shades, each test adding to the load.
Pi 4B performance gains shown below were indicated between 2.1 and 3.42 times.
Code: | ############################# Pi 3B+ #############################
Java Drawing Benchmark, Sep 20 2019, 11:08:33
Produced by javac 1.7.0_02
Test Frames FPS
Display PNG Bitmap Twice Pass 1 335 33.46
Display PNG Bitmap Twice Pass 2 546 54.53
Plus 2 SweepGradient Circles 502 50.08
Plus 200 Random Small Circles 366 36.59
Plus 320 Long Lines 134 13.30
Plus 4000 Random Small Circles 46 4.59
Total Elapsed Time 60.2 seconds
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
############################# Pi 4B ##############################
Java Drawing Benchmark, Sep 12 2019, 20:18:28
Produced by javac 1.7.0_02
Test Frames FPS Gains
Display PNG Bitmap Twice Pass 1 1146 114.52 3.42
Display PNG Bitmap Twice Pass 2 1318 131.79 2.42
Plus 2 SweepGradient Circles 1237 123.66 2.47
Plus 200 Random Small Circles 972 97.13 2.65
Plus 320 Long Lines 415 41.48 3.12
Plus 4000 Random Small Circles 97 9.65 2.10
Total Elapsed Time 60.1 seconds
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
32 bit Pi 4B speeds were between 8.25 and 104.47 FPS
|
Java Whetstone Benchmark
The benchmark measures performance of various floating point and integer calculations , with an overall rating in Million Whetstone Instructions Per Second (MWIPS).
Code: | ############################# Pi 3B+ #############################
Whetstone Benchmark Java Version, Sep 20 2019, 11:06:12
1 Pass
Test Result MFLOPS MOPS millisecs
N1 floating point -1.124750137 310.88 0.0618
N2 floating point -1.131330490 289.41 0.4644
N3 if then else 1.000000000 241.15 0.4292
N4 fixed point 12.000000000 706.28 0.4460
N5 sin,cos etc. 0.499110132 23.31 3.5700
N6 floating point 0.999999821 130.04 4.1480
N7 assignments 3.000000000 89.19 2.0720
N8 exp,sqrt etc. 0.825148463 21.92 1.6970
MWIPS 775.89 12.8884
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
############################# Pi 4B ##############################
Whetstone Benchmark Java Version, Sep 12 2019, 20:15:35
1 Pass
Test Result MFLOPS MOPS millisecs Gains
N1 floating point -1.124750137 488.80 0.0393 1.57
N2 floating point -1.131330490 475.92 0.2824 1.64
N3 if then else 1.000000000 344.31 0.3006 1.43
N4 fixed point 12.000000000 1571.86 0.2004 2.23
N5 sin,cos etc. 0.499110132 43.55 1.9104 1.87
N6 floating point 0.999999821 264.15 2.0420 2.03
N7 assignments 3.000000000 264.00 0.7000 2.96
N8 exp,sqrt etc. 0.825148463 25.80 1.4420 1.18
MWIPS 1445.70 6.9171 1.86
Operating System Linux, Arch. aarch64, Version 4.19.67
Java Vendor IcedTea, Version 1.8.0_222
|
DriveSpeed Benchmark
This benchmark has the format shown below, measuring writing and reading speeds of large files, cached files, random access and numerous small files. Run time parameters are available to specify large file size and the file path.
Code: | ########################## Pi 4B USB 3 ###########################
DriveSpeed RasPi 64 Bit 2.0 Fri Sep 13 22:25:40 2019
Selected File Path:
/run/media/demouser/PATRIOT//
Total MB 120832, Free MB 119778, Used MB 1054
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
512 30.72 31.11 34.01 287.24 295.04 311.90
1024 34.66 36.11 35.45 298.87 302.38 300.26
Cached
8 42.03 39.58 38.85 1167.71 1029.35 1061.56
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.007 0.310 9.65 10.42 9.71
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 0.03 0.07 0.13 268.10 427.95 657.48
ms/file 122.73 122.28 122.22 0.02 0.02 0.02 2.557
|
For non-cached tests, in the standard version of this benchmark, the file opening handle includes the O_DIRECT option, specifying Direct I/O (no caching). The latest minor variety of this appears to work, as expected, on the 32 bit Raspbian version, on both main and USB drives. The 64 bit compilation of this indicated a failure to write to the main SD drive and a failure to read from USB flash drives. Omitting O_DIRECT, for reading, appeared to correct the latter (see above). To check this and enable main drive measurements, separate direct I/O free large file write and read only programs were produced, to follow write/reboot/read procedures. These were also necessary to indicate throughput simultaneously writing or reading two USB 3 drives.
USB Flash Drives
Two FAT 32 formatted USB 3 sticks were used, P at 128 GB, with 32 KB sectors, reading speed rated as up to 400 MB/second, and R 8.8 GB partition, with 8 KB sectors, reading speed rated as up to 190 MB/second (but appears to do better sometimes).
Following is a summary of results, indicating USB 3 large file reading speed improvements between 6.7 and 8.1 times, but disappointing writing performance, where the slower P speeds might be affected by the mysteries of updating file allocation tables, also influencing random access and dealing with lots of small files, including file delete times. USB 3 use provided little or no performance gains for the latter. Cached reading reflects RAM speed
Code: | MB/second 16 MB USB 2, 1024 MB USB 3
System Drive Write1 Write2 Write3 Read1 Read2 Read3
Pi 3B+ USB 2 P 11.5 11.4 11.5 36.6 37.7 37.3
Pi 3B+ USB 2 R 15.9 16.4 13.9 37.1 40.1 39.8
Pi 4B USB 2 P 12.6 12.6 12.6 37.0 37.3 37.2
Pi 4B USB 2 R 22.6 22.9 22.9 36.5 36.3 36.5
Pi 4B USB 3 P 34.7 36.1 35.5 298.9 302.4 300.3
Pi 4B USB 3 R 48.9 44.6 53.4 249.4 248.8 246.2
Compare MB/second
Pi 4B P USB 3/2 2.75 2.88 2.81 8.07 8.11 8.07
Pi 4B R USB 3/2 2.17 1.94 2.33 6.83 6.85 6.74
Cached MB/second Write1 Write2 Write3 Read1 Read2 Read3
Pi 3B+ USB 2 P 13.6 14.2 14.4 633.4 544.0 464.3
Pi 3B+ USB 2 R 13.7 14.4 19.4 623.5 661.4 557.6
Pi 4B USB 2 P 15.0 14.7 14.8 1204.0 1047.3 1066.3
Pi 4B USB 2 R 20.8 21.2 13.9 930.2 933.6 1230.3
Pi 4B USB 3 P 42.0 39.6 38.9 1167.7 1029.4 1061.6
Pi 4B USB 3 R 21.1 15.9 36.2 1103.6 944.9 981.0
Compare
Pi 4B P USB 3/2 2.80 2.70 2.63 0.97 0.98 1.00
Pi 4B R USB 3/2 1.01 0.75 2.60 1.19 1.01 0.80
Random milliseconds
Read Write
Pi 3B+ USB 2 P 0.013 0.013 0.254 11.76 10.18 9.80
Pi 3B+ USB 2 R 0.017 0.008 0.032 1.09 1.39 11.72
Pi 4B USB 2 P 0.006 0.007 0.215 9.56 8.54 8.75
Pi 4B USB 2 R 0.009 0.005 0.016 1.35 2.12 1.34
Pi 4B USB 3 P 0.004 0.007 0.310 9.65 10.42 9.71
Pi 4B USB 3 R 0.004 0.004 0.008 1.75 0.85 0.92
Compare
Pi 4B P USB 3/2 1.50 1.00 0.69 0.99 0.82 0.90
Pi 4B R USB 3/2 2.25 1.25 2.00 0.77 2.49 1.46
200 Small Files milliseconds
Write Read Delete
Pi 3B+ USB 2 P 134.2 128.6 129.6 0.08 0.12 0.07 3.36
Pi 3B+ USB 2 R 105.5 104.7 107.6 0.05 0.05 0.07 0.26
Pi 4B USB 2 P 125.8 125.5 125.8 0.02 0.02 0.02 3.12
Pi 4B USB 2 R 104.1 104.0 104.0 0.02 0.02 0.03 0.14
Pi 4B USB 3 P 122.7 122.3 122.2 0.02 0.02 0.02 2.56
Pi 4B USB 3 R 105.4 104.0 104.3 0.02 0.02 0.03 0.15
Compare
Pi 4B P USB 3/2 1.03 1.03 1.03 1.00 1.00 1.00 1.22
Pi 4B R USB 3/2 0.99 1.00 1.00 1.00 1.00 1.00 0.95
|
Drive Write/Reboot/Read Tests
The write test also reads the data for verification, but this will normally be cached in RAM, with high data transfer speeds. VMSTAT results are provided, covering reading speeds.
Main SD Drive
This is rated at up to 98 MB/second reading speed but only achieves near 46 MB/second. VMSTAT results confirm data transfer speed and three files eventually occupying around 3 GB of the cache, with the low 2% (x4) CPU utilisation and 23% (x4) waiting for I/O.
Code: | Current Directory Path: /home/demouser/RPi3-64-Bit-Benchmarks/IOtests/writeread
Total MB 28225, Free MB 18761, Used MB 9464
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 18.99 19.34 19.47 1337.09 1164.91 1325.96
Read N/A N/A N/A 45.80 45.88 45.89
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 0 673848 60668 2792716 0 0 45056 0 767 1181 0 2 75 23 0
0 1 0 630228 60668 2835544 0 0 44544 0 789 1199 0 2 74 23 0
0 1 0 585204 60668 2880268 0 0 45056 0 691 1041 0 3 75 23 0
|
USB 3 Drive P
Read only speed was similar to that from the earlier detailed test. Note high CPU utilisation average of 17%, equivalent to 68% of one core.
Code: | Selected File Path:
/run/media/demouser/PATRIOT/
Total MB 120832, Free MB 119752, Used MB 1080
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 58.45 23.10 22.91 1368.04 1190.71 1354.84
Read N/A N/A N/A 306.18 294.93 302.91
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 256 811672 20920 2696504 0 0 305664 0 3898 6182 1 15 73 11 0
0 1 256 510852 20920 2996188 0 0 303616 0 4304 5936 1 16 72 12 0
1 0 256 239400 20920 3267636 0 0 307184 0 4512 6177 1 17 71 11 0
|
USB 3 Drive R
This time data transfer speed was slower than the earlier example.
Code: | Selected File Path:
/run/media/demouser/REMIX_OS/
Total MB 9017, Free MB 7485, Used MB 1532
1024 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
Write 46.43 28.81 36.57 1265.07 1103.23 1236.02
Read N/A N/A N/A 172.71 172.14 176.49
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 1 256 111512 912 3417624 0 0 175189 0 4315 5929 1 12 71 17 0
0 1 256 169756 992 3358840 0 0 169043 0 4064 5515 1 11 71 17 0
0 1 256 177444 1068 3351176 0 0 155724 0 4088 6023 1 12 70 16 0
|
USB 3 Drives R and P Together
File sizes were reduced to 512 MB for these tests, in order to ensure that there would be sufficient RAM to contain six copies, as indicated in VMSTAT cache occupancy. This makes it more tricky to measure total throughput, but the following appears to provide a best case example, with a maximum of up to 386 MB/second, with CPU utilisation near 100% of one core.
Later is a bad example, where one drive appears to be running at USB 2 speed.
Code: | Write/Read Thu Sep 19 16:07:48 2019 /run/media/demouser/REMIX_OS/
Write/Read Thu Sep 19 16:07:46 2019 /run/media/demouser/PATRIOT/
512 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3
R 28.72 33.89 44.69 1302.19 1131.65 1374.24
P 11.93 8.86 6.21 1232.47 1072.38 1213.36
Sep 23 17:11:21 2019 /run/media/demouser/PATRIOT/
Sep 23 17:11:20 2019 /run/media/demouser/REMIX_OS/
512 MB MBytes/Second
Write1 Write2 Write3 Read1 Read2 Read3 Seconds
P N/A N/A N/A 159.78 187.44 294.23 7.7
R N/A N/A N/A 221.83 232.10 230.94 6.7+2 delayed start
procs -----------memory--------- ---swap-- -----io---- -system-- ------cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 3160720 74616 296092 0 0 0 0 2031 3601 4 2 94 0 0
0 1 0 3112052 74616 342188 0 0 45552 0 1512 2257 1 3 93 4 0
0 1 0 2908004 74616 547600 0 0 206336 0 4684 7169 4 14 67 15 0
2 0 0 2531960 74616 919400 0 0 369136 0 5495 8033 4 24 47 25 0
2 0 0 2149064 74616 1303288 0 0 382960 0 5168 7007 1 21 52 26 0
1 1 0 1771492 74616 1681348 0 0 385024 0 5969 8255 1 23 49 26 0
1 1 0 1383524 74616 2068788 0 0 386016 0 5621 7926 1 21 49 29 0
0 2 0 999100 74616 2453280 0 0 383488 0 4602 6895 1 19 54 26 0
0 1 0 628988 74616 2824188 0 0 368640 0 5405 8153 2 20 56 22 0
1 0 0 310748 74624 3142732 0 0 317424 20 4622 6551 1 17 72 10 0
1 0 0 223052 73680 3231812 0 0 268288 0 2815 5012 1 18 72 10 0
0 0 0 223824 73680 3231280 0 0 32768 0 1044 2009 1 3 95 1 0
0 0 0 223824 73680 3231280 0 0 0 0 393 619 0 0 99 0 0
===============================================================================
Bad Example``````````````
Write1 Write2 Write3 Read1 Read2 Read3
P N/A N/A N/A 36.37 37.72 37.48
R N/A N/A N/A 248.18 248.22 223.53
|
LAN and WiFi Benchmarks
The Raspberry Pi LanSpeed64 version uses the same programming code as for the DriveSpeed benchmark, except O_DIRECT is not used on creating files. The measurements were made between the Pi 4B and a Windows 7 based PC, where the data transfer speed was confirmed via Task Manager Network information and sysstat sar -n DEV on the Raspberry Pi 4. SAMBA was also installed to connect a remote PC and enable an Intel Windows version, LanSpdx86Win.exe, to be run.
An example of a LanSpeed64 log file is provided below, preceded by examples of the required mount and run commands are shown below.
Code: | Commands
sudo mount -t cifs -o dir_mode=0777,file_mode=0777 //192.168.1.68/d /media/public
./LanSpeed64 FilePath /media/public/test
Log File
LanSpeed RasPi 64 Bit 1.0 Thu Sep 12 22:06:06 2019
Selected File Path:
/media/public/test/
Total MB 266240, Free MB 70991, Used MB 195249
MBytes/Second
MB Write1 Write2 Write3 Read1 Read2 Read3
8 66.13 92.09 92.76 96.36 96.85 97.30
16 80.79 93.59 94.61 103.99 104.34 104.57
Random Read Write
From MB 4 8 16 4 8 16
msecs 0.004 0.009 0.435 0.95 0.92 0.93
200 Files Write Read Delete
File KB 4 8 16 4 8 16 secs
MB/sec 1.37 2.45 4.77 1.37 2.49 4.92
ms/file 2.99 3.35 3.43 2.98 3.29 3.33 0.467
|
LAN and WiFi Benchmark Results
Below are results from programs run on the Pi 3B+ and 4B, plus others from running on a PC.
Dealing with large files, PC to Pi 4B and Pi 4B to PC LAN speeds demonstrated some gigabit performance examples (over 100 MB/second), around three times faster than on the Pi 3B+. My BT Hub has dual 2.4 and 5 GHz WiFi capabilities, leading to the following erratic WiFi performance, where (I think) greater than 10 MB/second is indicative of 5 GHz and around 4 MB/second for 2.4 GHz, the former usually only on writing. In this case, the hub was inches away from the Pi.
I changed the hub settings to provide separate 2.4 and 5 GHz hub address selections, with 72 and 180 Mbits/second being indicated, respectively. These sort of numbers were confirmed on my Smartphone, but variable. The 64 bit version would not connect to the network at 5 GHz, unlike the 32 bit program, for example, obtaining 15 MB/second writing and 8 MB/second reading. these differences could be, I suppose, due to program, software and/or hub incompatibility.
Random access times appeared to be quite similar on all WiFi tests, with faster but variable comparative times via LAN. There were similar relationships on dealing with numerous small files.
Code: | Large Files MB/second
System MB Write1 Write2 Write3 Read1 Read2 Read3
PC WiFi 16 4.08 4.16 4.11 2.34 1.68 1.30
PC LAN 16 106.11 106.11 105.89 50.67 33.86 25.47
LAN 3B+ 16 28.63 29.03 28.96 22.18 32.28 32.61
3B+ WiFi 16 11.15 11.00 10.76 4.01 3.89 3.09
4B WiFi1 16 6.43 6.39 6.47 4.33 4.13 4.86
4B WiFi2 16 13.26 13.34 13.25 3.69 4.22 4.00
4B LAN 16 80.79 93.59 94.61 103.99 104.34 104.57
4B LAN 128 96.58 96.67 95.74 106.41 107.24 107.82
Random milliseconds
System Read Write
PC WiFi 1.711 1.972 2.015 2.26 2.28 2.25
PC LAN 0.606 0.590 0.532 0.47 0.48 0.47
LAN 3B+ 0.030 0.816 0.484 1.19 1.16 1.16
3B+ WiFi 3.052 3.167 3.475 3.60 3.39 3.45
4B WiFi1 3.286 3.549 3.627 4.02 3.45 3.72
4B WiFi2 2.786 2.822 2.944 3.20 2.94 2.92
4B LAN 0.004 0.009 0.435 0.95 0.92 0.93
200 Small Files milliseconds per file
System Write Read Delete
PC WiFi 10.09 12.42 13.81 5.50 6.11 8.06 1.507
PC LAN 4.05 4.59 4.53 2.38 2.23 2.64 0.661
LAN 3B+ 3.72 4.36 4.45 3.33 3.40 3.60 0.378
3B+ WiFi 12.61 13.53 14.97 13.17 14.06 15.88 2.534
4B WiFi1 15.08 16.53 22.83 12.96 14.23 17.29 2.509
4B WiFi2 11.38 12.85 12.82 10.64 11.83 14.15 2.083
4B LAN 2.99 3.35 3.43 2.98 3.29 3.33 0.467
|
_________________ Regards
Roy |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 7713 Location: almost Mile High in the USA
|
Posted: Thu Sep 26, 2019 6:22 am Post subject: |
|
|
Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont? _________________ Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
Gavinmc42 n00b

Joined: 23 Sep 2019 Posts: 21 Location: Brisbane
|
Posted: Thu Sep 26, 2019 9:03 am Post subject: |
|
|
Any difference between A53 and A72 versions?
I barely understand "out of order execution" and where would it make a difference? _________________ Don't get Pi's if you are scared of learning. |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Thu Sep 26, 2019 9:29 am Post subject: |
|
|
eccerr0r wrote: | Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont? |
Which report are you referring to? I can't recall including a direct reference in this topic. It would be a few years old anyway. _________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Thu Sep 26, 2019 9:43 am Post subject: |
|
|
Gavinmc42 wrote: | Any difference between A53 and A72 versions?
I barely understand "out of order execution" and where would it make a difference? |
See comparisons of Pi 4B and Pi 3B+, the latter having the A53, where there are lots of Pi 4 A72 improvements above and beyond clock speed ratios.
I take "out of order execution" to mean that later instructions in a sequence can be executed if they have no impact on current calculations. This can improve performance. _________________ Regards
Roy |
|
Back to top |
|
 |
eccerr0r Watchman

Joined: 01 Jul 2004 Posts: 7713 Location: almost Mile High in the USA
|
Posted: Thu Sep 26, 2019 2:11 pm Post subject: |
|
|
roylongbottom wrote: | eccerr0r wrote: | Curious, what model of Atom 1666MHz is being used in the benchmark comparisons? Is it a Bonnell or Silvermont? |
Which report are you referring to? I can't recall including a direct reference in this topic. It would be a few years old anyway. |
Oh sorry yeah taken out of context, I think I was looking at your website and not from this thread hence the out of the blue question...
Silvermont is many years old now, and Bonnell is even older. _________________ Intel Core i7 2700K@ 4.1GHz/HD3000 graphics/8GB DDR3/180GB SSD
What am I supposed watching? |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Sat Sep 28, 2019 5:00 pm Post subject: GCC 9 Compiled Benchmarks |
|
|
GCC 9 Compiled Benchmarks
I am in the process of recompiling my benchmarks using gcc 9, to see if they produce faster performance than the existing gcc 6 compilations. Results of the single core CPU benchmarks and comparisons are provided below for a Pi 3B+ and Pi 4B, using the same Gentoo Operating System SD card and benchmark programs. For details of the latter and earlier results see:
https://www.raspberrypi.org/forums/viewtopic.php?f=31&t=44080&start=125#p1484388
and down the page
https://www.raspberrypi.org/forums/viewtopic.php?f=31&t=44080&start=125#p1485285
In due course, the gcc 9 benchmarks and source codes will be available to download.
The new compiler produces no real performance improvements for the first few benchmarks but provides some significant gains data streaming integer and single precision floating point calculations in the memory tests, where vector SIMD instructions are likely to be generated.
Another reason for new versions of the benchmarks is that the CPUID information included in the earlier programs provided too much unnecessary information, with nine identical lines for each CPU core. The new one provides the following data (except for Linux), using the lscpu command. Here, the CPU model name and normal operating frequency are provided.
Pi 4B Cortex A72
Code: | Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 3
Model name: Cortex-A72
Stepping: r0p3
CPU max MHz: 1500.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-p4-bis+ #2 SMP PREEMPT
Tue Aug 27 13:58:09 GMT 2019 aarch64 GNU/Linux
|
Pi 3B+ Cortex A53
Code: | Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1400.0000
CPU min MHz: 600.0000
BogoMIPS: 38.40
Flags: fp asimd evtstrm crc32 cpuid
Linux pi64 4.19.67-v8-174fcab91765-bis+ #2 SMP PREEMPT
Tue Aug 27 13:29:20 GMT 2019 aarch64 GNU/Linux
|
Whetstone Benchmark
Performance of the gcc 9 compilations for the Pi 4B was effectively the same as the earlier versions. The Pi 3B+ results indicated improvements, but this was due to the EXP type function calculations. The new compilation included a minor tweak for the IF tests, to avoid overoptimisation.
Code: | System MHz MWIPS ------ MFLOPS------ ----- ------ -MOPS-------- ------
1 2 3 COS EXP FIXPT IF EQUAL
gcc 9
Pi 3B+ 1400 1482 384 404 329 27.4 28.2 1712 2042 1362
Pi 4B 1500 2330 522 533 398 60.4 40.3 2493 2984 997
Pi4/3B+ 1.07 1.57 1.36 1.32 1.21 2.21 1.43 1.46 1.46 0.73
gcc 9/6
Pi 4B 1.00 1.03 1.00 1.00 1.00 1.10 1.01 1.00 N/A 1.00
|
Dhrystone Benchmark
The gcc 9 compilations lead to no real difference in performance.
Code: | Compiled DMIPS
System MHz DMIPS /MHz
gcc 9
Pi 3B+ 1400 3896 2.78
Pi 4B 1500 8190 5.46
Pi4/3B+ 1.07 2.10
gcc 9/6
Pi 4B 1.00 1.00
|
Linpack Benchmarks
The new gcc 9 compilations produced the same performance as the older versions, within the variations normally seen on this benchmark.
Code: | MFLOPS
System MHz DP SP SP NEON
gcc 9
Pi 3B+ 1400 396.2 571.3 566.7
Pi 4B 1500 1110.6 2052.4 1887.5
Pi4/3B+ 1.07 2.80 3.59 3.33
gcc 9/6
Pi 4B 1.00 1.05 1.04 0.96
|
Livermore Loops Benchmark
There were some performance differences in gcc 9 results but average speeds were quite similar
Code: | MFLOPS
System MHz Maximum Average Geomean Harmean Minimum
gcc 9
Pi 3B+ 1400 1000.7 347.8 308.0 275.2 117.3
Pi 4B 1500 2744.5 962.5 768.2 596.2 132.1
Pi4/3B+ 1.07 2.74 2.77 2.49 2.17 1.13
gcc 9/6
Pi 4B 1.00 1.10 1.08 1.05 0.99 0.62
MFLOPS Of 24 Kernels
gcc9
Pi 3B+ 565 320 319 535 227 207 1001 581 541 234 171 248
121 160 293 280 456 547 337 287 367 190 386 209
Pi 4B 2146 989 970 965 390 785 2386 2479 1879 632 500 973
134 423 814 670 726 1177 450 397 1675 561 818 283
Pi 4B/ 3.80 3.09 3.04 1.80 1.72 3.80 2.38 4.27 3.48 2.70 2.93 3.93
Pi 3B+ 1.10 2.65 2.78 2.39 1.59 2.15 1.33 1.39 4.56 2.95 2.12 1.35
Min 1.10 Max 4.56
gcc 9/6
Pi 4B 1.06 0.99 0.98 1.02 1.05 1.06 1.17 1.00 0.95 0.83 1.01 1.11
0.61 1.05 1.00 0.94 0.96 1.05 1.01 1.00 1.58 1.35 1.00 1.00
Min 0.61 Max 1.58
|
MemSpeed Benchmark
Many Pi 4B/3B+ comparisons were similar, but the gcc 9 compilation gave rise to a number of changes, compared with the older version. The latter was slightly faster using some double precision calculations, but gcc 9 produced speed increases between 1.3 and 2.6 times with integers and single precision, the latter providing a maximum of 5.5 GFLOPS compared with 3.5.
Code: | Gentoo 64b Pi 3B+ gcc 9
Memory Reading Speed Test 64 Bit gcc 9 by Roy Longbottom
Start of test Thu Sep 26 12:43:02 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
8 4565 5140 7847 5439 5827 7928 6161 4288 4334
16 4445 5145 7942 5362 5829 7941 6207 4358 4310
32 4094 4853 7251 4750 5396 7250 6139 4312 4303
64 3767 4748 7008 4320 5309 6954 5461 4097 4100
128 3912 4799 7319 4442 5486 7325 5328 4133 4134
256 3838 4824 6934 4400 5426 7247 5354 3844 4010
512 2570 3661 3826 2773 3975 4912 3302 2532 3017
1024 878 2120 2228 938 2182 2239 1098 1215 1361
2048 848 1961 2046 1016 2008 2033 758 805 814
4096 856 1961 2040 1007 1984 2036 839 863 856
8192 885 1940 1956 1013 1921 1957 844 865 868
Max MFLOPS 571 1286
Gentoo 64b Pi 4B
8 13385 21854 24413 13416 23402 24404 11630 9316 9315
16 13527 22116 24712 13551 23675 24722 11800 9447 9446
32 12170 19681 21716 12164 21047 21740 11403 9511 9514
64 11402 19074 20086 11613 20057 20101 9317 8651 8663
128 11770 20334 21119 12124 21389 21087 8003 8136 8136
256 11740 20281 21115 12029 21384 21111 8098 8184 8015
512 11671 20255 20873 12058 21561 21072 7721 6684 6929
1024 2818 7728 5968 3957 7839 7831 4691 3610 3832
2048 1884 3436 3743 1880 3578 3281 2597 2717 2696
4096 1284 2399 2555 1446 3802 3625 2420 2630 2632
8192 1913 3759 3459 1937 3798 3772 2468 2482 2482
Max MFLOPS 1691 5529
Comparison 64b Pi4/3B+
8 2.93 4.25 3.11 2.47 4.02 3.08 1.89 2.17 2.15
16 3.04 4.30 3.11 2.53 4.06 3.11 1.90 2.17 2.19
256 3.06 4.20 3.05 2.73 3.94 2.91 1.51 2.13 2.00
512 4.54 5.53 5.46 4.35 5.42 4.29 2.34 2.64 2.30
1024 3.21 3.65 2.68 4.22 3.59 3.50 4.27 2.97 2.82
4096 1.50 1.22 1.25 1.44 1.92 1.78 2.88 3.05 3.07
8192 2.16 1.94 1.77 1.91 1.98 1.93 2.92 2.87 2.86
Comparison Pi4B gcc 9/6
8 0.86 1.56 1.95 0.86 1.67 1.57 1.02 1.00 1.19
16 0.86 1.57 1.94 0.86 1.67 1.58 1.00 1.00 1.20
256 0.96 1.78 1.97 0.99 1.81 1.75 1.00 1.00 1.02
512 1.04 1.89 2.05 1.10 1.93 1.82 0.96 1.06 1.06
1024 0.83 2.93 1.82 1.17 2.42 2.63 1.25 0.91 0.95
4096 0.91 1.30 1.37 0.78 2.28 1.97 0.97 1.06 1.09
8192 1.00 1.96 1.80 1.27 2.00 1.99 0.99 1.11 1.00
|
NeonSpeed Benchmark
With the gcc 9 compilation, the Pi 4B continued to be significantly faster than the 3B+. Comparing Pi 4B gcc 9 an 6 results, performance was essentially the same when NEON Intrinsic Functions were used, but, as with MemSpeed, normal compilations were faster, averaging around 80% faster, in this case.
Code: | Gentoo 64b Pi 3B+ gcc 9
NEON Speed Test 64 Bit gcc 9 Thu Sep 26 12:45:07 2019
Vector Reading Speed in MBytes/Second
Memory Float v=v+s*v Int v=v+v+s Neon v=v+v
KBytes Norm Neon Norm Neon Float Int
16 5118 5461 6218 5298 6024 6011
32 4894 4980 5886 4855 5431 5445
64 4713 4557 5669 4452 4868 4867
128 4824 4703 5814 4598 4995 4946
256 4857 4750 5815 4643 5028 4964
512 3694 2652 4265 2675 3003 3007
1024 2085 1135 2204 1132 1128 1077
4096 2008 1021 2070 1033 1056 1036
16384 1912 1061 2042 958 1065 1047
65536 1783 1062 1873 769 1080 1081
Gentoo 64b Pi 4B
16 21046 14555 16698 13502 14565 16970
32 17797 12061 14509 10785 12282 13112
64 19517 10860 15252 9981 10793 11419
128 19839 10936 15468 10120 11001 11579
256 20094 10838 15603 10229 10885 11566
512 20076 10846 15469 10185 10943 11667
1024 7016 3040 6826 3211 3417 3548
4096 3945 1940 3599 1950 1768 1937
16384 3394 2017 3386 1963 1848 2014
65536 3484 2043 3839 1765 2060 2049
Comparison 64b Pi4/3B+
16 4.11 2.67 2.69 2.55 2.42 2.82
32 3.64 2.42 2.47 2.22 2.26 2.41
64 4.14 2.38 2.69 2.24 2.22 2.35
128 4.11 2.33 2.66 2.20 2.20 2.34
256 4.14 2.28 2.68 2.20 2.16 2.33
512 5.43 4.09 3.63 3.81 3.64 3.88
1024 3.36 2.68 3.10 2.84 3.03 3.29
4096 1.96 1.90 1.74 1.89 1.67 1.87
16384 1.78 1.90 1.66 2.05 1.74 1.92
65536 1.95 1.92 2.05 2.30 1.91 1.90
Comparison Pi4B gcc 9/6
16 1.51 0.89 1.34 0.89 0.91 0.99
32 1.86 1.12 1.62 1.12 1.12 1.19
64 1.83 0.92 1.48 0.93 0.89 0.94
128 1.86 0.92 1.50 0.95 0.92 0.97
256 1.88 0.91 1.51 0.95 0.91 0.96
512 1.98 0.95 1.59 1.00 0.97 1.01
1024 2.37 0.94 2.37 1.00 1.04 1.21
4096 2.28 1.13 2.08 1.10 1.11 1.12
16384 2.13 1.05 1.86 1.02 0.96 1.21
65536 1.77 1.18 1.92 1.01 1.09 1.01
Average 1.95 1.00 1.73 1.00 0.99 1.06
|
BusSpeed Benchmark
Results from the gcc 9 compilations were virtually the same as those from gcc 6.
Code: | Gentoo 64b Pi 3B+ gcc 9
BusSpeed 64 Bit gcc 9 Thu Sep 26 12:51:15 2019
Reading Speed 4 Byte Words in MBytes/Second
Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read
KBytes Words Words Words Words Words All
16 3860 4283 4677 4901 5022 3591
32 2228 2433 2989 4740 4912 3629
64 700 697 1299 2200 3310 3348
128 637 636 1208 2064 3151 3396
256 597 600 1161 1945 3105 3377
512 232 194 500 884 1629 2350
1024 118 131 159 440 692 1682
4096 91 99 197 463 923 1878
16384 119 117 200 392 775 1606
65536 101 105 238 464 873 1876
Gentoo 64b Pi 4B Rd All Rd All
4B/3B+ gcc 9/6
16 4815 5060 5573 5808 5741 8935 2.49 1.09
32 1534 1828 2967 4254 4930 7825 2.16 1.04
64 792 1007 1988 3269 4844 8062 2.41 1.02
128 730 950 1881 3133 5007 8162 2.40 1.04
256 733 955 1901 3128 5071 8236 2.44 1.04
512 737 952 1885 3139 5058 8237 3.51 1.07
1024 374 539 1047 1884 3177 5537 3.29 0.97
4096 235 255 497 990 1975 3386 1.80 0.82
16384 239 263 501 913 1984 3973 2.47 0.97
65536 239 237 502 995 1984 3971 2.12 0.98
|
Fast Fourier Transforms Benchmarks
The Pi 4B/3B+ performance gains were similar using both gcc 9 and gcc 6 compiled programs, but the gcc 9 compilation produced some faster FFT1 speeds, as shown in the Pi 4B gcc 9/6 comparisons.
Code: | Gentoo Pi 3B+ gcc 9 Gentoo Pi 4B gcc 9
Size FFT1 FFT3 FFT1 FFT3
K SP DP SP DP SP DP SP DP
1 0.15 0.16 0.15 0.14 0.04 0.04 0.04 0.04
2 0.34 0.39 0.31 0.31 0.08 0.13 0.08 0.09
4 0.89 1.00 0.82 0.79 0.19 0.33 0.19 0.21
8 2.19 2.70 1.66 1.89 0.71 0.74 0.46 0.46
16 4.32 5.94 4.88 5.32 1.63 2.06 1.17 1.09
32 12.47 24.05 9.59 14.82 3.73 4.03 2.44 3.09
64 66.46 116.11 26.53 36.64 7.92 27.12 5.46 9.06
128 169.06 268.02 63.65 84.00 43.28 100.75 16.09 22.00
256 401.86 600.72 141.83 195.69 192.57 254.20 37.08 49.76
512 853.48 1266.96 329.26 435.23 590.20 651.24 82.54 110.23
1024 1966.69 2808.07 721.36 981.82 1463.15 1749.37 202.20 251.71
Pi 4B/3B+ Pi 4B gcc 9/6
1 3.53 3.77 3.63 3.78 0.97 0.98 1.02 1.18
2 4.39 3.05 3.97 3.64 1.00 1.06 1.46 1.08
4 4.75 3.03 4.23 3.81 1.34 1.16 0.98 1.06
8 3.06 3.62 3.62 4.10 1.10 1.76 1.00 1.09
16 2.65 2.89 4.16 4.89 1.32 1.41 0.98 1.00
32 3.34 5.97 3.93 4.79 1.53 1.68 1.02 1.03
64 8.39 4.28 4.85 4.04 1.92 1.88 0.99 1.03
128 3.91 2.66 3.96 3.82 1.93 1.51 1.01 1.12
256 2.09 2.36 3.82 3.93 1.20 1.43 1.06 1.15
512 1.45 1.95 3.99 3.95 0.95 1.17 1.09 1.21
1024 1.34 1.61 3.57 3.90 0.85 1.07 1.06 1.21
|
Multithreaded Benchmarks Next _________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Sun Oct 20, 2019 10:16 am Post subject: GCC 9 Multithreaded Benchmarks |
|
|
GCC 9 Multithreaded Benchmarks
Compiling with gcc 9 did not provide across the board performance gains. Mainly considering 4 thread results, those within 10% were measured on MP-Whetstone, MP-Dhrystone and MP-Linpack. Some gains and some losses applied to MP-RandMem, MP-MFLOPS NEON, OpenMP MFLOPS and OpenMP MemSpeeds. Then real gains were demonstrated by MP-BusSpeed, MP-MFLOPS SP and MP-MFLOPS DP.
MP-Whetstone Benchmark
Most of the important Pi 4B results were virtually the same as those from the earlier gcc 6 compilations but the 3B+ COS and EXP speeds were somewhat slower using gcc 9..
Code: |
MWIPS MFLOPS MFLOPS MFLOPS Cos Exp Fixpt If Equal
Threads 1 2 3 MOPS MOPS MOPS MOPS MOPS
Gentoo 64b Pi 3B+ gcc 9
1 1500 381 384 328 27.2 28.1 5098 2049 1368
2 3001 766 762 656 54.5 56.5 10130 4102 2737
4 5940 1488 1528 1304 107.8 111.5 19741 7665 5423
8 5987 1528 1666 1267 107.4 117.9 25862 9518 5666
Overall Seconds 4.98 1T, 4.98 2T, 5.16 4T, 10.30 8T
Gentoo 64b Pi 4B gcc 9
1 2364 530 532 395 60.6 40.0 7426 2242 996
2 4724 1060 1052 789 121.0 80.4 14853 4476 1994
4 9413 2103 2112 1579 241.0 159.5 29161 8638 3968
8 9848 2671 2453 1644 247.0 168.1 37385 11636 4108
Overall Seconds 5.00 1T, 5.01 2T, 5.07 4T, 10.20 8T
Comparison 64b Pi4/3B+
1 1.58 1.39 1.38 1.20 2.23 1.42 1.46 1.09 0.73
2 1.57 1.38 1.38 1.20 2.22 1.42 1.47 1.09 0.73
4 1.58 1.41 1.38 1.21 2.24 1.43 1.48 1.13 0.73
8 1.64 1.75 1.47 1.30 2.30 1.43 1.45 1.22 0.72
Comparison Pi4B gcc 9/6
1 0.99 0.99 0.99 1.00 1.00 1.03 N/A 0.50 1.00
2 0.99 1.00 0.97 0.99 1.00 1.03 N/A 0.50 1.00
4 0.99 0.99 1.02 1.01 1.00 1.03 N/A 0.49 1.00
8 1.00 1.02 0.89 1.01 1.01 1.05 N/A 0.52 1.01
|
MP-Dhrystone Benchmark
As indicated for the earlier gcc 6 results, this benchmark produces inconsistent performance and does not provide a good example of multithreading but, in this case, gcc 6 and gcc 9 results were similar, with a reasonably high Pi 4B/3B+ performance gain.
Code: |
Threads 1 2 4 8
Seconds 0.54 0.67 1.23 2.46
VAX MIPS rating Pi 3B+ 6 4207 6804 7401 7415
VAX MIPS rating Pi 4B 64 8880 7828 8303 8314
Pi 4B/3B+ 64 bits 2.11 1.15 1.12 1.12
VAX MIPS rating Pi 4B 32 5539 5739 6735 7232
Pi 4B 64 bits/32 bits 1.60 1.36 1.23 1.15
Gentoo gcc 9
VAX MIPS rating Pi 3B+ 6 4062 6504 8242 8343
VAX MIPS rating Pi 4B 64 8298 7683 7870 7978
Pi 4B/3B+ 64 bits 2.04 1.18 0.95 0.96
Pi 4B gcc 9/6 0.93 0.98 0.95 0.96
|
MP Linpack Benchmark (Single Precision NEON)
This benchmark is even less suitable to demonstrate multithreading performance, and that was the intention, as the frequent thread starting overheads are too high. Hence, tests are included with no threading. Results from the old and new compilations were again similar, confirming the high P4B/3B+ performance gains, with no threading.
Code: |
MFLOPS 0 to 4 Threads, N 100, 500, 1000
Threads None 1 2 4
Gentoo 64b Pi 3B+ gcc 9
N 100 641.6 63.0 62.3 61.9
N 500 326.6 229.3 222.6 227.0
N 1000 320.1 275.0 274.3 275.2
Gentoo 64b Pi 4B gcc 9
N 100 2076.2 98.6 96.6 96.2
N 500 1327.1 631.9 632.5 639.2
N 1000 394.6 375.3 382.3 375.7
Comparison 64b Pi4/3B+
N 100 3.24 1.57 1.55 1.55
N 500 4.06 2.76 2.84 2.82
N 1000 1.23 1.36 1.39 1.37
Comparison Pi4B gcc 9/6
N 100 0.92 1.01 0.99 0.99
N 500 0.82 0.95 0.98 0.95
N 1000 0.99 0.92 0.94 0.94
|
MP BusSpeed (read only) Benchmark
Other than identifying the likely effects of burst reading on such as random access, the main area for comparison is on reading all data. The Pi 4B behaved as expected, where the speed of this was near twice that of data flow with word address increments of 2. The Pi 3B+ did not follow that normal operation, so the 4B/3B+ comparisons are suspect.
Code: |
MB/Second Reading Data, 1, 2, 4 and 8 Threads
KB Inc32 Inc16 Inc8 Inc4 Inc2 RdAll
Gentoo 64b Pi 3B+ gcc 9
12.3 1T 3453 4178 4428 3543 3584 2335
2T 5594 7732 8086 6856 6924 4654
4T 9065 12522 13157 12942 13415 9209
8T 6661 10770 13266 11955 12573 8478
122.9 1T 640 646 1197 1970 2909 2272
2T 1030 1012 2006 3671 5784 4528
4T 1001 1041 2145 4266 8337 6729
8T 1043 1061 2123 4005 8133 8572
12288 1T 114 104 241 444 932 1352
2T 126 122 253 370 1005 1997
4T 104 138 197 471 1133 1745
8T 102 96 231 466 796 1893
Gentoo 64b Pi 4B gcc 9 RdAll Pi 4B
4B/3B+ gcc 9/6
12.3 1T 5573 5750 5057 5646 5800 9129 3.91 2.16
2T 7191 9038 10035 11020 11125 17757 3.82 2.27
4T 7023 12144 14591 17681 20490 29184 3.17 1.97
8T 7553 11837 12565 15640 18546 30517 3.60 2.33
122.9 1T 672 922 1864 3092 4744 7741 3.41 1.86
2T 577 947 2100 3051 8780 14975 3.31 1.83
4T 519 983 1884 3980 8701 18139 2.70 1.24
8T 515 951 1913 4181 8797 16899 1.97 1.24
12288 1T 230 261 499 1016 1678 3873 2.86 1.07
2T 276 225 418 925 1929 5629 2.82 0.90
4T 258 267 579 802 1749 5758 3.30 1.52
8T 214 213 538 1069 2145 4680 2.47 1.31
|
MP-RandMem Benchmark
Some moderate Pi4/3B+ performance gains were produced but the older version was, possibly, a little faster than the gcc 9 compilation.
Code: |
MB/Second Using 1, 2, 4 and 8 Threads
Serial Serial Random Random Serial Serial Random Random
KB+Thread Read RdWr Read RdWr Read RdWr Read RdWr
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
12.3 1T 4886 3581 4878 3590 5737 6884 5763 7537
2T 8723 3550 8724 3550 11536 7592 10238 6898
4T 16836 3498 17531 3509 21084 7575 15160 7390
8T 15777 3459 16783 3466 20089 7339 15311 7200
122.9 1T 3913 3346 987 972 5739 7231 2006 1906
2T 7285 3339 1753 964 10662 7217 1742 1896
4T 12354 3344 2350 972 10376 6741 1815 1812
8T 11841 3333 2300 962 10298 6937 1823 1848
12288 1T 1795 761 69 60 3477 905 181 162
2T 1915 735 118 60 3750 794 215 164
4T 2452 730 128 59 4669 968 259 162
8T 1805 755 137 60 3419 981 301 157
4 Thread 4 Thread
Comparison 64b Pi4/3B+ Comparison Pi4B gcc 9/6
12.3 4T 1.25 2.17 0.86 2.11 0.92 0.97 0.68 0.94
122.9 4T 0.84 2.02 0.77 1.86 0.95 0.93 0.98 0.94
12288 4T 1.90 1.33 2.02 2.75 1.00 1.03 0.78 0.95
|
MP-MFLOPS Benchmarks
There are three versions, single precision, double precision and single precision using NEON intrinsic functions. The single precision ones obtain up to 25 GFLOPS and half that for double precision.
On the Pi 4, the whole of the tests, in each program, can be completed in less than two seconds, probably not long enough for accurate comparisons.
Approximate performance gains, using gcc 9, indicate that Pi 4B was between 3.5 to 4.5 times faster than the Pi 3B+, using cache based data, and around 30% faster when performance became RAM speed dependent. On the Pi 4B, gcc 9 results indicated some improvements in speed, compared to those from the earlier gcc 6 compilation, mainly on running the single precision version.
MP-MFLOPS SP
Code: |
MP-MFLOPS 64 Bit gcc 9 Thu Sep 26 12:36:54 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
MFLOPS
1T 827 805 371 3232 3157 2802 3162 3072 468 6754 6714 6340
2T 1608 1567 360 6420 6423 5286 6498 6029 496 13329 12397 7623
4T 1764 3142 400 11240 12355 6029 11709 6141 529 24825 25055 8723
8T 2548 2575 381 10813 11755 5827 10828 8158 493 19452 22190 8426
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 3.82 3.82 1.26 2.09 2.13 2.26 1.09 1.08 1.02 1.17 1.17 1.17
2T 4.04 3.85 1.38 2.08 1.93 1.44 1.14 1.14 1.09 1.22 1.11 0.96
4T 6.64 1.95 1.32 2.21 2.03 1.45 1.13 1.10 1.08 1.37 1.15 1.14
|
MP-MFLOPS DP
Code: |
MP-MFLOPS 64 Bit gcc 9 Double Precision Thu Sep 26 22:05:10 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
2 Ops/Word 32 Ops/Word 2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
MFLOPS
1T 384 350 127 1582 1546 1372 657 663 183 3283 3358 3169
2T 753 753 184 3109 3157 2645 3203 2690 223 6573 6353 4535
4T 1346 1330 194 4228 6099 3067 5799 3866 292 12432 12665 4906
8T 1234 1340 201 4888 5748 3190 5322 4583 269 10738 8891 4521
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 1.71 1.89 1.44 2.08 2.17 2.31 0.45 0.48 0.81 0.97 0.99 1.00
2T 4.25 3.57 1.21 2.11 2.01 1.71 1.13 0.96 0.98 0.98 0.94 1.00
4T 4.31 2.91 1.51 2.94 2.08 1.60 1.12 1.13 1.16 1.19 0.99 1.03
|
NEON MP MFLOPS SP
Code: |
MP-MFLOPS NEON Intrinsics 64 Bit gcc 9 Thu Sep 26 22:02:00 2019
FPU Add & Multiply using 1, 2, 4 and 8 Threads
Gentoo 64b Pi 3B+ gcc 9 Gentoo 64b Pi 4B gcc 9
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800 12.8 128 12800 12.8 128 12800
1T 769 765 354 3009 2967 2638 1233 1313 507 6451 6428 6224
2T 1315 1324 293 5863 5990 5097 6307 4824 389 12559 12784 7612
4T 1750 2647 380 10081 11250 5748 8101 5186 531 24762 24708 7902
8T 2180 2664 392 9719 11010 6368 6782 8444 504 22598 24113 7979
........... 64b Pi4/3B+ .......... .......... Pi4B gcc 9/6 ..........
1T 1.60 1.72 1.43 2.14 2.17 2.36 0.37 0.41 0.95 1.00 0.98 1.00
2T 4.80 3.64 1.33 2.14 2.13 1.49 1.37 0.78 0.70 0.96 0.98 0.90
4T 4.63 1.96 1.40 2.46 2.20 1.37 1.29 0.91 0.94 1.04 1.02 0.84
|
OpenMP MFLOPS Benchmark
As expected this program uses all four CPU cores, but a second compilation, notOpenMP MFLOPS Benchmark, without OpenMP directives, to use just one core.
The benchmark carries out the same calculations as MP-MFLOPS, with an additional section using 8 operations per data word read. It was a quick conversion from a benchmark that measures CUDA floating point performance. Hence, the meaningless titles included in the following example log file. Data sizes of 400 KB to 40 MB cover L2 cache and RAM
Code: |
OpenMP MFLOPS64g9 Thu Sep 26 16:52:54 2019
Test 4 Byte Ops/ Repeat Seconds MFLOPS First All
Words Word Passes Results Same
Data in & out 100000 2 2500 0.124228 4025 0.929538 Yes
Data in & out 1000000 2 250 0.842066 594 0.992550 Yes
Data in & out 10000000 2 25 0.873622 572 0.999250 Yes
Data in & out 100000 8 2500 0.147889 13524 0.957117 Yes
Data in & out 1000000 8 250 0.904478 2211 0.995518 Yes
Data in & out 10000000 8 25 0.951405 2102 0.999549 Yes
Data in & out 100000 32 2500 0.324246 24673 0.890215 Yes
Data in & out 1000000 32 250 1.097993 7286 0.988088 Yes
Data in & out 10000000 32 25 1.045087 7655 0.998796 Yes
End of test Thu Sep 26 16:53:00 2019
|
Following are results from gcc 9 compiled runs on the Pi 3B+ and Pi 4B for all threads and using the single thread one core version. Maximum speed was near the 25 GFLOPS obtained using MP-MFLOPS.
Pi 4B/Pi 3B+ performance improvements were mainly more than twice, using L2 cache or when the more CPU speed dependent 32 operations per word tests were used. The gcc 9/6 performance rations indicate no real advantage of either compilation.
Code: |
gcc 9
Mbytes/ Pi 3B+ Pi 4B gcc 9 Pi 4B
Ops/Word 64b 64b 4B/3B+ gcc 9/6
All 1T All 1T All 1T All 1T
0.4/2 2341 795 4025 2236 1.72 2.81 0.75 0.80
4/2 381 362 594 403 1.56 1.11 1.06 0.72
40/2 401 387 572 493 1.43 1.27 1.05 0.84
0.4/8 6051 1906 13524 5373 2.24 2.82 0.88 0.97
4/8 1491 1352 2211 1948 1.48 1.44 0.99 0.92
40/8 1598 1418 2102 2308 1.32 1.63 0.89 1.00
0.4/32 12002 3185 24673 6786 2.06 2.13 1.21 1.20
4/32 5641 2809 7286 6385 1.29 2.27 0.90 1.17
40/32 6142 2809 7655 6415 1.25 2.28 0.90 1.17
|
MemSpeed OpenMP Benchmark
As indicated for the earlier, this benchmark is not really suitable to demonstrate multithreading performance, as reflected in an example of the full results below.
Code: |
Memory Reading Speed Test OpenMP 64 Bit gcc 9 by Roy Longbottom
Start of test Thu Sep 26 15:12:19 2019
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
4 7616 8480 8749 7548 8520 8530 35856 18594 18601
8 8195 8660 8876 8147 5740 8365 37153 18878 18864
16 7992 7684 8189 8064 8139 8023 35774 18896 18898
32 8975 8535 8024 9048 8536 8512 37465 18392 19024
64 8622 7997 8057 8511 7953 7994 19618 16857 16701
128 11940 11637 11554 12101 11659 11498 13815 13417 13964
256 17008 17339 16359 17104 17396 17038 11877 12344 12376
512 17740 15986 18607 17522 18547 15612 12575 13616 13495
1024 7011 10208 10016 11310 5287 11413 7060 6279 10045
2048 7024 4201 7006 7017 6943 3225 2822 3386 3391
4096 3854 7002 7126 6912 7074 3985 2199 3127 3132
8192 2632 6950 7151 5291 2796 6813 2546 3091 2403
16384 7350 7073 3537 7583 5327 3200 2609 3053 1907
32768 7514 7616 7725 7807 2344 2936 2702 2559 3042
65536 7065 2937 7571 4306 7086 2975 2127 3017 2677
131072 1772 1779 2562 8092 2583 2800 2035 1866 2869
End of test Thu Sep 26 15:12:48 2019
|
As for MP-MFLOPS, there is a not MP version to demonstrate performance when using a single CPU core. A summary of results and comparisons, in key areas with data from L1 cache, L2 cache and RAM, are shown below, using gcc 9 compilations on the Pi 3B+ and Pi 4B, plus earlier gcc 6 details.
The Pi 4B versus Pi 3B+ comparisons, for single core tests, indicate across the board 4B improvements, that are not necessarily carried forward to the multithreaded tests, the highest gains being for single precision floating point calculations. .
The single core single precision and integer test functions indicate faster speed from gcc 9 compilations, where calculations are involved, but not always affecting multithreading in the same way.
Typical MP versus non-MP performance ratios, for each group, were between 0.2 and 3.5, or one core could be five times faster than four cores.
Code: |
Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m]
KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32
Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S
Gentoo Pi 4B gcc 6
8 8238 8906 3150 8308 9253 2339 29033 15749 2673
256 16320 17582 3236 17404 16671 2652 13683 14741 2411
65536 2462 2245 2390 7160 3945 2742 2746 2386 2259
Not OMP
8 15527 13976 15533 15504 14021 15537 11563 9311 7794
256 12236 11434 12096 12084 11740 12156 7883 8044 7818
65536 2047 2046 2037 2034 2054 2071 2567 2554 2547
Gentoo 64b Pi 4B gcc 9
8 8195 8660 8876 8147 5740 8365 37153 18878 18864
256 17008 17339 16359 17104 17396 17038 11877 12344 12376
65536 7065 2937 7571 4306 7086 2975 2127 3017 2677
Not OMP
8 13380 21857 24416 13414 23420 24400 11630 9313 9312
256 11705 20247 21090 12041 21382 21013 8081 8182 5919
65536 2030 3034 2135 2047 3035 2394 2550 2535 2546
Gentoo 64b Pi 3B+ gcc 9
8 3908 3630 9548 10512 5230 9273 13649 6850 9599
256 6730 3456 6358 9313 5346 9166 9375 5612 858
65536 2413 1137 2957 3982 1163 3052 808 904 897
Not OMP
8 4274 5139 7932 5442 5827 7934 6162 4334 4339
256 3703 4670 7152 4322 5378 7166 5452 4092 4094
65536 1035 1582 1649 1098 1616 1494 652 794 790
Pi 4B / Pi3B+ gcc 9
8 2.10 2.39 0.93 0.78 1.10 0.90 2.72 2.76 1.97
256 2.53 5.02 2.57 1.84 3.25 1.86 1.27 2.20 14.42
65536 2.93 2.58 2.56 1.08 6.09 0.97 2.63 3.34 2.98
Not OMP
8 3.13 4.25 3.08 2.46 4.02 3.08 1.89 2.15 2.15
256 3.16 4.34 2.95 2.79 3.98 2.93 1.48 2.00 1.45
65536 1.96 1.92 1.29 1.86 1.88 1.60 3.91 3.19 3.22
gcc 9 / gcc 6
8 0.99 0.97 2.82 0.98 0.62 3.58 1.28 1.20 7.06
256 1.04 0.99 5.06 0.98 1.04 6.42 0.87 0.84 5.13
65536 2.87 1.31 3.17 0.60 1.80 1.08 0.77 1.26 1.19
Not OMP
8 0.86 1.56 1.57 0.87 1.67 1.57 1.01 1.00 1.19
256 0.96 1.77 1.74 1.00 1.82 1.73 1.03 1.02 0.76
65536 0.99 1.48 1.05 1.01 1.48 1.16 0.99 0.99 1.00
|
_________________ Regards
Roy |
|
Back to top |
|
 |
Sakaki Guru


Joined: 21 May 2014 Posts: 409
|
Posted: Sun Oct 20, 2019 10:36 am Post subject: |
|
|
roylongbottom,
very interesting results as always!
Have you tried compiling any of your benchmarks with clang/llvm? This compiler is also included on the gentoo-on-rpi-64bit image, and produced some interesting differences wrt gcc 9 on other benchmarks (see e.g. this post ff). _________________ Regards,
sakaki |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Mon Oct 21, 2019 10:10 am Post subject: |
|
|
Sakaki wrote: | roylongbottom,
very interesting results as always!
Have you tried compiling any of your benchmarks with clang/llvm? This compiler is also included on the gentoo-on-rpi-64bit image, and produced some interesting differences wrt gcc 9 on other benchmarks (see e.g. this post ff). |
I have compiled a number of my benchmarks via clang and run them. In general floating point was a bit slower. Following are results for MP MFLOPS DP, both compiled using the same parameters:
Code: |
clang or gcc mpmflopsdp.c -lm -lrt -O3 -lpthread -march=armv8-a
|
There might be better options to use (for both) but current choices can be simply overwhelming. Anyway, the clang results were slower at 32 Ops/Word.
Code: | MP-MFLOPS 64 Bit clang Double Precision
FPU Add & Multiply using 1, 2, 4 and 8 Threads
2 Ops/Word 32 Ops/Word
KB 12.8 128 12800 12.8 128 12800
MFLOPS
1T 1643 1603 270 2447 2450 2418
2T 2424 3152 265 4798 4907 4274
4T 5884 3087 264 9252 9337 3889
8T 5240 4633 253 7960 8897 4322
MP-MFLOPS 64 Bit gcc 9 Double Precision
1T 1656 1567 259 3361 3360 3142
2T 2770 2951 291 6595 6592 4454
4T 5800 2766 302 12687 12910 4936
8T 2286 3149 299 11144 11815 4904
|
Following are disassembled details of the main inner loops. Both use 64 bit SIMD instructions, but clang failed to ring the bell by not implementing fused multiply and add or subtract (fmla or fmls) instructions, to reduce the count from 32 to 22.
Code: |
clang gcc 9
.LBB4_4 .L48:
ldr q20, [x11], #16 ldr q5, [x2]
subs x10, x10, #2 fadd v1.2d, v15.2d, v5.2d
fadd v21.2d, v20.2d, v30.2d fadd v7.2d, v13.2d, v5.2d
fadd v22.2d, v20.2d, v8.2d fadd v17.2d, v11.2d, v5.2d
fmul v21.2d, v21.2d, v31.2d fadd v16.2d, v5.2d, v9.2d
fmul v22.2d, v22.2d, v9.2d fmul v1.2d, v1.2d, v14.2d
fsub v21.2d, v21.2d, v22.2d fmls v1.2d, v12.2d, v7.2d
fadd v22.2d, v20.2d, v10.2d fadd v7.2d, v5.2d, v31.2d
fmul v22.2d, v22.2d, v11.2d fmla v1.2d, v10.2d, v17.2d
fadd v21.2d, v22.2d, v21.2d fadd v17.2d, v5.2d, v29.2d
fadd v22.2d, v20.2d, v12.2d fmls v1.2d, v16.2d, v8.2d
fmul v22.2d, v22.2d, v13.2d fadd v16.2d, v5.2d, v27.2d
fsub v21.2d, v21.2d, v22.2d fmla v1.2d, v7.2d, v30.2d
fadd v22.2d, v20.2d, v14.2d fadd v7.2d, v5.2d, v25.2d
fmul v22.2d, v22.2d, v15.2d fmls v1.2d, v17.2d, v28.2d
fadd v21.2d, v22.2d, v21.2d fadd v17.2d, v5.2d, v23.2d
fadd v22.2d, v20.2d, v16.2d fmla v1.2d, v16.2d, v26.2d
fmul v22.2d, v22.2d, v7.2d fadd v16.2d, v5.2d, v21.2d
fsub v21.2d, v21.2d, v22.2d fadd v5.2d, v5.2d, v19.2d
fadd v22.2d, v20.2d, v6.2d fmls v1.2d, v7.2d, v24.2d
fmul v22.2d, v22.2d, v5.2d fmla v1.2d, v17.2d, v22.2d
fadd v21.2d, v22.2d, v21.2d fmls v1.2d, v16.2d, v20.2d
fadd v22.2d, v20.2d, v4.2d fmla v1.2d, v5.2d, v18.2d
fmul v22.2d, v22.2d, v3.2d str q1, [x2], 16
fsub v21.2d, v21.2d, v22.2d cmp x2, x3
fadd v22.2d, v20.2d, v2.2d bne .L48
fmul v22.2d, v22.2d, v1.2d
fadd v21.2d, v22.2d, v21.2d
fadd v22.2d, v20.2d, v17.2d
fadd v20.2d, v20.2d, v19.2d
fmul v22.2d, v22.2d, v18.2d
fmul v20.2d, v20.2d, v0.2d
fsub v21.2d, v21.2d, v22.2d
fadd v20.2d, v20.2d, v21.2d
str q20, [x12]
mov x12, x11
b.ne .LBB4_4
|
_________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Mon Nov 11, 2019 5:06 pm Post subject: Raspberry Pi 4B 64 Bit Stress Tests |
|
|
Raspberry Pi 4B 64 Bit Stress Tests
The first stress tests used cover the central processor, for which an extra program was produced to measure the environment whilst running. Variable parameters are:
Passes and sampling seconds to determine running time. If the stress test also has sampling periods, it is normally not possible to synchronise them but approximate periods can be matched.
CPU MHz - This can vary faster than any sampling time based on seconds, but the general trend can be useful. Tests that measure speed over sampling periods provide a better indication.
Core Voltage - This appears to vary a little, reason unknown.
CPU Temperature - assuming that it is correct, as it change slowly, this is the most useful measurement.
PMIC temperature - No issue so far with Power Management Integrated Circuit temperatures.
Code: | ###################################################
Parameters - upper or lower case
./RPiHeatMHzVolts2 passes 33 secs 20 log 12
or
./RPiHeatMHzVolts2 P 33 S 20 L 12
For 33 samples at 20 second intervals, log file RPiHeatMHz12.txt
To cover 10 minute test
###################################################
Temperature and CPU MHz Measurement
Start at Mon Oct 28 20:49:52 2019
Using 33 samples at 20 second intervals
Seconds
0.0 ARM MHz=1500, core volt=0.8490V, CPU temp=61.0'C, pmic temp=55.2'C
20.0 ARM MHz=1500, core volt=0.8437V, CPU temp=73.0'C, pmic temp=62.8'C
40.3 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=66.5'C
60.5 ARM MHz=1500, core volt=0.8437V, CPU temp=79.0'C, pmic temp=69.4'C
80.7 ARM MHz=1500, core volt=0.8437V, CPU temp=80.0'C, pmic temp=70.3'C
101.0 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C
121.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
141.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
161.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C
181.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C
|
Next are results for the High Performance Linpack that runs for a long time, significantly increasing CPU temperatures and slowing down, without a cooling fan being in place. These results can be compared with those for the 32 bit version, available in the report
https://www.researchgate.net/publication/334561068_Raspberry_Pi_4B_Stress_Tests_Including_High_Performance_Linpack
This shows that the same sort of performance levels as the 64 bit version are obtained, with and without a cooling fan.
Following HPL results here, are some for my integer and floating point stress tests. Although further comparative tests are needed to be conclusive, it does seem that the 64 bit floating point versions are faster than the 32 bit varieties and subject to lower temperature increases.
High Performance Linpack Stress Test
The earlier HPL benchmark results quoted obtained speeds of 8.1 GFLOPS on a cold start and 10.8 GFLOPS later, with a cooling fan in operation for both. The first results below were run without a fan, with a room temperature around 21°C, producing 7.6 GFLOPS on a cold start. Then average CPU frequency came out at 1056 MHz, with an average temperature of 80.3°C.
The second results followed a warm reboot to use a different version of Gentoo with HPL installed, obtaining 5.54 GFLOPS, with severe CPU frequency throttling, down to 600 MHz, with temperatures up to 80.3°C. Averages were 790 MHz and 80.3°C.
Shortly afterwards, with the fan in place, the Pi ran at 1500 MHz continuously, achieving 10.4 GFLOPS, with a maximum temperature of 64°C.
Code: | ================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 702.81 7.589e+00
HPL_pdgesv() start time Sat Aug 24 10:42:58 2019
HPL_pdgesv() end time Sat Aug 24 10:54:41 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0008453 ...... PASSED
================================================================================
Example 2 - Note different sumchecks again
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 20000 128 2 2 963.16 5.538e+00
HPL_pdgesv() start time Tue Oct 29 11:51:10 2019
HPL_pdgesv() end time Tue Oct 29 12:07:13 2019
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0009005 ...... PASSED
================================================================================
Temperature and CPU MHz Measurement
Start at Tue Oct 29 11:50:27 2019
Using 40 samples at 30 second intervals
Seconds
0.0 ARM MHz=1500, core volt=0.8542V, CPU temp=63.0'C, pmic temp=58.0'C
30.0 ARM MHz=1500, core volt=0.8542V, CPU temp=79.0'C, pmic temp=69.4'C
60.3 ARM MHz=1000, core volt=0.8542V, CPU temp=83.0'C, pmic temp=72.2'C
91.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
122.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=74.1'C
152.7 ARM MHz= 750, core volt=0.8490V, CPU temp=83.0'C, pmic temp=74.1'C
183.2 ARM MHz=1000, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
213.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
244.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
274.7 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
305.2 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
335.6 ARM MHz=1000, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
366.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
396.6 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
427.2 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
457.5 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
488.0 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
518.6 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
549.0 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
579.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
610.1 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
640.6 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
671.1 ARM MHz= 750, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.9'C
701.6 ARM MHz= 600, core volt=0.8490V, CPU temp=86.0'C, pmic temp=76.0'C
732.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
762.4 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
792.9 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
823.4 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.9'C
853.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
884.4 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
914.9 ARM MHz= 600, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
945.3 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.9'C
975.8 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
1006.3 ARM MHz= 750, core volt=0.8490V, CPU temp=84.0'C, pmic temp=76.0'C
1036.7 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=76.0'C
1067.0 ARM MHz= 750, core volt=0.8490V, CPU temp=85.0'C, pmic temp=74.1'C
Averages 790 84.1 75.5
|
Integer Stress Test - MP-IntStress64, MP-IntStress
As for my other CPU stress tests, the four and 8 thread results are shown, from running in benchmarking mode. Run time parameters are also provided, the commands used for the particular tests being included.
In this case, a summary of separate tests for L1 cache, L2 cache and RAM are given. During the 10 minute sessions, the cache tests were mainly running at 1000 MHz, with those using RAM at the full speed 1500 MHz. No temperatures above 84°C were recorded.
Examining the full detail of the first test indicated that average CPU MHz and measured MB/second were around 75% of the maximum.
Code: |
KB KB MB Same All
Secs Thrds 16 160 16 Sumcheck Tests
3.0 4 28715 26652 3345 5A5A5A5A Yes
3.0 8 30292 26310 3334 AAAAAAAA Yes
./RPiHeatMHzVolts2 passes 66 secs 10 log 34 - used for all 10 minute stress tests
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1, 2, 4, 8, 16, 32 KB between 12 and 15624 Log < 100 Minutes any > 0
./MP-IntStress64 KB 16 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8455V, CPU temp=62.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8455V, CPU temp=69.0'C, pmic temp=62.8'C 28695
20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=73.0'C, pmic temp=64.6'C 28729
152.5 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=72.2'C 21523
305.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 20026
448.2 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19611
601.1 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 19199
%Min/Max 66.9
./MP-IntStress64 KB 160 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=64.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 26323
20.2 ARM MHz=1500, core volt=0.8402V, CPU temp=75.0'C, pmic temp=66.5'C 26140
152.9 ARM MHz=1000, core volt=0.8402V, CPU temp=82.0'C, pmic temp=74.1'C 18016
306.5 ARM MHz=1000, core volt=0.8402V, CPU temp=83.0'C, pmic temp=74.1'C 17306
449.8 ARM MHz=1000, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 17248
603.3 ARM MHz= 750, core volt=0.8402V, CPU temp=84.0'C, pmic temp=74.1'C 16832
%Min/Max 63.9
./MP-IntStress64 KB 16000 Threads 8 Mins 10 Log 34
Seconds MB/sec
0.0 ARM MHz=1500, core volt=0.8402V, CPU temp=66.0'C, pmic temp=60.9'C
10.0 ARM MHz=1500, core volt=0.8402V, CPU temp=71.0'C, pmic temp=62.8'C 3372
20.3 ARM MHz=1500, core volt=0.8402V, CPU temp=72.0'C, pmic temp=62.8'C 3369
155.2 ARM MHz=1500, core volt=0.8402V, CPU temp=76.0'C, pmic temp=68.4'C 3365
309.8 ARM MHz=1500, core volt=0.8402V, CPU temp=79.0'C, pmic temp=69.4'C 3367
454.4 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3367
599.7 ARM MHz=1500, core volt=0.8402V, CPU temp=78.0'C, pmic temp=70.3'C 3368
%Min/Max 99.8
|
Single Precision Floating Point Stress Test - MP-FPUStress64, MP-FPUStress
Two sets of result summaries are provided below, both using 1280 KB memory space and 8 threads. With four cores, this results in data being in L2 cache (4 x 160 KB) to run at full speed, with additional overhead of moving data to/from RAM. One test uses 8 operations per word, with 32 in the other. With hot starts, neither reached a CPU temperature of 84°C and had similar performance degradation at the highest temperatures.
Following writing the above, the 32 bit stress test was repeated, with results shown below. Although not conclusive from a single run, they indicate that the impact was more severe than the 64 bit run, CPU speed sample reducing to 600 MHz, higher temperatures and a larger performance degradation.
Code: |
Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
4.6 T4 2 9223 7520 519 40392 76406 99700
6.0 T8 2 9520 10471 545 40392 76406 99700
11.3 T4 8 19087 21040 2044 54764 85092 99820
12.9 T8 8 19747 21107 2016 54764 85092 99820
22.2 T4 32 25732 26704 9160 35206 66015 99520
24.1 T8 32 25708 25770 8927 35206 66015 99520
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0
./MP-FPUStress64 KB 1280 T 8 Ops 8 Mins 10 Log 33
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=64.0'C, pmic temp=59.0'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 17309
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=75.0'C, pmic temp=66.5'C 18018
101.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 14224
204.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 12806
306.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=73.1'C 12447
409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 11870
501.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 12191
604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 12169
%Min/Max 65.9
./MP-FPUStress64 KB 1280 T 8 Ops 32 Mins 10 Log 33
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=65.0'C, pmic temp=59.0'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=72.0'C, pmic temp=65.6'C 22634
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=76.0'C, pmic temp=67.5'C 22992
101.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 18629
204.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 16674
306.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 16448
408.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 16158
500.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 16081
603.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 15553
%Min/Max 67.6
================================================================================
32 Bit Version ./MP-FPUStress KB 1280 T 8 Ops 32 Mins 10 Log 73
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=56.0'C, pmic temp=50.5'C
10.0 ARM MHz=1500, core volt=0.8560V, CPU temp=70.0'C, pmic temp=60.9'C 20233
20.7 ARM MHz=1500, core volt=0.8560V, CPU temp=74.0'C, pmic temp=64.6'C 20221
106.4 ARM MHz=1000, core volt=0.8560V, CPU temp=83.0'C, pmic temp=70.3'C 14173
204.3 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=73.1'C 13115
302.2 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 12650
400.2 ARM MHz= 750, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11957
508.8 ARM MHz=1000, core volt=0.8455V, CPU temp=85.0'C, pmic temp=74.1'C 11485
585.1 ARM MHz= 600, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11454
606.9 ARM MHz=1000, core volt=0.8455V, CPU temp=84.0'C, pmic temp=74.1'C 11242
%Min/Max 55.6
|
Double Precision Floating Point Stress Test - MP-FPUStress64DP, MP-FPUStressDP
Below are full results for a 10 minute test using the double precision floating point stress test, with data in L2 cache with four cores in use. Although the measured MFLOPS was greater than that obtained be HPL Linpack, the same range of high temperatures and performance degradation were not generated.
The 32 bit version was also rerun, producing similar results as those at 64 bits.
Code: | Ops/ KB KB MB KB KB MB
Secs Thrd Word 12.8 128 12.8 12.8 128 12.8
8.9 T4 2 5024 4589 257 40395 76384 99700
11.5 T8 2 5089 5545 280 40395 76384 99700
21.7 T4 8 10259 10011 1068 54805 85108 99820
24.7 T8 8 10239 10824 1036 54805 85108 99820
43.1 T4 32 12940 13200 4497 35159 66065 99521
46.9 T8 32 13200 13049 4557 35159 66065 99521
==== Stress Test Parameters - upper or lower case, only first letter counts ====
Threads 1,2,4,8,16,32,64 KB 12 to 15624 Ops/Wordd 2,8,32 Log<100 Minutes any>0
./MP-FPUStress64DP KB 1280 T 8 Ops 32 Mins 10 Log 31
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8437V, CPU temp=63.0'C, pmic temp=57.1'C
10.0 ARM MHz=1500, core volt=0.8437V, CPU temp=71.0'C, pmic temp=62.8'C 12718
20.2 ARM MHz=1500, core volt=0.8437V, CPU temp=74.0'C, pmic temp=66.5'C 12755
30.5 ARM MHz=1500, core volt=0.8437V, CPU temp=77.0'C, pmic temp=68.4'C 12750
40.7 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12755
50.9 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=70.3'C 12183
61.2 ARM MHz=1500, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 11358
71.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 10922
81.6 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 10333
91.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9948
102.0 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9692
112.3 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 9466
122.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9217
132.8 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=74.1'C 9181
143.0 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 9145
153.2 ARM MHz=1000, core volt=0.8437V, CPU temp=80.0'C, pmic temp=72.2'C 9043
163.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8921
173.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 9838
183.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8755
194.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8737
204.4 ARM MHz=1000, core volt=0.8437V, CPU temp=81.0'C, pmic temp=72.2'C 8721
214.7 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8721
224.9 ARM MHz=1500, core volt=0.8437V, CPU temp=83.0'C, pmic temp=73.1'C 8670
235.1 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8619
245.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8592
255.6 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=72.2'C 8592
265.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8540
276.2 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=73.1'C 8488
286.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8547
296.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8510
307.0 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8473
317.2 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8507
327.5 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8541
337.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8544
347.9 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8464
358.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8531
368.4 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8495
378.7 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8460
388.9 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8514
399.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8484
409.4 ARM MHz=1000, core volt=0.8437V, CPU temp=82.0'C, pmic temp=74.1'C 8454
419.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8459
429.8 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8489
440.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8472
450.3 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8428
460.6 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8384
470.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8384
481.2 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8387
491.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8391
501.7 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8244
511.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8346
522.1 ARM MHz= 750, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272
532.4 ARM MHz=1000, core volt=0.8437V, CPU temp=83.0'C, pmic temp=74.1'C 8272
542.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8329
552.8 ARM MHz= 750, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8239
563.1 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8183
573.3 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8129
583.6 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8343
593.9 ARM MHz=1000, core volt=0.8437V, CPU temp=84.0'C, pmic temp=74.1'C 8266
604.1 ARM MHz=1000, core volt=0.8437V, CPU temp=85.0'C, pmic temp=74.1'C 8190
|
OpenGL + 3 x Livermore Loops - liverloopsPi64Rg9, liverloopsPi64, liverloopsPiA7R
In order make it easier to run these stress tests, lxterminal was installed and the script shown below used to open four terminal windows and run the environmental monitor program plus three copies of a modified Loops benchmark, that allows different log files to be specified. This executes 72 loops for a minimum time of 12 seconds each. The second script file is provided to run the kitchen display tests for 16 minutes in full screen mode. A further terminal was opened to run VMSTAT resource monitor.
The tests were run twice, without and with a cooling fan in place. Results are shown below. In this case, the no fan tests were not that much slower, obtaining averages of 77 to 80% of the fan cooled speeds on OpenGL FPS, CPU MHz and total Loops MFLOPS.
These results were produced with all programs compiled by gcc 9 and not run on a hot day. Compared with performance using 32 bit versions, detailed in this 32 Bit Report, the 64 bit results were far better, but the former were produced by an older compiler and run on a hot day. The tests were repeated, using 32 bit programs produced by the later gcc 8 compiler.
As before, the 64 bit gcc 9 Livermore Loops and OpenGL single core benchmarks were faster than the new 32 bit versions, in this case by 14% for the former and 40% for the latter. On running the stress test, both had similar average CPU MHz, CPU temperature and PMIC temperature, with 64 bit FPS and MFLOPS maintaining performance advantage, with similar ratios as obtained from single core tests.
Code: | run.sh
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 21 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 22 &
lxterminal -e ./liverloopsPi64Rg9 Seconds 12 Log 23
runogl.sh
export vblank_mode=0 &
./videogl64g9 Test 6 Mins 16 Log 20
No Fan With Fan
Seconds MHz CPU C PMIC C FPS MHz CPU C PMIC C FPS
0 1500 57 51 1500 37 32
30 1500 75 63 27 1500 53 44 27
60 1500 76 68 29 1500 53 44 28
90 1500 81 72 25 1500 58 50 27
120 1500 81 70 23 1500 55 48 26
150 1000 82 74 23 1500 57 49 29
180 1000 80 72 22 1500 54 47 27
210 1000 81 72 24 1500 55 46 29
240 1500 80 72 26 1500 54 44 28
270 1500 81 72 27 1500 55 47 28
300 1000 82 72 22 1500 56 48 29
330 1500 82 72 24 1500 56 50 29
360 1000 82 72 24 1500 56 49 28
390 1000 82 72 22 1500 58 50 26
420 1000 83 72 22 1500 57 50 26
450 1000 82 74 19 1500 56 50 30
480 1000 82 74 21 1500 56 48 28
510 1000 82 72 22 1500 54 46 29
540 1000 81 72 22 1500 55 47 30
570 1500 81 72 24 1500 55 47 30
600 1000 82 74 24 1500 57 49 30
630 1500 81 72 23 1500 58 51 29
660 1000 82 72 23 1500 57 50 29
690 1000 83 73 22 1500 59 51 28
720 1000 83 72 21 1500 57 51 28
750 1000 82 74 21 1500 57 50 29
780 1000 84 74 19 1500 54 47 29
810 1000 82 72 19 1500 56 48 29
840 1000 82 72 20 1500 54 46 29
870 1000 82 72 20 1500 53 46 30
900 1000 82 72 23 1500 49 42 31
Average 1161 81 71 23 1500 55 47 29
Minimum 1000 57 51 19 1500 37 32 26
Maximum 1500 84 74 29 1500 59 51 31
% Hot/Cold
Average 77 68 66 80
Minimum 67 65 61 73
Maximum 100 70 69 94
MFLOPS Average Geomean Harmean Average Geomean Harmean
1 684 562 453 898 732 590
2 716 574 451 887 712 571
3 716 566 438 895 724 582
Total %Hot/Cold
MFLOPS 79 78 77
|
Input/Output Stress Test - burnindrive264g9, burnindrive2
This is essentially the same as my program used during hundreds of UK Government and University computer acceptance trials during the 1970s and 1980s, with some significant achievements. Burnindrive writes four files, using 164 blocks of 64 KB, repeated 16 times (164.0 MB), with each block containing a unique data pattern. The files are then read for two minutes, on a sort of random sequence, with data and file ID checked for correct values. Then each block (unique pattern) is read numerous times, over one second, again with checking for correct values. Total time is normally about 5 minutes for all tests, with default parameters. The data patterns are shown below, followed by run time parameters, then examples of results provided, including added calculations of speed.
Code: | Patterns
No. Hex No. Hex No. Hex No. Hex No. Hex No. Hex No. Hex
1 0 25 800000 49 3 73 FF 97 FFFFDFFF 121 FFFFEAAA 145 FFFFF0F0
2 1 26 1000000 50 33 74 FF00FF 98 FFFFBFFF 122 FFFFAAAA 146 FFF0F0F0
3 2 27 2000000 51 333 75 1FF 99 FFFF7FFF 123 FFFEAAAA 147 F0F0F0F0
4 4 28 4000000 52 3333 76 3FF 100 FFFEFFFF 124 FFFAAAAA 148 FFFFFFE0
5 8 29 8000000 53 33333 77 7FF 101 FFFDFFFF 125 FFEAAAAA 149 FFFF83E0
6 10 30 10000000 54 333333 78 FFF 102 FFFBFFFF 126 FFAAAAAA 150 FE0F83E0
7 20 31 20000000 55 3333333 79 1FFF 103 FFF7FFFF 127 FEAAAAAA 151 FFFFFFC0
8 40 32 40000000 56 33333333 80 3FFF 104 FFEFFFFF 128 FAAAAAAA 152 FFFC0FC0
9 80 33 1 57 7 81 7FFF 105 FFDFFFFF 129 EAAAAAAA 153 FFFFFF80
10 100 34 5 58 1C7 82 FFFF 106 FFBFFFFF 130 AAAAAAAA 154 FFE03F80
11 200 35 15 59 71C7 83 FFFFFFFF 107 FF7FFFFF 131 FFFFFFFC 155 FFFFFF00
12 400 36 55 60 1C71C7 84 FFFFFFFE 108 FEFFFFFF 132 FFFFFFCC 156 FF00FF00
13 800 37 155 61 71C71C7 85 FFFFFFFD 109 FDFFFFFF 133 FFFFFCCC 157 FFFFFE00
14 1000 38 555 62 F 86 FFFFFFFB 110 FBFFFFFF 134 FFFFCCCC 158 FFFFFC00
15 2000 39 1555 63 F0F 87 FFFFFFF7 111 F7FFFFFF 135 FFFCCCCC 159 FFFFF800
16 4000 40 5555 64 F0F0F 88 FFFFFFEF 112 EFFFFFFF 136 FFCCCCCC 160 FFFFF000
17 8000 41 15555 65 F0F0F0F 89 FFFFFFDF 113 DFFFFFFF 137 FCCCCCCC 161 FFFFE000
18 10000 42 55555 66 1F 90 FFFFFFBF 114 BFFFFFFF 138 CCCCCCCC 162 FFFFC000
19 20000 43 155555 67 7C1F 91 FFFFFF7F 115 FFFFFFFE 139 FFFFFFF8 163 FFFF8000
20 40000 44 555555 68 1F07C1F 92 FFFFFEFF 116 FFFFFFFA 140 FFFFFE38 164 FFFF0000
21 80000 45 1555555 69 3F 93 FFFFFDFF 117 FFFFFFEA 141 FFFF8E38
22 100000 46 5555555 70 3F03F 94 FFFFFBFF 118 FFFFFFAA 142 FFE38E38
23 200000 47 15555555 71 7F 95 FFFFF7FF 119 FFFFFEAA 143 F8E38E38
24 400000 48 55555555 72 1FC07F 96 FFFFEFFF 120 FFFFFAAA 144 FFFFFFF0
Sequences - First 16
No. File No. File No. File No. File
1 0 1 2 3 5 0 2 1 3 9 0 3 1 2 13 0 1 2 3
2 1 2 3 0 6 1 3 2 0 10 1 0 3 2 14 1 2 3 0
3 2 3 0 1 7 2 0 1 3 11 2 1 0 3 15 2 3 0 1
4 3 0 2 1 8 3 1 2 0 12 3 2 1 0 16 3 0 2 1
###########################################################################
Run Time Parameters - Upper or Lower Case
Default
R or Repeats Data size, multiplier of 10.25 MB, more or less 16
P or Patterns Number of patterns for smaller files < 164 164
M or Minutes Large file reading time 2
L or Log Log file name extension 0 to 99 0
S or Seconds Time to read each block, last section 1
F or FilePath For other than SD card or SD card directory
C or CacheData Omit O_DIRECT on opening files to allow caching No
O or OutputPatterns Log patterns and file sequences used as above No
D or DontRunReadTests Or only run write tests No
Format ./burnindrive2 Repeats 16, Minutes 2, Log 0, Seconds 1
or ./burnindrive2 R 16, M 2, L 0, S 1
###########################################################################
Examples of Results Main SD Card Default Parameters
File 1 164.00 MB written in 14.66 seconds - 11.2 MB/second
To File 4 164.00 MB written in 12.15 seconds - 13.5 MB/second
Read passes 1 x 4 Files x 164.00 MB in 0.33 minutes - 33.1 MB/second
To Read passes 7 x 4 Files x 164.00 MB in 2.28 minutes - 33.6 MB/second
Passes in 1 second(s) for each of 164 blocks of 64KB: - 164 measurements
580 580 580 580 580 580 580 580 580 580 580
580 580 580 580 580 580 580 580 580 580 580
95120 read passes of 64KB blocks in 2.76 minutes - 36.8 MB/second
|
CPU + Main SD + USB + LAN Test
A system test was run using the following script file, comprising commands to run programs to monitor the environment, and others to exercise the main SD card, two USB 3 drives, 1 Gbps Ethernet and CPU floating point with two threads. The programs were run via the script file so that they all started at the same time, as indicated in the summaries below. They also all ran for between 12 and 13 minutes. The by itself performance levels (BI) are also shown, often not indicating much improvement. Performance is not as high as shown by other benchmarks, probably because data transfers are based on 64 KB block sizes and all data in each block is checked for correctness.
A snapshot of vmstat system performance is also provided. The bo and bi KB/second writing and reading speeds are essentially the same as the sum those reported by the programs handling the main and USB drives. LAN speeds are not included in vmstat. Total CPU utilisation (us + sy) is shown to be nearly 90% at the start of writing and closer to 75% on reading, representing average utilisation per core or at least three cores at 100%. Next page shows variations in performance with time.
Code: | ############################### Script File ###############################
lxterminal -e ./RPiHeatMHzVolts2 Passes 35 Seconds 30 Log 20 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1 Log 21 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /run/media/demouser/PATRIOT Log 22 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /run/media/demouser/REMIXOSSYS Log 23 &
lxterminal -e ./burnindrive264g9 Seconds 4 Minutes 1
FilePath /media/public/test Log 24 &
lxterminal -e ./MP-FPUStress64 KB 256 T 2 Ops 32 Mins 12 Log 33
vmstat 10 96 > vmstat.txt
############################################################################
Main SD Drive Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 16:00:06 2019
Write 164 MB x files 4 53.6 seconds = 12.2 MB/second (BI 12.7)
Read 164 MB x files 3 x 4 67.2 seconds = 29.3 MB/second (BI 33.6)
Read 329480 x 64 KB 659.4 seconds = 32.0 MB/second (BI 36.8)
============================================================
USB 3 Drive 1 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:31 2019
Write 164 MB x files 4 17.5 seconds = 37.5 MB/second (BI 68.3)
Read 164 MB x files 6 x 4 72.0 seconds = 54.7 MB/second (BI 75.0)
Read 735800 x 64 KB 657.6 seconds = 71.6 MB/second (BI 66.5)
============================================================
USB 3 Drive 2 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:57 2019
Write 164 MB x files 4 37.4 seconds = 17.5 MB/second (BI 23.8)
Read 164 MB x files 3 x 4 75.6 seconds = 26.0 MB/second (BI 28.5)
Read 282740 x 64 KB 660.0 seconds = 27.4 MB/second (BI 29.8)
============================================================
1 Gbps LAN Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:35 2019
Write 164 MB x files 4 18.1 seconds = 36.2 MB/second (BI 55.7)
Read 164 MB x files 3 x 4 74.4 seconds = 26.4 MB/second (BI 34.0)
Read 303920 x 64 KB 659.4 seconds = 29.5 MB/second (BI 45.3)
============================================================
MP-Threaded-MFLOPS 64 Bit v1.1 Tue Nov 5 15:47:03 2019
End of test Tue Nov 5 15:59:13 2019
2 core GFLOPS 10.9 to 7.4 with CPU throttling.
See RPiHeatMHzVolts2 results where detail is included
============================================================
From vmstat 10 second sampling
Secs procs ---------memory---------- ---swap-- -----io---- --system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
10 5 3 0 3059800 94956 346060 0 0 14 63204 17819 19051 51 38 2 9 0
20 3 2 0 3058696 95248 346704 0 0 14265 60713 17613 18789 51 33 4 12 0
60 4 2 0 3061196 95668 343572 0 0 93479 7577 24239 24987 54 19 4 23 0
70 4 3 0 3050632 95684 353600 0 0 112115 24 24496 25316 54 20 12 14 0
710 3 3 0 3058696 96532 349460 0 0 132992 16 18936 22387 53 22 3 22 0
720 5 1 0 3058728 96548 349452 0 0 134400 13 20635 23842 54 23 1 23 0
|
Speeds and Temperature - These tests were run without an active cooling fan, resulting in some CPU throttling, with clock speed down to 1000 MHz some of the time, when the temperature reached 80°C. The MP-Threaded-MFLOPS dual core performance measurements have been added to the environmental details, mainly indicating the effects of throttling.
The burnindrive last results record the number of read passes in 4 seconds, in a table comprising 14 lines of 11 recordings and one with 10, over approximately 11 minutes. The average burnindrive results for each line are provided below, not exactly synchronised, but giving an indication of changes in throughput with time. Total passes and percentage degradation are also shown, the latter not being as severe as CPU speed reductions.
Code: | Temperature and CPU MHz Measurement + MP-FPUStress64 2 Core MFLOPS
Start at Tue Nov 5 15:47:03 2019
Using 25 samples at 30 second intervals
Seconds MFLOPS
0.0 ARM MHz=1500, core volt=0.8560V, CPU temp=66.0'C, pmic temp=59.0'C
30.0 ARM MHz=1500, core volt=0.8560V, CPU temp=75.0'C, pmic temp=65.6'C 10890
60.2 ARM MHz=1500, core volt=0.8560V, CPU temp=78.0'C, pmic temp=68.4'C 10551
90.4 ARM MHz=1500, core volt=0.8560V, CPU temp=80.0'C, pmic temp=70.3'C 10549
120.6 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 10452
150.8 ARM MHz=1500, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9862
181.1 ARM MHz=1000, core volt=0.8560V, CPU temp=81.0'C, pmic temp=70.3'C 9482
211.4 ARM MHz=1500, core volt=0.8560V, CPU temp=82.0'C, pmic temp=72.2'C 9137
241.6 ARM MHz=1500, core volt=0.8507V, CPU temp=81.0'C, pmic temp=72.2'C 9132
271.9 ARM MHz=1000, core volt=0.8507V, CPU temp=82.0'C, pmic temp=70.3'C 9122
302.2 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9389
332.4 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8550
362.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 9043
392.9 ARM MHz=1500, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8045
423.3 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8174
453.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8444
483.9 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8335
514.3 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7951
544.6 ARM MHz=1500, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 8125
574.8 ARM MHz=1500, core volt=0.8455V, CPU temp=83.0'C, pmic temp=72.2'C 8078
605.1 ARM MHz=1000, core volt=0.8455V, CPU temp=81.0'C, pmic temp=72.2'C 8280
635.4 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7845
665.7 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7761
696.0 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=73.1'C 8341
726.2 ARM MHz=1000, core volt=0.8455V, CPU temp=82.0'C, pmic temp=72.2'C 7407
Passes in 4 seconds for each of 164 blocks of 64KB
Seconds Main SD USB 1 USB 2 LAN Total %First
44 2013 4522 1884 1915 10333 100
88 2007 4533 1838 1911 10289 100
132 2016 4496 1760 1809 10082 98
176 2011 4536 1785 1845 10178 99
220 2002 4493 1729 1913 10136 98
264 1971 4262 1751 1904 9887 96
308 1980 4540 1747 1911 10178 99
352 2002 4464 1660 1845 9971 96
396 1987 4442 1629 1844 9902 96
440 1964 4453 1585 1771 9773 95
484 1995 4504 1635 1731 9864 95
528 1989 4229 1696 1762 9676 94
572 1947 4616 1684 1833 10080 98
616 2013 4476 1660 1798 9947 96
660 2262 4758 1826 2022 10868 105
|
########################################################################
New Files At ResearchGate[/b]
The detailed Raspberry Pi 4B 64 Bit Benchmarks and Stress Tests.pdf report has now been uploaded to ResearchGate :
https://www.researchgate.net/publication/337165767_Raspberry_Pi_4B_64_Bit_Benchmarks_and_Stress_Tests
along with the archive file containing the benchmarks and source codes:
https://www.researchgate.net/profile/Roy_Longbottom/project/Performance-of-Raspberry-Pi-and-Android-Devices/attachment/5d108baa3843b0b982580793/AS:773236761579522@1561365418445/download/Raspberry-Pi-4-Benchmarks.tar.gz?context=ProjectUpdatesLog
######################################################################## _________________ Regards
Roy |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Thu Jan 09, 2020 3:52 pm Post subject: Video Player Benchmarks |
|
|
Video Player Benchmarks
I recently ran some benchmarks, with I/O content, to indicate relative performance at the full 1500 CPU MHz, compared with the 600 MHz that can be obtained at the extreme of throttling - see:
https://www.researchgate.net/publication/338230582_Raspberry_Pi_4_CPU_MHz_Throttling_Performance_Effects
This includes replaying a programme using BBC iPlayer, running under Raspbian. In this case, playback continued at 600 MHz, without buffering being indicated, but displaying at a lower quality of pixel density.
On trying to run the video test via 64 bit Gentoo 1.5.1, the Chromium browser would not run the BBC iPlayer, reporting that Flash player was needed (noting now that Flash will no longer be supported, from sometime in 2020). After installing Gentoo 1.5.3, I found that the iPlayer was supported.
Then there were Sakaki’s post on video decoding.
https://forums.gentoo.org/viewtopic.php?p=8404846#8404846
indicting that Firefox browser currently does not exploit RPi's built-in h/w video codecs. but the one use in Raspbian might.
Following are results from playing the same BBC iPlayer programme used previously, via Raspbian and Gentoo. The iPlayer accessed only has quality settings for low, medium and highest. It seems that (via Google), the maximum standard HD pixel settings are 1280 x 720 but can be automatically reduced, to suit response conditions. The Pi4 had no cooling fan attachment and was connected via LAN, where Speed Checker indicated 60 Mbps.
The results are for sample periods of 10 to 15 minutes. Quality details were obtained by right clicking on the screen, with others via the usual monitoring tools. There were no sign of data buffering or noticeable differences in quality (with casual viewing), but the CPU became much hotter under Gentoo, with occasional indications of throttling down to 1000 MHz, and CPU utilisation equivalent to nearly three cores in continuous use (software vs hardware decoding?).
Code: |
BBC iPlayer Lions Documentary
Av of 4 Throttled
Wide High kbps %CPU Max °C To 1000 Setting
Raspbian Chromium 840 540 1709 21 65 0 Highest
Raspbian Chromium 1280 720 5166 34 68 0 Highest
Gentoo Firefox 1280 720 5166 68 83 1 Highest
Gentoo Firefox 1280 720 5166 65 82 2 Highest
|
The next tests were via YouTube, playing the same HD widescreen movie (https://youtu.be/rWVXLy_fJGk). The first were from accessing the player via browsers, where quality settings are available for auto and a range of HD and normal options, and run time details shown via right click, selecting Stats for nerds. Here, Gentoo tests again indicated higher CPU utilisation and temperatures.
The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect
Code: |
YouTube HD Movie
Av of 4 Throttled
Wide High kbps %CPU Max °C To 1000 Setting
Raspbian Chromium 1920 816 16000 25 68 0 1080p
Gentoo Firefox 856 362 18000 30 73 0 Auto (480p)
Gentoo Firefox 1920 816 18000 60 81 0 1080p
Gentoo SMPlayer 640 272 300 10 65 0 1080p?
|
I also tried Prime Video, but this failed, indicating that no decryption add on could be found. I would like to use a Pi 4 to access this and other players on older TVs that I have in various rooms. _________________ Regards
Roy |
|
Back to top |
|
 |
Sakaki Guru


Joined: 21 May 2014 Posts: 409
|
Posted: Thu Jan 09, 2020 9:15 pm Post subject: Re: Video Player Benchmarks |
|
|
roylongbottom wrote: | Video Player Benchmarks
The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect
|
You should be able to set the stream quality via the Preferences -> Network dialog in SMPlayer, per this screenshot. Turn adaptive streams OFF if you want to force a particular resolution in YouTube. You can also change the codec route as shown there.
You can have SMPlayer display the resolution and framerate in the window also (as in the above). _________________ Regards,
sakaki |
|
Back to top |
|
 |
roylongbottom n00b

Joined: 13 Feb 2017 Posts: 64 Location: Essex, UK
|
Posted: Fri Jan 10, 2020 12:36 pm Post subject: Re: Video Player Benchmarks |
|
|
Sakaki wrote: | roylongbottom wrote: | Video Player Benchmarks
The other test was via SMPlayer, where lower CPU utilisation could not exactly be confirmed, as the quality settings I found had no effect
|
You should be able to set the stream quality via the Preferences -> Network dialog in SMPlayer, per this screenshot. Turn adaptive streams OFF if you want to force a particular resolution in YouTube. You can also change the codec route as shown there.
You can have SMPlayer display the resolution and framerate in the window also (as in the above). |
I had tried using those properties, without success. It needed adaptive streams turned on before the video was loaded. I was running full screen, where resolution and framerate were not displayed, but seen via right click, View, Information and Properties, Information. As yours, 25 FPS (nearly) was selected.
Results are included below, now similar to Raspbian on CPU utilisation and temperature. Dual mirrored monitor results are also shown, where kbps was rather strange but display quality looked fine (without detailed scrutiny).
Code: | YouTube HD Movie
Av of 4 Throttled
Wide High kbps %CPU Max °C To 1000 Setting
Raspbian Chromium 1920 816 16000 25 68 0 1080p
Gentoo Firefox 856 362 18000 30 73 0 Auto (480p)
Gentoo Firefox 1920 816 18000 60 81 0 1080p
Gentoo SMPlayer 640 272 300 10 65 0 ~~1080p
Gentoo SMPlayer 1920 816 4428 28 73 0 1080p
Gentoo SMPlayer 1920x2 816x2 588-2028 32 76 0 1080p
~~1080p - use adaptive streams box not ticked
|
_________________ Regards
Roy |
|
Back to top |
|
 |
Sakaki Guru


Joined: 21 May 2014 Posts: 409
|
Posted: Fri Jan 10, 2020 5:15 pm Post subject: |
|
|
Strange. I've just tried this video myself - changing the "Playback quality" under "Options for YouTube" dialog does seem to work fine (720p, 1080p etc) with adaptive streams turned off. But, I found the player needs to be pointed at a different URL first sometimes, as it seems to remember the res for each target.
You can press Shift-I (eye) to show useful live overlaid info even when full screen on SMPlayer. _________________ Regards,
sakaki |
|
Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|