Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
64 Bit Raspberry Pi 4B Benchmarks
View unanswered posts
View posts from last 24 hours

Goto page 1, 2  Next  
Reply to topic    Gentoo Forums Forum Index Gentoo on ARM
View previous topic :: View next topic  
Author Message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Wed Aug 21, 2019 11:30 am    Post subject: 64 Bit Raspberry Pi 4B Benchmarks Reply with quote

64 Bit Raspberry Pi 4B Benchmarks

Previously, I have run my 32 bit and 64 bit benchmarks on the appropriate range of Raspberry Pi computers, up to model 3B+. Details of the benchmarks, results and download links are available from ResearchGate in a

https://www.researchgate.net/publication/327467963_Raspberry_Pi_3B_32_bit_and_64_bit_Benchmarks_and_Stress_Tests

I have also run the 32 bit versions on the Raspberry Pi 4, with results in

https://www.researchgate.net/publication/333973011_Raspberry_Pi_4B_32_Bit_Benchmarks">Raspberry-Pi-4-Benchmarks.pdf

This report contains brief reminders of the benchmarks, with 64 bit results on the Raspberry Pi 4 using Gentoo Operating System. Existing benchmarks were used to provide comparisons with the old 3B+ model and the Pi 4B system using 32 bit Raspbian. The first part is for my original single core programs.


Whetstone Benchmark

This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations, lately those identified as COS and EXP. The last three can be over optimised (N/A), but the time does not affect the overall rating much.

For this simple code, at 64 bits, average Pi 4 performance gain, over the Pi 3B+, was 2.12 times, but only around 1.3 times for straightforward floating point calculations. Then, as should be expected, the Pi 4B 32 bit speed was not much slower.

Code:

System       MHz  MWIPS   ----- MFLOPS------   ------------ MOPS--------------
                            1      2      3     COS    EXP  FIXPT    IF   EQUAL

Pi 3B+ 64b  1400   1071    383    403    328   20.9   12.4   1704   N/A    1357
Pi 4B  64b  1500   2269    522    534    398   54.8   39.8   2487   N/A     997
Pi4/3B+     1.07   2.12   1.36   1.32   1.21   2.63   3.21   1.46   N/A    0.73

Pi 4B 32b   1500   1884    516    478    310   54.7   27.1   2498   2247    999
64b/32b     1.00   1.20   1.01   1.12   1.28   1.00   1.47   1.00   N/A    1.00



Dhrystone Benchmark

This appears to be the most popular ARM benchmark and often subject to over optimisation. So you can’t compare results from different compilers. Ignoring this, results in VAX MIPS aka DMIPS and comparisons follow. This benchmark has no significant data arrays, suitable for vectorisation.

Using the same 64 bit program, the Pi 4 was more than twice as fast and 52% faster than the 32 bit compilation.

Code:

                             DMIPS
System      MHz     DMIPS     /MHz

Pi 3B+ 64b  1400     4028     2.88
Pi 4B  64b  1500     8176     5.45
Pi4/3B+     1.07     2.03

Pi 4B 32b   1500     5366     3.58
64b/32b     1.00     1.52


Linpack Benchmark

The original Linpack benchmark specified the use of double precision (DP) floating point arithmetic, and the code used here is identical to that initially approved for use on old PCs. For the benefit of early ARM computers, the code is also run using single precision (SP) numbers. A version was also produced, replacing the key Daxpy code with NEON Intrinsic Functions, using vector operations, also with single precision calculations.

The Pi 3B+ 32 bit results are also provided for clarification. My results were highlighted in the MagPi magazine, on announcement of the Pi 4, particularly the 2 GFLOPS 32 bit NEON speed. See:

https://www.raspberrypi.org/magpi/raspberry-pi-4-specs-benchmarks/

At 64 bits, Pi 4/3B+ performance ratios were generally higher, than those with the earlier benchmarks. Then, as could be expected, virtually compiler independent performance, using NEON Intrinsic Functions, were similar at 32 bits and 64 bits. The main 64 bit gain was with the compiled single precision version, obtaining the same performance as that via NEON Intrinsics.

Code:

System      MHz     ------- MFLOPS --------
                      DP      SP    SP NEON

Pi 3B+ 64b  1400    396.6    562.1    604.2
Pi 4B  64b  1500   1059.9   1977.8   1968.6
Pi4/3B+     1.07     2.67     3.52     3.26

Pi 4B 32b   1500    760.2    921.6   2010.5
64b/32b     1.00     1.39     2.15     0.98

Pi 3B+ 32b  1400    210.5    225.2    562.5
Pi4/3B+     1.07     3.61     4.09     3.57


Livermore Loops Benchmark

This original main benchmark for supercomputers was first introduced in 1970, initially comprising 14 kernels of numerical application, written in Fortran. This was increased to 24 kernels in the 1980s. Following are overall MFLOPS ratings, geometric mean being the official average performance, followed by details from the 24 kernels. Note that these are for double precision calculations

All the ratings indicate reasonably significant performance gains of Pi 4 over Pi 3B+ and 64 bits over 32 bits. Results from the 24 kernels indicate some higher gains. Also note the maximum speed of 2.49 GFLOPS (Double Precision).

The speed of the original Raspberry Pi could be rated as 4.5 times faster than the Cray 1 supercomputer (Geomean 11.9) - see my quote on

https://www.webarchive.org.uk/wayback/archive/20131218132751/http://www.roylongbottom.org.uk/Raspberry%20Pi%20Benchmarks.htm#anchor7a

Now, one core of the Raspberry Pi 4B, at 64 bits, produces performance equivalent to 61 Cray 1 supercomputers.

Code:

System      MHz Maximum Average Geomean Harmean Minimum

Pi 3B+ 64b 1400   737.7   319.4   284.7   250.6    91.6
Pi 4B  64b 1500  2490.5     892   730.3   603.3   212.4
Pi4/3B+    1.07    3.38    2.79    2.57    2.41    2.32

Pi 4B 32b  1500  1800.2   635.1     519   416.1   155.3
64b/32b    1.00    1.38    1.40    1.41    1.45    1.37


MFLOPS Of 24 Kernels

Pi 3B+    540   296   539   527   226   175   738   428   484   251   169   245
64b       127   161   291   258   440   520   333   280   310    93   362   209

Pi 4B    2026   997   987   948   372   739  2033  2491  1980   758   495   875
64b       220   404   811   710   753  1124   444   397  1061   414   822   283

Pi4/3B+  3.75  3.37  1.83  1.80  1.65  4.23  2.76  5.83  4.09  3.02  2.92  3.57
         1.73  2.51  2.79  2.75  1.71  2.16  1.33  1.42  3.43  4.48  2.27  1.36
         Min   1.33  Max   5.83

Pi 4B 32  746   964   988   943   212   538  1169  1800  1032   469   214   186
32b       159   335   778   623   732  1034   320   350   489   360   749   187

64b/32b  2.72  1.03  1.00  1.00  1.76  1.37  1.74  1.38  1.92  1.62  2.31  4.70
         1.38  1.20  1.04  1.14  1.03  1.09  1.39  1.13  2.17  1.15  1.10  1.51
         Min   1.00  Max   4.70



Next are single core benchmarks that use data in caches and RAM.
_________________
Regards

Roy


Last edited by roylongbottom on Sun Aug 25, 2019 11:41 am; edited 1 time in total
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Wed Aug 21, 2019 11:50 am    Post subject: Reply with quote

roylongbottom,

very interesting analysis as always, thanks for all your continued hard work on this!

Will you be posting these results to your Raspberry Pi Benchmarking thread on the RPi forums in due course?
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Wed Aug 21, 2019 12:45 pm    Post subject: Reply with quote

Sakaki wrote:
roylongbottom,

very interesting analysis as always, thanks for all your continued hard work on this!

Will you be posting these results to your Raspberry Pi Benchmarking thread on the RPi forums in due course?


Yes, nearly identical, including link to new Gentoo
_________________
Regards

Roy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47163
Location: 56N 3W

PostPosted: Wed Aug 21, 2019 4:13 pm    Post subject: Reply with quote

roylongbottom,

Did you use the same binaries on the Pi3 and Pi4 or rebuild to code to take advantage of the out of order execution available on the Pi4?
Here, I'm being lazy and using Pi3 64 bit code everywhere.

Thank you for your Pi benchmark work.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Wed Aug 21, 2019 7:09 pm    Post subject: Reply with quote

NeddySeagoon wrote:
roylongbottom,

Did you use the same binaries on the Pi3 and Pi4 or rebuild to code to take advantage of the out of order execution available on the Pi4?
Here, I'm being lazy and using Pi3 64 bit code everywhere.

Thank you for your Pi benchmark work.



The benchmarks were those compiled for the Pi 3. As for 32 bit benchmarks, I intend to recompile some on the Pi 4. Are there any special compile options?
_________________
Regards

Roy
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47163
Location: 56N 3W

PostPosted: Wed Aug 21, 2019 7:28 pm    Post subject: Reply with quote

roylongbottom,

On the Pi3, for 64 bit code, I use
Code:
CFLAGS="-march=armv8-a+crc -mtune=cortex-a53 -ftree-vectorize -O2 -pipe -fomit-frame-pointer"

The A53 does not support out of order execution.

The Pi4 has an A72 CPU, which does provide for out of order instruction execution.

If out of order instruction execution requires preparation in the code stream,
Code:
CFLAGS="-march=armv8-a+crc -mtune=cortex-a72 -ftree-vectorize -O2 -pipe -fomit-frame-pointer"

should produce code that is better matched to the Pi4.

I don't know if the A72 takes the A53 in order code stream and does what it can with instruction reordering.

I have not done any 32 bit work on either platform beyond booting 32 bit Raspbian and noting that it works.
That's a great confidence booster when you can't even get a serial console.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Wed Aug 21, 2019 8:00 pm    Post subject: Reply with quote

Interesting read about this topic here:

https://community.arm.com/developer/tools-software/tools/b/tools-software-ides-blog/posts/compiler-flags-across-architectures-march-mtune-and-mcpu
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Aug 22, 2019 11:33 am    Post subject: Reply with quote

sakaki

I became distracted from reporting some benchmark results after building ATLAS Linear Algebra Subprograms overnight (13 hours), in order to run the High Performance Linpack Benchmark. All went well until the final stage compiling the HPL program, where mpicc could not be found. It was there on the 3B Gentoo, where I successfully installed and ran HPL on a Pi 3B+.


Is mpicc available for downloading for Pi 4 Gentoo?
_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Thu Aug 22, 2019 3:41 pm    Post subject: Reply with quote

roylongbottom wrote:
All went well until the final stage compiling the HPL program, where mpicc could not be found. It was there on the 3B Gentoo, where I successfully installed and ran HPL on a Pi 3B+.


Is mpicc available for downloading for Pi 4 Gentoo?
Yes, if you do:
Code:

demouser@pi64 ~ $ sudo emaint sync --repo genpi64
demouser@pi64 ~ $ sudo emerge -v sys-cluster/mpich

you should get mpicc installed. This is built as a binary package on the binhost, so installation shouldn't take long. Please let me know if there are any problems.
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Aug 22, 2019 4:32 pm    Post subject: Reply with quote

Sakaki

Thanks

Nearly there, mpicc is used but now error is mpif77: Command not found
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Aug 22, 2019 6:05 pm    Post subject: Reply with quote

Memory Benchmarks

This batch of programs measure speed dependent on data from caches and RAM.


MemSpeed Benchmark

MemSpeed benchmark measures data reading speeds in MegaBytes per second, carrying out calculations on arrays of cache and RAM data, normally sized 2 x 4 KB to 2 x 4 MB. Calculations are as shown in the result headings. For the first two double precision tests, speed MFLOPS can be calculated by dividing MB/second by 8 and 16. For single precision divide by 4 and 8.

Results are provided below for the Gentoo 64 bit version on the Pi 3B+ and Pi 4B, and the Raspbian 32 bit variety on the Pi 4B, then a sample of relative performance, covering data from L1 cache, L2 cache and RAM.

Gains, greater than the 7% CPU MHz difference, were recorded all round by the Pi 4B over the Pi 3B+. The most impressive were on using L2 cache based data and the more intensive floating point calculations.

On the Pi 4B, speeds of 64 bit and 32 bit compilations were similar using RAM based data and executing some integer tests, but significantly faster from cache based floating point calculations.

Code:

                              Gentoo 64b Pi 3B+

         Memory Reading Speed Test armv8 64 Bit by Roy Longbottom

               Start of test Fri Aug 16 12:48:51 2019

    Memory  x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m]       x[m]=y[m]
    KBytes   Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32
      Used   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S

          8   4813   2897   4350   6180   3954   4831   5378   4324   4324
         16   4540   2900   4356   6213   3961   4838   5401   4344   4333
         32   4184   2780   4047   5540   3721   4483   5421   4285   4316
         64   3784   2678   3803   4776   3547   4171   4925   4087   4051
        128   3613   2694   3842   4731   3562   4188   4967   4087   4103
        256   3133   2652   3800   4626   3493   4027   4967   4093   4096
        512    670    882   1630   2913   2422   2718   3101   3141   2780
       1024    587    774   1017   1310   1287   1184   1105   1526   1543
       2048    555    746    917   1143   1131   1043   1071   1007   1128
       4096    545    691   1130   1039   1015   1140   1045   1087    892
       8192    537    795   1139    980   1133   1148    887    854    922
 Max MFLOPS    602    725

                              Gentoo 64b Pi 4B

          8  15530  13973  12509  15570  14025  15534  11417   9308   7798
         16  15719  14042  12750  15745  14200  15660  11753   9447   7890
         32  14062  12228  11435  14052  12699  12855  11864   9459   7937
         64  12195  11344  10698  12211  11705  12025   8872   8752   7904
        128  12172  11360  10755  12166  11862  11975   8569   8460   7913
        256  12228  11369  10697  12123  11790  12082   8073   8222   7896
        512  11269  10738  10206  10985  11164  11590   8017   6280   6557
       1024   3407   2635   3281   3396   3242   2979   3765   3947   4029
       2048   1525   1832   1838   1851   1607   1838   2819   2790   2770
       4096   1407   1851   1859   1861   1666   1840   2485   2487   2410
       8192   1913   1914   1922   1528   1895   1891   2496   2234   2489
 Max MFLOPS   1965   3511

                              Comparison 64b Pi4/3B+

          8   3.23   4.82   2.88   2.52   3.55   3.22   2.12   2.15   1.80
         16   3.46   4.84   2.93   2.53   3.58   3.24   2.18   2.17   1.82
 
        256   3.90   4.29   2.82   2.62   3.38   3.00   1.63   2.01   1.93
        512  16.82  12.17   6.26   3.77   4.61   4.26   2.59   2.00   2.36
       1024   5.80   3.40   3.23   2.59   2.52   2.52   3.41   2.59   2.61

       4096   2.58   2.68   1.65   1.79   1.64   1.61   2.38   2.29   2.70
       8192   3.56   2.41   1.69   1.56   1.67   1.65   2.81   2.62   2.70

                              Raspbian 32b Pi 4B

          8   8459   4766  13344   8303   4768  15553   7806   9926   9927
         16   7142   3918   8649   7103   4094   9309   7899  10086  10056
         32   7969   4490  10339   7941   4532  11627   7758  10070  10048
         64   8126   4602   9909   8114   4617  11069   7425   8021   8070
        128   8302   4651   9623   8311   4657  10836   7374   8049   7934
        256   8319   4663   9627   8360   4666  10768   7530   7922   7925
        512   8088   4629   9453   8239   4650  10696   5023   7904   7949
       1024   3581   3113   3618   3577   3150   3675   5358   2431   1560
       2048   1338   1808   1780   1811   1832   1773   2131    950    956
       4096   1881   1880   1852   1879   1664   1336   1988    984   1054
       8192   1890   1901   1884   1729   1319   1367   2252   1018   1021
 Max MFLOPS   1057   1192

                              Comparison Pi 4B 64b/32b

          8   1.84   2.93   0.94   1.88   2.94   1.00   1.46   0.94   0.79
         16   2.20   3.58   1.47   2.22   3.47   1.68   1.49   0.94   0.78

        256   1.47   2.44   1.11   1.45   2.53   1.12   1.07   1.04   1.00
        512   1.39   2.32   1.08   1.33   2.40   1.08   1.60   0.79   0.82
       1024   0.95   0.85   0.91   0.95   1.03   0.81   0.70   1.62   2.58
 
       4096   0.75   0.98   1.00   0.99   1.00   1.38   1.25   2.53   2.29
       8192   1.01   1.01   1.02   0.88   1.44   1.38   1.11   2.19   2.44


NeonSpeed Benchmark

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler, using NEON directives and Neon through using Intrinsic Functions.

Unlike running the same programs on the Pi 3B+, using the Pi 4, compiled codes were no longer slower than those produced via Intrinsic Functions. This lead to performance gains of up to over five times.

Except using L1 cache based data, performance was essentially the same using 32 bit and 64 bit benchmarks.

Code:

                     Gentoo 64b Pi 3B+

   NEON Speed Test armv8 64 Bit V 1.0 Fri Aug 16 2019

       Vector Reading Speed in MBytes/Second
     Memory  Float v=v+s*v  Int v=v+v+s  Neon v=v+v
     KBytes   Norm   Neon   Norm   Neon  Float    Int

         16   2715   5110   3945   4826   5426   5598
         32   2528   4326   3569   4191   4596   4661
         64   2491   4153   3494   4068   4407   4429
        128   2537   4228   3583   4120   4461   4473
        256   2526   4265   3614   4140   4480   4514
        512   1917   2830   2545   2579   2896   2964
       1024   1166   1299   1152   1257   1205   1229
       4096   1022   1135   1132   1122   1130   1100
      16384   1080   1026   1131   1016   1064   1094
      65536    996   1120   1061    831   1110   1069

                     Gentoo 64b Pi 4B

         16  13982  16424  12505  15239  16065  17193
         32   9554  10753   8981   9657  10970  11025
         64  10658  11833  10274  10722  12110  12134
        128  10657  11887  10337  10680  11994  11973
        256  10709  11970  10360  10774  12003  12083
        512  10147  11441   9733  10209  11264  11532
       1024   2964   3222   2876   3216   3270   2942
       4096   1734   1712   1729   1772   1586   1728
      16384   1592   1922   1818   1923   1926   1667
      65536   1970   1736   1997   1747   1884   2021

                   Comparison 64b Pi4/3B+

         16   5.15   3.21   3.17   3.16   2.96   3.07

        256   4.24   2.81   2.87   2.60   2.68   2.68
        512   5.29   4.04   3.82   3.96   3.89   3.89

      65536   1.98   1.55   1.88   2.10   1.70   1.89
                             
                      Raspbian 32b Pi 4B

         16   9677  10072   8905   9358   9776  10473
         32  10149  10330   9364   9539   9988  10543
         64  10948  11708  10466  10568  11318  11994
        128  10484  11232  10410  10104  11200  11792
        256  10509  11369  10428  10264  11273  11842
        512  10406  11066  10134  10054  11075  11467
       1024   3069   3202   3159   3166   3204   3203
       4096   1721   1910   1908   1882   1903   1900
      16384   2023   2009   2008   1965   2032   2013
      65536   2073   2074   2074   2073   2068   2064

                   Comparison Pi 4B 64b/32b

         16   1.44   1.63   1.40   1.63   1.64   1.64

        256   1.02   1.05   0.99   1.05   1.06   1.02
        512   0.98   1.03   0.96   1.02   1.02   1.01

      65536   0.95   0.84   0.96   0.84   0.91   0.98



BusSpeed Benchmark

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word address increments, followed by decreasing increments. finally reading all data. This shows were data is read in bursts, enabling estimates being made of bus speeds. The two comparison columns ar for two word and one word increments.

Most data transfers were 2.0 to 2.5 times faster on the Pi 4, including from RAM, and somewhat higher with L2 cache based data.

The 64 bit version still deals with 32 bit words but transferred data somewhat quicker than the 32 bit program, as shown by the Pi 4 results.

Code:

                       Gentoo 64b Pi 3B+

    BusSpeed armv8 64 Bit Fri Aug 16 12:53:43 2019

     Reading Speed 4 Byte Words in MBytes/Second
     Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read   Inc2   Read
     KBytes  Words  Words  Words  Words  Words    All  Words    All

         16   3819   4253   4622   5041   5089   3870
         32   1234   1328   2067   3158   4082   3674
         64    681    704   1325   2208   3350   3602
        128    638    646   1214   2070   3238   3625
        256    592    617   1165   1991   3164   3622
        512    295    309    640    985   2085   2790
       1024    108    120    271    525   1070   1636
       4096     98    123    249    486    881   1840
      16384    121    114    246    480    977   1642
      65536    121    124    248    409    989   1864

                       Gentoo 64b Pi 4B
                                                          Pi4/3B+

         16   4999   5042   5665   5885   5891   8217   1.16   2.12
         32   1578   2105   3283   4339   5154   7507   1.26   2.04
         64    585    911   1855   3085   5163   7918   1.54   2.20
        128    590    932   1888   3110   5161   7874   1.59   2.17
        256    598    934   1908   3056   5265   7883   1.66   2.18
        512    603    939   1822   3019   5124   7716   2.46   2.77
       1024    319    482   1060   1885   3283   5721   3.07   3.50
       4096    209    253    503   1006   2009   4111   2.28   2.23
      16384    209    261    520   1041   2071   4115   2.12   2.51
      65536    203    263    489   1011   2023   4036   2.05   2.17

                       Raspbian 32b Pi 4B
                                                          64b/32b

         16   3836   4049   4467   5885   4641   5858   1.14   1.14
         32    761   1473   2594   3216   3960   4780   1.01   1.01
         64    409    801   1684   2422   3745   3940   0.95   0.95
        128    406    803   1202   1914   3037   5377   1.32   1.32
        256    415    700   1165   2481   4789   5137   1.27   1.27
        512    392    760   1243   2455   3764   4264   1.38   1.38
       1024    230    256    623   1061   2455   3501   1.59   1.59
       4096    197    214    454    938   1852   3195   1.80   1.80
      16384    138    215    445    897   1724   3210   1.91   1.91
      65536    174    215    398    744   1655   3130   1.61   1.61



Fast Fourier Transforms Benchmark

This is a real application provided by my collaborator at Compuserve Forum. There are two versions. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements, at each size, using both single and double data, calculating FFT sizes between 1K and 1024K. Results are in milliseconds, with those here, the average of three measurements.

There were gains all round on the Pi 4, compared with the 3B+, mainly between 3 and 4 times on the optimised version, less so using FFT1, with more data transfer speed dependency.

On the Pi 4, performance from the 32 bit compilation was often similar to that at 64 bits. This is probably due to much of the data being read on a skipped sequential basis, not good for vectorisation.

Code:


                 Gentoo 64b Pi 3B+

       Size    FFT1           FFT3
          K      SP      DP     SP     DP

          1    0.13    0.15   0.15   0.17
          2    0.29    0.39   0.32   0.38
          4    0.76    1.13   0.79   0.85
          8    1.93    2.66   1.77   1.94
         16    4.02    5.51   4.69   5.14
         32    9.50   25.11   9.51  13.67
         64   42.53  110.21  25.30  32.25
        128  151.08  257.41  57.68  76.71
        256  355.88  589.07 129.47 174.85
        512  819.91 1324.89 297.80 390.74
       1024 1746.23 2943.08 641.50 863.82

       
               Gentoo 64b Pi 4B             Pi4/3B+

       Size    FFT1           FFT3          FFT1          FFT3
          K      SP      DP     SP     DP     SP     DP     SP     DP

          1    0.04    0.04   0.04   0.04   3.30   3.62   3.60   4.13
          2    0.08    0.14   0.11   0.09   3.81   2.88   2.82   4.03
          4    0.25    0.38   0.19   0.22   3.05   2.93   4.13   3.86
          8    0.79    1.31   0.46   0.50   2.45   2.04   3.87   3.87
         16    2.15    2.91   1.15   1.09   1.87   1.89   4.07   4.71
         32    5.71    6.76   2.48   3.18   1.66   3.71   3.83   4.30
         64   15.22   51.00   5.43   9.29   2.79   2.16   4.66   3.47
        128   83.47  151.95  16.28  24.75   1.81   1.69   3.54   3.10
        256  231.24  362.64  39.13  57.28   1.54   1.62   3.31   3.05
        512  561.16  765.18  90.20 133.21   1.46   1.73   3.30   2.93
       1024 1250.51 1878.44 213.35 303.39   1.40   1.57   3.01   2.85


              Raspbian 32b Pi 4B            64b/32b

       Size    FFT1           FFT3          FFT1          FFT3
          K      SP      DP     SP     DP     SP     DP     SP     DP

          1    0.04    0.04   0.06   0.05   0.99   0.96   1.44   1.18
          2    0.08    0.12   0.13   0.11   1.04   0.89   1.14   1.18
          4    0.32    0.37   0.27   0.24   1.28   0.96   1.42   1.09
          8    0.77    0.97   0.58   0.55   0.98   0.74   1.26   1.09
         16    1.69    2.01   1.49   1.35   0.78   0.69   1.29   1.24
         32    4.37    4.89   2.96   3.63   0.77   0.72   1.19   1.14
         64    9.12   26.55   7.46  10.75   0.60   0.52   1.37   1.16
        128   55.52  160.11  17.93  26.03   0.67   1.05   1.10   1.05
        256  305.92  423.06  41.16  55.06   1.32   1.17   1.05   0.96
        512  833.10  854.88  86.93 120.53   1.48   1.12   0.96   0.90
       1024 1617.49 1875.52 190.28 266.60   1.29   1.00   0.89   0.88



Next Multithreading Benchmarks
_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Thu Aug 22, 2019 7:10 pm    Post subject: Reply with quote

roylongbottom wrote:
Sakaki

Thanks

Nearly there, mpicc is used but now error is mpif77: Command not found
Looks like mpich needs recompilation with the fortran USE flag enabled. I'll do that tonight and post again here when done.
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Aug 22, 2019 7:44 pm    Post subject: Reply with quote

Sakaki wrote:
roylongbottom wrote:
Sakaki

Thanks

Nearly there, mpicc is used but now error is mpif77: Command not found
Looks like mpich needs recompilation with the fortran USE flag enabled. I'll do that tonight and post again here when done.


Thanks, but maybe I should try something before putting you to the trouble.

The Pi 3 Gentoo appears to include mpicc, that worked when installing HPL on my Pi 3B+, using the following

https://computenodes.net/2018/06/28/building-hpl-an-atlas-for-the-raspberry-pi/

This installs mpich-3.2. I will try to recompile that.
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Thu Aug 22, 2019 9:37 pm    Post subject: Reply with quote

Sakaki

My recompile worked, so I now have a working Gentoo Pi 4 HPL Benchmark, but the speed is disappointing, same as the 32 bit version with a maximum of just over 10 GFLOPS (with 4 GB RAM). It might need some compiling parameters changing for HPL (or ATLAS) and wonder if I could find anyone to advise how and where.

At least it is three times faster than using the Gentoo Pi 3B+ version.
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Fri Aug 23, 2019 10:23 am    Post subject: Reply with quote

sakaki

For an up to date comparison, I have been running that HPL benchmark and my other MP tests on my Pi 3B+, using the new Gentoo. All ran without any problems, but there were two things to report.

The first was that TV display started at 1024 x 786. Settings did not provide an option anywhere near 1920 x 1080.

The second point was that WiFi connected, without any intervention, using the originally entered password. Back on the Pi 4, it still did not connect.
_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Fri Aug 23, 2019 12:23 pm    Post subject: Reply with quote

roylongbottom wrote:
sakaki

For an up to date comparison, I have been running that HPL benchmark and my other MP tests on my Pi 3B+, using the new Gentoo. All ran without any problems, but there were two things to report.

The first was that TV display started at 1024 x 786. Settings did not provide an option anywhere near 1920 x 1080.

The second point was that WiFi connected, without any intervention, using the originally entered password. Back on the Pi 4, it still did not connect.


Thanks for the feedback. Not sure what the issue with WiFi is, I have had no issue connecting on the 4B locally. If you could run:
Code:
demouser@pi64 ~ $ dmesg > kernel.log

and email me the results (removing anything sensitive first if you wish), I may be able to pinpoint what is happening.

Incidentally, I have pushed a version of mpich-3.3 (with the fortran USE flag set) to the binhost. To get it use:
Code:
demouser@pi64 ~ $ sudo emaint sync --repo genpi64
demouser@pi64 ~ $ sudo emerge -v --oneshot sys-cluster/mpich


As to the monitor settings, I think I introduced a regression in 1.5.0 by uncommenting the line "hdmi_drive=2" in /boot/config.txt.

Could you try reverting this and see if your monitor compatibility improves? You can do so by simply running the Applications -> Settings -> RPi Config Tool app, and unchecking the "Force audio output in DMT modes" box, then click "Save and Exit" rebooting when prompted. Be sure to confirm your settings when the system comes back up (you'll be prompted about this).

Also, per Neddy's points and the ARM paper on compiler settings linked above, it'd be interesting to see the effect on your benchmarks of e.g. compiling them with:
Code:
gcc -march=armv8-a+crc -mtune=cortex-a72 -O2 -pipe a_benchmark.c
options under GCC (and possibly -O3 instead of -O2, if you don't already use that). Code with the above settings will be optimized for the Pi4, but will also run on the Pi3.

The equivalent for optimized code on the Pi3 (which will also run on the Pi4):
Code:
gcc -march=armv8-a+crc -mtune=cortex-a53 -O2 -pipe a_benchmark.c

_________________
Regards,

sakaki
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47163
Location: 56N 3W

PostPosted: Fri Aug 23, 2019 1:02 pm    Post subject: Reply with quote

roylongbottom,

See Pi 4 Wifi.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47163
Location: 56N 3W

PostPosted: Fri Aug 23, 2019 1:09 pm    Post subject: Reply with quote

Team,

Sakaki wrote:
Also, per Neddy's points and the ARM paper on compiler settings linked above, it'd be interesting to see the effect on your benchmarks of e.g. compiling them with:
Code:
gcc -march=armv8-a+crc -mtune=cortex-a72 -O2 -pipe a_benchmark.c
options under GCC (and possibly -O3 instead of -O2, if you don't already use that). Code with the above settings will be optimized for the Pi4, but will also run on the Pi3.

The equivalent for optimized code on the Pi3 (which will also run on the Pi4):
Code:
gcc -march=armv8-a+crc -mtune=cortex-a53 -O2 -pipe a_benchmark.c


It would be interesting to see what was the least worst settings for code to run on both platforms.
My binhost is cortex-a53 but I would rebuild it with cortex-a72 if that produced the best results across both systems.
I'm tempted to do that anyway.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Fri Aug 23, 2019 1:12 pm    Post subject: Reply with quote

NeddySeagoon wrote:
roylongbottom,

See Pi 4 Wifi.

The 1.5.0 image ought to have the issue pointed out above fixed (commit).
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Fri Aug 23, 2019 1:24 pm    Post subject: Reply with quote

NeddySeagoon wrote:
Team,
It would be interesting to see what was the least worst settings for code to run on both platforms.
My binhost is cortex-a53 but I would rebuild it with cortex-a72 if that produced the best results across both systems.
I'm tempted to do that anyway.

Also, ARM has a big.LITTLE architecture that allows work to be transferred on the fly between (e.g.) A72 and A53 cores depending on system load, and because of that there's a "cortex-a72.cortex-a53" mtune variant available for gcc also...

For the 1.5.0 release (and binhost --emptytree @world rebuild), I decided in the end that most people would end up shifting to the Pi4 in time, so migrated to straight-up a72 tuning.
_________________
Regards,

sakaki
Back to top
View user's profile Send private message
NeddySeagoon
Administrator
Administrator


Joined: 05 Jul 2003
Posts: 47163
Location: 56N 3W

PostPosted: Fri Aug 23, 2019 2:01 pm    Post subject: Reply with quote

Sakaki,

That's my thinking too but I've not done it yet.

My Acer R13 Chromebook in a big.LITTLE device but for now, it just runs my Pi3 Gentoo.
_________________
Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Fri Aug 23, 2019 10:06 pm    Post subject: Reply with quote

Optimisers or Misers

I have been trying the suggested compiling parameters on various benchmarks, via Gentoo on a Pi 4B, but have not found one where they made a great deal of difference - unlike hardware architecture. No doubt there are some.

Below are result for the Livermore loops, comprising 24 program kernels, the most critical at Lawrence Livermore Laboratory for selecting a new supercomputer. The tables show the compile parameters used. The first table indicating the measured MFLOPS for each kernel, and the second one relative ratios compared with

Code:

MFLOPS 
                                                                         
 original gcc  lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a
            -o liverloopsPi64

    1943  999  951  924  372  681 2067 2538 2041  674  495  862
     224  445  812  711  753 1164  443  397  915  408  822  283

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a
            -o liverloopsPi64
    1982  986  961  964  384  753 2316 2743 1907  871  500  965
     148  411  814  668  725 1167  449  397 1680  557  817  283

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a
            -mtune=cortex-a72 -o liverloopsPi64
    1965  962  996  965  388  512 2021 1900 1956  875  483  974
     173  400  815  633  748 1184  450  397 1577  560  823  312

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a+crc
        -mtune=cortex-a72 -o liverloopsPi64
    1926  960  962  965  382  683 2043 2374 1441  624  500  969
     175  413  815  637  748 1172  450  397 1488  553  824  312

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a+crc
            -mtune=cortex-a72 -pipe -o liverloopsPi64
    2153  961  992  964  388  668 2056 2399 2088  793  500  973
     169  417  814  621  748 1152  449  397 1677  551  822  312

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O2 -march=armv8-a+crc
            -mtune=cortex-a72 -pipe -o liverloopsPi64
    2206 1218  995  965  206  766 2284 1739 2090  667  500  741
     222  365  813  652  746 1116  449  393  639  560  602  125

gcc 9 gcc  lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a -pipe
         -o liverloopsPi64
    2130  986  989  965  389  681 2336 1692 1976  678  500  969
     177  408  814  668  726 1199  449  397 1651  559  816  283

gcc 9 gcc  lloops2.c cpuidc.c -lm -lrt -O3  -o liverloopsPi64
    2154  988  977  965  389  731 2328 2841 2078  703  500  977
     177  414  815  668  727 1188  450  397 1640  562  820  283


Following are the comparisons with speeds of the first, original 64 bit version. Using the same parameters and gcc 9 produced an slight average improvement. the performance went downhill by including suggested parameters, worst was on using -O2 and best with no parameters other than -O3.

Perhaps this is the result of compiling on the target computer.

Code:

Ratios

 original gcc  lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a  Average

    1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00   1.00
    1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a 
            -o liverloopsPi64
    1.02 0.99 1.01 1.04 1.03 1.10 1.12 1.08 0.93 1.29 1.01 1.12   1.06
    0.66 0.93 1.00 0.94 0.96 1.00 1.01 1.00 1.84 1.37 0.99 1.00

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a
            -mtune=cortex-a72 -o liverloopsPi64
    1.01 0.96 1.05 1.04 1.04 0.75 0.98 0.75 0.96 1.30 0.97 1.13   1.03
    0.77 0.90 1.00 0.89 0.99 1.02 1.02 1.00 1.72 1.37 1.00 1.11

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a+crc
            -mtune=cortex-a72 -o liverloopsPi64
    0.99 0.96 1.01 1.04 1.03 1.00 0.99 0.94 0.71 0.92 1.01 1.12   1.02
    0.78 0.93 1.00 0.89 0.99 1.01 1.02 1.00 1.63 1.36 1.00 1.11

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a+crc
            -mtune=cortex-a72 -pipe -o liverloopsPi64
    1.11 0.96 1.04 1.04 1.04 0.98 0.99 0.95 1.02 1.18 1.01 1.13   1.05
    0.76 0.94 1.00 0.87 0.99 0.99 1.01 1.00 1.83 1.35 1.00 1.10

gcc 9 - gcc lloops2.c cpuidc.c -lm -lrt -O2 -march=armv8-a+crc
            -mtune=cortex-a72 -pipe -o liverloopsPi64
    1.14 1.22 1.05 1.04 0.55 1.13 1.11 0.68 1.02 0.99 1.01 0.86   0.95
    0.99 0.82 1.00 0.92 0.99 0.96 1.01 0.99 0.70 1.37 0.73 0.44

gcc 9 gcc  lloops2.c cpuidc.c -lm -lrt -O3 -march=armv8-a -pipe
           -o liverloopsPi64
    1.10 0.99 1.04 1.04 1.05 1.00 1.13 0.67 0.97 1.01 1.01 1.12   1.04
    0.79 0.92 1.00 0.94 0.96 1.03 1.01 1.00 1.81 1.37 0.99 1.00

gcc 9 gcc  lloops2.c cpuidc.c -lm -lrt -O3  -o liverloopsPi64
    1.11 0.99 1.03 1.05 1.04 1.07 1.13 1.12 1.02 1.04 1.01 1.13   1.07
    0.79 0.93 1.00 0.94 0.97 1.02 1.01 1.00 1.79 1.38 1.00 1.00

_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Sun Aug 25, 2019 9:48 am    Post subject: Reply with quote

Sakaki wrote:


As to the monitor settings, I think I introduced a regression in 1.5.0 by uncommenting the line "hdmi_drive=2" in /boot/config.txt.

Could you try reverting this and see if your monitor compatibility improves? You can do so by simply running the Applications -> Settings -> RPi Config Tool app, and unchecking the "Force audio output in DMT modes" box, then click "Save and Exit" rebooting when prompted. Be sure to confirm your settings when the system comes back up (you'll be prompted about this).


I reverted that line in config.txt. Then, the particular monitor worked perfectly with full screen displays on both input sockets. The other monitor is better, displaying the coloured square and booting text but then goes off line.
_________________
Regards

Roy
Back to top
View user's profile Send private message
roylongbottom
n00b
n00b


Joined: 13 Feb 2017
Posts: 64
Location: Essex, UK

PostPosted: Mon Sep 02, 2019 7:47 am    Post subject: Reply with quote

Sakaki

My WiFi is now working using v1.5.1 bugfix release, on my two Pi 4s and a Pi 3B+. I have also found that two monitors and a TV display at the correct resolution.
_________________
Regards

Roy
Back to top
View user's profile Send private message
Sakaki
Guru
Guru


Joined: 21 May 2014
Posts: 409

PostPosted: Mon Sep 02, 2019 12:02 pm    Post subject: Reply with quote

roylongbottom wrote:
Sakaki

My WiFi is now working using v1.5.1 bugfix release, on my two Pi 4s and a Pi 3B+. I have also found that two monitors and a TV display at the correct resolution.
Thanks for the feedback, happy to hear these features are now working for you!

PS it appears that the Python interpreter (at least, v3.7.3) runs programs significantly more slowly, on average, in 64bit (both Gentoo and Debian) than in 32bit. I copy my original post to the RPi forums below, as it may be of interest (some further discussion may be found here, ff):

sakaki wrote:
Hello,

apologies for the slight OT, but the Python 64-bit performance discrepancy mentioned above caught my attention (and I observed it also, trying out selfgrams.py), so I decided to try some more detailed benchmarking, the results of which are reported below.

For the tests, I set up a Raspbian Buster system with a 64-bit kernel, but otherwise stock, on a 4GiB Pi4, 1.5GHz, performance CPU governor, Pimoroni fan shim (so no thermal throttling). I then ran the pyperformance benchmark suite in a clean virtualenv. Python v3.7.3 was used. This was the baseline.

I then ran the same test suite in:

  • a 32-bit armhf Debian Buster chroot (same 64-bit kernel, Raspbian host OS, physical machine), Python v3.7.3;
  • a 64-bit arm64 Debian Buster chroot (ditto);
  • a Gentoo64 v1.5.0 system booted under the same kernel, Python v3.7.3 (built under the stock (-O2 no-pgo) settings);
  • ditto (but with Python v3.7.3 built using -O3 and profile guided optimization in use (since Debian appear to use this now);
  • ditto (but with Python v3.7.4 built using -O3 and profile guided optimization).


I then normalized the reported runtime statistics for each sub-benchmark in the suite, so that 1.00 = the time taken by the 32-bit baseline [1], and then took the median [2] of the full suite's relative performance for each platform as an overall performance measure (lower is better).

The results are tabulated below. Very rough, and with the caveats that apply to any benchmarks, but in summary:

  • The Python interpreter (at least, v3.7.3) seems to run programs faster on average in 32-bit than 64-bit (whether Gentoo or Debian), by a significant margin.
  • Debian armhf (32-bit) is marginally faster than Raspbian 32-bit, on a median basis.
  • The stock (-O2, no pgo) Gentoo 64-bit v3.7.3 is significantly slower than Debian's 64-bit arm64 version... however
  • Once I turned on -O3 and pgo (which appears to be Debian's default build settings, and are also now mine for Python from the forthcoming v1.5.1 release onwards ^-^) Gentoo 64 marginally outperformed Debian 64 at v3.7.3 (although was still slower than both 32-bit variants tested).
  • The v3.7.4 Gentoo 64-bit Python loses some ground against v3.7.3, but still keeps up with Debian64 v3.7.3 (there are some apparent performance regressions in there, such as unpickle_list, which account for most of this).


Results:
Code:

pyperformance benchmark, Pi4, common 64-bit kernel, fan shim, 1.5GHz performance governor

                         Raspbian     Debian     Debian     Gentoo     Gentoo     Gentoo
                           32-bit     32-bit     64-bit     64-bit     64-bit     64-bit
                            stock      armhf      stock      stock    -O3 pgo    -O3 pgo
             Benchmark     v3.7.3     v3.7.3     v3.7.3     v3.7.3     v3.7.3     v3.7.4
-----------------------------------------------------------------------------------------
 Median (lower=faster)       1.00       0.98       1.19       1.37       1.16       1.19
-----------------------------------------------------------------------------------------
                  2to3       1.00       1.00       1.23       1.34       1.18       1.20
             chameleon       1.00       0.93       1.12       1.32       1.04       1.05
                 chaos       1.00       0.94       1.26       1.45       1.21       1.22
          crypto_pyaes       1.00       0.97       1.23       1.41       1.13       1.20
             deltablue       1.00       1.03       1.26       1.45       1.25       1.26
       django_template       1.00       0.94       1.28       1.45       1.21       1.24
           dulwich_log       1.00       0.93       1.15       1.33       1.13       1.13
              fannkuch       1.00       1.08       1.17       1.29       1.06       1.07
                 float       1.00       1.02       1.16       1.45       1.20       1.26
           genshi_text       1.00       1.05       1.27       1.39       1.16       1.20
            genshi_xml       1.00       0.99       1.23       1.34       1.11       1.16
                    go       1.00       1.00       1.19       1.35       1.17       1.18
                hexiom       1.00       1.07       1.23       1.47       1.20       1.25
              html5lib       1.00       0.98       1.22       1.36       1.19       1.19
            json_dumps       1.00       0.89       1.14       1.26       1.02       1.04
            json_loads       1.00       0.94       1.09       1.24       0.92       0.93
        logging_format       1.00       0.92       1.19       1.35       1.14       1.14
        logging_silent       1.00       1.10       1.38       1.47       1.16       1.26
        logging_simple       1.00       0.93       1.19       1.35       1.14       1.12
                  mako       1.00       1.03       1.26       1.46       1.15       1.19
        meteor_contest       1.00       1.02       1.18       1.28       1.09       1.13
                 nbody       1.00       1.03       1.03       1.23       1.04       1.07
               nqueens       1.00       1.00       1.22       1.53       1.22       1.24
               pathlib       1.00       0.93       1.23       1.44       1.21       1.19
                pickle       1.00       0.86       1.08       1.22       0.95       0.97
           pickle_dict       1.00       1.03       1.09       1.46       1.09       1.10
           pickle_list       1.00       1.03       1.20       1.59       1.10       1.07
    pickle_pure_python       1.00       1.05       1.27       1.50       1.19       1.22
              pidigits       1.00       0.96       0.57       0.60       0.56       0.56
        python_startup       1.00       0.89       1.13       1.27       1.18       1.16
python_startup_no_site       1.00       0.88       1.09       1.25       1.18       1.14
              raytrace       1.00       0.98       1.21       1.47       1.24       1.26
         regex_compile       1.00       0.99       1.19       1.40       1.16       1.19
             regex_dna       1.00       1.12       1.00       0.99       0.92       0.84
          regex_effbot       1.00       1.22       1.03       1.04       0.94       0.96
              regex_v8       1.00       1.36       1.18       1.28       1.07       1.06
              richards       1.00       1.06       1.21       1.38       1.22       1.21
           scimark_fft       1.00       0.96       1.07       1.22       1.08       1.14
            scimark_lu       1.00       1.03       1.31       1.58       1.28       1.34
   scimark_monte_carlo       1.00       1.01       1.27       1.51       1.25       1.41
           scimark_sor       1.00       1.06       1.12       1.37       1.22       1.24
scimark_sparse_mat_mul       1.00       0.98       0.91       1.18       1.02       1.04
         spectral_norm       1.00       0.98       1.11       1.27       1.10       1.09
sqlalchemy_declarative       1.00       0.94       1.20       1.34       1.23       1.33
 sqlalchemy_imperative       1.00       0.90       1.20       1.37       1.24       1.54
          sqlite_synth       1.00       0.84       1.12       1.29       1.08       1.12
          sympy_expand       1.00       0.99       1.18       1.41       1.19       1.53
       sympy_integrate       1.00       0.96       1.23       1.45       1.21       1.36
             sympy_str       1.00       0.96       1.21       1.45       1.20       1.44
             sympy_sum       1.00       0.93       1.24       1.46       1.23       1.39
                 telco       1.00       0.85       1.28       1.40       0.98       1.06
          tornado_http       1.00       0.92       1.17       1.30       1.16       1.22
       unpack_sequence       1.00       1.05       0.96       0.99       0.98       0.95
              unpickle       1.00       0.97       1.12       1.32       1.04       1.52
         unpickle_list       1.00       1.15       1.28       1.38       1.17       2.25
  unpickle_pure_python       1.00       1.10       1.38       1.57       1.28       1.33
    xml_etree_generate       1.00       0.96       1.40       1.69       1.31       1.29
   xml_etree_iterparse       1.00       1.04       1.16       1.36       1.13       1.07
       xml_etree_parse       1.00       0.88       1.00       1.16       0.97       0.99
     xml_etree_process       1.00       0.96       1.39       1.58       1.29       1.28



Hope that is of some interest! On the basis of the above, I'd expect the performance gap for the Python programs in the chart below (which I copy again for ease of reference) to narrow significantly under the forthcoming v1.5.1 Gentoo 64 release, but still underperform the 32-bit Raspbian tests (in contrast to the Rust, most of the C/C++ tests, which do better under 64-bit). NB for avoidance of doubt this chart has not been updated using the -O3/pgo v3.7.3 or v3.7.4 64-bit Python builds yet.

http://fractal.math.unr.edu/~ejolson/pi/anagram/fame64.png

Best,

sakaki

[1] Oops, just noticed this is reversed to the "divide by 64-bit runtime" metric used in the chart. Apologies for any confusion!
[2] I guess taking logs might have been an idea first >< ... but don't think this will affect things too much.
Edit: just confirmed this: working with log relative performance gives the same comparative ranking:

  • Debian 32-bit armhf v3.7.3 (fastest)
  • Raspbian 32-bit v3.7.3
  • Gentoo 64-bit v3.7.3 -O3 pgo
  • Debian 64-bit v3.7.3 / Gentoo 64-bit v3.7.4 -O3 pgo (dead heat)
  • Gentoo 64-bit v3.7.3 -02 no-pgo (slowest)

_________________
Regards,

sakaki
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Gentoo on ARM All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum