AMD Compilation Flags in gcc.

Author	Message
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 643 - Posted: 17 Oct 2008, 15:03:20 UTC Last modified: 17 Oct 2008, 15:11:24 UTC Hi All, I've tried recompiling the engima.c code several different ways in 64-bit mode. The quickest time I've gotten using the benchmark test on my overclocked Phenom (2.930Ghz) is about 3m 11s with the following flags.... CHOST='x86_64-pc-linux-gnu' CFLAGS='-O3 -ffast-math -funroll-all-loops -fpeel-loops -march=amdfam10' -pipe and -sse3 slowed me down by about 3-4secs. -O3 beats -O2 by about 1 second. I also tried profile optimization, and got and extra 3 seconds of run time for my trouble. :-( Unfortunately, the Pentium III 32-bit executable TJM came up with beats my best complation so far by about 1 second :-( . Anyone have any other flags that I should try that has increased the speed of the AMD 64-bit compilations? I was thinking of trying some of the ACML libraries, but I am out of my element trying to work with C (the last FORTRAN compilation i had was back in the early 1990's, then I found MS-Excel :-).) With the Phenom, less seems to be more with the compilation flags. I'll play with a few more tonight, but I have a feeling there will be no dramatic increase in speed. One other thing I noticed with the Gnome System Monitor is that it seems that the benchmark will change the core it's running on mid-way through the test. I noticed several instances of the task "jumping" from core 3 to core 2, and vice versa. I'm wondering if this is a heat-dissipation or heat-distribution scheme by AMD so you don't thermally stress the processor? I would imagine this does not happen when all 4 cores are working at full capacity, but I wonder if anyone else has noticed this? I have the 995Z 125W version of the processor. Mike Doerner PS I'm using gcc 4.3.1 from openSUSE 11.0. ID: 643 · Rating: 0 · rate: / Reply Quote

TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0	Message 644 - Posted: 17 Oct 2008, 19:11:12 UTC - in response to Message 643. Hi All, Unfortunately, the Pentium III 32-bit executable TJM came up with beats my best complation so far by about 1 second :-( . To build this one I used gcc 3.2, cflags: -march=pentium3 -O3 M4 Project homepage M4 Project wiki ID: 644 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 645 - Posted: 17 Oct 2008, 20:12:27 UTC - in response to Message 644. Last modified: 17 Oct 2008, 20:26:43 UTC To build this one I used gcc 3.2, cflags: -march=pentium3 -O3 No -ffast-math flag? Hmmmm. Too bad gcc 3.2 only has -m32 and -m64 flags. I might try that but I'm not sure what the optimizations in 4.3.1 (amdfam10) are vs. 3.2 (m64) in the compiled code, and other than installing 3.2 on my box, I'm not sure how else to evaluate the two. I wonder how the -march=pentium3 would do with the flags ffast-math, funroll-all-loops and fpeel-all-loops would do. I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. Mike Doerner PS Also, the executable compile is called out in the makefile. If i want to play with Intel's compiler or gcc 3.2, how do I get the Makefile to recognize an alternate compiler? Compile seems to go to gcc by default, or at least openSuSe's default. ID: 645 · Rating: 0 · rate: / Reply Quote

TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0	Message 646 - Posted: 17 Oct 2008, 20:29:34 UTC - in response to Message 645. Last modified: 17 Oct 2008, 20:31:17 UTC Too bad gcc 3.2 only has -m32 and -m64 flags. I might try that but I'm not sure what the optimizations in 4.3.1 (amdfam10) are vs. 3.2 (m64) in the compiled code, and other than installing 3.2 on my box, I'm not sure how else to evaluate the two. You could try to set up virtual machine and then install older linux distribution on it, the gcc I've used was Mandrake 9's default version. gcc version 3.2 (Mandrake Linux 9.0 3.2-1mdk) I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. I think that executable compiled with Intel's C compiler will outperfom anything build with gcc on Intel's processors. Take a look at this host: http://www.enigmaathome.net/results.php?hostid=1369 it's 996MHz PIII with PC-133 SDRAM running Intel-optimized executable. M4 Project homepage M4 Project wiki ID: 646 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 650 - Posted: 20 Oct 2008, 2:53:21 UTC Last modified: 20 Oct 2008, 3:53:20 UTC I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-) TJM, if you can verify, I'd appreciate it. This app seems to run 3-4 seconds faster than the PIII executable w/ the benchmark you provided. I found newer AMD recommended flags for gcc 4.3.0... 1st compile.... export CFLAGS='-O3 -funroll-all-loops -ffast-math -mtune=amdfam10 -fprefetch-loop-arrays -ftree-parallelize-loops=n -combine -fwhole-program -fprofile-generate' Ran the start command twice... 2nd compile..... export CFLAGS='-O3 -funroll-all-loops -ffast-math -mtune=amdfam10 -fprefetch-loop-arrays -ftree-parallelize-loops=n -combine -fwhole-program -fprofile-use' The 1st compile seemed to equal the speed of the PIII app, but after re-compiling with -fprofile-use it seems to have optimized it further. I didn't see any *.da files like the gcc 4.2.0 documentation stated would be there, but I figured, what the heck, I'll give it a shot anyways. Also, if you'd like to d/load my executable, here it is..... http://members.cox.net/mdoerner1/enigma_Phenom64 I haven't zipped it or anything, it's just the raw executable. Let me know if my time-savings are correct. The ACML libraries are a bit out of my league, and not having the time to analyze the source code, I'm not sure if the BLAS functions would save any time in computation. Mike Doerner PS Here's the latest AMD compilation flags for gcc......http://developer.amd.com/assets/AMD_GCC_Quick_Reference_Guide080509.pdf PPS Strangely enough, -march code was 1 second slower than -mtune. Go figure. ID: 650 · Rating: 0 · rate: / Reply Quote

TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0	Message 655 - Posted: 23 Oct 2008, 12:00:13 UTC - in response to Message 650. I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-) TJM, if you can verify, I'd appreciate it. This app seems to run 3-4 seconds faster than the PIII executable w/ the benchmark you provided. I found newer AMD recommended flags for gcc 4.3.0... 1st compile.... export CFLAGS='-O3 -funroll-all-loops -ffast-math -mtune=amdfam10 -fprefetch-loop-arrays -ftree-parallelize-loops=n -combine -fwhole-program -fprofile-generate' Ran the start command twice... 2nd compile..... export CFLAGS='-O3 -funroll-all-loops -ffast-math -mtune=amdfam10 -fprefetch-loop-arrays -ftree-parallelize-loops=n -combine -fwhole-program -fprofile-use' The 1st compile seemed to equal the speed of the PIII app, but after re-compiling with -fprofile-use it seems to have optimized it further. I didn't see any *.da files like the gcc 4.2.0 documentation stated would be there, but I figured, what the heck, I'll give it a shot anyways. Also, if you'd like to d/load my executable, here it is..... http://members.cox.net/mdoerner1/enigma_Phenom64 I haven't zipped it or anything, it's just the raw executable. Let me know if my time-savings are correct. The ACML libraries are a bit out of my league, and not having the time to analyze the source code, I'm not sure if the BLAS functions would save any time in computation. Mike Doerner PS Here's the latest AMD compilation flags for gcc......http://developer.amd.com/assets/AMD_GCC_Quick_Reference_Guide080509.pdf PPS Strangely enough, -march code was 1 second slower than -mtune. Go figure. I can't try your executable because the only Phenom I have (BOINC server) is running 32 bit OS (Debian 4). I built 32 bit version with same cflags, seems to run a bit faster than the previous one I used. Btw, if the benchmark results between two apps are similar, try running a real workunit and check it's runtime. The benchmark is specific, it runs enigma on a very short ciphertext (shorter than hceyz72). Sometimes a very small diference (like 1-3% in benchmark time) can turn into huge speedup on longer workunits. M4 Project homepage M4 Project wiki ID: 655 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 656 - Posted: 23 Oct 2008, 13:27:03 UTC Glad to see those flags work for you then. I've also overclocked my processor since I've optimized my 64-bit app, so it's harder to compare apples to apples so to speak. However, I'll see if I can come up with some sort of comparison. You're right though, a 4 second savings on a 3.25 minute run time equates to around a minute savings on each awgly100_0 that is run. Considering the amount of workunits left, that will help get us to the end quicker. My only apprehension is if you use those flags for an Intel optimization, making us AMD users even less competitive....:-( Oh well. Whatever works best. Mike Doerner ID: 656 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 657 - Posted: 23 Oct 2008, 13:38:14 UTC Last modified: 23 Oct 2008, 13:39:12 UTC These times are at 2.6Ghz.... Before - awgly100_0_1998795_r0 3,747.89 seconds of cpu time (default app) After - awgly100_0_2001488_r0 2,765.39 seconds of cpu time (optimized phenom app) And the Intel.... Well this is annoying. My other computer is a laptop w/ C2D T7500 2.2Ghz. With the C2D optimized app in Windows..... awgly100_0_1760349_r0 2,542.52 Secs CPU time. (220 seconds less than my 2.6 Ghz Phenom) My phenom is currently clocked at 2.93Ghz, and I'm getting awgly100_0_2153304_r0 2,487.11 secs CPU time w/ my new optimized app. 2.6 / 2.93 = 0.887 so 2765.39 should be 2453.93 secs if I had overclocked earlier with the old optimized app. Right in the ballpark, but hard to say if I've actually saved a minute of computing time or not, as I see completion times change between the different awgly100's. I'll have to just watch it I guess....:-) Mike Doerner ID: 657 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 658 - Posted: 23 Oct 2008, 21:50:47 UTC Here's some of the low times since I compiled the app.... awgly100_0_2380425_r0 2,465.53 secs (these results may have been while I was running at 3.0 Ghz briefly) awgly100_0_2032261_r0 2,494.51 secs awgly100_0_2027623_r0 2,484.84 secs awgly100_0_2360478_r0 2,475.29 secs awgly100_0_2354671_r0 2,458.55 secs awgly100_0_2352802_r0 2,481.73 secs awgly100_0_2347369_r0 2,454.21 secs ....kinda all over the place but compared to before 10/19/08.... awgly100_0_2246402_r0 2,493.26 secs awgly100_0_2239341_r0 2,438.31 secs ....really all over the place. Who knows.... Mike Doerner ID: 658 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0	Message 662 - Posted: 26 Oct 2008, 2:04:52 UTC Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. HTH ID: 662 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 665 - Posted: 28 Oct 2008, 1:45:13 UTC - in response to Message 662. Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. HTH Howdie, What is vectorization and how will it help the Enigma code run faster?:-) Mike Doerner ID: 665 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0	Message 668 - Posted: 28 Oct 2008, 15:08:24 UTC - in response to Message 665. Vectorization is another term for SIMD (single instruction and multiple data). In other words, the use of a single operation, say multiply, applied to different data at the same time. SSE2 has 128-bit registers which can accommodate 2 double-precision or 4 single-precision floating-point numbers, similarly for integer types. Through vectorization, it's possible to operate on 2 of such registers containing multiple data, thus resulting in a single instruction providing 2 (for double-precision) or 4 (for single-precision) results, a 2 or 4-fold improvement in peak performance (1.25 or 2.5-fold being more realistic in actual programs). GCC is capable of placing multiple data in the SSE registers and using SIMD instructions to operate on them through the -ftree-vectorize option. HTH ID: 668 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 672 - Posted: 28 Oct 2008, 23:07:23 UTC - in response to Message 668. Thanks for explaining that. I'll try recompiling it with that flag enabled.... Mike Doerner ID: 672 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 673 - Posted: 28 Oct 2008, 23:32:42 UTC Last modified: 28 Oct 2008, 23:47:55 UTC DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. Mike Doerner ID: 673 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0	Message 675 - Posted: 29 Oct 2008, 14:51:07 UTC - in response to Message 673. Last modified: 29 Oct 2008, 14:54:02 UTC DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. Yes, but only in GCC 4.3 or later, not in earlier versions. And only when the architecture supports SSE, i.e., all x86-64 processors and only when -msse2 is specified for x86 processors, but then it will fail running on those processors which don't support SSE2. HTH ID: 675 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 676 - Posted: 30 Oct 2008, 2:06:41 UTC Last modified: 30 Oct 2008, 2:39:40 UTC OK, now I have a better understanding of how the gcc flags interact. I have recompiled again with your suggestion, and it looks like I'm 6 to 7 seconds faster then TJM's 32-bit P3 app. I am using gcc 4.3.1 w/ openSUSE 11.0. Here's the flags I used at the last compile.... 1st compile. export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse2 -fprefetch-loop-arrays -ftree-parallelize-loops=4 -combine -fwhole-program -fprofile-generate' 2nd compile. export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse2 -fprefetch-loop-arrays -ftree-parallelize-loops=4 -combine -fwhole-program -fprofile-use' Profiling didn't seem to generate much of a speed improvement compared to just a regular compile without profiling enabled. Got any other tricks up your sleeve?;-) The -ftree-parralellize-loops=4 just seems to run faster, compared to using 2 and 8 as the other values I tried. Fortunately, the task did NOT thread to the other cores on my Phenom when I tried this. PS I've also tried export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse4a -mabm -fprefetch-loop-arrays -combine -fwhole-program' but it seems to run at the same speed as the previously listed flags, FWIW. Mike Doerner ID: 676 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0	Message 677 - Posted: 30 Oct 2008, 15:03:15 UTC - in response to Message 676. Last modified: 30 Oct 2008, 15:06:24 UTC export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse2 -fprefetch-loop-arrays -ftree-parallelize-loops=4 -combine -fwhole-program' You should drop -ftree-paralelize-loops. At the moment, BOINC applications must be single threaded. [quote]Got any other tricks up your sleeve?;-) The -ftree-parralellize-loops=4 just seems to run faster, compared to using 2 and 8 as the other values I tried. Fortunately, the task did NOT thread to the other cores on my Phenom when I tried this. Naturally, as Phenom has 4 cores. But I wonder why you didn't see the other cores busy. Threads don't show up in top populating the cores, but you notice that the cores a busy. Or perhaps the compiler couldn't parallelize the code. export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse4a -mabm -fprefetch-loop-arrays -combine -fwhole-program' but it seems to run at the same speed as the previously listed flags, FWIW. Not surprising, as neither SSE4A nor ABM add critical instructions to typical programs. Most of them are for bit manipulation. Also, when building for x86-64, the options -m64 and -msse2 are the default. So there's no need to specify them, though the don't hurt either. Now, an interesting option would be -msse3. SSE3 adds helpful instructions for matrix multiplication and operation on complex numbers. So if Enigma relies on algorithms using such operations, it might benefit too. Unfortunately, according to my tentative analysis here, systems supporting SSE3 are not as common as one might expect. However, if you want the application to run well on most systems, it's better to not use it as well as dropping -march=amdfam10 (AKA -march=barcelona). For the x86 application, you might want to try the option -mpc64 that reduces the precision of floating-point calculations to 64 from 80 bits, making it faster and with results similar to those produced by x86-64. HTH ID: 677 · Rating: 0 · rate: / Reply Quote

mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0	Message 678 - Posted: 30 Oct 2008, 17:26:09 UTC - in response to Message 677. export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse2 -fprefetch-loop-arrays -ftree-parallelize-loops=4 -combine -fwhole-program' You should drop -ftree-paralelize-loops. At the moment, BOINC applications must be single threaded. [quote]Got any other tricks up your sleeve?;-) The -ftree-parralellize-loops=4 just seems to run faster, compared to using 2 and 8 as the other values I tried. Fortunately, the task did NOT thread to the other cores on my Phenom when I tried this. Naturally, as Phenom has 4 cores. But I wonder why you didn't see the other cores busy. Threads don't show up in top populating the cores, but you notice that the cores a busy. Or perhaps the compiler couldn't parallelize the code. Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. The times to complete suggest little floating point math is used to generate the results, making my AMD look like an abacus compared to the Intel.:-) Since enigma was never designed as a multi-threaded application (which BOINC would maybe choke on if the 4 tasks became multi-threaded) that may explain why turning on the -ftree-parallelize-loops didn't do anything. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Being a Mechanical Engineer, computer programming is not my strength. ;-) I don't know if calling up the ACML headers is enough, or if the ACML has specific syntax to include its functionality into the code. Mike Doerner ID: 678 · Rating: 0 · rate: / Reply Quote

ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0	Message 679 - Posted: 30 Oct 2008, 19:22:54 UTC - in response to Message 678. Last modified: 30 Oct 2008, 19:23:17 UTC Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. AMD does pretty well in FP, but, with Intel's Core2, the gap is smaller if not zeroed, while zooming past AMD in integer. Core2 is truly a very good architecture. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Perhaps, but I've never used ACML before. It does have many routines rewritten in assembler to extract every bit of performance from AMD processors, but not all were rewritten in assembler. So it may be a matter of making sure that the routines important to Enigma are indeed optimized. HTH ID: 679 · Rating: 0 · rate: / Reply Quote