AMD Compilation Flags in gcc. |
Message boards : Number crunching : AMD Compilation Flags in gcc.
Author | Message |
---|---|
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Hi All, I've tried recompiling the engima.c code several different ways in 64-bit mode. The quickest time I've gotten using the benchmark test on my overclocked Phenom (2.930Ghz) is about 3m 11s with the following flags.... CHOST='x86_64-pc-linux-gnu' CFLAGS='-O3 -ffast-math -funroll-all-loops -fpeel-loops -march=amdfam10' -pipe and -sse3 slowed me down by about 3-4secs. -O3 beats -O2 by about 1 second. I also tried profile optimization, and got and extra 3 seconds of run time for my trouble. :-( Unfortunately, the Pentium III 32-bit executable TJM came up with beats my best complation so far by about 1 second :-( . Anyone have any other flags that I should try that has increased the speed of the AMD 64-bit compilations? I was thinking of trying some of the ACML libraries, but I am out of my element trying to work with C (the last FORTRAN compilation i had was back in the early 1990's, then I found MS-Excel :-).) With the Phenom, less seems to be more with the compilation flags. I'll play with a few more tonight, but I have a feeling there will be no dramatic increase in speed. One other thing I noticed with the Gnome System Monitor is that it seems that the benchmark will change the core it's running on mid-way through the test. I noticed several instances of the task "jumping" from core 3 to core 2, and vice versa. I'm wondering if this is a heat-dissipation or heat-distribution scheme by AMD so you don't thermally stress the processor? I would imagine this does not happen when all 4 cores are working at full capacity, but I wonder if anyone else has noticed this? I have the 995Z 125W version of the processor. Mike Doerner PS I'm using gcc 4.3.1 from openSUSE 11.0. |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
Hi All, To build this one I used gcc 3.2, cflags: -march=pentium3 -O3 M4 Project homepage M4 Project wiki |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
No -ffast-math flag? Hmmmm. Too bad gcc 3.2 only has -m32 and -m64 flags. I might try that but I'm not sure what the optimizations in 4.3.1 (amdfam10) are vs. 3.2 (m64) in the compiled code, and other than installing 3.2 on my box, I'm not sure how else to evaluate the two. I wonder how the -march=pentium3 would do with the flags ffast-math, funroll-all-loops and fpeel-all-loops would do. I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. Mike Doerner PS Also, the executable compile is called out in the makefile. If i want to play with Intel's compiler or gcc 3.2, how do I get the Makefile to recognize an alternate compiler? Compile seems to go to gcc by default, or at least openSuSe's default. |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
You could try to set up virtual machine and then install older linux distribution on it, the gcc I've used was Mandrake 9's default version.
I'd think the AMD optimizations would do as well if not better on the Intel platform. I'll try it this weekend if I have time. I think that executable compiled with Intel's C compiler will outperfom anything build with gcc on Intel's processors. Take a look at this host: http://www.enigmaathome.net/results.php?hostid=1369 it's 996MHz PIII with PC-133 SDRAM running Intel-optimized executable. M4 Project homepage M4 Project wiki |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-) TJM, if you can verify, I'd appreciate it. This app seems to run 3-4 seconds faster than the PIII executable w/ the benchmark you provided. I found newer AMD recommended flags for gcc 4.3.0... 1st compile....
Ran the start command twice... 2nd compile.....
The 1st compile seemed to equal the speed of the PIII app, but after re-compiling with -fprofile-use it seems to have optimized it further. I didn't see any *.da files like the gcc 4.2.0 documentation stated would be there, but I figured, what the heck, I'll give it a shot anyways. Also, if you'd like to d/load my executable, here it is..... http://members.cox.net/mdoerner1/enigma_Phenom64 I haven't zipped it or anything, it's just the raw executable. Let me know if my time-savings are correct. The ACML libraries are a bit out of my league, and not having the time to analyze the source code, I'm not sure if the BLAS functions would save any time in computation. Mike Doerner PS Here's the latest AMD compilation flags for gcc......http://developer.amd.com/assets/AMD_GCC_Quick_Reference_Guide080509.pdf PPS Strangely enough, -march code was 1 second slower than -mtune. Go figure. |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
I think I got an App faster than the PIII 32-bit app for the Phenom in 64-bit mode....:-) I can't try your executable because the only Phenom I have (BOINC server) is running 32 bit OS (Debian 4). I built 32 bit version with same cflags, seems to run a bit faster than the previous one I used. Btw, if the benchmark results between two apps are similar, try running a real workunit and check it's runtime. The benchmark is specific, it runs enigma on a very short ciphertext (shorter than hceyz72). Sometimes a very small diference (like 1-3% in benchmark time) can turn into huge speedup on longer workunits. M4 Project homepage M4 Project wiki |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Glad to see those flags work for you then. I've also overclocked my processor since I've optimized my 64-bit app, so it's harder to compare apples to apples so to speak. However, I'll see if I can come up with some sort of comparison. You're right though, a 4 second savings on a 3.25 minute run time equates to around a minute savings on each awgly100_0 that is run. Considering the amount of workunits left, that will help get us to the end quicker. My only apprehension is if you use those flags for an Intel optimization, making us AMD users even less competitive....:-( Oh well. Whatever works best. Mike Doerner |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
These times are at 2.6Ghz....
And the Intel....
My phenom is currently clocked at 2.93Ghz, and I'm getting awgly100_0_2153304_r0 2,487.11 secs CPU time w/ my new optimized app. 2.6 / 2.93 = 0.887 so 2765.39 should be 2453.93 secs if I had overclocked earlier with the old optimized app. Right in the ballpark, but hard to say if I've actually saved a minute of computing time or not, as I see completion times change between the different awgly100's. I'll have to just watch it I guess....:-) Mike Doerner |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Here's some of the low times since I compiled the app.... awgly100_0_2380425_r0 2,465.53 secs (these results may have been while I was running at 3.0 Ghz briefly) awgly100_0_2032261_r0 2,494.51 secs awgly100_0_2027623_r0 2,484.84 secs awgly100_0_2360478_r0 2,475.29 secs awgly100_0_2354671_r0 2,458.55 secs awgly100_0_2352802_r0 2,481.73 secs awgly100_0_2347369_r0 2,454.21 secs ....kinda all over the place but compared to before 10/19/08.... awgly100_0_2246402_r0 2,493.26 secs awgly100_0_2239341_r0 2,438.31 secs ....really all over the place. Who knows.... Mike Doerner |
ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0 |
Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. HTH |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Try -ftree-vectorize to enable vectorization, since all x86-64 systems must support at least SSE2. Howdie, What is vectorization and how will it help the Enigma code run faster?:-) Mike Doerner |
ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0 |
Vectorization is another term for SIMD (single instruction and multiple data). In other words, the use of a single operation, say multiply, applied to different data at the same time. SSE2 has 128-bit registers which can accommodate 2 double-precision or 4 single-precision floating-point numbers, similarly for integer types. Through vectorization, it's possible to operate on 2 of such registers containing multiple data, thus resulting in a single instruction providing 2 (for double-precision) or 4 (for single-precision) results, a 2 or 4-fold improvement in peak performance (1.25 or 2.5-fold being more realistic in actual programs). GCC is capable of placing multiple data in the SSE registers and using SIMD instructions to operate on them through the -ftree-vectorize option. HTH |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Thanks for explaining that. I'll try recompiling it with that flag enabled.... Mike Doerner |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. Mike Doerner |
ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0 |
DOH! Looking at the gcc documentation, -ftree-vectorize is turned on by the -O3 flag. Looks like it's already on. Yes, but only in GCC 4.3 or later, not in earlier versions. And only when the architecture supports SSE, i.e., all x86-64 processors and only when -msse2 is specified for x86 processors, but then it will fail running on those processors which don't support SSE2. HTH |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
OK, now I have a better understanding of how the gcc flags interact. I have recompiled again with your suggestion, and it looks like I'm 6 to 7 seconds faster then TJM's 32-bit P3 app. I am using gcc 4.3.1 w/ openSUSE 11.0. Here's the flags I used at the last compile.... 1st compile. Profiling didn't seem to generate much of a speed improvement compared to just a regular compile without profiling enabled. Got any other tricks up your sleeve?;-) The -ftree-parralellize-loops=4 just seems to run faster, compared to using 2 and 8 as the other values I tried. Fortunately, the task did NOT thread to the other cores on my Phenom when I tried this. PS I've also tried export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse4a -mabm -fprefetch-loop-arrays -combine -fwhole-program'but it seems to run at the same speed as the previously listed flags, FWIW. Mike Doerner |
ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0 |
Naturally, as Phenom has 4 cores. But I wonder why you didn't see the other cores busy. Threads don't show up in top populating the cores, but you notice that the cores a busy. Or perhaps the compiler couldn't parallelize the code. export CFLAGS='-m64 -O3 -funroll-all-loops -ffast-math -march=amdfam10 -msse4a -mabm -fprefetch-loop-arrays -combine -fwhole-program'but it seems to run at the same speed as the previously listed flags, FWIW. Not surprising, as neither SSE4A nor ABM add critical instructions to typical programs. Most of them are for bit manipulation. Also, when building for x86-64, the options -m64 and -msse2 are the default. So there's no need to specify them, though the don't hurt either. Now, an interesting option would be -msse3. SSE3 adds helpful instructions for matrix multiplication and operation on complex numbers. So if Enigma relies on algorithms using such operations, it might benefit too. Unfortunately, according to my tentative analysis here, systems supporting SSE3 are not as common as one might expect. However, if you want the application to run well on most systems, it's better to not use it as well as dropping -march=amdfam10 (AKA -march=barcelona). For the x86 application, you might want to try the option -mpc64 that reduces the precision of floating-point calculations to 64 from 80 bits, making it faster and with results similar to those produced by x86-64. HTH |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. The times to complete suggest little floating point math is used to generate the results, making my AMD look like an abacus compared to the Intel.:-) Since enigma was never designed as a multi-threaded application (which BOINC would maybe choke on if the 4 tasks became multi-threaded) that may explain why turning on the -ftree-parallelize-loops didn't do anything. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Being a Mechanical Engineer, computer programming is not my strength. ;-) I don't know if calling up the ACML headers is enough, or if the ACML has specific syntax to include its functionality into the code. Mike Doerner |
ebahapo Send message Joined: 11 Sep 07 Posts: 7 Credit: 306,962 RAC: 0 |
Supposedly AMD's are faster in floating point math than the Intels, but looking at the results so far, the Intel is about 25% faster at integer math than my AMD processor. AMD does pretty well in FP, but, with Intel's Core2, the gap is smaller if not zeroed, while zooming past AMD in integer. Core2 is truly a very good architecture. The only other trick I can think of to speed this up is to try incorporating the ACML libraries from AMD into the enigma.c code (and maybe other source files) and calling up some of the BLAS routines to speed up integer operations. Perhaps, but I've never used ACML before. It does have many routines rewritten in assembler to extract every bit of performance from AMD processors, but not all were rewritten in assembler. So it may be a matter of making sure that the routines important to Enigma are indeed optimized. HTH |
Message boards :
Number crunching :
AMD Compilation Flags in gcc.