New Optimized apps for 64-bit Linux |
Message boards : Number crunching : New Optimized apps for 64-bit Linux
Author | Message |
---|---|
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Hi All, With AMD's release of the Open64 compiler for Linux, I thought I would try my hand at recompiling the enigma application. I have always been interested in speeding up these applications because I have felt that a 64-bit application should perform the same function faster than a 32-bit application, especially one so mathematically driven like enigma. While I have compiled my own application for my AMD Phenom in the past, the performance increase in 64-bit Linux was only a few seconds of improvement vs. TJM's 32-bit applications he posted on the site last year. This has changed dramatically with the release of AMD's recent release of Open64 v4.2.2. I have seen a significant decrease in computation time for these new 64-bit applications compared to the 32-bit optimized ones released last year. Here is a list of times I have noticed with the benchmark application evaluation program. My system: AMD Phenom 9950: Overclocked 3.05 Ghz, running Linux 64-bit (OpenSuSe 11.1 KDE 4.2.2) Default Application 32-bit – 3 minutes 48.202 seconds TJM's Pentium 3, 32-bit application compiled on gcc 3.2 – 3m 15.155s (fastest of TJM's optimized apps) My 64-bit Phenom gcc 4.3 app – 3m 8s (appx) AMD has also helped the gcc folks with more optimizations in v4.4 My 64-bit Phenom gcc 4.4 app – 2m 54s (getting better...more on this in the conclusions) I have compiled both 32-bit and 64-bit applications under Open64. The 32-bit applications either showed no improvement, or in one case, marginal improvement. Open64 32-bit Athlon – 3m 14.443s (basically even with TJM's P3 app) Open64 32-bit Opteron – 3m 27.920s Open64 32-bit Athlon64 – 3m 31.448s Open64 32-bit Athlon64fx – 3m 34.46s Open64 32-bit Phenom (Barcelona) – 3m 31.595s Now for the good part. The 64-bit Linux optimized applications REALLY PUT THE HAMMER DOWN! I do not know what voodoo the engineers at AMD exercised when they optimized this compiler but there is about a 20% performance increase compared to TJM's P3 optimized app. Here's the numbers.... Open64 64-bit Opteron – 2m 29.500s Open64 64-bit Athlon64 – 2m 28.851s Open64 64-bit Athlon 64fx – 2m 28.427s Open64 64-bit Phenom (Barcelona) – 2m 28.410s So the AMD optimizations give roughly the same performance across the board of about 24% compared to TJM's P3 app. 35% compared to the default app. For more info, I also ran some Intel optimizations... Open64 64-bit Core – 2m 29.632s Open64 64-bit Wolfdale – Could not compile (some B.S. about Phenoms not having SSE2 instructions) Open64 64-bit Xeon – 2m 31.182s Open64 64-bit EM64T – 2m 28.998s Open64 64-bit Pentium4 – 2m 29.622s Since there is such a substantial performance improvement, I have taken the Open64 apps I compiled and placed them in a file for downloading here...... Mike D's Linux Apps ...if you would like to use one of them (32-bit or 64-bit, AMD or Intel optimized), be my guest. I would like someone to compile a Wolfdale 64-bit app in 64-bit mode so I can see if an Open64 64-bit Wolfdale run on an Intel Core 2 Duo or Quad would post even quicker times than the one's I've shown here. Conclusions: 1.)The 32-bit Open64 apps are just as fast, but not faster than, TJM's Intel P3 optimized app from last year. 2.)The 64-bit Open64 application I compiled are significantly faster any 32-bit application, either gcc or Open64 compiled. (And available for anyone to use, just click on the link above to download. I have included both 32-bit and 64-bit apps just in-case someone else would like to test them.) 3.)Windows users may not be left out here. The gcc 4.4 version supposedly includes more optimizations for AMD processors (as shown with my gcc 4.4 result above). I would think if someone was running WinXP 64-bit, or Vista 64-bit, or Win7 64-bit could download gcc 4.4 64-bit for their version of windows and compile these apps to find out if they could get any improvement compared to TJM's P3 app. Also, Open64 I would imagine may be ported to Windows some day (I can't see AMD letting this work only on Linux.) Thanks to TJM for letting the code out so us unemployed mechanical engineers can have something meaningful and rewarding to do besides changing dirty diapers during the day....;-) If you have any questions, best way to get a hold of me is by email at mdoerner1 (at) cox (dot) net Mike Doerner PS The default flags in conf-cc are changed from gcc -Wall -W -O3 to opencc -Wall -W -Ofast -m64 The only change to the conf-ld file I made is I changed the compiler from gcc to opencc and added the -ipa -m64 flags, as they were needed by the compiler to successfully compile the code. The flags I used for gcc-4.4 are gcc-4.4 -Wall -W -O3 -m64 -march=amdfam10 -combine -funroll-all-loops. |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I forgot to mention, for the Open64 32-bit and 64-bit apps I did add -march=barcelona, or athon64, or athlon64fx, etc. to the flags to get the architecture flag working correctly. Mike Doerner |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Howdy All, Just to eliminate confusion, you must use the app_test_522.tgz file from TJM's Optimized App Thread along with the executables shown above. You must still follow TJM's Optimized app procedure as shown in this thread.... v5.22 Experimental optimized app 1st, grab this file....app_test_522.zip ...the difference is instead of using the files from the 2nd link (test.tgz) you would use one of the executable files from my link above. If you have any questions, please email or send me a private message and we'll get you straightened out as best as we can. ;-) Mike Doerner |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I've added a 32-bit Intel Core architecture optimized app compiled with Open64 to the file, for those who refuse to realize 64-bit is the only way to go....;-) Mike D |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
DOH! Do not use the app_test_522.zip link in the above post, it for the WINDOWS platform. The link you want is this one.....for Linux..... app_test_522.tgz Mike D |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Here is the performance boost I'm seeing (so far) on my Phenom. The only minor tweak I've made on my system is I've gone from 3.05GHz to 3.10GHz, but this only shaves a few seconds of the small tasks, maybe a minute or 2 on the large tasks.I've gone from 1775 RAC to 2330 RAC, maybe a bit more. We'll see here shortly. |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
I've tried to build Wolfdale app, but the compiler says that the target processor does not support SSE2. That's quite weird, because I've also tried on a machine with Wolfdale processor and got the same result. Buninek from BOINC@Poland also tried with no luck, but here is his other app (xeon x64): http://dl.getdropbox.com/u/349831/enigma.xeon.x86_64.tar.bz2 So far the fastest app runs at speed similar to one of my older 64 bit executable built with Intel compiler, which was slightly slower than old PIII app. Tested on Q6600, E7200 and E5200. Now I'm trying to build something faster for Athlons 64/64 x2, it seems unfair that these processors run enigma slower than Pentium III with half of their clocks (996MHz PIII runs faster than Athlon 64 2,4GHz). M4 Project homepage M4 Project wiki |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I can let the guys at Open64 know that -march=wolfdale chokes on Intel and AMD processors, giving the same bogus "No SSE2" error. So the 32-bit gcc compiled P3 app runs faster than either Open64 or Intel Compiled 64-bit code? That's weird, because when I ran and tested the P3 gcc 32-bit app in the benchmark it was slower than the Open64 64-bit code. Maybe we're finally running into an processor architecture situation here? Just so I understand, are you saying a PIII at 1.0Ghz beats a 2.4GHz Athlon64 in raw time?!?!?! Or are you saying clock-for-clock the PIII is more efficient? (i.e. let's say the P3 completes a task 1000 seconds, are you saying the Athlon64 completes it in 2400 seconds, 1200 secs, or 990 secs?) I'm glad to hear the Intel Compiler code is as efficient as the Open64 code. If the Gentoo guys make a distro with Open64 instead of GCC, I just may have to switch distros from OpenSuSE to Gentoo..... :) Mike Doerner PS I don't want to open another can of worms here, but I've heard the C code isn't as efficient mathematically as Fortran (which is why it won't die). Not that a code re-write is possible this late in the game, but that architecture optimization will only get us so far..... |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
When you change the flag from -Ofast to -O0 thru -O3 you get this error.... mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76> make clean rm -f *.o tools/*.o mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76> make ( cat warn-auto.sh; \ echo exec "`head -1 conf-cc`" '-c ${1+"$@"}' \ ) > compile chmod 755 compile ./compile enigma.c Error loading wolfdale.so: wolfdale.so: cannot open shared object file: No such file or directory make: *** [enigma.o] Error 2 mdoerner@Linux-QuadZilla:~/Xfers/enigma-suite-0.76> Looks like they forgot to package something in the distro. I've heard the bug update will be pushed out the door by AMD early June, maybe they can fix this by then.... Mike D |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Buninek from BOINC@Poland also tried with no luck, but here is his other app (xeon x64): I tried running that app on my box and it choked..... mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark> ./start ./start: line 4: ./enigma: No such file or directory real 0m0.000s user 0m0.000s sys 0m0.000s mdoerner@Linux-QuadZilla:~/Xfers/enigma_benchmark> Ran may other apps just fine. Does it only work on Xeon? Mike D |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
Yep, thats exactly what I'm saying. 1GHz PIII Coppermine beats 2.4GHz Athlon64 by around 3-4 minutes on shortest workunits. But... I think that's the only situation where app build with Intel C Compiler is much faster than anything else. I forgot exact numbers and I don't have PIII box here anymore, but the speedup gained by replacing PIII-gcc app by PIII-Intel was just insane - more than twice the speed of the PIII-gcc app and around 3 times faster than speed of base app. If I remember correctly, 1GHz PIII needed around 40 minutes to complete hceyz72/0, and the fastest result I've seen from Athlon 64/2.4GHz is around 44 minutes.
I guess it's not a statically linked app and some of your libraries are wrong version. M4 Project homepage M4 Project wiki |
thinking_goose Send message Joined: 12 Nov 07 Posts: 119 Credit: 2,750,621 RAC: 0 |
I always thought the p3's were good. Is there an app for the Turion, or is it compatible with the Athlon64 app? |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I always thought the p3's were good. Is there an app for the Turion, or is it compatible with the Athlon64 app? I would just use the Athlon64 code for now. There is no Turion flag in Open64. Mike Doerner |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
I've just tested 32bit Athlon64 app from your pack on my Athlon64/2.4GHz. Old app (the best app I could build with gcc) benchmark: 3m 27s (*) Your app: 2m 26s Now that's a serious speedup :-D (*) - these runtimes cannot be compared with runtimes from other processors/apps, because I used modified, shorter benchmark file. M4 Project homepage M4 Project wiki |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
I've just tested 32bit Athlon64 app from your pack on my Athlon64/2.4GHz. Glad to see someone got some benefit out of it. :-) At least the Athlon64's won't get beat by the P3's anymore....;-) It's hard to tell how well an application optimized for an earlier processor will run when you're running a later generation processor.... Mike Doerner PS I just looked at your computers on the project. Your AMD Athlon box seems to be having a lot of errors right now. I hope that isn't the fault of the app I made...... |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
AMD has released Open64 4.2.2.1, so I was able to compile a 64-bit Wolfdale app for the Intel guys. The 32-bit Wolfdale app will have to wait a bit, as Open64 4.2.2.1 has introduced a bug with the -m32 flag...:-(. As soon as they get it fixed I'll add it to the archive. Mike D |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
AMD has released Open64 4.2.2.1, so I was able to compile a 64-bit Wolfdale app for the Intel guys. The 32-bit Wolfdale app will have to wait a bit, as Open64 4.2.2.1 has introduced a bug with the -m32 flag...:-(. As soon as they get it fixed I'll add it to the archive. OK, the 32-bit wolfdale app has been added. For some reason, when compiling with -m32 flag, you gotta use gcc 4.3.3. AMD says the preferred version is gcc 4.1.2 or 4.2, but that only works with the -m64 flag. Oh well. Also, I have added 2 apps (32-bit and 64-bit) to the file, with -march=anyx86 enabled. I'm gonna try it on an old, old, old 233Mhz Pentium MMX portable I've had sitting around doing nothing for awhile. I put DSL Linux on it (a debian derivative) with a 2.4.31 version of the kernel. Unfortunately, it picked up a hceyz72_1_ task and it's crunching on the default app right now, so it might not be until tomorrow before it's done crunching on it....:-( How computers have come along in a decade..... Odd thing is anyx86 and barcelona flags seem to compute in the same time on my Phenom box, so I'll give it a whirl and see how the times come up an regular tasks. Mike D |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Tried running the anyx86 code on the Pentium mobile, and it choked. Not sure if it's because the Pentium is on the 2.4.31 kernel or if I need to add more flags to make the libraries static or what..... Mike Doerner |
[AF>Libristes] Dudumomo Send message Joined: 15 Feb 09 Posts: 20 Credit: 196,260 RAC: 0 |
I've tried the wolfdale and anyX86_64, both give me computation error on my Gentoo 64b, (2.6.28-gentoo-r5) Edit : the error is : wrapper: running ../../projects/www.enigmaathome.net/enigma_0.76_i686-pc-linux-gnu (-R -o results.txt 00trigr.cur 00bigr.cur 00ciphertext) ../../projects/www.enigmaathome.net/enigma_0.76_i686-pc-linux-gnu: error while loading shared libraries: libmv.so.1: cannot open shared object file: No such file or directory app |
mdoerner Volunteer developer Volunteer tester Send message Joined: 30 Jul 08 Posts: 202 Credit: 6,998,388 RAC: 0 |
Hmmm.... I might have to statically link the libraries for you (though you should consider installing the required libraries in the future). Maybe it's time to come up with a separate archive since those files will be so much bigger.... PS Here's the new link. Right now there's only a 32-bit AnyX86 statically linked file. Let me know if you have any further problems. Mike D's Linux App Statically Linked - AnyX86 Mike Doerner |
Message boards :
Number crunching :
New Optimized apps for 64-bit Linux