optimized applications |
Message boards : Number crunching : optimized applications
Author | Message |
---|---|
Paul D Harris Send message Joined: 15 Feb 12 Posts: 4 Credit: 72,585 RAC: 0 |
Are there any optimized applications for use at seti I use lunatics but I do not see any used here. I believe it would speed up my work units it takes now about 3 hours for a work unit to complete. |
Ageless Volunteer moderator Volunteer tester Send message Joined: 11 Sep 07 Posts: 104 Credit: 155,932 RAC: 0 |
|
Peciak Send message Joined: 27 Aug 09 Posts: 9 Credit: 117,918,807 RAC: 0 |
|
Paul D Harris Send message Joined: 15 Feb 12 Posts: 4 Credit: 72,585 RAC: 0 |
WINDOWS INTEL Thanks got it installed just waiting on wu now. |
Paul D Harris Send message Joined: 15 Feb 12 Posts: 4 Credit: 72,585 RAC: 0 |
WINDOWS INTEL I am doing wu in 1.5 hrs now it took off about 1 hr. |
Peciak Send message Joined: 27 Aug 09 Posts: 9 Credit: 117,918,807 RAC: 0 |
New optimizet applications by AGBAR forum topic http://www.boincatpoland.org/smf/enigmahome-78/enigma-optima/ apps https://github.com/Agbar/enigma-optima/releases/ apps http://chomikuj.pl/rakowskipw/boinc/enigma/opty+enigma |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
Greate you've wirtten about my project here, but please don't redistribute Enigma-Optima. I would like to know how many people are using it - Github stats will be distorted by downloads from chomikuj. |
[AF>Amis des Lapins] Oncle Bob Send message Joined: 24 Feb 13 Posts: 18 Credit: 55,194,685 RAC: 0 |
Thank you, I will test this app (x64 beta 3). Is there a chance for getting ARM optimized app ? |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
Thanks for your interest. I would like to concentrate on x86 (preferably 64 bit). I don't know much about ARM architecture, so preparing ARM optimized app would be a huge challenge, and potentially time consuming. There are few other factors too. Summarizing (not going into details): no ARM version in foreseeable future. |
[AF>Amis des Lapins] Oncle Bob Send message Joined: 24 Feb 13 Posts: 18 Credit: 55,194,685 RAC: 0 |
Well, I compared this optimized app and the previous one on my i7 2600K running at 4.2 GHz, using two series of 100 UT. It seems that Optima is almost 25% faster than the previous optimized app. Thank you for your work, I have spread the word on the Alliance Francophone forum. |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
It seems that Optima is almost 25% faster than the previous optimized app. Speed gain depends on CPU model you have. My teammates reported biggest wins on AVX2 enabled CPUs (ie Intel Core i7-4770K). If you encounter any problems (like trashing work units) please reach me here or on BOINC@Poland forum (we have English language board) or - even better - on Github. |
[AF>Amis des Lapins] Oncle Bob Send message Joined: 24 Feb 13 Posts: 18 Credit: 55,194,685 RAC: 0 |
Excellent, you included the latest features of modern CPU (in this case AVX2). I suppose that this app run AVX on Sandy bridge (or close CPU from early 2010's). What does it exploit on older CPU ? SSE4 ? Is there an improvement on these old CPU which don't have AVX ? By the way, you may warn users that this app will increase power consumption and temperature of the CPU because it exploit AVX. |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
There are several upgrades to original code. 1. (no SSE) First thing I realized was that in a hot path (decode + score) there are many multiplications. They are used mainly to calculate addresses in multidimensional arrays. For a few percent of space (I don' remember exact number, it is trivial to calculate however) I changed multiplication by 26 to 32. Multiplication by 32 can be done with "shift left" instruction. For most modern (>= Pentium ;) processors shift can be done in 1 cycle while multiplication takes at least 3 (up to Nehalem; Sandy Bridge has faster multiplier, but it doesn't matter 'cause it has AVX). Using well optimizing compiler like one from Intel I had fastest optimized enigma app at the time (it was around 3 years ago). And it is included for processors that doesn't support any versions of code. 2. SSSE3 (notice triple S for Supplemental SSE3) Basically I owned first gen Core i7 and i5 machines, so I started tinkering to get something faster than "basic" code. It wasn't that easy because I needed to use intrinsics - it is much easier to write than real assembly. However it turned out that GCC emits undeniably code for __builtin_shuffle(16x8bit_vector,32x8bit_permutation) (as it turned the most important function to make it work) so I was forced to write it on my own. And this faster shuffle is crucial to get any considerable speed improvement over "basic" code. In simple words as "basic" decodes one character by another, SSSE3 code decodes group of up to 16 characters at a time. AVX does not include required operations so Sandy/Ivy Bridge processors execute the same code (SSSE3) 3. AVX2: Haswell includes operations on 256bit integer vectors (including PSHUFB instruction). Code is quite simple extrapolation of SSSE3 code to wider registers. There is still room for improvements. Currently I am testing "basic" version that is almost as fast as SSSE3 on my computer: less than 2% slower! Seriously GCC would do better job, because all I have done was to fight one "optimization". I believe it can be even faster. SSSE3 code isn't optimized very well and I am sure AVX2 is even worse (I don't own anything THAT modern yet, I tested correctness on Intel SDE) It is fast enough to be faster than basic code ;) AVX code might be marginally faster than SSSE3, but it's probably too much work compared to the results. It is possible to "extrapolate" ideas of this code to support AVX512, but it is to early now as it would be a dead code for next year. Purley and Cannonlake are expected to be released in 2017. By the way, you may warn users that this app will increase power consumption and temperature of the CPU because it exploit AVX. You are probably right. But notice that any modern CPU has Turbo Boost or equivalent - processors automatically adjust for power, so there should be no real change. |
jj666 Send message Joined: 10 Mar 14 Posts: 8 Credit: 68,796,111 RAC: 0 |
Great work here! Works very well on my old Xeon Proliant servers. Cheers, -jj- |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
Thanks! I was about to ask everybody if it works fine? Are there any issues with this app? If no I would publish this version as v1.0.0 (stable). As I said before I have faster basic (no SSSE3) version that would work better on older processors. I had Pentium 3 in mind while doing it, but I noticed recently that some older Opterons don't have it either (example and capabilities.) I must write some tests first as this app isn't simple recompilation as most of others of this kind, but - you know - I have a life besides programming ;) I want to say thank you to all of you who uses it, especially guys from top list. I didn't expect this size of deployment at this stage! |
jj666 Send message Joined: 10 Mar 14 Posts: 8 Credit: 68,796,111 RAC: 0 |
As some benchmarking, from the previous optimised binaries in the days since I've been using:- Proliant G6: Xeon X5670 (x2) @ 2.93ghz -> points per day around 31,500 to 41,500. Proliant G7: Xeon X5690 (x2) @ 3.47ghz -> points per day around 35,000 to 47,000. No broken WU's seen in the last days that I have been using. Given that it's one app, and not a bunch of different apps compiled differently, it working much better for deployment (I do also have an AMD Bulldozer and an i5 machine I crunch on infrequently). Thanks again! Cheers, -jj- |
stiwi Send message Joined: 20 May 12 Posts: 19 Credit: 109,893,954 RAC: 0 |
Thanks! I had 4 errors but I don't know if they were caused by the optimized app or something different is wrong: <core_client_version>7.4.8</core_client_version> <![CDATA[ <stderr_txt> Wrapper v5.26 build 8: starting 03:11:14 (4644): wrapper: running enigma_0.76.exe (-R -o results.txt 00trigr.cur 00bigr.cur 00ciphertext) Enigma Optima v1.0.0-beta.3 Windows64 Best ISA: AVX Seed set to: 1466298699. 2016-06-19 03:11:39 enigma: working on range ... 2016-06-19 03:54:13 enigma: finished range 03:54:14 (4644): called boinc_finish </stderr_txt> <message> finish file present too long </message> ]]> All other taks works fine and fast :) |
Ben Send message Joined: 1 Dec 15 Posts: 1 Credit: 399,478 RAC: 0 |
I've been doing a few work units with my i5 and the agbars intel app. The 30 point wu times used to be around 15 minutes i can now do them in approximately 10 minutes. The 50 and 60 point wus i can crunch in just over 20 minutes. Amazing work. Have done over 100 wus without any issues |
LOVIT Send message Joined: 23 Apr 10 Posts: 2 Credit: 60,954 RAC: 0 |
ho hoo great work! on my old (haha main computer) C2Q 9300@ 3GHz i do 52 points WU with standard app cca 2500s (41 mins; 10 seconds) and now cca 1550 s (25 mins 50s), thats amazing 40% speed increase! thanks mister |
Agbar Send message Joined: 10 Sep 09 Posts: 28 Credit: 690,568 RAC: 0 |
I had 4 errors but I don't know if they were caused by the optimized app or something different is wrong: By any chance, do you have task ids? In 202892545 there is an error: enigma: error: resume file is not in the right format I don't remember touching anything in this part of the code. Maybe disk error or OC? Without resume file I can't tell what went wrong. Unfortunately when the task fail that file is deleted. I bet it is disk/filesystem problem. |
Message boards :
Number crunching :
optimized applications