Long (never ending?) work unit |
Message boards : Number crunching : Long (never ending?) work unit
Author | Message |
---|---|
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
This WU has been running now for 63 hours: http://www.enigmaathome.net/result.php?resultid=15555105 http://www.enigmaathome.net/workunit.php?wuid=14533239 Shows 65 hours remaining, progress indicator resets to zero, but I am aware that this is a known bug :) Is it worth allowing this to run? I need to switch off the host within the next day or so to relocate it, and am concerned that the WU will start from zero again as it isn't checkpointing. Thanks. |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
Is it worth allowing this to run? Anyone? |
thinking_goose Send message Joined: 12 Nov 07 Posts: 119 Credit: 2,750,621 RAC: 0 |
I'd let it run. If it still hasn't finished before the time you need to move the machine, so be it. I've given up looking at the estimated times these units take to complete- I get a fairly accurate prediction by looking at the time it has taken to complete similar work units and just take it from there. So far they're only a few minutes either way. |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
The outcome: The WU was still running at around 100 hours when I had to power down. On rebooting it retained it's runtime, and an outstanding Spinhenge unit started to run. Pretty soon that was running far longer than expected. Rebooted again, Spinhenge unit ran normally and finished (although 3 times the usual time). Enigma WU got stuck again and at around 104 hours reset to zero hours. Then I remembered that BOINC had blown up about a month ago on this host (all tasks went to computation error, started accepting tasks from random projects etc). BOINC version was 5.10.28! Upgraded to my favourite version, 5.10.45, and all seemed well, Enigma WU completed in an apparent 4 hours (more than double the usual time, and of course it had been crunching for 104 hours before that!). Still getting stuck tasks on Enigma and other projects. Uninstalled BOINC, then installed 6.10.56. Somehow it migrated all of the outstanding WUs (I guess they don't get deleted on uninstall?). Took hours to install for some reason, kept freezing. Seemed to be OK, but got bitten by the stupid Max CPU Time bug, sorry 'feature', and other settings weren't correctly migrated. Straightened the settings out, all apparently OK. Except now I've got another hung Enigma WU. Conclusion: I think the host is broken in strange and mysterious ways :/ |
noderaser Send message Joined: 24 Dec 08 Posts: 88 Credit: 1,496,863 RAC: 0 |
Is something else eating up all the resources, preventing BOINC from getting any to complete its tasks? Click Here to see My Detailed BOINC Stats |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
Is something else eating up all the resources, preventing BOINC from getting any to complete its tasks? Nope. CPU usage with BOINC suspended is no more than a few percent, and that's being consumed by the VNC connection (like most of my hosts, I only connect via VNC). With BOINC running it's a solid 100%. Memory usage minimal when running Enigma. It's a single core machine with 1GB of RAM. The only projects it struggles with would be CosNo and ViP. Been running EDGeS solidly for 24 hours now with no problems, and that uses a lot of memory. It's almost as if the 'lite' projects like Enigma and Spinhenge are the problematic ones. Putting things in perspective, this host cost me 20 or 30 UKP a few years ago and no longer really does anything 'useful' on my network. All that it does now is provide a small (manual) cache of some hefty files from my NAS, and I was using the front panel USB ports as charger sockets for my mp3 player and sat nav :) I'll play around with it for a while, if nothing else as a testbed for different BOINC versions (I'm not liking the latest stuff at all!). |
Cartoonman Send message Joined: 9 May 09 Posts: 1 Credit: 2,027,871 RAC: 0 |
This sounds like the WU application is somehow corrupted, and thus, isn't working properly(a simple re-installation won't rid of your project data), as as much as anyone else has seen, the WU's are performing fine for me (the only prob is that the progress indicator is highly inaccurate, and time left to finish is easily determined by CPU time) According to your computer stats, your running XP, so your WU files and stats should be in your Application data folder. you can find it in it's default location at: "C:/Documents and Settings/[your user acc with BOINC*]/Application Data/BOINC/projects/(the Enigma@home folder) (or run a search for "BOINC", and the folder shown in an application data folder is the one) The folder once your in the projects folder is easily discernible. Delete the entire folder, but make sure that BOINC isn't running. after you've deleted the folder, restart BOINC, and let it re-download all of the necessary files and applications for Enigma, and see how the WU's run after that. *if you allowed all users to use BOINC, it would be in the All Users folder |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
Latest gripping news ... ;-) Other projects are running intermittently slowly, and occasionally the whole of Windows just runs very slowly. Checked all of the obvious things and ran a few diagnostics. The only obvious 'problem' was a very high CPU temperature, close to the point that the processor throttles internally. So I left ThrottleWatch running, but that didn't show anything. Heatsink was clean, fan blowing plenty of air, so took the heatsink off, cleaned it up and refitted with fresh silver heatsink compound. Seemed to run even slower. Removed, cleaned, refitted with ceramic heatsink compound and things seemed a little better. For a while. Now it's hanging again. I'm not convinced that the junction temperature is as high is reported, because the heatsink is barely warm, and cooling it down with freezer spray has very little effect on the processor temperature. So, that might be a red herring. Next plan is to change the memory. I'll have a gigabyte of working memory becoming free in the next few days when I upgrade a different machine. This machine is probably going to be scrapped shortly (I only keep the 10 best machines running BOINC, and this is one of the slowest), so I'm not that bothered. But I'd like to find out what's happening just to satisfy my own curiosity :) |
TJM Project administrator Project developer Project scientist Send message Joined: 25 Aug 07 Posts: 843 Credit: 267,994,998 RAC: 0 |
Just for info: all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken - unless it runs on really old hardware. I think that even a Pentium III can go through most of the workunit types in less than 12 hours. I think that the host has some kind of hardware problem, perhaps memory errors. I've already seen similar problem on a machine with broken RAMs, the O/S itself was stable, but most of the results returned completely random data, usually from completely different WU ranges. Probably the data got corrupted in memory and the app was randomly jumping from one settings to another, this also explained runtime varying from normal to tens of hours. M4 Project homepage M4 Project wiki |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken Thanks for that. I'm pretty certain it's a memory problem now. I don't remember when or where I got the memory that's in there at the moment. I'll fit some different memory when I get around to it. |
ChertseyAl Send message Joined: 23 Sep 07 Posts: 16 Credit: 1,038,886 RAC: 0 |
I'm pretty certain it's a memory problem now. And indeed it was. At some stage I'd fitted some spare PC3200 based on the Crucial scanner tool which recommended PC2700 or PC3200. Different PC3200 DIMMs showed the same problem. A friend suggested that although PC3200 was 'better', the mobo might not work well with it. So replaced the memory with 1Gb of PC2700. Every project I've run now works properly. Not tried Enigma yet though as I'm mopping up milestones on other projects ;) |
elgordodude Send message Joined: 3 Jun 10 Posts: 9 Credit: 1,289,107 RAC: 0 |
Just for info: all the enigma workunits have fixed lengths, so if suddenly WU takes more than a couple of hours of CPU time, it's probably broken - unless it runs on really old hardware. I think that even a Pentium III can go through most of the workunit types in less than 12 hours. Just checked in on my PIII in the corner and found an m4-pldrv64 wu that's been running for 41 hours. The host has been reliable, it even did okay with those 210's a month or two ago. For the moment it started running some regular pldrv's in high priority and those look okay. I don't think I've seen this type of wu before, is it supposed to take this long, or is this the beginning of that box's swan song? Here's the task: http://www.enigmaathome.net/result.php?resultid=20309640 Here's the host: http://www.enigmaathome.net/show_host_detail.php?hostid=34839 |
TenthReality Send message Joined: 6 Sep 09 Posts: 6 Credit: 550,574 RAC: 0 |
The pldrv64 series are taking nearly 6 hours on a host the average times are 20 minutes on. So given what you've linked in terms of the p3 host, I don't think 41 hours is that unheard of. Is the % complete going up at all? At some point in not too long this unit will return on the linked host for comparison: http://www.enigmaathome.net/result.php?resultid=20678922 http://www.enigmaathome.net/show_host_detail.php?hostid=42242 Fairly long WU's for Enigma, some of the longest I've seen to date. |
elgordodude Send message Joined: 3 Jun 10 Posts: 9 Credit: 1,289,107 RAC: 0 |
Unfortunately it's a Linux box, so the progress bar swings wildly regardless of elapsed time on all tasks. Which reminds me, any news on the new linux wrapper? generally I don't care, but it would be useful today. Currently, it's at 73.344, but as I said that's meaningless. The average times on that box are around 200 minutes, she is an old girl, but if you're saying 20 minutes extrapolates to 3600 on your machine, than I guess I'm looking at a runtime around 36,000 minutes, or 60 hours. So it should be around 66% done. I'll let it run and see what happens, as long as it keeps taking regular work units at high priority when they get close, worst case is it will time out on the 14th. Generally though does anyone know what's up with these super units? Like these and the pldrv210, is the code really complicated, or is there a ton of it, or both? Should have looked closer, your task is listed as pldrv64, and those haven't been a problem, this one is really weird, because it's listed as m4-pldrv. I just downloaded some new tasks on another box that were labeled m3-pldrv as a download, but then showed pldrv as a task. Is it possible this task is corrupted given the m4 designation? |
TenthReality Send message Joined: 6 Sep 09 Posts: 6 Credit: 550,574 RAC: 0 |
I did not even notice the naming difference between the two. I'm not 100% positive but aren't the m3/m4 prefixed guys imported from the M4 project where guys without prefix are workunits that come from our side of things? Are we just looking at 2 different things trying to crack the same long message which would explain simmilar timing? Also right around new years there was a pre-fix renaming on new units that was a "Happy New Year" type thing that has since gone away, wondering if during the name/rename situation something occured there. Perhaps our awesome admin can chime in here to try to figure this one out. |
elgordodude Send message Joined: 3 Jun 10 Posts: 9 Credit: 1,289,107 RAC: 0 |
Well all is good, it completed in a hair under 57 hours for a whopping 817 credits! However, it gets weirder, I know your on windows, and may not be familiar with the bug, but until now I've never seen a progress bar work on linux. Once I started to keep an eye on it I noticed the progress bar moving at a rather consistent .010 every few minutes so I wrote down some timestamps. At 41 1/2 hours - the stated 73.344 At: 42 1/2 - 74.775 43 1/2 - 76.263 57 - 100 This gives a shockingly accurate estimate right around 1.75% per hour. Has anyone else seen behavior like this on a linux machine? Especially with these work units. The app hasn't changed, I'm crunching a 3pldrv210 and pldrv59, that are both appropriately jumping erraticly. So why did this one work? Was it intentional, Is there something different about the code that caused it to happen to work, Or is this a case of a thousand monkeys on a thousand typewriters? |
TenthReality Send message Joined: 6 Sep 09 Posts: 6 Credit: 550,574 RAC: 0 |
the pldrv210 series appear to be taking the same length of time as the 64's should you run into one. |
Message boards :
Number crunching :
Long (never ending?) work unit