Message boards : Number crunching : Auto-ban faulty hosts
Previous · 1 · 2
Author | Message |
---|---|
Tex1954 Send message Joined: 4 Feb 17 Posts: 4 Credit: 24,049,346 RAC: 0 |
Ummm, I am just getting an older system going and it crapped out a couple times and I think it may be or get banned. No tasks are showing in BOINC on my end,, did a BOINC update, your system shows 16 error and 16 in progress... I think it got glitched somehow... Anyway, the host ID is 4878 Can you reset it, correct the "In Progress" data and unban it if it is banned? It is setup with 2 GTX 580 cards (yes old I know) and an I7-950 on X58 mobo. Right now both GTX580's are running PrimeGrid tasks flawlessly... seems like the hardware and drivers are correct now. Thanks! 8-) |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 534 Credit: 72,451,573 RAC: 0 |
Host 4878 is not banned. "Error" status is just a client error, it doesn't count. Only "Invalid" status counts. |
Tex1954 Send message Joined: 4 Feb 17 Posts: 4 Credit: 24,049,346 RAC: 0 |
Ohh, okay! Thanks! 8-) |
Salt Send message Joined: 12 Dec 17 Posts: 2 Credit: 222,545,447 RAC: 0 |
Would be grateful if you could remove my Macbook Pro host # 20220 ... It has insufficient GPU RAM to run Amicable Numbers so until the base memory requirement goes down it is just wasting space in your statistics. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 534 Credit: 72,451,573 RAC: 0 |
It can still run CPU tasks. But if you consider only GPUs for Amicable Numbers, this host will be auto-removed after 3 months of inactivity. |
Salt Send message Joined: 12 Dec 17 Posts: 2 Credit: 222,545,447 RAC: 0 |
Thanks Sergei - I found out how to remove it. I only do GPU Amicable tasks. Yoyo@home takes up all my CPUs |
steverocky Send message Joined: 14 Mar 17 Posts: 1 Credit: 21,111,379 RAC: 0 |
I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!! |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 534 Credit: 72,451,573 RAC: 0 |
I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!! Do you realise this ban was automatic? Unless you fix this PC, it will stay banned. |
mikey Send message Joined: 20 Feb 17 Posts: 20 Credit: 1,547,501,986 RAC: 654,863 |
I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!! You guys do realize that he's using the latest driver and the problems was: 1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl". Frontend phase failed compilation. OpenCL.cpp, line 397: Trying to disable 'goto' and build again 01:16:06 (2252): called boinc_finish(0) please tell me how this is a USER CAUSED ERROR? Because I too could be banned if this is a problem caused by the users!! |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 534 Credit: 72,451,573 RAC: 0 |
1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl". That PC has this message in all tasks' logs, including a few hundred valid ones. It's normal: in this case it failed to compile because of a known issue with amd drivers and a workaround was applied, so it compiled and ran fine after that. But incorrect result was returned for a few tasks out of a few hundred valid tasks. This is a direct indication of unstable hardware. Many similar configurations run with the same message in logs, but without problems (validation failures) on this project. Only validation failures are counted - it's when a client didn't report any errors on exit, but returned incorrect result. This can jeopardize the whole project's target and can't be tolerated. Only "Completed, marked as invalid" status is counted for banning. Because I too could be banned if this is a problem caused by the users!! No, your computers have 0 "Completed, marked as invalid" tasks. P.S. There are over 5000 hosts that run Amicable Numbers every day. Only 3 hosts are currently banned, so please don't exaggerate this problem. |
mikey Send message Joined: 20 Feb 17 Posts: 20 Credit: 1,547,501,986 RAC: 654,863 |
1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl". I'm not trying too...the point is the host did Valid (240) workunits and ONLY· Invalid (2) ones and yet it was banned. My point is that perhaps your criteria for banning is a bit too sensitive and then to say 'well it's automatic' isn't being genuine because SOMEONE had to set the criteria for the 'automatic' procedure to be implemented. It's not built into the Boinc software to do that or several other projects would have long ago banned hundreds of graphics cards, but they haven't. This is like the accounting department telling me 'the computer made a mistake', NO computers don't do that, they only do what they are told to do that's why we all love them, they can do the same thing over and over and over again until either a part fails or the programming fails, but they do not randomly ban pc's or make mistakes. I don't want to get into an argument here, it's your project and you can run it the way you like, but to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 534 Credit: 72,451,573 RAC: 0 |
to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine. It's not harsh, because extremely few computers actually get banned here. He needs to check his GPU, run stress tests and make sure it doesn't crash or give errors, doesn't overheat, and I'll unban it then. "Working just fine" depends on the type of workload. It may be just on the edge for some projects and run fine, but cross the edge and fail sometimes on other projects. |
mikey Send message Joined: 20 Feb 17 Posts: 20 Credit: 1,547,501,986 RAC: 654,863 |
to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine. He's not a rookie cruncher and has been crunching for a long time so understands all that, but it's okay 'lots of fish in the sea' as they say and he's now happily crunching just fine at another project. |
SoNic1967 Send message Joined: 8 Sep 18 Posts: 13 Credit: 23,954,022 RAC: 0 |
I am not receiving any units for: https://sech.me/boinc/Amicable/show_host_detail.php?hostid=52502 I had three nVidia errors because I was testing the Kernel size and at 23 it crashed. LE: All is good now. |
SoNic1967 Send message Joined: 8 Sep 18 Posts: 13 Credit: 23,954,022 RAC: 0 |
Canceled 7 nVidia GPU tasks because I removed that GPU from the system. Too slow. |
Beyond Send message Joined: 12 Apr 17 Posts: 13 Credit: 2,369,891,829 RAC: 0 |
It's not harsh, because extremely few computers actually get banned here. And that's a problem for the project and the rest of us that are still processing WUs. There's a mechanism in BOINC that will lower the number of WUs a bad host gets per day until it only receives 1/day. Please institute this mechanism. If the host starts producing valid WUs their allocation will increase. Right now the situation is ridiculous, with some hosts spewing out hundreds of errors. This causes valid users to receive erroneous "Completed, can't validate" messages. All because of good WUs that get flagged as "Too many errors (may have bug)" when they are fine. Example 1: https://sech.me/boinc/Amicable/workunit.php?wuid=9905159 My machine is listed as "Completed, can't validate" because of "Too many errors (may have bug)". In reality EVERY other machine is one that produces almost all errors. Example 2: https://sech.me/boinc/Amicable/workunit.php?wuid=9880120 Once again, my machine is listed as "Completed, can't validate" because of "Too many errors (may have bug)". In reality the 6 other machines are ones that produce almost all errors. This is just crazy. Please get these bad machines under control. |
Tigers_Dave Send message Joined: 26 Oct 22 Posts: 46 Credit: 8,027,297,795 RAC: 25,531,091 |
Just want to add my $0.02 on this topic. Over the past few days, five of my twelve crunchers were auto banned as a result of too many invalid results (4 out of 200-800 tasks). I emailed Sergei and explained that it appeared to be a more or less anomalous situation that was not due to overclocking or some other change in my configuration. Sergei graciously accepted my explanation and removed the ban within a few hours of my request. So, I am not going to complain about the auto-ban process, because Sergei is willing to accept reasonable explanations and un-ban a host in a timely manner. I don't have the time to pay much attention to BOINC on my machines, as they are workstations for me and my graduate students. Thus, I value plug-and-play distributive computing solutions for my machines. With the demise of SETI and Collatz, Amicable Numbers and Rosetta continue to provide that to me and I am delighted to contribute my GPU and CPU cycles to these projects. War Eagle! Tigers_Dave |
Message boards : Number crunching : Auto-ban faulty hosts
©2024 Sergei Chernykh