Auto-ban faulty hosts

Message boards : Number crunching : Auto-ban faulty hosts

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Tex1954

Send message
Joined: 4 Feb 17
Posts: 4
Credit: 24,049,346
RAC: 0
   
Message 501 - Posted: 26 Jun 2017, 16:07:07 UTC
Last modified: 26 Jun 2017, 16:16:25 UTC

Ummm, I am just getting an older system going and it crapped out a couple times and I think it may be or get banned.

No tasks are showing in BOINC on my end,, did a BOINC update, your system shows 16 error and 16 in progress... I think it got glitched somehow...

Anyway, the host ID is 4878

Can you reset it, correct the "In Progress" data and unban it if it is banned? It is setup with 2 GTX 580 cards (yes old I know) and an I7-950 on X58 mobo.

Right now both GTX580's are running PrimeGrid tasks flawlessly... seems like the hardware and drivers are correct now.

Thanks!

8-)
ID: 501 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 506
Credit: 72,451,573
RAC: 0
   
Message 502 - Posted: 26 Jun 2017, 18:10:13 UTC
Last modified: 26 Jun 2017, 18:10:23 UTC

Host 4878 is not banned. "Error" status is just a client error, it doesn't count. Only "Invalid" status counts.
ID: 502 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tex1954

Send message
Joined: 4 Feb 17
Posts: 4
Credit: 24,049,346
RAC: 0
   
Message 503 - Posted: 26 Jun 2017, 20:49:37 UTC - in response to Message 502.  

Ohh, okay! Thanks!

8-)
ID: 503 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Salt
Avatar

Send message
Joined: 12 Dec 17
Posts: 2
Credit: 222,545,447
RAC: 0
   
Message 690 - Posted: 23 Dec 2017, 11:32:15 UTC

Would be grateful if you could remove my Macbook Pro host # 20220 ... It has insufficient GPU RAM to run Amicable Numbers so until the base memory requirement goes down it is just wasting space in your statistics.
ID: 690 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 506
Credit: 72,451,573
RAC: 0
   
Message 691 - Posted: 23 Dec 2017, 13:07:03 UTC - in response to Message 690.  

It can still run CPU tasks. But if you consider only GPUs for Amicable Numbers, this host will be auto-removed after 3 months of inactivity.
ID: 691 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Salt
Avatar

Send message
Joined: 12 Dec 17
Posts: 2
Credit: 222,545,447
RAC: 0
   
Message 708 - Posted: 19 Jan 2018, 14:32:29 UTC - in response to Message 691.  

Thanks Sergei - I found out how to remove it. I only do GPU Amicable tasks. Yoyo@home takes up all my CPUs
ID: 708 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
steverocky

Send message
Joined: 14 Mar 17
Posts: 1
Credit: 21,111,379
RAC: 0
   
Message 773 - Posted: 24 Mar 2018, 17:27:23 UTC

I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!!
ID: 773 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 506
Credit: 72,451,573
RAC: 0
   
Message 774 - Posted: 25 Mar 2018, 7:18:14 UTC - in response to Message 773.  

I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!!

Do you realise this ban was automatic? Unless you fix this PC, it will stay banned.
ID: 774 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 20 Feb 17
Posts: 19
Credit: 503,444,318
RAC: 65,653
    
Message 775 - Posted: 26 Mar 2018, 23:44:24 UTC - in response to Message 774.  

I did 1300 good work units and 4 invalid and you banned my computer!!!!!!! BYE!!!


Do you realise this ban was automatic? Unless you fix this PC, it will stay banned.


You guys do realize that he's using the latest driver and the problems was:

1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl".
Frontend phase failed compilation.

OpenCL.cpp, line 397: Trying to disable 'goto' and build again
01:16:06 (2252): called boinc_finish(0)

please tell me how this is a USER CAUSED ERROR? Because I too could be banned if this is a problem caused by the users!!
ID: 775 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 506
Credit: 72,451,573
RAC: 0
   
Message 776 - Posted: 27 Mar 2018, 6:39:55 UTC - in response to Message 775.  
Last modified: 27 Mar 2018, 7:06:24 UTC

1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl".
Frontend phase failed compilation.

OpenCL.cpp, line 397: Trying to disable 'goto' and build again
01:16:06 (2252): called boinc_finish(0)

please tell me how this is a USER CAUSED ERROR? Because I too could be banned if this is a problem caused by the users!!

That PC has this message in all tasks' logs, including a few hundred valid ones. It's normal: in this case it failed to compile because of a known issue with amd drivers and a workaround was applied, so it compiled and ran fine after that.

But incorrect result was returned for a few tasks out of a few hundred valid tasks. This is a direct indication of unstable hardware. Many similar configurations run with the same message in logs, but without problems (validation failures) on this project.

Only validation failures are counted - it's when a client didn't report any errors on exit, but returned incorrect result. This can jeopardize the whole project's target and can't be tolerated.

Only "Completed, marked as invalid" status is counted for banning.

Because I too could be banned if this is a problem caused by the users!!

No, your computers have 0 "Completed, marked as invalid" tasks.

P.S. There are over 5000 hosts that run Amicable Numbers every day. Only 3 hosts are currently banned, so please don't exaggerate this problem.
ID: 776 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 20 Feb 17
Posts: 19
Credit: 503,444,318
RAC: 65,653
    
Message 777 - Posted: 27 Mar 2018, 13:16:39 UTC - in response to Message 776.  

1 error detected in the compilation of "C:\Users\steve\AppData\Local\Temp\OCL2252T1.cl".
Frontend phase failed compilation.

OpenCL.cpp, line 397: Trying to disable 'goto' and build again
01:16:06 (2252): called boinc_finish(0)

please tell me how this is a USER CAUSED ERROR? Because I too could be banned if this is a problem caused by the users!!

That PC has this message in all tasks' logs, including a few hundred valid ones. It's normal: in this case it failed to compile because of a known issue with amd drivers and a workaround was applied, so it compiled and ran fine after that.

But incorrect result was returned for a few tasks out of a few hundred valid tasks. This is a direct indication of unstable hardware. Many similar configurations run with the same message in logs, but without problems (validation failures) on this project.

Only validation failures are counted - it's when a client didn't report any errors on exit, but returned incorrect result. This can jeopardize the whole project's target and can't be tolerated.

Only "Completed, marked as invalid" status is counted for banning.

Because I too could be banned if this is a problem caused by the users!!


No, your computers have 0 "Completed, marked as invalid" tasks.

P.S. There are over 5000 hosts that run Amicable Numbers every day. Only 3 hosts are currently banned, so please don't exaggerate this problem.


I'm not trying too...the point is the host did Valid (240) workunits and ONLY· Invalid (2) ones and yet it was banned. My point is that perhaps your criteria for banning is a bit too sensitive and then to say 'well it's automatic' isn't being genuine because SOMEONE had to set the criteria for the 'automatic' procedure to be implemented. It's not built into the Boinc software to do that or several other projects would have long ago banned hundreds of graphics cards, but they haven't. This is like the accounting department telling me 'the computer made a mistake', NO computers don't do that, they only do what they are told to do that's why we all love them, they can do the same thing over and over and over again until either a part fails or the programming fails, but they do not randomly ban pc's or make mistakes.

I don't want to get into an argument here, it's your project and you can run it the way you like, but to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine.
ID: 777 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 506
Credit: 72,451,573
RAC: 0
   
Message 778 - Posted: 27 Mar 2018, 14:24:20 UTC - in response to Message 777.  

to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine.

It's not harsh, because extremely few computers actually get banned here. He needs to check his GPU, run stress tests and make sure it doesn't crash or give errors, doesn't overheat, and I'll unban it then. "Working just fine" depends on the type of workload. It may be just on the edge for some projects and run fine, but cross the edge and fail sometimes on other projects.
ID: 778 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 20 Feb 17
Posts: 19
Credit: 503,444,318
RAC: 65,653
    
Message 783 - Posted: 28 Mar 2018, 3:31:35 UTC - in response to Message 778.  

to start banning pc's with just 2 errors out of 242 workunits seems a bit harsh. What's the procedure for the guy to bring his pc back? In the banning process he had more workunits on his pc that he finished crunching but was NOT allowed to return them to even figure out it the problem was fixed or not, he had to abort them all and move to another project, where the graphics card is working just fine.


It's not harsh, because extremely few computers actually get banned here. He needs to check his GPU, run stress tests and make sure it doesn't crash or give errors, doesn't overheat, and I'll unban it then. "Working just fine" depends on the type of workload. It may be just on the edge for some projects and run fine, but cross the edge and fail sometimes on other projects.


He's not a rookie cruncher and has been crunching for a long time so understands all that, but it's okay 'lots of fish in the sea' as they say and he's now happily crunching just fine at another project.
ID: 783 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SoNic1967

Send message
Joined: 8 Sep 18
Posts: 13
Credit: 23,954,022
RAC: 0
  
Message 920 - Posted: 18 Sep 2018, 17:42:42 UTC
Last modified: 18 Sep 2018, 18:35:43 UTC

I am not receiving any units for: https://sech.me/boinc/Amicable/show_host_detail.php?hostid=52502
I had three nVidia errors because I was testing the Kernel size and at 23 it crashed.
LE: All is good now.
ID: 920 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
SoNic1967

Send message
Joined: 8 Sep 18
Posts: 13
Credit: 23,954,022
RAC: 0
  
Message 922 - Posted: 19 Sep 2018, 19:15:05 UTC

Canceled 7 nVidia GPU tasks because I removed that GPU from the system. Too slow.
ID: 922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Beyond
Avatar

Send message
Joined: 12 Apr 17
Posts: 13
Credit: 2,366,316,500
RAC: 495,795
    
Message 1076 - Posted: 29 Jan 2019, 20:00:02 UTC - in response to Message 778.  
Last modified: 29 Jan 2019, 20:02:03 UTC

It's not harsh, because extremely few computers actually get banned here.

And that's a problem for the project and the rest of us that are still processing WUs. There's a mechanism in BOINC that will lower the number of WUs a bad host gets per day until it only receives 1/day. Please institute this mechanism. If the host starts producing valid WUs their allocation will increase. Right now the situation is ridiculous, with some hosts spewing out hundreds of errors. This causes valid users to receive erroneous "Completed, can't validate" messages. All because of good WUs that get flagged as "Too many errors (may have bug)" when they are fine.

Example 1:

https://sech.me/boinc/Amicable/workunit.php?wuid=9905159

My machine is listed as "Completed, can't validate" because of "Too many errors (may have bug)". In reality EVERY other machine is one that produces almost all errors.


Example 2:

https://sech.me/boinc/Amicable/workunit.php?wuid=9880120

Once again, my machine is listed as "Completed, can't validate" because of "Too many errors (may have bug)". In reality the 6 other machines are ones that produce almost all errors.

This is just crazy. Please get these bad machines under control.
ID: 1076 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tigers_Dave

Send message
Joined: 26 Oct 22
Posts: 39
Credit: 2,578,482,339
RAC: 9,534,390
    
Message 1848 - Posted: 12 Mar 2024, 0:29:11 UTC
Last modified: 12 Mar 2024, 0:39:51 UTC

Just want to add my $0.02 on this topic. Over the past few days, five of my twelve crunchers were auto banned as a result of too many invalid results (4 out of 200-800 tasks). I emailed Sergei and explained that it appeared to be a more or less anomalous situation that was not due to overclocking or some other change in my configuration. Sergei graciously accepted my explanation and removed the ban within a few hours of my request. So, I am not going to complain about the auto-ban process, because Sergei is willing to accept reasonable explanations and un-ban a host in a timely manner.

I don't have the time to pay much attention to BOINC on my machines, as they are workstations for me and my graduate students. Thus, I value plug-and-play distributive computing solutions for my machines. With the demise of SETI and Collatz, Amicable Numbers and Rosetta continue to provide that to me and I am delighted to contribute my GPU and CPU cycles to these projects.

War Eagle!

Tigers_Dave
ID: 1848 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Auto-ban faulty hosts


©2024 Sergei Chernykh