Auto-ban faulty hosts

Message boards : Number crunching : Auto-ban faulty hosts

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 225 - Posted: 26 Feb 2017, 10:11:47 UTC
Last modified: 26 Feb 2017, 18:16:59 UTC

I've set up a script that bans hosts which have:

1) More than 3 invalid results in the last week ("Completed, marked as invalid")
or
2) More than 10 inconclusive results in the last 2 days

Normally, hosts don't produce invalid/inconclusive results at all. If you see one of your hosts as banned, it means that you need to fix it: reduce overclocking, update GPU drivers etc. and contact me as soon as you've fixed it.
ID: 225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Tern

Send message
Joined: 17 Feb 17
Posts: 27
Credit: 69,545,002
RAC: 52
   
Message 230 - Posted: 26 Feb 2017, 18:02:15 UTC - in response to Message 225.  
Last modified: 26 Feb 2017, 18:05:05 UTC

Will a notice be sent to banned hosts? Otherwise I can see a lot of hours spent scratching heads going "why am I not getting work?!?!?"...

I assume this is in addition to the 'normal' BOINC rollback mechanism; a host returning garbage should be limited to 1 task/day pretty quickly even without the ban.

lhcathome-dev had problems because of VirtualBox and their redundant work scheme, BOINC didn't know if a task was good or not, so tasks kept being sent and notices not sent out. They had to resort to manually sending private messages. And then emails when nobody read those. But they are a special (self-inflicted) case.
ID: 230 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Bryan
Avatar

Send message
Joined: 23 Jan 17
Posts: 17
Credit: 278,854,007
RAC: 0
   
Message 231 - Posted: 26 Feb 2017, 18:11:59 UTC

You need to be careful with banning for invalids. I have quite a few showing up where I completed the WU but a wingman aborted or timed out and then it was issued to a 3rd person. Apparently the server then CANCELLED the WU because the 3rd person is showing "not needed". The WU winds up showing on my account as "completed, can't validate."
ID: 231 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 232 - Posted: 26 Feb 2017, 18:12:51 UTC - in response to Message 230.  
Last modified: 26 Feb 2017, 18:15:48 UTC

Banned hosts will get "Not accepting requests from this host" error message. Pretty obvious, I think. And they're also marked with red "BANNED" sign at the hosts page.

P.S. Only "Completed, marked as invalid" tasks count for banning.
ID: 232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 233 - Posted: 26 Feb 2017, 18:14:58 UTC - in response to Message 232.  
Last modified: 26 Feb 2017, 18:16:15 UTC

And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far.
ID: 233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 234 - Posted: 26 Feb 2017, 18:34:51 UTC - in response to Message 230.  

I assume this is in addition to the 'normal' BOINC rollback mechanism; a host returning garbage should be limited to 1 task/day pretty quickly even without the ban.

This mechanism was deprecated. It's not in the latest BOINC code which I use.
ID: 234 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
fractal

Send message
Joined: 1 Feb 17
Posts: 8
Credit: 100,048,293
RAC: 0
   
Message 237 - Posted: 27 Feb 2017, 3:57:51 UTC - in response to Message 233.  

And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far.

Both of them are mine, both are arm64 systems running an older client. I need to recompile the new source.

I only see one "invalid" on one of them and two on the other but the history is pretty short.
ID: 237 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 238 - Posted: 27 Feb 2017, 5:46:38 UTC - in response to Message 237.  

Both of them are mine, both are arm64 systems running an older client. I need to recompile the new source.

I only see one "invalid" on one of them and two on the other but the history is pretty short.

Unbanned them.
ID: 238 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BlackObelisk

Send message
Joined: 14 Mar 17
Posts: 1
Credit: 100,303,424
RAC: 0
   
Message 344 - Posted: 16 Mar 2017, 3:09:13 UTC - in response to Message 225.  

Normally, hosts don't produce invalid/inconclusive results at all. If you see one of your hosts as banned, it means that you need to fix it: reduce overclocking, update GPU drivers etc. and contact me as soon as you've fixed it.


https://sech.me/boinc/Amicable/show_host_detail.php?hostid=2181

This is a mobile platform with integrated screen and graphics card that doesn't support overclocking, so I guess the problem could be in drivers that is hard to update due to administration "we love windows XP" "no update/no upgrade" policy. So the question is... what exactly is "etc." that could be played around and is there some sort of logging? or maybe offline testing software? Same GPU seems to produce valid results for other BOINC projects, but they have either vogue or distant goals to accomplish so I switched to something more amicable. I was planning to use more computers with similar configuration to crunch this project, but now I guess if I can't fix it, then I will leave them crunching something less amicable.
ID: 344 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 345 - Posted: 16 Mar 2017, 6:03:31 UTC - in response to Message 344.  
Last modified: 16 Mar 2017, 6:10:39 UTC

The most common problem with notebook GPUs is overheating. And this project puts quite a load on GPU. But most likely it's something wrong with drivers. Only two hosts have been banned so far, and both used AMD GPUs.
ID: 345 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mikey
Avatar

Send message
Joined: 20 Feb 17
Posts: 20
Credit: 1,547,570,348
RAC: 479,227
    
Message 354 - Posted: 20 Mar 2017, 19:41:30 UTC - in response to Message 237.  

And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far.

Both of them are mine, both are arm64 systems running an older client. I need to recompile the new source.

I only see one "invalid" on one of them and two on the other but the history is pretty short.


I have a "banned" host too, I think it's because the gpu is too old to crunch here and even though it tried I have 4 bad units so I aborted the rest of them,13 of them, and moved it to PrimeGrid, where it works just fine. It's an AMD 6670 so not new by any stretch, I don't care if it can't crunch here, it's way too slow anyway.
ID: 354 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 355 - Posted: 20 Mar 2017, 20:42:18 UTC - in response to Message 354.  

I have a "banned" host too, I think it's because the gpu is too old to crunch here and even though it tried I have 4 bad units so I aborted the rest of them,13 of them, and moved it to PrimeGrid, where it works just fine. It's an AMD 6670 so not new by any stretch, I don't care if it can't crunch here, it's way too slow anyway.

Unbanned it.
ID: 355 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
forretrio

Send message
Joined: 7 Feb 17
Posts: 4
Credit: 4,771,946
RAC: 0
   
Message 360 - Posted: 22 Mar 2017, 12:56:03 UTC

https://sech.me/boinc/Amicable/workunit.php?wuid=421350
https://sech.me/boinc/Amicable/workunit.php?wuid=421791

Got two "can't validate" results probably due to the wingman. It won't count as my faulty result right?

Just wonder why extra tasks are not sent from the same workunit to validate the correctly returned result. It may not be very fair to the one who returned correct result but got no credits for that.
ID: 360 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 361 - Posted: 22 Mar 2017, 13:07:03 UTC - in response to Message 360.  
Last modified: 22 Mar 2017, 13:11:47 UTC

https://sech.me/boinc/Amicable/workunit.php?wuid=421350
https://sech.me/boinc/Amicable/workunit.php?wuid=421791

Got two "can't validate" results probably due to the wingman. It won't count as my faulty result right?

Just wonder why extra tasks are not sent from the same workunit to validate the correctly returned result. It may not be very fair to the one who returned correct result but got no credits for that.

They don't count as invalid. These work units were cancelled because of a bug in GPU version: https://sech.me/boinc/Amicable/forum_thread.php?id=50

P.S. I'll give out credits manually to those who got "can't validate" error.
ID: 361 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JugNut

Send message
Joined: 23 Feb 17
Posts: 3
Credit: 1,295,275,000
RAC: 0
   
Message 362 - Posted: 23 Mar 2017, 6:39:18 UTC - in response to Message 361.  

https://sech.me/boinc/Amicable/workunit.php?wuid=421350
https://sech.me/boinc/Amicable/workunit.php?wuid=421791

Got two "can't validate" results probably due to the wingman. It won't count as my faulty result right?

Just wonder why extra tasks are not sent from the same workunit to validate the correctly returned result. It may not be very fair to the one who returned correct result but got no credits for that.

They don't count as invalid. These work units were cancelled because of a bug in GPU version: https://sech.me/boinc/Amicable/forum_thread.php?id=50

P.S. I'll give out credits manually to those who got "can't validate" error.




Ahh that's good I was starting to get a tad worried, I have 40+ that "can't validate".
Thanks Sergei :)
ID: 362 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
mmonnin

Send message
Joined: 19 Mar 17
Posts: 11
Credit: 626,817,994
RAC: 605,343
    
Message 363 - Posted: 24 Mar 2017, 0:49:16 UTC - in response to Message 362.  

I was worried too. I woke up with no tasks but an update downloaded more. I later found 33 invalids and 25 aborted. Now I see it was from 1.12.
ID: 363 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sorcrosc

Send message
Joined: 18 Feb 17
Posts: 5
Credit: 2,236,002
RAC: 96
   
Message 483 - Posted: 18 Jun 2017, 18:16:36 UTC

My host here has been banned:
https://sech.me/boinc/Amicable/show_host_detail.php?hostid=4574

I was testing my gpu in linux with mesa drivers. It was interesting because it's the first project I found to work with them. When I realized there were some failing workunits it was because I have seen the "Not accepting requests from this host" message, it was too late :)

Can you please explain the need for this banning? Isn't the validation mechanism enough for Amicable?


Thanks
ID: 483 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 484 - Posted: 18 Jun 2017, 18:56:49 UTC - in response to Message 483.  
Last modified: 18 Jun 2017, 18:57:34 UTC

Faulty hosts tend to fault in the same way (not finding amicable numbers at all), so if two faulty hosts meet on the same work unit, some numbers can be missed. Therefore I need to detect and disable such hosts as early as possible.

P.S. And then re-run all work units which were sent to such hosts.
ID: 484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
sorcrosc

Send message
Joined: 18 Feb 17
Posts: 5
Credit: 2,236,002
RAC: 96
   
Message 485 - Posted: 18 Jun 2017, 20:14:12 UTC - in response to Message 484.  

Faulty hosts tend to fault in the same way (not finding amicable numbers at all), so if two faulty hosts meet on the same work unit, some numbers can be missed. Therefore I need to detect and disable such hosts as early as possible.


Thanks. I suggest to write this in the top post.

Do you think you can show in the workunit result page if there is some amicable number found?

I have disabled opencl on this computer (mesa package removed). Remaining opencl workunits aborted but one of them is ready to report and can't be aborted.
Please remove it from ban when possible. Thanks again.
ID: 485 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 486 - Posted: 18 Jun 2017, 20:45:59 UTC - in response to Message 485.  

Unbanned it.
ID: 486 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Auto-ban faulty hosts


©2024 Sergei Chernykh