Message boards : Number crunching : Auto-ban faulty hosts
Author | Message |
---|---|
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
I've set up a script that bans hosts which have: 1) More than 3 invalid results in the last week ("Completed, marked as invalid") or 2) More than 10 inconclusive results in the last 2 days Normally, hosts don't produce invalid/inconclusive results at all. If you see one of your hosts as banned, it means that you need to fix it: reduce overclocking, update GPU drivers etc. and contact me as soon as you've fixed it. |
Tern Send message Joined: 17 Feb 17 Posts: 27 Credit: 69,545,002 RAC: 9,911 |
Will a notice be sent to banned hosts? Otherwise I can see a lot of hours spent scratching heads going "why am I not getting work?!?!?"... I assume this is in addition to the 'normal' BOINC rollback mechanism; a host returning garbage should be limited to 1 task/day pretty quickly even without the ban. lhcathome-dev had problems because of VirtualBox and their redundant work scheme, BOINC didn't know if a task was good or not, so tasks kept being sent and notices not sent out. They had to resort to manually sending private messages. And then emails when nobody read those. But they are a special (self-inflicted) case. |
Bryan Send message Joined: 23 Jan 17 Posts: 17 Credit: 278,854,007 RAC: 0 |
You need to be careful with banning for invalids. I have quite a few showing up where I completed the WU but a wingman aborted or timed out and then it was issued to a 3rd person. Apparently the server then CANCELLED the WU because the 3rd person is showing "not needed". The WU winds up showing on my account as "completed, can't validate." |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
Banned hosts will get "Not accepting requests from this host" error message. Pretty obvious, I think. And they're also marked with red "BANNED" sign at the hosts page. P.S. Only "Completed, marked as invalid" tasks count for banning. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
I assume this is in addition to the 'normal' BOINC rollback mechanism; a host returning garbage should be limited to 1 task/day pretty quickly even without the ban. This mechanism was deprecated. It's not in the latest BOINC code which I use. |
fractal Send message Joined: 1 Feb 17 Posts: 8 Credit: 100,048,293 RAC: 0 |
And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far. Both of them are mine, both are arm64 systems running an older client. I need to recompile the new source. I only see one "invalid" on one of them and two on the other but the history is pretty short. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
Both of them are mine, both are arm64 systems running an older client. I need to recompile the new source. Unbanned them. |
BlackObelisk Send message Joined: 14 Mar 17 Posts: 1 Credit: 100,303,424 RAC: 0 |
Normally, hosts don't produce invalid/inconclusive results at all. If you see one of your hosts as banned, it means that you need to fix it: reduce overclocking, update GPU drivers etc. and contact me as soon as you've fixed it. https://sech.me/boinc/Amicable/show_host_detail.php?hostid=2181 This is a mobile platform with integrated screen and graphics card that doesn't support overclocking, so I guess the problem could be in drivers that is hard to update due to administration |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
The most common problem with notebook GPUs is overheating. And this project puts quite a load on GPU. But most likely it's something wrong with drivers. Only two hosts have been banned so far, and both used AMD GPUs. |
mikey Send message Joined: 20 Feb 17 Posts: 20 Credit: 1,466,110,265 RAC: 1,685,747 |
And only 2 hosts have been banned so far, and both wasn't mistakes. No false positives so far. I have a "banned" host too, I think it's because the gpu is too old to crunch here and even though it tried I have 4 bad units so I aborted the rest of them,13 of them, and moved it to PrimeGrid, where it works just fine. It's an AMD 6670 so not new by any stretch, I don't care if it can't crunch here, it's way too slow anyway. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
I have a "banned" host too, I think it's because the gpu is too old to crunch here and even though it tried I have 4 bad units so I aborted the rest of them,13 of them, and moved it to PrimeGrid, where it works just fine. It's an AMD 6670 so not new by any stretch, I don't care if it can't crunch here, it's way too slow anyway. Unbanned it. |
forretrio Send message Joined: 7 Feb 17 Posts: 4 Credit: 4,771,946 RAC: 0 |
https://sech.me/boinc/Amicable/workunit.php?wuid=421350 https://sech.me/boinc/Amicable/workunit.php?wuid=421791 Got two "can't validate" results probably due to the wingman. It won't count as my faulty result right? Just wonder why extra tasks are not sent from the same workunit to validate the correctly returned result. It may not be very fair to the one who returned correct result but got no credits for that. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
https://sech.me/boinc/Amicable/workunit.php?wuid=421350 They don't count as invalid. These work units were cancelled because of a bug in GPU version: https://sech.me/boinc/Amicable/forum_thread.php?id=50 P.S. I'll give out credits manually to those who got "can't validate" error. |
JugNut Send message Joined: 23 Feb 17 Posts: 3 Credit: 1,295,275,000 RAC: 0 |
https://sech.me/boinc/Amicable/workunit.php?wuid=421350 Ahh that's good I was starting to get a tad worried, I have 40+ that "can't validate". Thanks Sergei :) |
mmonnin Send message Joined: 19 Mar 17 Posts: 11 Credit: 523,707,686 RAC: 366,856 |
I was worried too. I woke up with no tasks but an update downloaded more. I later found 33 invalids and 25 aborted. Now I see it was from 1.12. |
sorcrosc Send message Joined: 18 Feb 17 Posts: 5 Credit: 2,208,657 RAC: 0 |
My host here has been banned: https://sech.me/boinc/Amicable/show_host_detail.php?hostid=4574 I was testing my gpu in linux with mesa drivers. It was interesting because it's the first project I found to work with them. When I realized there were some failing workunits it was because I have seen the "Not accepting requests from this host" message, it was too late :) Can you please explain the need for this banning? Isn't the validation mechanism enough for Amicable? Thanks |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
Faulty hosts tend to fault in the same way (not finding amicable numbers at all), so if two faulty hosts meet on the same work unit, some numbers can be missed. Therefore I need to detect and disable such hosts as early as possible. P.S. And then re-run all work units which were sent to such hosts. |
sorcrosc Send message Joined: 18 Feb 17 Posts: 5 Credit: 2,208,657 RAC: 0 |
Faulty hosts tend to fault in the same way (not finding amicable numbers at all), so if two faulty hosts meet on the same work unit, some numbers can be missed. Therefore I need to detect and disable such hosts as early as possible. Thanks. I suggest to write this in the top post. Do you think you can show in the workunit result page if there is some amicable number found? I have disabled opencl on this computer (mesa package removed). Remaining opencl workunits aborted but one of them is ready to report and can't be aborted. Please remove it from ban when possible. Thanks again. |
Sergei Chernykh Project administrator Project developer Send message Joined: 5 Jan 17 Posts: 518 Credit: 72,451,573 RAC: 0 |
Unbanned it. |
Message boards : Number crunching : Auto-ban faulty hosts
©2024 Sergei Chernykh