Work unit errors

Message boards : Number crunching : Work unit errors

To post messages, you must log in.

AuthorMessage
AT Hiker

Send message
Joined: 21 Sep 18
Posts: 20
Credit: 66,803,284
RAC: 0
   
Message 940 - Posted: 3 Oct 2018, 21:43:12 UTC

Daily I have to suspend operation of Boinc work units to change users.

Sometimes, but not always, when the work is started again the work unit(s) will immediate leave Boinc and new work downloaded.

It makes me wonder if there is some time limit a work unit in be actual progress and finish before it goes into an error state.

Anyone have an idea as to what causes this and how I can stop it.

Thanks.
ID: 940 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 461
Credit: 72,451,573
RAC: 0
   
Message 941 - Posted: 4 Oct 2018, 7:13:35 UTC

I see only 3 errors out of 925 work units for your PC. Time limit for work unit is 3 days from the time it was sent to client.
ID: 941 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AT Hiker

Send message
Joined: 21 Sep 18
Posts: 20
Credit: 66,803,284
RAC: 0
   
Message 942 - Posted: 4 Oct 2018, 12:10:35 UTC - in response to Message 941.  

That means that it is not a timing issue.
ID: 942 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 461
Credit: 72,451,573
RAC: 0
   
Message 943 - Posted: 4 Oct 2018, 19:44:40 UTC
Last modified: 4 Oct 2018, 19:46:41 UTC

It can happen if:
- this is an old work unit which expired for someone else
- then it was sent to you
- then this "someone else" finally sends it to the server
- the server validates it and cancels all remaining "in progress" tasks (including yours)

P.S. But your 3 errors are just computing errors, they didn't happen immediately after start.
ID: 943 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AT Hiker

Send message
Joined: 21 Sep 18
Posts: 20
Credit: 66,803,284
RAC: 0
   
Message 944 - Posted: 5 Oct 2018, 0:22:03 UTC - in response to Message 943.  

Yes they are listed are computation errors but it is a little suspicious because:

I have literally run thousands of PrimeGrid work units without a computational error and those work units tend to stress the GPUs more than the ones that failed here. If I am wrong about the stress part please correct me.

All of the errors occurred immediately after restarting the work.

Something happened which might never be explained.

Thanks for the reply.
ID: 944 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
vseven

Send message
Joined: 15 Mar 18
Posts: 12
Credit: 587,338,410
RAC: 0
  
Message 947 - Posted: 10 Oct 2018, 12:15:36 UTC

I've seen the same issue, a WU failing upon startup. Not just in this project but in others also. But with 3 failed out of 900+ I don't know if its worth trying to figure out.
ID: 947 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobMALCS

Send message
Joined: 27 May 18
Posts: 2
Credit: 18,232,128
RAC: 0
  
Message 951 - Posted: 24 Oct 2018, 21:44:49 UTC

Running Windows 10.

I have now had the same problem. A work unit failed immediately up being restarted at 00:00:00. Obviously it is a rare occurance but still a waste of time.
Looking at the stderr output I noticed one thing.

=======

Stderr output

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)</message>
<stderr_txt>
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>3</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>23</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 23
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>3</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>23</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 23
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 1130: clGetEventInfo returned error -58
00:00:00 (5288): called boinc_finish(-1)

</stderr_txt>
]]>
=======

I assume that "c:\temp\" is the real name of the folder and not some indirect reference to somewhere else. If it is an indirect reference then my following statements are not relevant..

I do not have a folder named "c:\temp\". My temp folders are on another disk. If you are going to use the system temp folder then go look for it and do not assume where you think it should be.

Its not a good idea to use the temp folder when the task is inactive for a long period of time; 12 hours in my case. You have no idea what may happen to the folder or data in that time period. Part of my system maintenance is to delete unused or not recently active files in the temp folder.

In any case, should you not be using the "..\BOINC\Data\slots\" folder for a task's temporary work files.

At a guess it doesn't seem likely, from the error message, that this causes an error.


However BOINC is set to leave the tasks in core while they are idle. I likely restarted the PC while the task was suspended. This could well have caused some corruption at the end of the file.
ID: 951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Work unit errors


©2023 Sergei Chernykh