Computation error

Message boards : Number crunching : Computation error

To post messages, you must log in.

AuthorMessage
flallnatural

Send message
Joined: 29 Sep 18
Posts: 2
Credit: 67,507,412
RAC: 0
   
Message 989 - Posted: 22 Nov 2018, 3:22:42 UTC

I follow a schedule on BOINC so my computer only computes when I'm not using it. The problem I'm running into though is when BOINC suspends projects because of a schedule or use, when it returns to crunching, the last Amicable Numbers task it was working on is lost regardless of the percentage it was at. I have other projects running on GPU and CPU and they are able to resume just fine. Whats going on with Amicable Numbers?

I would really appreciate some help. Thanks.
ID: 989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kellen

Send message
Joined: 14 Nov 17
Posts: 70
Credit: 1,000,005,236
RAC: 0
   
Message 990 - Posted: 23 Nov 2018, 3:34:25 UTC - in response to Message 989.  

Hi flallnatural,

When we started the new large prime search this also started happening to my computers. I have not found any way around it other than to make sure that any given task completes before I suspend BOINC work. When I want to use my computer I select "No New Tasks" for Amicable Numbers on the Projects tab in BOINC and just wait for the ones I have downloaded to finish. I run BOINC with zero buffer, so this is, at most, two tasks. As your buffer is somewhat larger, you can do the same thing, selecting No New Tasks on the Projects tab, then suspend all of the Amicable Numbers tasks that are not currently running, and wait out the one that is running. Your computer seems to be taking approximately 900 seconds to complete each task, so the most you would have to wait is 15 minutes. The perfect amount of time to make a nice cup of tea and a slice or two of toast :)

I know this isn't the solution you are looking for, but I hope it helps anyway.

Regards,
Kellen
ID: 990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
flallnatural

Send message
Joined: 29 Sep 18
Posts: 2
Credit: 67,507,412
RAC: 0
   
Message 991 - Posted: 23 Nov 2018, 4:02:50 UTC - in response to Message 990.  

Thank you for your response. I'll give that a shot!
ID: 991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AT Hiker

Send message
Joined: 21 Sep 18
Posts: 20
Credit: 66,803,284
RAC: 0
   
Message 992 - Posted: 23 Nov 2018, 14:26:38 UTC

Obviously there is a problem in the coding of Amicable Numbers.

The "fix" suggested works but you waste computing time if you run more than 1 work unit at a time, which is what I do.
ID: 992 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Quantum Mechanic

Send message
Joined: 23 Feb 19
Posts: 1
Credit: 27,345
RAC: 0
Message 1098 - Posted: 26 Feb 2019, 0:02:18 UTC

ID: 1098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 1099 - Posted: 26 Feb 2019, 8:02:29 UTC - in response to Message 1098.  
Last modified: 26 Feb 2019, 8:03:00 UTC

clEnqueueWriteBuffer returned error -5

This is CL_OUT_OF_RESOURCES error. Try to reduce kernel size in computing preferences: https://sech.me/boinc/Amicable/prefs.php?subset=project
ID: 1099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
candido

Send message
Joined: 13 Feb 17
Posts: 1
Credit: 2,520,331
RAC: 0
  
Message 1100 - Posted: 28 Feb 2019, 16:13:09 UTC - in response to Message 1099.  

Hi! I have a few errors (https://sech.me/boinc/Amicable/results.php?userid=2313&offset=0&show_names=0&state=6&appid=) with this one :

clGetEventInfo returned error -58

Any idea on what is causing the errors?

ThankS!

Eg:
<core_client_version>7.8.3</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)</message>
<stderr_txt>
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<allow_non_selected_apps>1</allow_non_selected_apps>
<max_jobs>0</max_jobs>
<max_cpus>0</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<allow_non_selected_apps>1</allow_non_selected_apps>
<max_jobs>0</max_jobs>
<max_cpus>0</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 1130: clGetEventInfo returned error -58
21:40:53 (13408): called boinc_finish(-1)

</stderr_txt>
]]>
ID: 1100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JohnMD
Avatar

Send message
Joined: 8 Jan 18
Posts: 11
Credit: 25,123,011
RAC: 0
   
Message 1102 - Posted: 13 Mar 2019, 1:56:16 UTC - in response to Message 1100.  
Last modified: 13 Mar 2019, 2:00:10 UTC

I get the same "-58" in 2 situations with Nvidia 930M (I also have Intel 520 for display)
1. When I close BOINC and restart. The GPU app CAN'T restart, even though it has created a checkpoint file.
2. When I switch to another user, the GPU app gets suspended. When I switch back it CAN'T resume, even though suspended tasks are 'kept in storage'.
Sounds to me like something's been forgotten and the program not properly tested.
ID: 1102 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BobMALCS

Send message
Joined: 27 May 18
Posts: 2
Credit: 18,232,128
RAC: 0
  
Message 1105 - Posted: 30 Mar 2019, 12:17:54 UTC

This error is irritating. Especially as there seems to be no attempt to fix it.

I'm not going to waste my time and money on it.

Bye.
ID: 1105 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1106 - Posted: 30 Mar 2019, 16:41:06 UTC
Last modified: 30 Mar 2019, 16:51:15 UTC

I'm receiving error -58 on my 1060 3gb.

Sometimes work units complete, other times they get errors.

Is it related to how much available CPU is open to the WU?
If CPU projects take up too much CPU does this error occur?

Have run CUDA or OpenCL WUs for Einstein, GPUGrid, Moo!, Milkyway and Asteroids on this 1060 3GB successfully over the last week as part of a baseline testing at default values (no overclocking, just an aggressive cooling profile to assure the GPU's are under 60C and best BIOS based clocking decisions are made).

Logfile:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)</message>
<stderr_txt>
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>2</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>2</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 1130: clGetEventInfo returned error -58
20:33:57 (1356): called boinc_finish(-1)

</stderr_txt>
]]>
ID: 1106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1107 - Posted: 31 Mar 2019, 20:07:56 UTC
Last modified: 31 Mar 2019, 20:14:37 UTC

It's most every WU now.
Kernel size is a meager 21 for a 3GB GTX, that should be easy (https://sech.me/boinc/Amicable/forum_thread.php?id=128&postid=795#795
2 CPU's selected.

Peak working set on every failed WU doesn't exceed 586 MB or peak swap 985mb.

The valid WUs show maximum:
Peak working set size 608 MB
Peak swap size 1,005 MB

Not attempting multiple WU in app_config and this is just default, non-overclocked WU attempts.
GPU is running cool at 41C.
Maybe it's the driver version 19.3.2? (took hours to get dual ATI/nVidia setup to work... not wanting to change drivers).
ID: 1107 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 1110 - Posted: 1 Apr 2019, 11:10:34 UTC - in response to Message 1107.  

I don't really know what causes error -58 (CL_INVALID_EVENT). It's triggered at this line: https://github.com/SChernykh/Amicable/blob/boinc-opencl-version-128-bit/Amicable/OpenCL.cpp#L1017 - but it's always set properly in the preceding call to clEnqueueNDRangeKernel on the last iteration of "for" loop. My guess is that OpenCL driver runs out of resources occasionally.
ID: 1110 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1111 - Posted: 1 Apr 2019, 21:00:15 UTC - in response to Message 1110.  
Last modified: 1 Apr 2019, 21:09:11 UTC

I don't really know what causes error -58 (CL_INVALID_EVENT). It's triggered at this line: https://github.com/SChernykh/Amicable/blob/boinc-opencl-version-128-bit/Amicable/OpenCL.cpp#L1017 - but it's always set properly in the preceding call to clEnqueueNDRangeKernel on the last iteration of "for" loop.

It started happening on the RX 550 GPU (machine has 1 RX 550, 1 GTX 1060), so that should eliminate video drivers or hardware and point to the OS or main computing Guessing it's OS components or non-GPU hardware configuration.

The machine has 8GB RAM and 4GB swap file space. At one point, there was reported 2.1GB free RAM but the swapfile was nearly full with commits.

If I decrease swap file space and get 100% error -58 failures and increase swapspace and the errors are gone, then it would be not enough real/virtual memory.
My guess is that OpenCL driver runs out of resources occasionally.

Won't get to test it till later in the week. The machine has moved onto other data gathering.
ID: 1111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1113 - Posted: 3 Apr 2019, 22:59:55 UTC - in response to Message 1111.  

The machine is ready to test Amicable again.

New job load of BOINC and other running apps on barebones (shut down most all possible services) Windows 10 (Oct 2018) configuration:
Total memory commits are 8.07GB
4.3GB free of 8GB RAM.

Each WU has 1 free CPU available.
1 WU of sieve 23 on the 1060 3GB
1 WU of sieve 23 on the RX 550 4GB

Will let you know the results.
ID: 1113 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1114 - Posted: 4 Apr 2019, 10:06:35 UTC - in response to Message 1113.  
Last modified: 4 Apr 2019, 10:09:42 UTC

There are two pending validation and 4x error -58 WU's.
task 23580804
task 23580777
task 23580802
task 23588243


I restarted the computer in order to shut down more Windows 10 services and get it to bare bones system (every Windows store app support service).
Gained another 350MB and so the system uses ~750MB on clean boot with no user apps running.

Each of the -58 computation errors occurred after the restart and BOINC attempting to resume Amicable Numbers WU from save point. There wasn't a shortage of swap space or free RAM, so original hypothesis likely denied.

Will leave the computer alone for 2 days and see if any more error -58 occur spontaneously.
ID: 1114 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dingo
Avatar

Send message
Joined: 30 Jan 17
Posts: 11
Credit: 71,598,438
RAC: 1,146
   
Message 1115 - Posted: 4 Apr 2019, 14:57:37 UTC

My tasks are all ending in error today since the last lot of new work: It processes up to the last second then aborts with an error;

This is an example. https://sech.me/boinc/Amicable/result.php?resultid=23592821

I have aborted all my tasks till this is fixed.

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)</message>
<stderr_txt>
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>0</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>0</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 1130: clGetEventInfo returned error -58
01:52:13 (1232): called boinc_finish(-1)

</stderr_txt>
]]>[[/url]

Proud Founder and member of



Have a look at my WebCam
ID: 1115 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1116 - Posted: 5 Apr 2019, 20:35:35 UTC - in response to Message 1114.  


I restarted the computer in order to shut down more Windows 10 services and get it to bare bones system (every Windows store app support service).
Gained another 350MB and so the system uses ~750MB on clean boot with no user apps running.

Will leave the computer alone for 2 days and see if any more error -58 occur spontaneously.


So the barebones Windows 10 has no screensaver, no windows defender/firewall, no antivirus, no defrag, no back ground tasks infrastructure, no tasks host, no windows update, no Windows store services and no users apps but MSI Afterburner and BOINC Manager running.
Just a basic OS running BOINC and GPU fan cooling.

No error -58's in 2 days.

Going to stress test Amicable Number WU's by suspending in rapid succession, shutting down and restarting BOINC client multiple times, starting up 3 VM's to fill up RAM and see if I can cause some -58's.

It seems to be a resume error so all the dedicated BOINC machines that never suspend the WU's are not seeing an issue.
ID: 1116 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
marmot

Send message
Joined: 14 Mar 19
Posts: 9
Credit: 26,298,837
RAC: 0
   
Message 1118 - Posted: 8 Apr 2019, 3:25:59 UTC - in response to Message 1116.  
Last modified: 8 Apr 2019, 3:26:35 UTC



It seems to be a resume error so all the dedicated BOINC machines that never suspend the WU's are not seeing an issue.


Suspending WU's repeatedly (removed from RAM) caused no issues on resume as long as boinc.exe remained in RAM.

(Note: this machine just finished 2 days of WU's on both GPU's without errors)
Error -58 comes from a failed WU restart after boinc.exe shuts down and restarts.
Let a WU run on both GPU's, plenty of free RAM and swap space, and a free CPU for each.
Both WU ran successfully for 30 minutes then performed a graceful BOINC shutdown.

Upon restart, both WU showed they were restarting from 29:xx minutes and within 10 seconds computation error of -58 on both WU, independent of driver or GPU model.

1st attempt:
23667006
23666961

2nd attempt:
23661437
23666900

Rest of the visible errors were from testing the maximum number of WU per GPU possible.
(nCores=Max WU, but ran out of virtual memory at 6, running 2/GPU test overnight).
ID: 1118 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sergei Chernykh
Project administrator
Project developer

Send message
Joined: 5 Jan 17
Posts: 534
Credit: 72,451,573
RAC: 0
   
Message 1119 - Posted: 8 Apr 2019, 8:20:32 UTC
Last modified: 8 Apr 2019, 8:21:01 UTC

It looks like I fixed this error: https://github.com/SChernykh/Amicable/commit/806085804dc51e48aef2527cd18861ad3a986bc0

I'll test it some more and update GPU versions today.
ID: 1119 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Computation error


©2024 Sergei Chernykh