Posts by marmot

1) Message boards : Number crunching : Computation error (Message 1118)
Posted 8 Apr 2019 by marmot
Post:


It seems to be a resume error so all the dedicated BOINC machines that never suspend the WU's are not seeing an issue.


Suspending WU's repeatedly (removed from RAM) caused no issues on resume as long as boinc.exe remained in RAM.

(Note: this machine just finished 2 days of WU's on both GPU's without errors)
Error -58 comes from a failed WU restart after boinc.exe shuts down and restarts.
Let a WU run on both GPU's, plenty of free RAM and swap space, and a free CPU for each.
Both WU ran successfully for 30 minutes then performed a graceful BOINC shutdown.

Upon restart, both WU showed they were restarting from 29:xx minutes and within 10 seconds computation error of -58 on both WU, independent of driver or GPU model.

1st attempt:
23667006
23666961

2nd attempt:
23661437
23666900

Rest of the visible errors were from testing the maximum number of WU per GPU possible.
(nCores=Max WU, but ran out of virtual memory at 6, running 2/GPU test overnight).
2) Message boards : Number crunching : Computation error (Message 1116)
Posted 5 Apr 2019 by marmot
Post:

I restarted the computer in order to shut down more Windows 10 services and get it to bare bones system (every Windows store app support service).
Gained another 350MB and so the system uses ~750MB on clean boot with no user apps running.

Will leave the computer alone for 2 days and see if any more error -58 occur spontaneously.


So the barebones Windows 10 has no screensaver, no windows defender/firewall, no antivirus, no defrag, no back ground tasks infrastructure, no tasks host, no windows update, no Windows store services and no users apps but MSI Afterburner and BOINC Manager running.
Just a basic OS running BOINC and GPU fan cooling.

No error -58's in 2 days.

Going to stress test Amicable Number WU's by suspending in rapid succession, shutting down and restarting BOINC client multiple times, starting up 3 VM's to fill up RAM and see if I can cause some -58's.

It seems to be a resume error so all the dedicated BOINC machines that never suspend the WU's are not seeing an issue.
3) Message boards : Number crunching : Computation error (Message 1114)
Posted 4 Apr 2019 by marmot
Post:
There are two pending validation and 4x error -58 WU's.
task 23580804
task 23580777
task 23580802
task 23588243


I restarted the computer in order to shut down more Windows 10 services and get it to bare bones system (every Windows store app support service).
Gained another 350MB and so the system uses ~750MB on clean boot with no user apps running.

Each of the -58 computation errors occurred after the restart and BOINC attempting to resume Amicable Numbers WU from save point. There wasn't a shortage of swap space or free RAM, so original hypothesis likely denied.

Will leave the computer alone for 2 days and see if any more error -58 occur spontaneously.
4) Message boards : Number crunching : Computation error (Message 1113)
Posted 3 Apr 2019 by marmot
Post:
The machine is ready to test Amicable again.

New job load of BOINC and other running apps on barebones (shut down most all possible services) Windows 10 (Oct 2018) configuration:
Total memory commits are 8.07GB
4.3GB free of 8GB RAM.

Each WU has 1 free CPU available.
1 WU of sieve 23 on the 1060 3GB
1 WU of sieve 23 on the RX 550 4GB

Will let you know the results.
5) Message boards : Number crunching : Computation error (Message 1111)
Posted 1 Apr 2019 by marmot
Post:
I don't really know what causes error -58 (CL_INVALID_EVENT). It's triggered at this line: https://github.com/SChernykh/Amicable/blob/boinc-opencl-version-128-bit/Amicable/OpenCL.cpp#L1017 - but it's always set properly in the preceding call to clEnqueueNDRangeKernel on the last iteration of "for" loop.

It started happening on the RX 550 GPU (machine has 1 RX 550, 1 GTX 1060), so that should eliminate video drivers or hardware and point to the OS or main computing Guessing it's OS components or non-GPU hardware configuration.

The machine has 8GB RAM and 4GB swap file space. At one point, there was reported 2.1GB free RAM but the swapfile was nearly full with commits.

If I decrease swap file space and get 100% error -58 failures and increase swapspace and the errors are gone, then it would be not enough real/virtual memory.
My guess is that OpenCL driver runs out of resources occasionally.

Won't get to test it till later in the week. The machine has moved onto other data gathering.
6) Message boards : Getting started : A note to the Project Dev/Lead (Message 1109)
Posted 31 Mar 2019 by marmot
Post:
Also, you will need 1 CPU per WU running on the GPU(s).
This is not typical of most GPU projects.
7) Message boards : Number crunching : GPU Utilization (Message 1108)
Posted 31 Mar 2019 by marmot
Post:
The problem is you then need more CPU cores. I'm running graphics cards that can do 16 WU at a time at a 22 kernel but I don't have the CPU cores to support it. So I have it set to 23 and can run 11 on a 12 CPU core machine.



What are the error results of WU's when you attempt to run more WU than the CPU cores can handle?

Is it error -58? (see: https://sech.me/boinc/Amicable/forum_thread.php?id=156&postid=1106#1106)
8) Message boards : Number crunching : Computation error (Message 1107)
Posted 31 Mar 2019 by marmot
Post:
It's most every WU now.
Kernel size is a meager 21 for a 3GB GTX, that should be easy (https://sech.me/boinc/Amicable/forum_thread.php?id=128&postid=795#795
2 CPU's selected.

Peak working set on every failed WU doesn't exceed 586 MB or peak swap 985mb.

The valid WUs show maximum:
Peak working set size 608 MB
Peak swap size 1,005 MB

Not attempting multiple WU in app_config and this is just default, non-overclocked WU attempts.
GPU is running cool at 41C.
Maybe it's the driver version 19.3.2? (took hours to get dual ATI/nVidia setup to work... not wanting to change drivers).
9) Message boards : Number crunching : Computation error (Message 1106)
Posted 30 Mar 2019 by marmot
Post:
I'm receiving error -58 on my 1060 3gb.

Sometimes work units complete, other times they get errors.

Is it related to how much available CPU is open to the WU?
If CPU projects take up too much CPU does this error occur?

Have run CUDA or OpenCL WUs for Einstein, GPUGrid, Moo!, Milkyway and Asteroids on this 1060 3GB successfully over the last week as part of a baseline testing at default values (no overclocking, just an aggressive cooling profile to assure the GPU's are under 60C and best BIOS based clocking decisions are made).

Logfile:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
(unknown error) - exit code -1 (0xffffffff)</message>
<stderr_txt>
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>2</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
Initializing prime tables...done
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 294: Preferences:
<project_preferences>


<max_jobs>0</max_jobs>
<max_cpus>2</max_cpus>
<kernel_size_amd>21</kernel_size_amd>
<kernel_size_nvidia>21</kernel_size_nvidia>
</project_preferences>

c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 307: Kernel size for NVIDIA GPU has been set to 21
c:\temp\amicable-boinc-opencl-version-128-bit\amicable\opencl.cpp, line 1130: clGetEventInfo returned error -58
20:33:57 (1356): called boinc_finish(-1)

</stderr_txt>
]]>



©2024 Sergei Chernykh