[color=#000000]Hi,[/color]
[color=#000000]The error "[i]Maximum number of clients reachedMaximum number of clients reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT 2021-05-25T11:44:26 ***[/i]" message seems to indicate that the job scheduler in your SLURM cluster cancelled your jobs because you have exceeded the quota of allowed simultaneous jobs running in your cluster. I would suggest to:[/color]
[color=#000000]a) check with your cluster administrator (or do a bit of trial-and-error testing) to see what the maximum number of simultaneous jobs that you are allowed to run in your cluster is (e.g. individual users may be allowed to run up to 50 simultaneous jobs)[/color]
[color=#000000]b) when running your processing/analysis steps in CONN set the number of jobs to some value slightly below that maximum number (e.g. 40)[/color]
[color=#000000]c) make a quick estimate of how much time each job is going to need in order to finish (e.g. if you have 1000 subjects divided by 40 jobs, each job is going to process 25 subjects, so if you are running preprocessing and expecting it to take something like 20 minutes per subject then each job is going to need around 8 hours to finish) and make sure that the wall-time allocated to your jobs is sufficiently high (e.g. by adding "-t 12:00:00" to the 'in-line additional submit settings' option of your Slurm profile in CONN to request that each job is allowed to run for a maximum of 12 hours)[/color]
Hope this helps
Alfonso
[i]Originally posted by sat2020:[/i][quote]Good morning,
I am running Conn on my university's HPC, which uses SLURM.
I had posted about 6 months ago with some issues we were running into with ART.
Because it seems Matlab-related, the standalone version (18b) was installed,
and that seemed to work on all the tests I've run. With very small test
samples, I've been able to process and analyze test data from start to finish
using the built-in HPC function in the GUI, with no issues.
However, my actual sample is quite large (~1000), and now
that I am running the real data, the jobs are getting stuck about 2/3 of the
way through--showing as running still, but not finishing. I have only gotten to
preprocessing so far. I can see from some earlier posts that at least one other
person had that issue, and they updated to the later Conn version and that
helped. However, because we had the issue with the ART step/Matlab, I'm not sure
that would be an option here, since we are using the most recent standalone
version. I've already consulted with our IT and tried modifying how much time
I'm requesting for the jobs, but that hasn't fixed the problem.
1) Here is the text from some of the errors I'm getting:
-When starting Conn from the terminal window:
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
-Text from an stderr file:
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
Maximum number of clients reachedMaximum number of clients
reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT
2021-05-25T11:44:26 ***
2) When the jobs seem to get stuck running and I cancel
them, I delete all the files generated during that step, including within the
anat and func folders. I also delete the project folders that were created. Does
Conn modify the original structural and functional files? The modified date on
those files changes to the date I ran Conn. I don't have another set of
original files to replace them with each time I need to cancel the process
since there are 4000+ files (I could re-download them), but want to make sure
that's not contributing to the problem.
Any insight you could provide would be greatly appreciated.
Thank you![/quote]
[color=#000000]The error "[i]Maximum number of clients reachedMaximum number of clients reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT 2021-05-25T11:44:26 ***[/i]" message seems to indicate that the job scheduler in your SLURM cluster cancelled your jobs because you have exceeded the quota of allowed simultaneous jobs running in your cluster. I would suggest to:[/color]
[color=#000000]a) check with your cluster administrator (or do a bit of trial-and-error testing) to see what the maximum number of simultaneous jobs that you are allowed to run in your cluster is (e.g. individual users may be allowed to run up to 50 simultaneous jobs)[/color]
[color=#000000]b) when running your processing/analysis steps in CONN set the number of jobs to some value slightly below that maximum number (e.g. 40)[/color]
[color=#000000]c) make a quick estimate of how much time each job is going to need in order to finish (e.g. if you have 1000 subjects divided by 40 jobs, each job is going to process 25 subjects, so if you are running preprocessing and expecting it to take something like 20 minutes per subject then each job is going to need around 8 hours to finish) and make sure that the wall-time allocated to your jobs is sufficiently high (e.g. by adding "-t 12:00:00" to the 'in-line additional submit settings' option of your Slurm profile in CONN to request that each job is allowed to run for a maximum of 12 hours)[/color]
Hope this helps
Alfonso
[i]Originally posted by sat2020:[/i][quote]Good morning,
I am running Conn on my university's HPC, which uses SLURM.
I had posted about 6 months ago with some issues we were running into with ART.
Because it seems Matlab-related, the standalone version (18b) was installed,
and that seemed to work on all the tests I've run. With very small test
samples, I've been able to process and analyze test data from start to finish
using the built-in HPC function in the GUI, with no issues.
However, my actual sample is quite large (~1000), and now
that I am running the real data, the jobs are getting stuck about 2/3 of the
way through--showing as running still, but not finishing. I have only gotten to
preprocessing so far. I can see from some earlier posts that at least one other
person had that issue, and they updated to the later Conn version and that
helped. However, because we had the issue with the ART step/Matlab, I'm not sure
that would be an option here, since we are using the most recent standalone
version. I've already consulted with our IT and tried modifying how much time
I'm requesting for the jobs, but that hasn't fixed the problem.
1) Here is the text from some of the errors I'm getting:
-When starting Conn from the terminal window:
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
-Text from an stderr file:
Fontconfig warning:
"/users/USERNAME/.config/fontconfig/fonts.conf", line 82: unknown
element "blank"
Maximum number of clients reachedMaximum number of clients
reachedslurmstepd: error: *** JOB 1391804 ON node1127 CANCELLED AT
2021-05-25T11:44:26 ***
2) When the jobs seem to get stuck running and I cancel
them, I delete all the files generated during that step, including within the
anat and func folders. I also delete the project folders that were created. Does
Conn modify the original structural and functional files? The modified date on
those files changes to the date I ran Conn. I don't have another set of
original files to replace them with each time I need to cancel the process
since there are 4000+ files (I could re-download them), but want to make sure
that's not contributing to the problem.
Any insight you could provide would be greatly appreciated.
Thank you![/quote]