-
Notifications
You must be signed in to change notification settings - Fork 260
fasterq-dump getting stuck in loops -- inconsistent and rectified when running fastq-dump #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Fasterq-dump uses per default the current directory for temp. files. If you are running other jobs in parallel in the same directory they compete for space. You could try to use a different directory ( commandline-option -t|--temp ) for each job. The speed can also be improved if the temp. directory is on a different file-system like a SSD or a RAM-disk. I do not know the details of your cluster-management, but I know from others that they terminate jobs if a job exceeds limits. Fasterq-dump needs more temporary space than fastq-dump on disk and in memory. That is where the speed improvement comes from - not only from the usage of multiple threads. Maybe you have to adjust your job-settings. |
Thank you for the quick reply. Nextflow manages jobs such that each job will run in a separate temp directory, so I don't imagine that that's the issue. We have over a petabyte of space, so unfortunately I don't think that's the issue either. Our cluster has 64 nodes with 128 cores each. I have been running these jobs requesting 8 cores with a memory limit of up to 200GB (I had suspected it might be a memory issue). I could increase this further, but given that I can run fastq-dump with 15-20GB without issue, I would again probably revert back to that. I have run mpstat on the job, and it doesn't ever seem to get close to exceeding the memory limit. However, this isn't an issue of speed. When it runs, it is extremely efficient. The issue arises when I try to run multiple fasterq-dump jobs at once only about 1/3-1/2 will actually run and the others get stuck (some even start and then get stuck) and the others just go on in a loop until they hit the time wall I set for them. To add to this -- I never get an exit code indicating lack of space or memory allocation. The jobs never actually terminate. |
I am very interested to find out the reason for getting stuck. So you are saying that you can process SRR-accessions with fasterq-dump without problems as long as it is just one at a time, but when you try that with multiples of them at the same time - some of them get stuck. Can you please tell me what getting stuck means. Are you running with the progress-option ( -p|--progress ) and the progress is stopped? As I am reading your comment - you are running multiples of them on different machines? But they influence each other somehow! How many of them are you running at once? |
Correct -- and it's never the same SRR that gets "stuck". By stuck I mean one of two things:
I do not get any error codes in either of these scenarios. Sometimes these jobs are split across different nodes and sometimes they are on the same node, but there does not seem to be a pattern as to which get stuck (e.g. if there are two running on the same node, sometimes those will both run, sometimes only one, sometimes neither). The node allocation just depends on job load on our cluster. I'm typically trying to run 8 or less jobs at once. I've tried requesting fewer threads (between 4-8), too, and this does not seem to resolve the issue. Switching to fastq-dump has resolved this, so it seems to be something with the multi-threading, although admittedly I am assuming that is the only difference between the two. I'm just having trouble getting at what could be causing it. What's odd is even if it was a memory allocation issue, once the other jobs that started running and successfully complete, the "stuck" jobs do not pick up and begin running. Thanks! |
Did you prefetch the accessions or are you 'downloading' them with fasterq-dump/fastq-dump? |
I have a suspicion that our servers a limiting the number of connections from a certain ip-address. |
I have been using prefetch to get the SRAs first. I don't want to store that many fastq files on our server (for obvious reasons), but it's nice to have the sras in case we want to re-process anything. Presumably though there is still some information stored on the server when I'm using fasterq-dump? I'm actually not aware if all information needed for the fastq is stored directly in the sra file. |
If you're using prefetch, no information will be left on NCBI servers. |
Prefetch puts the SRA's into a special location to be found by the tools : /home/username/ncbi/public/sra |
If the SRA is aligned against the reference, it does need access to the reference-accessions. Prefetch downloads them into /home/username/ncbi/public/refseq. Does your cluster-node have access to that too? |
I redirected the output of the SRAs to a /scratch directory. I did this by creating a .ncbi/ directory in my home directory and making a user-settings.mkfg file with the following line: /repository/user/main/public/root = "/scratch/<my defined sra output directory>/" I'm then running fastq-dump/fasterq-dump on SRAs in this repository. |
@magruca - are you still experiencing issues? |
Yes -- I ultimately chose to switch back to using the parallel-fastq-dump wrapper in the interim as I'm not experiencing any issues with that. |
@magruca, do you still need help? |
we are seeing this in 2.11.0. when it hangs it is not even possible to do Ctrl+C. we are investigating this from our side and will let you know what we come up with. this is running over baremetal on a singularity container with a mounted beegfs file system - so far we do not see problems when changing to the hosts /tmp as the cwd. |
We have seen issues before with As an aside, |
Interesting; we're also on BeeGFS and experiencing this. We're trying to mitigate this problem in different ways. All the ideas we have now is to write wrappers for export FASTERQ_DUMP_TEMP=/fast/big/local/scratch/tempfolder globally to lower the risk for running into these problems. This would probably also benefit the end-user, since they'll work toward a much faster local disk than a shared global parallel file system. |
Just learned that this issue has been addressed in BeeGFS 7.3.2 (2022-10-14). From the release notes:
|
Is this still an issue? |
I cannot tell anymore, because we're on BeeGFS 7.2.15 and this was fixed in BeeGFS 7.3.2 (see above comment). I suspect it's a problem on any systems that run BeeGFS versions prior to that. As already mentioned above, in such cases, the problem is catastrophic. |
Hi all--
I am working on a cluster that has a SLURM job handler and plan to download a few thousands SRAs for a project. I have developed a Nextflow pipeline to handle this load and was excited when you released the new version of sra-tools that facilitated multi-threading. However, I am sporadically crashing nodes when trying to use fasterq-dump. I noticed there were issues regarding fasterq-dump getting caught in loops, but I do believe this issue differs from others I had read.
For example, I was trying to run fasterq-dump on Andrysik2017 data SRR4090[098-109] with basic commands: fasterq-dump ${SAMPLE.sra} -e 8. Each time I would run fasterq-dump on these examples, different SRRs would get stuck in an endless loop and ultimatelycrash an entire node. This hasn't been consistent. Sometimes entire projects similar to this go through without a hitch, however I would say failures are more common than not over the past two weeks I have been running it. This is also significantly more common when I try to run multiple jobs in parallel, but again I could cancel all of those jobs and resubmit the same batch, and the SRRs that would proceed and those that get stuck would be different than those that did in the first submission.
I originally suspected this might be a problem with our cluster or Nextflow, however when I switched back to using fastq-dump, the issue was rectified which leads me to believe it's not on our end.
Thank you in advance for your time in troubleshooting and resolving this issue.
The text was updated successfully, but these errors were encountered: