
I tested CPU requests from 1 to 7 (step size 1) and then from 9 to 17 (step size 2).Max Nextflow queue size of 30 to avoid too many threads reading from the same two files.Note: This time I used the following settings: They released a new version just before I asked cluster admins to compile against libdeflate and didn't notice until later.) samtools 1.15 + libdeflate (this is a different version, but shouldn't have a major effect.Ok, for what I expect will be my final update for this answer, I compared the following: There are obviously many other factors (e.g., cluster size, administrative scheduler settings, etc.), but I suspect findings on a given cluster will generalize to other clusters. I would expect the data to follow a -log function, where we could identify the optimal settings at the knee in the curve. Ideally, I'd like to see plots identifying the 'sweet spot' for threads + memory for a single sample (e.g., x = threads, y = memory, z = time), and then plots maximizing throughput for more samples. There is also likely a diminishing return at a certain point, even for a single sample. I'm mostly interested in optimizing samtools sort because the other steps (e.g., alignment) are much more obvious to fine-tune.įor a single sample, throwing large amounts of memory + threads is trivial, but that isn't scalable for thousands of samples.


bam files using samtools sort on a cluster?īy 'optimal', I'm looking to sort as many samples as possible in the shortest time frame. bam files beginning with re-alignment, sorting, etc.Ĭomplicated question: Has anyone investigated the optimal thread and memory settings to sort hundreds to thousands of.
