
(Default: no) -M, -multimap ¶Ĭonsider not unique mappings (not reccomended), -samtools-threads ¶ If set to bp the first N bases of the sequence will be used to extend UB (ideal for STRT). If set to Gene then the GX tag will be appended to the UB tag. If set to chr the mapping position (binned to 10Gb intervals) will be appended to UB (ideal for InDrops+dropEst). In case UMI is too short to guarantee uniqueness (without information from the ampping) set this parameter to chr, Gene ro bp If this flag is used the data is assumed UMI-less and reads are counted instead of molecules (default: off) -u, -umi-extension ¶ The logic to use for the filtering (default: Default) -U, -without-umi ¶

Important: cells reads should not be distributed over multiple bamfiles is not supported!! (default: off) -l, -logic ¶ If this flag is used every bamfile passed is interpreted as an independent cell, otherwise multiple files are interpreted as batch of different cells to be analyzed together. gtf file containing intervals to mask -c, -onefilepercell ¶ Table containing metadata of the various samples (csv formatted, rows are samples and cols are entries) -m, -mask ¶ The sample name that will be used to retrieve informations from metadatatable -s, -metadatatable ¶ Output folder, if it does not exist it will be created.

If –bcfile is not specified all the cell barcodes will be included.Ĭell barcodes should be specified in the bcfile as the CB tag for each read -o, -outputfolder ¶ For the suggested solution the latter would be better - I guess.Velocyto run BAMFILE. But if all temporary files are merged at once, then it would start writing output immediately (which would start view earlier). If it is done in a tree like fashion, then it would start to write output on the top level of the merge tree. The result should be equivalent.Įfficiency depends a bit on how sort merges the temporary files. Mapper | samtools sort -l 0 -O bam - | samtools view -O bam -o out.bam x*100%, where x ist the number of threads given to samtools. If your mapper is the slow part, then yes samtools will likely be stuck at under 100% CPU, but that's not really a samtools issue I think.Īctually (in my case the mapper is hisat2) CPU usage is most of the time approx 100% and then spikes for a short time to approx. On finishing (no more stdin) it then has a separate merge stage.

This can avoid issues with small pipe sizes. The samtools view command will only start consuming cpu after the mapper has finished so both mapper and view can be given the same cores to work on.įinally maybe you'll get more luck using mapper | mbuffer | samtools too with some systems and/or aligners. The second merge stage only starts when the mapper has finished, and this will be I/O bound and won't be threading on output as there are no lengthy bgzf compression steps. Mapper | samtools sort -l 0 -O bam | samtools view -O bam -o out.bam Note there is more or less a way to handle what you want already (untested, but I think it's equivalent), eg: If your mapper is the slow part, then yes samtools will likely be stuck at under 100% CPU, but that's not really a samtools issue I think. It uses one thread until it's read enough data and then it uses multiple threads to sort and write that temporary data to disk, repeatedly. Over-specifying the number of threads is not a catastrophically bad thing to do, and you can use cgroups or hwloc-bind to govern how many cores the entire process can take up too.Īlso I don't think it's true to say that samtools sort only uses more than one CPU until the mapper has finished. However this particular problem is perhaps one of expectation. Ideally it would be using asynchronous I/O too.

the CPU usage could be better limited (in shared environments you need to specify the number of cores and sometimes admins really check).
#Samtools threads full
Until the mapper is finished samtools could for instance use a single thread for reading and chunking and then use the full number of threads afterwards (when the mapper has finished). This would simplify the specification of the number of threads used by both programs. I suggest to allow to specify the number of CPUs used by samtools during reading the data (and producing pre sorted chunks) separately.
