Why is my
dedupcommand taking so long?
The time taken to resolve each position is dependent on how many unique UMIs there are and how closely related they are since these factors will affect how large the network of connected UMIs is. Some of the factors which can affect run time are:
- UMI length (shorter => fewer possible UMIs => increased connectivity between UMIs => larger networks => longer run time)
- Sequencing error rate (higher => more error UMIs => larger networks => longer run time)
- Sequencing depth (higher = greater proportion of possible UMIs observed => larger network => longer run time)
umi_tools dedup --output-statsrequires a considerable amount of time and memory to generate the null distributions. If you want these stats, consider obtaining stats for just a single contig, eg.
- If you are doing single-cell RNA-seq and you have reads from more than one cell in your BAM file, make sure you are running with the
Why is my
dedupcommand taking so much memory?
There are a few reasons why your command could require an excessive amount of memory:
- Most of the above factors increasing run time can also increase memory
chimeric-reads=use(default) can take a large amount of memory. This is because
dedupwill wait until a whole contig has been deduplicated before re-parsing the reads from the contig and writing out the read2s which match the retained read1s. For chimeric read pairs, the read2s will not be found on the same contig and will be kept in a buffer of “orphan” read2s which may take up a lot of memory. Consider using
chimeric-reads=discardinstead to discard chimeric read pairs. Similarly, use of
--pairedcan also increase memory requirements.
Can I run
umi_toolswith parallel threads?
What’s the correct regex to use for technique X?
Here is a table of the techniques we have come across that are not easily processed with the basic barcode pattern syntax, and the correct regex’s to use with them:
If you know of other, please drop us a PR with them!
Can I use
umi_toolsto determine consensus sequences?
Right now, you can use
umi_tools groupto identify the duplicated read groups. From this, you can then derive consensus sequences as you wish. We have discussed adding consense sequence calling as a separate
umi_toolscommand (see #203). If you’d like to help us out, get in touch!
What do the
--per-contigoptions do in
These options are designed to handle samples from sequence library protocols where fragmentation occurs post amplification, e.g CEL-Seq for single cell RNA-Seq. For such sample, mapping coordinates of duplicate reads will no longer be identical and
umi_toolscan instead use the gene to which the read is mapped. This behaviour is switched on with
umi_toolsthen needs to be told if the reads are directly mapped to a transcriptome (
--per-contig), or mapped to a genome and the transcript/gene assignment is contained in a read tag (
--gene-tag=[TAG]). If you have assigned reads to transcripts but want to use the gene IDs to determine UMI groups, there is a further option to provide a file mapping transcript IDs to gene IDs (
--gene-transcript-map). Finally, for single cell RNA-Seq, you must specify
count --helpfor details of further related options and the UMI-tools single cell RNA-Seq guide.
Should I use
--per-genewith my sequencing from method X?
It can be difficult to work this out sometimes! So far we have come across the following technqies that require the use of
--per-gene: CEL-seq2, SCRB-seq, 10x Chromium, inDrop, Drop-seq and SPLiT-seq. Let us know if you know of more
Has the whitelist command been peer-reviewed and compared to alternatives?
No. At the time of the UMI-tools publication on 18 Jan ‘17, the only tools available were
whitelistcommands were added later. With
count, the deduplication part is identical to
dedup, so it’s reasonable to say the underlying agorithm has been peer-reviewed. However,
whitelistis using an entirely different approach (see here which has not been rigourously tested, compared to alternative algorithms or peer-reviewed. We recommend users to explore other options for whitelisting also.
Can I use whitelist without UMIs?
Strickly speaking, yes, but only with
--extract-method string. If you use
--extract-method regexand don’t provide both a UMI and Cell barcode position in the regex, you’ll get an error.