
Tools for dealing with Unique Molecular Identifiers¶
Note
Important update: We now recommend the use of alevin for 10x Chromium and Drop-seq droplet-based scRNA-Seq. alevin is an accurate, fast and convenient end-to-end tool to go from fastq -> count matrix and extends the UMI error correction in UMI-tools within a framework that also enables quantification of droplet scRNA-Seq without discarding multi-mapped reads. See alevin documentation and alevin pre-print for more information
Welcome to the UMI-tools documentation. UMI-tools contains tools for dealing with Unique Molecular Identifiers (UMIs)/Random Molecular Tags (RMTs) and single cell RNA-Seq cell barcodes.
UMI-tools was published in Genome Research on 18 Jan ‘17 (open access)
Contents:
Tools¶
Currently there are 6 commands. The extract
and whitelist
commands are used to prepare a fastq containg UMIs +/- cell barcodes
for alignment.
- whitelist:
- Builds a whitelist of the ‘real’ cell barcodes
- This is useful for droplet-based single cell RNA-Seq where the identity of the true cell barcodes is unknown. The whitelist can then be used to filter cell barcodes with extract (see below)
- extract:
- Flexible removal of UMI sequences from fastq reads.
- UMIs are removed and appended to the read name. Any other barcode, for example a library barcode, is left on the read. Can also filter reads by quality or against a whitelist (see above)
The remaining commands, group
, dedup
and
count
/count_tab
, are used to identify PCR duplicates using the
UMIs and perform different levels of analysis depending on the needs
of the user. A number of different UMI deduplication schemes are
enabled - The recommended method is directional. For more deails about the
deduplication schemes see The network-based deduplication methods
- dedup:
- Groups PCR duplicates and deduplicates reads to yield one read per group
- Use this when you want to remove the PCR duplicates prior to any downstream analysis
- group:
- Groups PCR duplicates using the same methods available through `dedup`.
- This is useful when you want to manually interrogate the PCR duplicates or perform bespoke downstream processing such as generating consensus sequences
- count:
- Groups and deduplicates PCR duplicates and counts the unique molecules per gene
- Use this when you want to obtain a matrix with unique molecules per gene, per cell, for scRNA-Seq
- count_tab:
- As per count except input is a flatfile
Each tool has a set of Common options for input/output, profiling and debugging.
See Quick start guide for a quick tutorial on the most common usage pattern.
If you want to use UMI-tools in single-cell RNA-Seq data processing, see Single cell tutorial
The dedup
, group
, and count
/ count_tab
commands make use of network-based methods to resolve similar UMIs with the same alignment coordinates. For a background regarding these methods see:
Installation¶
If you’re using Conda, you can use:
$ conda install -c bioconda -c conda-forge umi_tools
Or pip:
$ pip install umi_tools
Or if you’d like to work directly from the git repository:
$ git clone https://github.com/CGATOxford/UMI-tools.git
Enter repository and run:
$ python setup.py install
For more detail see Installation Guide
To get detailed help on umi_tools run
$ umi_tools --help
To get help on a specific [COMMAND] run
$ umi_tools [COMMAND] --help
Dependencies¶
umi_tools is dependent on numpy, pandas, scipy, cython, pysam, future, regex and matplotlib