Each tool has a set of Common options for input/output, profiling and debugging.
Group - Group reads based on their UMI and mapping coordinates¶
Identify groups of reads based on their genomic coordinate and UMI
The group command can be used to create two types of outfile: a tagged BAM or a flatfile describing the read groups
To generate the tagged-BAM file, use the option
provide a filename with the
-S option. Alternatively,
if you do not provide a filename, the bam file will be outputted to
the stdout. If you have provided the
-L option to send
the logging output elsewhere, you can pipe the output from the group
command directly to e.g samtools view like so:
umi_tools group -I inf.bam --group-out=grouped.tsv --output-bam --log=group.log --paired | samtools view - |less
The tagged-BAM file will have two tagged per read:
- UG Unique_id. 0-indexed unique id number for each group of reads with the same genomic position and UMI or UMIs inferred to be from the same true UMI + errors
- BX Final UMI. The inferred true UMI for the group
To generate the flatfile describing the read groups, include the
--group-out=<filename> option. The columns of the read groups file are
below. The first five columns relate to the read. The final 3 columns
relate to the group.
- read identifier
- alignment contig
- Alignment position. Note that this position is not the start position of the read in the BAM file but the start of the read taking into account the read strand and cigar
- The gene assignment for the read. Note, this will be NA unless the –per-gene option is specified
- The read UMI
- The number of times this UMI is observed for reads at the same position
- The inferred true UMI for the group
- The total number of reads within the group
- The unique id for the group
Outfile name for file mapping read id to read group
Output a bam file with read groups tagged using the UG tag
BAM tag for the error corrected UMI selected for the group. Default=BX
It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:
where AATT is the UMI sequeuence.
If you have used an alternative method which does not separate the
read id and UMI with a “_”, such as bcl2fastq which uses “:”, you can
specify the separator with the option
replacing <sep> with e.g “:”.
Alternatively, if your UMIs are encoded in a tag, you can specify this
by setting the option –extract-umi-method=tag and set the tag name
with the –umi-tag option. For example, if your UMIs are encoded in
the ‘UM’ tag, provide the following options:
Finally, if you have used umis to extract the UMI +/- cell barcode,
you can specify
The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.
How are the barcodes encoded in the read?
- read_id (default)
- Barcodes are contained at the end of the read separated as specified with
- Barcodes contained in a tag(s), see
- Barcodes were extracted using umis (https://github.com/vals/umis)
Separator between read id and UMI. See
Tag which contains UMI. See
Separate the UMI in tag by SPLIT and take the first element
Separate the UMI in by DELIMITER and concatenate the elements
Tag which contains cell barcode. See –extract-umi-method above
Separate the cell barcode in tag by SPLIT and take the first element
Separate the cell barcode in by DELIMITER and concatenate the elements
UMI grouping options¶
What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.
- Reads group share the exact same UMI
- Reads group share the exact same UMI. UMIs with counts < 1% of the median counts for UMIs at the same position are ignored.
- Identify clusters of connected UMIs (based on hamming distance threshold). Each network is a read group
- Cluster UMIs as above. For each cluster, select the node (UMI) with the highest counts. Visit all nodes one edge away. If all nodes have been visited, stop. Otherwise, repeat with remaining nodes until all nodes have been visted. Each step defines a read group.
- directional (default)
- Identify clusters of connected UMIs (based on hamming distance threshold) and umi A counts >= (2* umi B counts) - 1. Each network is a read group.
For the adjacency and cluster methods the threshold for the edit distance to connect two UMIs in the network can be increased. The default value of 1 works best unless the UMI is very long (>14bp).
Causes two reads that start in the same position on the same strand and having the same UMI to be considered unique if one is spliced and the other is not. (Uses the ‘N’ cigar operation to test for splicing).
Mappers that soft clip will sometimes do so rather than mapping a spliced read if there is only a small overhang over the exon junction. By setting this option, you can treat reads with at least this many bases soft-clipped at the 3’ end as spliced. Default=4.
If the sam/bam contains tags to identify multimapping reads, you can specify for use when selecting the best read at a given loci. Supported tags are “NH”, “X0” and “XT”. If not specified, the read with the highest mapping quality will be selected.
Use the read length as a criteria when deduping, for e.g sRNA-Seq.
Single-cell RNA-Seq options¶
Reads will be grouped together if they have the same gene. This is useful if your library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option is hardcoded to be on with the count command. I.e counting is always performed per-gene. Must be combined with either
Deduplicate per gene. The gene information is encoded in the bam read tag specified
BAM tag which describes whether a read is assigned to a gene. Defaults to the same value as given for
Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be considered to have the same alignment position. This is useful if you have aligned to a reference transcriptome with one transcript per gene. If you have aligned to a transcriptome with more than one transcript per gene, you can supply a map between transcripts and gene using the
File mapping genes to transcripts (tab separated), e.g:gene1 transcript1 gene1 transcript2 gene2 transcript3
Reads will only be grouped together if they have the same cell barcode. Can be combined with
Minimium mapping quality (MAPQ) for a read to be retained. Default is 0.
- How should unmapped reads be handled. Options are:
- discard (default)
- Discard all unmapped reads
- If read2 is unmapped, deduplicate using read1 only. Requires
- Output unmapped reads/read pairs without UMI grouping/deduplication. Only available in umi_tools group
- How should chimeric read pairs be handled. Options are:
- Discard all chimeric read pairs
- use (default)
- Deduplicate using read1 only
- Output chimeric read pairs without UMI grouping/deduplication. Only available in umi_tools group
- How should unpaired reads be handled. Options are:
- Discard all unpaired reads
- use (default)
- Deduplicate using read1 only
- Output unpaired reads without UMI grouping/deduplication. Only available in umi_tools group
Ignore the UMI and group reads using mapping coordinates only
Only consider a fraction of the reads, chosen at random. This is useful for doing saturation analyses.
Only consider a single chromosome. This is useful for debugging/testing purposes
By default, inputs are assumed to be in BAM format and outputs are written in BAM format. Use these options to specify the use of SAM format for input or output.
BAM is paired end - output both read pairs. This will also force the use of the template length to determine reads with the same mapping coordinates.
By default, output is sorted. This involves the use of a temporary unsorted file since reads are considered in the order of their start position which may not be the same as their alignment coordinate due to soft-clipping and reverse alignments. The temp file will be saved (in
--temp-dir) and deleted when it has been sorted to the outfile. Use this option to turn off sorting.
forces dedup to parse an entire contig before yielding any reads for deduplication. This is the only way to absolutely guarantee that all reads with the same start position are grouped together for deduplication since dedup uses the start position of the read, not the alignment coordinate on which the reads are sorted. However, by default, dedup reads for another 1000bp before outputting read groups which will avoid any reads being missed with short read sequencing (<1000bp).