Each tool has a set of Common options for input/output, profiling and debugging.
dedup - Deduplicate reads using UMI and mapping coordinates
Deduplicate reads based on the mapping co-ordinate and the UMI attached to the read
The identification of duplicate reads is performed in an error-aware
manner by building networks of related UMIs (see
--method). dedup can also handle cell barcoded input (see
--per-cell).
Usage:
umi_tools dedup --stdin=INFILE --log=LOGFILE [OPTIONS] > OUTFILE
Selecting the representative read
For every group of duplicate reads, a single representative read is retained.The following criteria are applied to select the read that will be retained from a group of duplicated reads:
1. The read with the lowest number of mapping coordinates (see
--multimapping-detection-method option)
2. The read with the highest mapping quality. Note that this is not the read sequencing quality and that if two reads have the same mapping quality then one will be picked at random regardless of the read quality.
Otherwise a read is chosen at random.
Dedup-specific options
--output-stats=[PREFIX]
One can use the edit distance between UMIs at the same position as an quality control for the deduplication process by comparing with a null expectation of random sampling. For the random sampling, the observed frequency of UMIs is used to more reasonably model the null expectation.
Use this option to generate a stats outfile called:
- [PREFIX]_edit_distance.tsv
Reports the (binned) average edit distance between the UMIs at each position. Positions with a single UMI are reported seperately. The edit distances are reported pre- and post-deduplication alongside the null expectation from random sampling of UMIs from the UMIs observed across all positions. Note that separate null distributions are reported since the null depends on the observed frequency of each UMI which is different pre- and post-deduplication. The post-duplication values should be closer to their respective null than the pre-deduplication vs null comparison
In addition, this option will trigger reporting of further summary statistics for the UMIs which may be informative for selecting the optimal deduplication method or debugging.
Each unique UMI sequence may be observed [0-many] times at multiple positions in the BAM. The following files report the distribution for the frequencies of each UMI.
- [PREFIX]_per_umi_per_position.tsv
The _per_umi_per_position.tsv file simply tabulates the counts for unique combinations of UMI and position. E.g if prior to deduplication, we have two positions in the BAM (POSa, POSb), at POSa we have observed 2*UMIa, 1*UMIb and at POSb: 1*UMIc, 3*UMId, then the stats file is populated thus:
counts
instances_pre
1
2
2
1
3
1
If post deduplication, UMIb is grouped with UMIa such that POSa: 3*UMIa, then the instances_post column is populated thus:
counts
instances_pre
instances_post
1
2
1
2
1
0
3
1
2
- [PREFIX]_per_umi.tsv
The _per_umi.tsv table provides UMI-level summary statistics. Keeping in mind that each unique UMI sequence can be observed at [0-many] times across multiple positions in the BAM,
- times_observed:
How many positions the UMI was observed at
- total_counts:
The total number of times the UMI was observed across all positions
- median_counts:
The median for the distribution of how often the UMI was observed at each position (excluding zeros)
Hence, whenever times_observed=1, total_counts==median_counts.
Extracting barcodes
It is assumed that the FASTQ files were processed with umi_tools extract before mapping and thus the UMI is the last word of the read name. e.g:
@HISEQ:87:00000000_AATT
where AATT is the UMI sequeuence.
If you have used an alternative method which does not separate the
read id and UMI with a “_”, such as bcl2fastq which uses “:”, you can
specify the separator with the option --umi-separator=<sep>,
replacing <sep> with e.g “:”.
Alternatively, if your UMIs are encoded in a tag, you can specify this
by setting the option –extract-umi-method=tag and set the tag name
with the –umi-tag option. For example, if your UMIs are encoded in
the ‘UM’ tag, provide the following options:
--extract-umi-method=tag --umi-tag=UM
Finally, if you have used umis to extract the UMI +/- cell barcode,
you can specify --extract-umi-method=umis
The start position of a read is considered to be the start of its alignment minus any soft clipped bases. A read aligned at position 500 with cigar 2S98M will be assumed to start at position 498.
--extract-umi-method
How are the barcodes encoded in the read?
Options are:
- read_id (default)
Barcodes are contained at the end of the read separated as specified with
--umi-separatoroption
- tag
Barcodes contained in a tag(s), see
--umi-tag/--cell-tagoptions
- umis
Barcodes were extracted using umis (https://github.com/vals/umis)
--umi-separator=[SEPARATOR]
Separator between read id and UMI. See
--extract-umi-methodabove. Default=``_``
--umi-tag=[TAG]
Tag which contains UMI. See
--extract-umi-methodabove
--umi-tag-split=[SPLIT]
Separate the UMI in tag by SPLIT and take the first element
--umi-tag-delimiter=[DELIMITER]
Separate the UMI in by DELIMITER and concatenate the elements
--cell-tag=[TAG]
Tag which contains cell barcode. See –extract-umi-method above
--cell-tag-split=[SPLIT]
Separate the cell barcode in tag by SPLIT and take the first element
--cell-tag-delimiter=[DELIMITER]
Separate the cell barcode in by DELIMITER and concatenate the elements
UMI grouping options
--method
What method to use to identify group of reads with the same (or similar) UMI(s)?
All methods start by identifying the reads with the same mapping position.
The simplest methods, unique and percentile, group reads with the exact same UMI. The network-based methods, cluster, adjacency and directional, build networks where nodes are UMIs and edges connect UMIs with an edit distance <= threshold (usually 1). The groups of reads are then defined from the network in a method-specific manner. For all the network-based methods, each read group is equivalent to one read count for the gene.
- unique
Reads group share the exact same UMI
- percentile
Reads group share the exact same UMI. UMIs with counts < 1% of the median counts for UMIs at the same position are ignored.
- cluster
Identify clusters of connected UMIs (based on hamming distance threshold). Each network is a read group
- adjacency
Cluster UMIs as above. For each cluster, select the node (UMI) with the highest counts. Visit all nodes one edge away. If all nodes have been visited, stop. Otherwise, repeat with remaining nodes until all nodes have been visted. Each step defines a read group.
- directional (default)
Identify clusters of connected UMIs (based on hamming distance threshold) and umi A counts >= (2* umi B counts) - 1. Each network is a read group.
--edit-distance-threshold
For the adjacency and cluster methods the threshold for the edit distance to connect two UMIs in the network can be increased. The default value of 1 works best unless the UMI is very long (>14bp).
--spliced-is-unique
Causes two reads that start in the same position on the same strand and having the same UMI to be considered unique if one is spliced and the other is not. (Uses the ‘N’ cigar operation to test for splicing).
--soft-clip-threshold
Mappers that soft clip will sometimes do so rather than mapping a spliced read if there is only a small overhang over the exon junction. By setting this option, you can treat reads with at least this many bases soft-clipped at the 3’ end as spliced. Default=4.
--multimapping-detection-method=[NH/X0/XT]
If the sam/bam contains tags to identify multimapping reads, you can specify for use when selecting the best read at a given loci. Supported tags are “NH”, “X0” and “XT”. If not specified, the read with the highest mapping quality will be selected.
--read-length
Use the read length as a criteria when deduping, for e.g sRNA-Seq.
Single-cell RNA-Seq options
--per-gene
Reads will be grouped together if they have the same gene. This is useful if your library prep generates PCR duplicates with non identical alignment positions such as CEL-Seq. Note this option is hardcoded to be on with the count command. I.e counting is always performed per-gene. Must be combined with either
--gene-tagor--per-contigoption.
--gene-tag
Deduplicate per gene. The gene information is encoded in the bam read tag specified
--assigned-status-tag
BAM tag which describes whether a read is assigned to a gene. Defaults to the same value as given for
--gene-tag
--per-contig
Deduplicate per contig (field 3 in BAM; RNAME). All reads with the same contig will be considered to have the same alignment position. This is useful if you have aligned to a reference transcriptome with one transcript per gene. If you have aligned to a transcriptome with more than one transcript per gene, you can supply a map between transcripts and gene using the
--gene-transcript-mapoption
--gene-transcript-map
File mapping genes to transcripts (tab separated), e.g:
gene1 transcript1 gene1 transcript2 gene2 transcript3
--per-cell
Reads will only be grouped together if they have the same cell barcode. Can be combined with
--per-gene.
SAM/BAM Options
--mapping-quality
Minimium mapping quality (MAPQ) for a read to be retained. Default is 0.
--unmapped-reads
- How should unmapped reads be handled. Options are:
- discard (default)
Discard all unmapped reads
- use
If read2 is unmapped, deduplicate using read1 and output read1 only. Note that if read1 is unmapped, read2 will always be descarded irrepsective of whether it is mapped. WARNING: May lead to unpaired reads in output. Requires
--paired
- output
Output unmapped reads/read pairs without UMI grouping/deduplication. Only available in umi_tools group
--chimeric-pairs
- How should chimeric read pairs be handled. Options are:
- discard
Discard all chimeric read pairs
- use (default)
Deduplicate using read1 information only. Both read1 and read2 should still be output, as long as Read2 is actaully found. Can lead to unpaired reads in output if read1 is marked as having a mapped mate, but read2 is never found.
- output
Output chimeric read pairs without UMI grouping/deduplication. Only available in umi_tools group
--unpaired-reads
- How should unpaired reads be handled. Options are:
- discard
Discard all unpaired reads. Note: Can still lead to unpaired reads in the output if a read1 is marked as having a mapped mate, but the mate is never found.
- use (default)
Deduplicate unpaired reads using read1 only. Note, unpaired read2s will still be discarded.
- output
Output unpaired reads without UMI grouping/deduplication. Only available in umi_tools group
--ignore-umi
Ignore the UMI and group reads using mapping coordinates only
--subset
Only consider a fraction of the reads, chosen at random. This is useful for doing saturation analyses.
--chrom
Only consider a single chromosome. This is useful for debugging/testing purposes
--paired
BAM is paired end - output both read pairs. This will also force the use of the template length to determine reads with the same mapping coordinates.
Group/Dedup options
--no-sort-output
By default, output is sorted. This involves the use of a temporary unsorted file since reads are considered in the order of their start position which may not be the same as their alignment coordinate due to soft-clipping and reverse alignments. The temp file will be saved (in
--temp-dir) and deleted when it has been sorted to the outfile. Use this option to turn off sorting.
--buffer-whole-contig
forces dedup to parse an entire contig before yielding any reads for deduplication. This is the only way to absolutely guarantee that all reads with the same start position are grouped together for deduplication since dedup uses the start position of the read, not the alignment coordinate on which the reads are sorted. However, by default, dedup reads for another 1000bp before outputting read groups which will avoid any reads being missed with short read sequencing (<1000bp).
Input/Output Format Options
The following options deal with input and output format, and are useful for
outputting CRAM format. In general UMI-tools will attempt to guess the input
and output formats from the file names, but thing can be over-written using
the out-format and input-format options. The location of CRAM
reference files will be taken from the either the an input CRAM file
(if present) or from the --reference-filename option. Otherwise
the reference will be embedded in the file.
--in-format=IN_FORMAT
File format of the input file. Format is usually implied from the extension of the filename, but maybe overridden with this option. Default=bam
--input-options=INPUT_OPTIONS
Format string provided to htslib for reading. Mostly useful for CRAM formatted files. See samtools documentation
--in-sam
[DEPRECATED] USE
--in-format. By default, inputs are assumed to be in BAM format. Use this option to specify the use of SAM format for input.
--reference-filename=REFERENCE_FILENAME
File path or URL to the genome reference to be used when reading or writing CRAM files. Can be a path or a URL. By default, when reading a CRAM file, the reference recorded in the input file will be used unless this is specified. URL references cannot be read from input files, however. When writing, specifying a reference location is required unless specified in input.
--out-format=OUT_FORMAT
File format of the input file. Format is usually implied from the extension of the filename, but maybe overridden with this option. Default=bam
--output-options=OUTPUT_OPTIONS
Format string provided to htslib for writing. Mostly useful for CRAM formatted files. See samtools documentation