Each tool has a set of Common options for input/output, profiling and debugging.

extract - Extract UMI from fastq

Extract UMI barcode from a read and add it to the read name, leaving any sample barcode in place

Can deal with paired end reads and UMIs split across the paired ends. Can also optionally extract cell barcodes and append these to the read name also. See the section below for an explanation for how to encode the barcode pattern(s) to specficy the position of the UMI +/- cell barcode.

Usage:

For single ended reads, the following reads from stdin and outputs to stdout:

umi_tools extract --extract-method=string
--bc-pattern=[PATTERN] -L extract.log [OPTIONS]

For paired end reads, the following reads end one from stdin and end two from FASTQIN and outputs end one to stdin and end two to FASTQOUT:

umi_tools extract --extract-method=string
--bc-pattern=[PATTERN] --bc-pattern2=[PATTERN]
--read2-in=[FASTQIN] --read2-out=[FASTQOUT] -L extract.log [OPTIONS]

Using regex and filtering against a whitelist of cell barcodes:

umi_tools extract --extract-method=regex
--bc-pattern=[REGEX] --whitlist=[WHITELIST_TSV]
-L extract.log [OPTIONS]

Filtering and correcting cell barcodes

umi_tools extract can optionally filter cell barcodes against a user-supplied whitelist (--whitelist). If a whitelist is not available for your data, e.g if you have performed droplet-based scRNA-Seq, you can use the whitelist tool.

Cell barcodes which do not match the whitelist (user-generated or automatically generated) can also be optionally corrected using the --error-correct-cell option.

--error-correct-cell

Error correct cell barcodes to the whitelist (see --whitelist)

--whitelist

Whitelist of accepted cell barcodes. The whitelist should be in the following format (tab-separated):

AAAAAA    AGAAAA
AAAATC
AAACAT
AAACTA    AAACTN,GAACTA
AAATAC
AAATCA    GAATCA
AAATGT    AAAGGT,CAATGT

Where column 1 is the whitelisted cell barcodes and column 2 is the list (comma-separated) of other cell barcodes which should be corrected to the barcode in column 1. If the --error-correct-cell option is not used, this column will be ignored. Any additional columns in the whitelist input, such as the counts columns from the output of umi_tools whitelist, will be ignored.

--blacklist

BlackWhitelist of cell barcodes to discard

--subset-reads=[N]

Only parse the first N reads

--quality-filter-threshold

Remove reads where any UMI base quality score falls below this threshold

--quality-filter-mask

If a UMI base has a quality below this threshold, replace the base with ‘N’

--quality-encoding

Quality score encoding. Choose from:
  • ‘phred33’ [33-77]
  • ‘phred64’ [64-106]
  • ‘solexa’ [59-106]

--reconcile-pairs

Allow read 2 infile to contain reads not in read 1 infile. This enables support for upstream protocols where read one contains cell barcodes, and the read pairs have been filtered and corrected without regard to the read2s

Experimental options

Note

These options have not been extensively testing to ensure behaviour is as expected. If you have some suitable input files which we can use for testing, please contact us.

If you have a library preparation method where the UMI may be in either read, you can use the following options to search for the UMI in either read:

--either-read --extract-method --bc-pattern=[PATTERN1] --bc-pattern2=[PATTERN2]

Where both patterns match, the default behaviour is to discard both reads. If you want to select the read with the UMI with highest sequence quality, provide --either-read-resolve=quality.

Barcode extraction

--bc-pattern

Pattern for barcode(s) on read 1. See --extract-method

--bc-pattern2

Pattern for barcode(s) on read 2. See --extract-method

--extract-method

There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x
  • string

    This should be used where the barcodes are always in the same place in the read.

    • N = UMI position (required)
    • C = cell barcode position (optional)
    • X = sample position (optional)

    Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. Bases with an X will be reattached to the read.

    E.g. If the pattern is NNNNCC, Then the read:

    @HISEQ:87:00000000 read1
    AAGGTTGCTGATTGGATGGGCTAG
    +
    DA1AEBFGGCG01DFH00B1FF0B
    

    will become:

    @HISEQ:87:00000000_TT_AAGG read1
    GCTGATTGGATGGGCTAG
    +
    1AFGGCG01DFH00B1FF0B
    

    where ‘TT’ is the cell barcode and ‘AAGG’ is the UMI.

  • regex

    This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected adapter sequence. UMI-tools uses the regex module rather than the more standard re module since the former also enables fuzzy matching

    The regex must contain groups to define how the barcodes are encoded in the read. The expected groups in the regex are:

    umi_n = UMI positions, where n can be any value (required) cell_n = cell barcode positions, where n can be any value (optional) discard_n = positions to discard, where n can be any value (optional)

    UMI positions and cell barcode positions will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read.

    Discard bases and the corresponding quality scores will be removed from the read. All bases matched by other groups or components of the regex will be reattached to the read sequence

    For example, the following regex can be used to extract reads from the Klein et al inDrop data:

    (?P<cell_1>.{8,12})(?P<discard_1>GAGTGATTGCTTGTGACGCCTT)(?P<cell_2>.{8})(?P<umi_1>.{6})T{3}.*
    

    Where only reads with a 3’ T-tail and GAGTGATTGCTTGTGACGCCTT in the correct position to yield two cell barcodes of 8-12 and 8bp respectively, and a 6bp UMI will be retained.

    You can also specify fuzzy matching to allow errors. For example if the discard group above was specified as below this would enable matches with up to 2 errors in the discard_1 group.

    (?P<discard_1>GAGTGATTGCTTGTGACGCCTT){s<=2}
    

    Note that all UMIs must be the same length for downstream processing with dedup, group or count commands

--3prime

By default the barcode is assumed to be on the 5’ end of the read, but use this option to sepecify that it is on the 3’ end instead. This option only works with --extract-method=string since 3’ encoding can be specified explicitly with a regex, e.g .*(?P<umi_1>.{5})$

--read2-in

Filename for read pairs

--filtered-out

Write out reads not matching regex pattern or cell barcode whitelist to this file

--filtered-out2

Write out read pairs not matching regex pattern or cell barcode whitelist to this file

--ignore-read-pair-suffixes

Ignore  and  read name suffixes. Note that this options is required if the suffixes are not whitespace separated from the rest of the read name