Each tool has a set of Common options for input/output, profiling and debugging.

whitelist - Identify the likely true cell barcodes

Extract cell barcodes and identify the most likely true cell barcodes

Usage:

For single ended reads, the following reads from stdin and outputs to stdout:

umi_tools whitelist --bc-pattern=[PATTERN] -L extract.log
[OPTIONS]

For paired end reads where the cell barcodes is split across the read pairs, the following reads end one from stdin and end two from FASTQIN and outputs to stdin:

umi_tools whitelist --bc-pattern=[PATTERN]
--bc-pattern2=[PATTERN] --read2-in=[FASTQIN] -L extract.log
[OPTIONS]

Output:

The whitelist is outputted as 4 tab-separated columns:

  1. whitelisted cell barcode
  2. Other cell barcode(s) (comma-separated) to correct to the whitelisted barcode
  3. Count for whitelisted cell barcodes
  4. Count(s) for the other cell barcode(s) (comma-separated)

example output:

AAAAAA      AGAAAA          146     1
AAAATC                      22
AAACAT                      21
AAACTA      AAACTN,GAACTA   27      1,1
AAATAC                      72
AAATCA      GAATCA          37      3
AAATGT      AAAGGT,CAATGT   41      1,1
AAATTG      CAATTG          36      1
AACAAT                      18
AACATA                      24

If --error-correct-threshold is set to 0, columns 2 and 4 will be empty.

Identifying the true cell barcodes

In the absence of the --set-cell-number option, whitelist finds the knee in the curve for the cumulative read counts per CB or unique UMIs per CB (--method=[reads|umis]). This point is referred to as the ‘knee’. Previously this point was identified using the distribution of read counts per CB or unique UMIs per CB. The old behaviour can be activated using --knee-method=density

See this blog post for a more detailed exploration of the previous method:

https://cgatoxford.wordpress.com/2017/05/18/estimating-the-number-of-true-cell-barcodes-in-single-cell-rna-seq/

Counts per cell barcode can be performed using either read or unique UMI counts. Use --method=[read|umis] to set the counting method.

The process of selecting the “best” local minima with --knee-method=density is not completely foolproof. We recommend users always run whitelist with the --plot-prefix option to visualise the set of thresholds considered for defining cell barcodes. This option will also generate a table containing the thresholds which were rejected if you want to manually adjust the threshold. In addition, if you expect that a local minima will not be found, you can use the --allow-threshold-error option to allow whitelist to proceed proceed past this stage. In addition, if you have some prior expectation on the maximum number of cells which may have been sequenced, you can provide this using the option --expect-cells (see below).

If you don’t mind if whitelist --knee-method=density cannot identify a suitable threshold as you intend to inspect the plots and identify the threshold manually, provide the following options: --allow-threshold-error, --plot-prefix=[PLOT_PREFIX]

We expect that the default distance-based knee method should be more robust than the density-based method. However, we haven’t extensively tested this method. If you have a dataset where you believe the density-based method is better, please share this information with us: https://github.com/CGATOxford/UMI-tools/issues

Finally, in some datasets there may be a risk that CBs above the selected threshold are actually errors from another CB. We can detect potential instances of this by looking for CBs within one error (substition, insertion or deletion) of another CB with higher counts. One can then either take a conservate approach (remove CB with lower counts), or a more relaxed approach (correct CB with lower counts to CB with higher counts). Note that correction is only possible for substitutions since insertions & deletions may also affect the UMI so these are always discarded. See --ed-above-threshold=[discard/correct] below. Of course, the risk with the relaxed approach is that this may erroneously merge two truly different CBs together and create an in-silico “doublet”. The end of the log file (–log) will detail the number of reads from CBs above the threshold which may be errors. In most cases, we expect the number of reads to be a very small fraction of the total reads and therefore recommend taking the conservative approach. See https://cgatoxford.wordpress.com/2017/05/23/estimating-the-number-of-true-cell-barcodes-in-single-cell-rna-seq-part-2/ for an analysis of errors in barcodes above the knee threshold.

whitelist-specific options

--method

“reads” or “umis”. Use either reads or unique UMI counts per cell

--knee-method

“distance” or “density”. Two methods are available to detect the ‘knee’ in the cell barcode count distributions. “distance” identifies the maximum distance between the cumulative distribution curve and a straight line between the first and last points on the cumulative distribution curve. “density” transforms the counts per UMI into a gaussian density and then finds the local minima which separates “real” from “error” cell barcodes. The gaussian method was the only method available prior to UMI-tools v1.0.0. “distance” is now the default method.

--set-cell-number

Use this option to explicity set the number of cell barcodes which should be accepted. Note that the exact number of cell barcodes in the outputted whitelist may be slightly less than this if there are multiple cells observed with the same frequency at the threshold between accepted and rejected cell barcodes.

--expect-cells

An upper limit estimate for the number of inputted cells. The knee method will now select the first threshold (order ascendingly) which results in the number of cell barcodes accepted being <= EXPECTED_CELLS and > EXPECTED_CELLS * 0.1. Note: This is not compatible with the default --knee-method=distance since there is always as single solution using this method.

--allow-threshold-error

This is useful if you what the command to exit with just a warning if a suitable threshold cannot be selected

--error-correct-threshold

Hamming distance for correction of barcodes to whitelist barcodes. This value will also be used for error detection above the knee if required (--ed-above-threshold)

--plot-prefix

Use this option to indicate the prefix for the plots and table describing the set of thresholds considered for defining cell barcodes

--ed-above-threshold=[discard|correct]

Detect CBs above the threshold which may be sequence errors:

  • “discard”
    Discard all putative error CBs.
  • “correct”
    Correct putative substituion errors in CBs above the threshold. Discard putative insertions/deletions. Note that correction is only possible when the CB contains only substituions since insertions and deletions may cause errors in the UMI sequence too

Where a CB could be corrected to two other CBs, correction is not possible. In these cases, the CB will be discarded regardless of which option is used.

--subset-reads

Use the first N reads to automatically identify the true cell barcodes. If N is greater than the number of reads, all reads will be used. Default is 100000000 (100 Million).

Barcode extraction

--bc-pattern

Pattern for barcode(s) on read 1. See --extract-method

--bc-pattern2

Pattern for barcode(s) on read 2. See --extract-method

--extract-method

There are two methods enabled to extract the umi barcode (+/- cell barcode). For both methods, the patterns should be provided using the --bc-pattern and --bc-pattern2 options.x
  • string

    This should be used where the barcodes are always in the same place in the read.

    • N = UMI position (required)
    • C = cell barcode position (optional)
    • X = sample position (optional)

    Bases with Ns and Cs will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read. Bases with an X will be reattached to the read.

    E.g. If the pattern is NNNNCC, Then the read:

    @HISEQ:87:00000000 read1
    AAGGTTGCTGATTGGATGGGCTAG
    +
    DA1AEBFGGCG01DFH00B1FF0B
    

    will become:

    @HISEQ:87:00000000_TT_AAGG read1
    GCTGATTGGATGGGCTAG
    +
    1AFGGCG01DFH00B1FF0B
    

    where ‘TT’ is the cell barcode and ‘AAGG’ is the UMI.

  • regex

    This method allows for more flexible barcode extraction and should be used where the cell barcodes are variable in length. Alternatively, the regex option can also be used to filter out reads which do not contain an expected adapter sequence. UMI-tools uses the regex module rather than the more standard re module since the former also enables fuzzy matching

    The regex must contain groups to define how the barcodes are encoded in the read. The expected groups in the regex are:

    umi_n = UMI positions, where n can be any value (required) cell_n = cell barcode positions, where n can be any value (optional) discard_n = positions to discard, where n can be any value (optional)

    UMI positions and cell barcode positions will be extracted and added to the read name. The corresponding sequence qualities will be removed from the read.

    Discard bases and the corresponding quality scores will be removed from the read. All bases matched by other groups or components of the regex will be reattached to the read sequence

    For example, the following regex can be used to extract reads from the Klein et al inDrop data:

    (?P<cell_1>.{8,12})(?P<discard_1>GAGTGATTGCTTGTGACGCCTT)(?P<cell_2>.{8})(?P<umi_1>.{6})T{3}.*
    

    Where only reads with a 3’ T-tail and GAGTGATTGCTTGTGACGCCTT in the correct position to yield two cell barcodes of 8-12 and 8bp respectively, and a 6bp UMI will be retained.

    You can also specify fuzzy matching to allow errors. For example if the discard group above was specified as below this would enable matches with up to 2 errors in the discard_1 group.

    (?P<discard_1>GAGTGATTGCTTGTGACGCCTT){s<=2}
    

    Note that all UMIs must be the same length for downstream processing with dedup, group or count commands

--3prime

By default the barcode is assumed to be on the 5’ end of the read, but use this option to sepecify that it is on the 3’ end instead. This option only works with --extract-method=string since 3’ encoding can be specified explicitly with a regex, e.g .*(?P<umi_1>.{5})$

--read2-in

Filename for read pairs

--filtered-out

Write out reads not matching regex pattern or cell barcode whitelist to this file

--filtered-out2

Write out read pairs not matching regex pattern or cell barcode whitelist to this file

--ignore-read-pair-suffixes

Ignore  and  read name suffixes. Note that this options is required if the suffixes are not whitespace separated from the rest of the read name