Each tool has a set of Common options for input/output, profiling and debugging.

prepare_for_em - make the output from dedup compatible with EM tools

prepare_for_em - make output from dedup or group compatible with RSEM or

Salmon

Usage:

umi_tools prepare_for_em [OPTIONS] [--stdin=IN_BAM] [--stdout=OUT_BAM]

   note: If ``--stdout`` is ommited, standard out is output. To
         generate a valid BAM/SAM/CRAM file on standard out, please
         redirect log with ``--log=LOGFILE`` or ``--log2stderr``

The SAM format specification states that the mnext and mpos fields should point to the primary alignment of a read’s mate. However, not all aligners adhere to this standard.

In general (except in a few edge cases) UMI tools outputs only the read2 to that corresponds to the read specified in the mnext and mpos positions of a selected read1, and only outputs this read once, even if multiple read1s point to it.

This makes UMI-tools outputs incompatible with some downstream tools, noteably RSEM and Salmon (although we recommend using Alevin if you want to quantify single cell RNA-seq from protocols that Alevin supports). This script takes the output from dedup or group and ensures that each read1 has exactly one read2 (and vice versa), that read2 always appears directly after read1, and that pairs point to each other (note this is technically not valid SAM format). Copy any specified tags from read1 to read2 if they are present (by default, UG and BX, the unique group and correct UMI tags added by _group_)

In order for this to work correctly, your input file must be sorted by read name. Generally the protocol would be:

  1. Align reads to the transcriptome with your favourite aligner.

  2. Position sort the resulting BAM file.

  3. Run dedup on the position sorted name file.

  4. Use samtools sort -n or samtools collate to sort by read name.

  5. Use prepare_for_rsem to create a file that has exactly one mate per read and that pairs are adjecent in the output.

  6. Run your downstream tools - RSEM/Salmon/Kalisto on the output.

prepare_for_em specific options

--tags =TAG[,TAG....]

List of SAM tags that are transfered from read1 to read2. The default is UG and BX, which is the numeric UMI group, and the infered true UMI respectively.

Input/Output Format Options

The following options deal with input and output format, and are useful for outputting CRAM format. In general UMI-tools will attempt to guess the input and output formats from the file names, but thing can be over-written using the out-format and input-format options. The location of CRAM reference files will be taken from the either the an input CRAM file (if present) or from the --reference-filename option. Otherwise the reference will be embedded in the file.

--in-format=IN_FORMAT

File format of the input file. Format is usually implied from the extension of the filename, but maybe overridden with this option. Default=bam

--input-options=INPUT_OPTIONS

Format string provided to htslib for reading. Mostly useful for CRAM formatted files. See samtools documentation

--in-sam

[DEPRECATED] USE --in-format . By default, inputs are assumed to be in BAM format. Use this option to specify the use of SAM format for input.

--reference-filename=REFERENCE_FILENAME

File path or URL to the genome reference to be used when reading or writing CRAM files. Can be a path or a URL. By default, when reading a CRAM file, the reference recorded in the input file will be used unless this is specified. URL references cannot be read from input files, however. When writing, specifying a reference location is required unless specified in input.

--out-format=OUT_FORMAT

File format of the input file. Format is usually implied from the extension of the filename, but maybe overridden with this option. Default=bam

--output-options=OUTPUT_OPTIONS

Format string provided to htslib for writing. Mostly useful for CRAM formatted files. See samtools documentation