Quick start guide ============================= This quick start guide uses an iCLIP dataset as an example. If you want to apply UMI-tools in a single cell RNA-Seq analysis, please see the [Single cell tutorial](Single_cell_tutorial.md) - [UMI-Tools quick start guide](#umi-tools-quick-start-guide) * [Step 1: Install `UMI-Tools`](#step-1--install--umi-tools-) * [Step 2: Download the test data](#step-2--download-the-test-data) * [Step 3: Extract the UMIs](#step-3--extract-the-umis) * [Step 4: Mapping](#step-4--mapping) * [Step 5: Deduplication](#step-5--deduplication) * [Common variations](#common-variations) + [Paired-end sequencing](#paired-end-sequencing) + [Read grouping](#read-grouping) + [Other options](#other-options) The following steps will guide you through a short example of how to use the `UMI-tools` package to process data with UMIs added to them. The data used comes from one of the control replicates from [Mueller-Mcnicoll et al, *Genes Dev* (2016) 30: 553](http://genesdev.cshlp.org/content/30/5/553). We have adaptor trimmed and filtered the data to reduce its size. The general pipeline is: extract UMI from raw reads -> map reads -> deduplicate reads based on UMIs The most computationally intensive part of this is the middle part - mapping the reads. It is also the least interesting for us here. To aid the ability of folks to follow along without having to worry if they have the correct indexs install etc, we have provided a BAM file of the mapped reads from this example. You can download it [here](https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam>). It will need indexing with `samtools index` before use. You can then skip straight to Step 5 below. Step 1: Install `UMI-Tools` ---------------------------- The easiest way to install `UMI-Tools` is using your favorite python package manager, either `pip` or `conda`. If you don't know which you have installed, we recommend trying both, starting with `conda`: $ conda install -c bioconda umi_tools Alternatively, `pip`: $ pip install umi_tools If neither of these work, see our installation guide. Step 2: Download the test data -------------------------------- The test data we are going to use is a control sample from a recent iCLIP experiment. We have processed this data to remove the various adaptors that you find in iCLIP data. You can download the trimmed file here: $ wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.fastq.gz If you're using macOS use: $ curl -L "https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.fastq.gz" -o "example.fastq.gz" The file is about 100Mb, and takes a couple of minutes to download on our system. Step 3: Extract the UMIs ------------------------- UMIs are strings of random nucleotides attached to the start of reads. Before the reads are mapped the random nucleotides must be removed from the reads, but the sequence must be kept. The `extract` command of `UMI-Tools` moves the UMI from the read to the read name. Several techniques that use UMIs mix the UMI sequence in with a library barcode. In this case we want to remove the random part of the barcodes, but leave the library part so that the reads can be de-multiplexed. We specify this using the `--bc-pattern` parameter to `extract`. Ns represent the random part of the barcode and Xs the fixed part. For example, in a standard iCLIP experiment, the barcode is made of 3 random bases, followed by a 4 base library barcode, followed by 2 more random bases. Thus the `--bc-pattern` would be "NNNXXXXNN". Read: TAGCCGGCTTTGCCCAATTGCCAAATTTTGGGGCCCCTATGAGCTAG Barcode: NNNXXXXNN | v TAGCCGGCT | V random-> TAG CCGG CT <- random ^ | library Processed read: CCGGTTGCCCAATTGCCAAATTTTGGGGCCCCTATGAGCTAG ^^^^ The processed reads could then be passed to a demultiplexing tool to deal with the library part of the barcode. Since the file we have downloaded contains only one library, here we will treat the whole barcode as a UMI, and so the pattern will contain only Ns. $ umi_tools extract --stdin=example.fastq.gz --bc-pattern=NNNNNNNNN --log=processed.log --stdout processed.fastq.gz Note that `extract` can output to a gzipped or uncompressed file depending on the file extension. It can also output to `stdout` if no output is specified. A log file is saved containing the parameters with which `extract` was run and the frequency of each UMI encountered. This can be redirected with `--log` or suppressed with `--supress-stats` (run parameters are still output). Step 4: Mapping --------------- The next step is to map the reads (in real life, you might also want to demultiplex, trim and quality filter the reads). Below we will use `bowtie` to map the reads to the mouse genome and `samtools` to create a BAM file from the results. If you don't wish to spend the time doing this, or don't have access to `bowtie` or `samtools` (or suitable alternatives), we provide a premapped BAM file (see command at the end of this step). First map the reads with your favorite read mapper, here `bowtie`, using parameters from the paper which we stole the sample from. This assumes the mm9 `bowtie` index and fasta are in your current directory. $ bowtie --threads 4 -v 2 -m 10 -a mm9 <( gunzip < processed.fastq.gz ) --sam > mapped.sam Next we need to convert the SAM file to BAM (actually `dedup` will use SAM files for single ended analysis, but it's much slower). $ samtools import mm9.fa mapped.sam mapped.bam The BAM now needs to be sorted and indexed: $ samtools sort mapped.bam -o example.bam $ samtools index example.bam If you want to skip the mapping, you can get the file here. It will still need indexing (see "samtools index" command above): $ wget https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam Again for macOS use the following to download: $ curl -L "https://github.com/CGATOxford/UMI-tools/releases/download/v0.2.3/example.bam" -o "example.bam" Step 5: Deduplication ---------------------- Now that we have a mapped, sorted, indexed file, we can proceed to run the deduplication procedure on it: $ umi_tools dedup -I example.bam --output-stats=deduplicated -S deduplicated.bam The `--output-stats` option is optional, but selecting it will provide a range of statistics about the run. One of the most interesting is the distribution of edit distances (here named deduplicated_edit_distance.tsv). The content of this file after running the above will look something like: |directional-adjacency|directional-adjacency_null|edit_distance|unique|unique_null| |---------------------|--------------------------|-------------|------|-----------| |10465 |10465 |Single_UMI |8976 |8976 | |0 |1 |0 |0 |2 | |2 |8 |1 |1167 |33 | |164 |73 |2 |496 |183 | |211 |258 |3 |281 |566 | |700 |916 |4 |730 |1663 | |395 |320 |5 |317 |617 | |89 |4 |6 |59 |5 | |21 |2 |7 |21 |2 | |0 |0 |8 |0 |0 | The first two columns show the distribution of average edit distances between UMIs found at a single base in the genome after deduplication with the directional-adjacency method (the default). Thus in the third line we see that there are 2 bases in the genome where the average edit distance between the UMIs found at that base is 1. The second column is what we would expect to see if UMIs were randomly distributed between mapping locations (taking into account any biases in the overall usage of particular UMI sequences). The last two columns the same, but for the naive `unique` deduplication method where every UMI is assumed to represent an independent molecule in the biological sample. Looking at the third row, we see that there are 1167 positions where the average edit distance between UMIs is 1, whereas in the random null (in the final column) we would only expect to see 33 such bases. The statistics options signficantly reduce the speed at which deduplication is performed and increase the memory usage. If time or memory usage is an issue, try running without the `--output-stats` option. Common variations ------------------ ### Paired-end sequencing ### If paired-end sequencing has been performed, it is necessary to make sure that the UMI sequence is added to both reads. When processing, provide the second read like so: $ umi_tools extract -I pair.1.fastq.gz --bc-pattern=NNNXXXXNN \ --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz \ --read2-out=processed.2.fastq.gz This assumes the UMI is on the 5' end of read1. For other possibilities (such as a UMI on both reads) see `umi_tools extract --help`. After paired-end mapping, paired end deduplication can be achieved by adding the `--paired` option to the call to `dedup`: $ umi_tools dedup -I mapped.bam --paired -S deduplicated.bam Paired deduplicating is signficantly slower and more memory intensive than single-ended. ### Read grouping ### For some applications it may be neccessary to mark the duplicates but retain all reads, for example, where the PCR duplicates are used to correct sequence errors by generating a consensus sequence. In these cases, the _group_ command can be used to mark each read with its read group. Optionally a flatfile detailing the read groups and read identifiers can also be output using the `--group-out` option: $ umi_tools group -I mapped.bam --paired --group-out=groups.tsv --output-bam -S mapped_grouped.bam The output bam will contain two tags: UG = read group id, BX = read group UMI. The tag containing the read group UMI can be modified with the `--umi-group-tag` option. The groups flatfile contains the following columns: - read_id - contig - position - umi = raw umi - umi_count = how many times was this umi observed at the same alignment coordinates - final_umi = the error corrected umi - final_umi_count = how many times was the umi observed at the same alignment coordinates, inc. error correction - unique_id = the unique identifier for this group #### Example UMI extraction: #### In the case above the UMIs are extracted with the pattern --bc-pattern=NNXXXXNN. Below is an example of how the fastq should be formatted following extraction: UMI is bases 3-7, bases 1-2 and 7-8 are the sample barcode and need to be removed @HISEQ:87:00000000 read1 AAGGTTGCTGATTGGATGGGCTAG + DA1AEBFGGCG01DFH00B1FF0B will become: @HISEQ:87:00000000_GGTT read1 TGATTGGATGGGCTAG + 1AFGGCG01DFH00B1 ### Other options ### See `umi_tools extract --help` and `umi_tools dedup --help` for details of futher possibilities.