Single cell tutorial ==================== **Important update**: We now recommend the use of `alevin` for droplet-based scRNA-Seq (e.g 10X, inDrop etc). `alevin` extends the directional method used in `UMI-tools` to correct UMI errors with droplet scRNA-Seq within a framework that also enables quantification using multi-mapped reads. `alevin` is an accurate, fast and convenient end-to-end tool to go from fastq -> count matrix. See [alevin documentation](https://salmon.readthedocs.io/en/latest/alevin.html) and [alevin pre-print](https://www.biorxiv.org/content/10.1101/335000v2) for more information This tutorial will go through an end to end analysis for single cell analysis using UMI-tools. Before you start, you will need: * An installed copy of UMI-tools (see the installation guide) * [The STAR aligner](https://github.com/alexdobin/STAR) * A STAR index for the human genome * A transcriptome annotation. * The [Subread package](http://subread.sourceforge.net/), version 1.5.3 or greater. This tutorial will use an example dataset generated by 10X Genomics on their Chromium platform. At the end of the tutorial we cover variations on this workflow. We will continue to expand this section and cover other other data types here as well. With 4 cores at your disposal, the whole thing should be do-able in less than an hour for this dataset. The below assumes you are working in a linux environment. Some changes might be necessary if you are working in OSX, notably `wget` should be replace with `curl -O` Outline of the process: | Step | Inputs | Output | | -------- | -------- | -------- | | Find Cell barcode whitelist | Read 1 fastq | whitelist.txt | | Extract CB/UMIs and filter CBs | R1 + R2 fastqs + whitelist.txt | extracted fastqs | | Map reads | extracted fastqs | BAM | | Assign reads to genes | BAM + transcriptome GTF | BAM | Count unique reads per genes per cell | BAM | Counts.txt TL;DR ----- ```bash #! /bin/env bash # Step 1: get data wget http://cf.10xgenomics.com/samples/cell-exp/1.3.0/hgmm_100/hgmm_100_fastqs.tar; tar -xf hgmm_100_fastqs.tar; cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > hgmm_100_R1.fastq.gz; cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > hgmm_100_R2.fastq.gz; # Step 2: Identify correct cell barcodes umi_tools whitelist --stdin hgmm_100_R1.fastq.gz \ --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --set-cell-number=100 \ --log2stderr > whitelist.txt; # Step 3: Extract barcdoes and UMIs and add to read names umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --stdin hgmm_100_R1.fastq.gz \ --stdout hgmm_100_R1_extracted.fastq.gz \ --read2-in hgmm_100_R2.fastq.gz \ --read2-out=hgmm_100_R2_extracted.fastq.gz \ --whitelist=whitelist.txt; # Step 4: Map reads STAR --runThreadN 4 \ --genomeDir hg38_noalt_junc85_99.dir \ --readFilesIn hgmm_100_R2_extracted.fastq.gz \ --readFilesCommand zcat \ --outFilterMultimapNmax 1 \ --outSAMtype BAM SortedByCoordinate; # Step 5: Assign reads to genes featureCounts -a geneset.gtf \ -o gene_assigned \ -R BAM Aligned.sortedByCoord.out.bam \ -T 4; samtools sort Aligned.sortedByCoord.out.bam.featureCounts.bam -o assigned_sorted.bam; samtools index assigned_sorted.bam; # Step 6: Count UMIs per gene per cell umi_tools count --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S counts.tsv.gz ``` Step 1: Obtaining the data -------------------------- The dataset was produced from around 100 cells from a mixture of human and mouse cells. The download is 760MB, so make sure you've got space. It can be downloaded thus: ``` $ wget http://cf.10xgenomics.com/samples/cell-exp/1.3.0/hgmm_100/hgmm_100_fastqs.tar $ tar -x hgmm_100_fastqs.tar ``` out of the archive you will get a directory called `fastqs` that will contain data from 8 sequencing lanes, with three files for each lane: a read 1 file, a read 2 file and file containing the Sample Barcodes. We are not interested in the sample barcodes files, but for the read 1 and read2 files, we need to combine the eight lanes into one. By the magic of gzip, we can do with `cat`: $ cat fastqs/hgmm_100_S1_L00?_R1_001.fastq.gz > hgmm_100_R1.fastq.gz $ cat fastqs/hgmm_100_S1_L00?_R2_001.fastq.gz > hgmm_100_R2.fastq.gz If we look at the content of of the files we will see that the first read in the pair contains 26nt, which correspond to a 16nt Cell barcode (CB) and 10nt the Unique Molecular Identifier (UMI): $ zcat hgmm_100_R1.fastq.gz | head -n2 @ST-K00126:308:HFLYFBBXX:1:1101:25834:1173 1:N:0:NACCACCA NGGGTCAGTCTAGTGTGGCGATTCAC + #AAFFJJJJJJJJJJJJJJJJJJJJJ -------CB-------|---UMI--- We will need to remember this for the next two steps Step 2: Identifying the real cells --------------------------------- Cell barcodes are short nucleotide sequences, very much like UMIs, except instead of identifying independent molecules, they identify independent cells. We generally observe more of them in an experiment than there were cells. This could be for several reasons, including sequencing or PCR errors and the sequencing of empty droplets or those containing contaminants. Thus we must identify which cell barcodes we wish to use downstream. UMI-Tools `whitelist` command is used to produce a list of CB to use downstream. `whitelist` currently allows the common method of taking the top X most abundant barcodes. X can be estimated automatically from the data using the `knee` method (for more detail see this [blog post](https://cgatoxford.wordpress.com/2017/05/18/estimating-the-number-of-true-cell-barcodes-in-single-cell-rna-seq/)). However, it is just an estimate and for this data we've been told that there were 100 cells, so we can just supply that number (see variations section for performing the estimation for data sets where cell number is unknown). umi_tools whitelist --stdin hgmm_100_R1.fastq.gz \ --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --set-cell-number=100 \ --log2stderr > whitelist.txt A couple of things about this command: ### Input files Firstly, note that FASTQ file that contains the barcodes is passed to `--stdin`, this tells `whitelist` to read the data from this file rather than from stdin. This of course means you could pass data using standard pipes. ### Plots The `--plot-prefix` option tells `whitelist` to output summary plots for the frequency of each CB. We always recommend running with this option in order to visualise whether the number of cells accepted seems reasonable. See [Variations](Automatic estimation of cell number) for a more complete explanation of what these plots show. ### Specifying barcode locations Second, the `--bc-pattern`. This tells `whitelist` where to find the CB and UMI in the read sequence. By default we assume the barcodes are at the 5' end of the read (this can be changed with `--3prime`). We then use `C` characters to show where CB bases are and `N` characters to show were UMI bases are. Thus, in the above we have 16 `C`s followed by 10 `N`s to denote that the first 16 bases of the read are CB bases and the second 16 are UMI bases. Alternatively, you can also define the barcode pattern using a regex instead (--extract-method=regex) in which there are named groups to define the positions for the cell barcode and UMI. For example we would change the above command to: umi_tools whitelist --stdin hgmm_100_R1.fastq.gz \ --bc-pattern='(?P.{16})(?P.{10})' \ --extract-method=regex \ --set-cell-number=100 \ --log2stderr > whitelist.txt This interface is very powerful and flexible and allows for the specification of all sorts of interesting things, like variable length CBs (UMIs do have to be a fixed length) and tolerant linker sequences (see the inDrop example in [Variations](#Barcode-extraction-for-inDrop) and `umi_tools whitelist --help`). ### Specifying outputs By default UMI-tools outputs everything, final output, logging info and progress report, to the standard out pipe. Downstream sections will take this output just fine, but you might want to put the log somewhere else, like a separate file, or on the terminal. This can be achieved in one of several ways: * You could redirect the final output using the `--stdout` or `-S` options. The final output will be directed to a file and the log will continue to come on stdout. * You could redirect the log to a file with `--stdlog` or `-L`. The log and progress will be saved to a file and the final output sent to the standard out. * You can redirect the log and progress to the standard err using `--log2stderr` as above. * You can switch off the log/progress with `-v 0` This applies to all the UMI-Tools commands. ### Using UMI counts rather than read counts in umi_tools whitelist Many published protocols rank CBs by the number of reads the CBs appear in. However you could also use the number of unique UMIs a CB is associated with. Note that this is still an approximation to the number of transcripts captured because the same UMI could be associated with two different transcripts and be counted as independent. Activate this with `--method=umis`. ### Contents of `whitelist.txt` The output of the whitelist command is a table containing the accepted CBs. It has four columns: 1. The accepted CB 2. Comma separated list of other CBs within an edit distance of the CB in columns 1 and >1 edit away from any other accepted CB. 3. The abundance (read or UMI count) of the accepted. 4. Comma separated list of abundances for the CBs in column 2 e.g: ``` $ head whitelist.txt AAAGATGAGAAACGAG AAAAATGAGAAACGAG,AAACATGAGAAACGAG,... 53122 4,6,... AAAGCAAGTACCTACA AAAACAAGTACCTACA,AAACCAAGTACCTACA,... 36255 2,3,... AACACGTCAGCGTAAG AAAACGTCAGCGTAAG,AACAAGTCAGCGTAAG,... 53133 4,11,... ``` (I have truncated columns 2 and 4 for easy viewing) As we asked for 100 cells, this file should contain 100 lines: ``` $ wc -l whitelist.txt 100 whitelist.txt ``` Step 3: Extract the barcodes and filter the reads ------------------------------------------------- The next step is to extract the CB and UMI from Read 1 and add it to the Read 2 read name. We will also filter out reads that do not match one of the accepted cell barcode. The most basic form of this is executed with: ``` umi_tools extract --bc-pattern=CCCCCCCCCCCCCCCCNNNNNNNNNN \ --stdin hgmm_100_R1.fastq.gz \ --stdout hgmm_100_R1_extracted.fastq.gz \ --read2-in hgmm_100_R2.fastq.gz \ --read2-out=hgmm_100_R2_extracted.fastq.gz \ --whitelist=whitelist.txt ``` The `--bc-pattern` and `--stdin` options are as before. Note that we send the standard out (which contains the extract Read 1s) to the a file (add `.gz` to the end of the name will trigger automatic compression). `--read2-in` and `--read2-out` specify the output files for read 2. Finally `--whitelist` passes the list of accepted CBs generated in the previous step and tells `extract` to only output those reads that contain accepted CBs. As the end of the log makes clear, the command above took about 20 minutes ``` 2017-08-10 16:29:45,449 INFO Parsed 7100000 reads 2017-08-10 16:30:01,896 INFO Input Reads: 7197662 2017-08-10 16:30:01,896 INFO Reads output: 5963891 2017-08-10 16:30:01,896 INFO Filtered cell barcode: 1233771 # job finished in 1175 seconds at Thu Aug 10 16:30:01 2017 -- 1153.42 24.13 0.00 0.00 -- f5227f66-4f36-4862-9c7a-87caf3f81870 ``` There should now be two output files in your directory: `hgmm_100_R1_extracted.fastq.gz` and `hgmm_100_R2_extracted.fastq.gz`. Looking at the first read we see: ``` $ zcat hgmm_100_R1_extracted.fastq.gz | head -n4 @ST-K00126:308:HFLYFBBXX:1:1101:31345:1261_AACTCTTGTTCTGAAC_CGGTTGGGAT 1:N:0:NACCACCA + ``` The reads are now empty because the CBs and UMIs have now been moved from the read sequence to the read name, and there is nothing left. We will not need this file again. If we look at the Read 2: ``` $ zcat hgmm_100_R2_extracted.fastq.gz | head -n4 @ST-K00126:308:HFLYFBBXX:1:1101:31345:1261_AACTCTTGTTCTGAAC_CGGTTGGGAT 2:N:0:NACCACCA CCTTTTTGGAACCAACAATAGCAGCTCCATTTCTGGAGTCTGGGTCTTCCGAGGCCAGGAGCTCGCCTTTCCGCCGAGCCCAGATTGGCAGGTGGACT + A< whitelist.txt We can check how many CBs we accepted: ``` $ wc -l whitelist.txt 119 whitelist.txt ``` Based on our prior knowledge that there were ~100 true CBs, we've probably slightly over-estimated the number of CBs here. We can visualise this by looking at the following plots which present the same data in different ways to help assess the threshold: ![Counts distribution](./cell_barcode_count_density.png) Figure 1. "expect_whitelist_cell_barcode_counts_density.png" Distribution of read counts per CB. The least abundant CBs are removed from this plot --- ![Cumulative counts](./cell_barcode_knee.png) Figure 2. "expect_whitelist_cell_barcode_knee.png" Cumulative counts as number of accepted barcodes is increased --- ![Counts per barcode](./cell_barcode_counts.png) Figure 3. "expect_whitelist_cell_barcode_counts.png" Logs counts per barcode (ranked) --- For all plots, the 'Selected' local minima (or threshold) is shown by the dashed line. If we had other local minima which were 'Rejected' these would also be shown. The thresholds are also tabulated in the file `expect_whitelist_cell_thresholds.tsv`. In Figure 1, we can see two clear peaks, with the right peak representing the true cell barcodes. In this case, the selected local minima appears to be in the correct place. This is confirmed in Figures 2 & 3, where the threshold appears to be at the 'knee' in the cumulative counts, and counts per barcode. So it looks like in this case, the automatic detection of the 'knee' has worked well. However, in some cases, the automatic selection may not work so well. In these case we can re-run `whitelist` and set the number of cells with the `--set-cell-number` option as described above. In most cases, there will be many local minima and sometimes the wrong one may be selected. If so, you can insepect the `[PLOT_PREFIX]_cell_thresholds.tsv` file and re-run the `whitelist` command using the `--set-cell-number` option with the threshold which you have deemed to be most appropriate. We've not yet come across a case where no local minima identified near the 'knee', but if so, you can inspect the counts per barcode and cumulative counts figures and estimate a number to use with the `--set-cell-number` option. ### Correction of cell barcodes In [Step 2](#Step-2:-Identifying-the-real-cells) above, we generated a whitelist of accepted ("true") CBs. By default, we also identified the CBs that were not accepted but were a single edit from a just one accepted barcode (column 2) in `whitelist.txt`. These are likely to be errors and can be corrected. By default, `extract` will only retain reads where the CB matches the whitelist. However, we can also accept these "error" CBs and correct the sequence to the accepted CB using the `--error-correct-cell` option. ### Quality filtering and masking barcodes In some cases, the UMI bases may have very low base call quality scores. There are three options to deal with these reads: 1. Retain 2. Discard 3. Mask By defualt, `extract` takes option 1 and does not considered quality scores. If you want to take option 2 and discard these reads you can do so with `--quality-filter-threshold=[FILTER]`, whereby all reads where the UMI contains a base with a Phred scrore below `FILTER` are discarded. However, if you are concerned about low quality bases, our recommendation is that you take option 3 and use `--quality-filter-mask=[FILTER]` to replace all UMI bases with a Phred score below `FILTER` with an "N". UMIs with an "N" will be 1-edit distance away from their cognate error-free UMI sequence and will therefore be handled sensibly by the downstream UMI-tools commands. ### Going back a step in the group>dedup>count proceedure In this guide we have discussed the the UMI-tools `whitelist`, `extract` and `count` commands. There are two further UMI-tools commands which deal with BAM files, `group` and `dedup`. `group`, `dedup` and `count` can be thought of as performing sequential levels of analysis in the process of obtaining read counts per gene: First the UMIs are "grouped" according to the gene to which they are assigned, then the UMIs are "deduplicated" to leave a single UMI per group, then the UMIs per gene are "counted". In some cases, you may wish to stop the proceedure prior to counting to inspect the grouping or deduplication process. This can be easily achieved by simply replacing the `count` command with `dedup` or `group`, with a very few considerations to command specific options: * `--wide-format-cell-counts` is a `count`-specific option * When running `dedup`, you may want to include the `--output-stats` option * To output a BAM with `group`, you need to supply the `--output-bam` option **warning**: Running the `group` command with the `--per-gene` option in order to assess the groups used in the process of counting is very memory intensive, especially when the reads have been aligned to a genome, since all reads for a given contig in the BAM must be kept in memory as python objects until the whole contig has been parsed. This is not the case for `count` since we only retain the tally of reads. We recommend if you want to `group` together with `--per-gene` you used a transcriptome mapping (see below) ### Mapping to the transcriptome rather than genome Many single cell protocols recommend mapping to the transcriptome rather than then genome. This makes downstream processing easier. In paritcular it makes it easier to know when you've seen all the reads from a gene, so deduplication can happen. This leads to a reduced memory footprint, which is particularly obvious in `dedup` and `group`. However, there are problems. 1. When you map to the transcriptome almost all reads are multimapping as most sequence in a gene is part of more than one transcript. This makes it difficult to distinguish between genuninely multimapping reads and reads that just come from multiple transcripts of the same gene. 2. If we include these reads we also face the problem that reads will be double counted, and each time the deduplication is done, it will be done using a different set of other reads, thus a read might filtered as a duplicate for one transcript but not another. 3. Reads that should map elsewhere might be forced to map to the transcriptome. It turns out we can solve either problems 1 + 2 or problem 3. This is why we recommend mapping to the genome and annotating with `featureCounts` above. **Method 1** We solve problems 1 and 2 by using a collapsed transcriptome where all transcripts for a gene are merged into a single model. This is similar to the superTranscript concept proposed by [Davidison *et al*](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-017-1284-1). You can download our merged transcriptome annotations for human GENCODE 26 and mouse GENCODE 15 from ==here and here==, or see below for instructions on generating your own. You should generate FASTA from these, and then genome indexes for your favorite spliced aligner. After alignment the contigs of the aligned BAM file will contain the gene_id. You can then run `count`, `dedup` or `group` with: ``` $ umi_tools [COMMAND] --per-contig -I aligned.bam ``` You'll find that after the effort of generating the genome and its indexes, the `umi_tools` step is quicker and uses less memory, and the `featureCounts` step is unecessary. **Method 2** The above doesn't solve the third problem above however. We can get transcriptome aligned reads without the problem three using `STAR`. STAR has an output mode `--quantMode TranscriptomeSAM` where reads are mapped to the genome, but then their mapping coordinates are translated to the transcriptome and output in that form. For this you would pass `STAR` a normal transcriptome (i.e. not one of the collasped ones from above) using `--sjdbGTFfile` option. You could then sort the `BAM` file and run `umi_tools` with: ``` $ umi_tools [COMMAND] --per-contig --per-gene --gene-transcript-map=[MAPFILE] ``` where `MAPFILE` is a tab-sepearted text file mapping transcript_ids (in column 1) to gene_ids (in column 2). However, since we might encounter the same read twice (mapped to different transcripts of the same gene), UMI counts might be inaccurate. Ideally we'd like to combine these two methods: map to the genome and then have STAR translate to merged transcriptome coordinates. Unfortunately `STAR` will only transfer reads that fit the intron/exon structure of the collapsed gene, which looses many splice junctions, and therefore about 10% of reads are lost. ### Generating merge transcriptome annotations We generated the above linked collapsed transcriptomes using the [`cgat` pacakge](https://github.com/CGATOxford/cgat) and files from gencode. This involves repeated use of the `gtf2gtf` tool, which first merges the transcripts, and then sets the `transcript_id` of every exon to be equal to the `gene_id`. For example, for the human transcriptome: ``` $ wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_26/gencode.v26.annotation.gtf.gz $ cgat gtf2gtf --method=merge-exons -I gencode.v26.annotation.gtf.gz \ | cgat gtf2gtf --method=set-transcript-to-gene -S hg38_genocode26_merged.gtf.gz ``` [1]: http://www.sciencedirect.com/science/article/pii/S0092867415005498 [2]: https://www.ncbi.nlm.nih.gov/pubmed/26000488