Release notes¶
1.1.2¶
07 May 2021
Bugfix
whitelist --filtered-out
with SE reads threw an unassigned error. Thanks @yech1990 for rectifying this (#453)
Also includes a very minor update of syntax (#455)
1.1.0¶
4 Nov 2020
Additional functionality
- Write out reads failing regex matching with extract/whitelist (see options –filtered-out, –filtered-out2). See #328 for motivation
- Ignore template length with paired-end dedup/group (see option –ignore-tlen). See #357 for motivation. Thanks @skitcattCRUKMI
- Ignore read pair suffixes with extract/whitelist e.g /1 or /2. (see option –ignore-read-pair-suffixes). See (#325, #391, #418, PierreBSC/Viral-Track issue 9) for motivation
Performance
- Sped up error correction mapping for cell barcodes in whitelist by using BKTree. Thanks @redst4r. Note that this adds a new python dependency (pybktree) which is available via pip and conda-forge.
- Very slight reduction in memory usage for dedup/group via bugfix to reduce the amount of reads being retained in the buffer. Thanks to @mitrinh1 for spotting this (#428). The bug was equivalent to hardcoding the option -buffer-whole-contig on, which ensures all reads with the same start position are grouped together for deduplication, but at the cost of not yielding reads until the end of each contig, thus increasing memory usage. As such, the bug was not detrimental to results output.
Bugfixes
- Unmapped mates were not properly discarded with dedup and group. Thanks @Daniel-Liu-c0deb0t for rectifying this.
1.0.1¶
6 Dec 2019
Debug for KeyError when some reads are missing a cell barode tag and stats output required from umi_tools dedup. See comments from @ZHUwj0 in #281
1.0.0¶
14 Feb 2019
This release is intended to be a stable release with no plans for significant updates to UMI-tools functionality in the near future. As part of this release, much of the code base has been refactored. It is possible this may have introduced bugs which have not been picked up by the regression testing. If so, please raise an issue and we’ll try and rectify with a minor release update ASAP.
Documentation
UMI-tools documentation is now available online: https://umi-tools.readthedocs.io/en/latest/index.html
Along with the previous documentation, the readthedocs pages also include new pages:
- FAQ
- Making use of our Alogrithmns: The API
New knee method for whitelist
- The method to detect the “knee” in whitelist has been updated (#317). This method should always identify a threshold and is now set as the default method. Note that this knee method appears to be slightly more conservative (fewer cells above threshold) but having identified the knee, one can always re-run whitelist and use
--set-cell-number
to expand the whitelist if desired- The old method is still available via
--knee-method=density
- In addition, to run the old knee method but allow whitelist to exit without error even if a suitable knee point isn’t identified, use the new
--allow-threshold-error
option (#249)- Putative errors in CBs above the knee can be detected using
--ed-above-threshold
(#309)
Explicit options for handling chimeric & inproper read pairs (#312)
The behaviour for chimeric read pairs, inproper read pairs and unmapped reads can now be explictly set with the --chimeric-pairs
, --unpaired-reads
and --unmapped-reads
options.
New options
Misc
0.5.5¶
16 Nov 2018
Mainly minor debugs and improved detection of incorrect command line options. Minor updates to documentation.
This involves the addition of the --assigned-status-tag
option
- Testing for OSX has been dropped due to unresolved issues with travis. We hope to resurrect this in the future!
- In line with major python packages (e.g https://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html), support for python 2 will be dropped from January 1st 2019.
0.5.2¶
21 Dec 2017
Adds options to specify a delimiter for a cell barcode or UMI which should be concatenated + options to specify a string splitting the cell barcode or UMI into multiple parts, of which only the first will be used. Note, this options will only work if the barcodes are contained in the BAM tag - if they were appended to the read name using umi_tools extract there is no need for these options. See #217 for motivation:
--umi-tag-delimiter=[STRING]
- remove the delimeter STRING from the UMI. Defaults to None
--umi-tag-split=[STRING]
- split UMI by STRING and take only the first portion. Defaults to None
--cell-tag-delimiter=[STRING]
- remove the delimeter STRING from the cell barcode. Defaults to None
--cell-tag-split=[STRING]
- split cell barcode by STRING and take only the first portion. Defaults to
-
to deal with 10X GEMsReduced memory requirements for
count --wide-format-cell-counts
(#222)Updates documentation (#204, #210, #211). Thanks @kohlkopf, @hy09 & @cbrueffer.
0.5.1¶
16 Oct 2017
- Minor update. Improves detection of duplicate reads with paired end
reads, reduces run time with dedup
--output-stats
and a few simple debugs. - Improved identification of duplicate reads from paired end reads - will now use the position of the FIRST splice junction in the read (in reference coords) (#187)
- Speeds up dedup when running with
--output-stats
- (#184) - Fixes bugs:
whitelist --set-cell-number --plot-prefix
-> unwanted error- dedup gave non-informative error when input contains zero valid reads/read pairs. Now raises a warning but exits with status 0 (#190, #195)
- count errored if gene identifier contained a “:” (#198)
- Renames
--whole-contig option
to--buffer-whole-contig
to avoid confusion with –per-contig` option.--whole-contig
option will still work but will not be visible in documentation (#196)
0.5.0¶
18 Aug 2017
Version 0.5.0 introduces new commands to support single-cell RNA-Seq and reduces run-time. The underlying methods have not changed hence the minor release number uptick.
UMI-tools goes single cell
New commands for single cell RNA-Seq (scRNA-Seq):
whitelist
Extract cell barcodes (CB) from droplet-based scRNA-Seq fastqs and
estimate the number of “true” CBs. Outputs a flatfile listing the true cell barcodes and ‘error’ barcodes within a set distance. See #97 for a motivating example. Thanks to @Hoohm for input and patience in testing. Thanks to @k3yavi for input in discussions about implementing a ‘knee’ method.
In the process of creating these commands, the options for dealing with UMIs on a “per-gene” basis have been re-jigged to make their purpose clearer. See e.g #127 for a motvating example.
To perform group, dedup or count on a per-gene, basis, the --per-gene
option should be provided. This must be combined with either --gene-tag
if the BAM contains gene assignments in a tag, or --per-contig
if the reads have been aligned to a transcriptome. In the later case, if the reads have been aligned to a transcriptome where each contig is a transcript, the option --gene-transcript-map
can be used to operate at the gene level. These options are standardised across all tools such that one can easily change e.g a count
command into a dedup
command.
Additional updates
extract
can now accept regex patterns to describe UMI +/- CB encoding in read(s). See--extract-method=regex
option.- We have written a guide for how to use UMI-tools for scRNA-Seq analysis including estimation of the number of true CBs, flexible extraction of cell barcodes and UMIs and
--per-cell
read-counting as well as common workflow variations.- Reduced run-time (#156)
- Introduced a hashing step to limit the scope of the edit-distance comparisons required to build the networks. Big thanks to @mparker2 for this!
- Simplified installation (#145)
- Previously extensions were cythonized and compiled on the fly using
pyximport
, requiring users to have access to the install directory the first time the extension was required. Now the cythonized extension is provided, and is compiled at install-time.
0.4.4¶
8 May 2017
- Tweaks the way group handles paired end BAMs. To simplify the process and ensure all reads are written out, the paired end read (read 2) is now outputted without a group or UMI tag. (#115).
- Introduces the
--skip-tags-regex
option to enable users to skip descriptive gene tags, such as “Unassigned” when using the –gene-tag option. See #108.
- Bugfixes:
- If the
--transcript-gene-map
included transcripts not observed in the BAM, this caused an error when trying to retrieve reads aligned to the transcript. This has been resolved. See #109 - Allow output to zipped file with extract using python 3 #104
- Improved test coverage (
--chrom
and--gene-tag
options). Thanks @MarinusVL for kindly sharing a BAM with gene tags.
- If the
0.4.2¶
22 Mar 2017
- When using the directional method with the group command, the ‘top’ UMI within each group was not always the most abundant (see comments in #96). This has now been resolved
0.4.1¶
16 Mar 2017
- Due to a bug in
pysam.fetch()
paired end files with a large number of contigs could take a long time to process (see #93). This has now been resolved. Thanks to @gpratt for spotting and resolving this.
0.4.0¶
9 Mar 2017
Added functionality:
Deduplicating on gene ids (#44 <https://github.com/CGATOxford/UMI-tools/issues/44>`_ for motivation) - The user can now group/dedup according to the gene which the read
aligns to. This is useful for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. The following options may be used define the gene_id for each read:
--per-gene
--gene-transcript-map
--gene-tag
UMIs can now be extracted from the BAM tags and group will add a tag to each read describing the read group and UMI. See following options for controlling this behaviour:
--extract-umi-method
--umi-tag
--umi-group-tag
Ouput unmapped reads (#78)
The group command will now output unmapped reads if the
--output-unmapped
is supplied. These reads will not be assigned to any group.
0.3.6¶
1 Feb 2017
0.3.5¶
27 Jan 2017
- The code has been tweaked to improve run-time. See #69 for a discussion about the changes implemented.
0.3.4¶
23 Jan 2017
- Corrects the edit distance comparison used to generate the network for the
directional
method.
- This will only affect results generated using the directional method and
--edit-distance-threshold
>1.- Previously, using the
directional
method with the option--edit-distance-threshold
set to > 1 did not return the expected set of de-duplicated reads. If you have used thedirectional
method with a threshold >1, we recommend updating UMI-tools and re-running dedup.
0.3.3¶
19 Jan 2017
- Debugs
python 3
compatibility issues- Adds
python 3
tests
0.3.0¶
1 Dec 2016
- Adds the new
group
command to group PCR duplicates and return the groups in a tagged BAM file and/or flat file format. This was motivated by multiple requests to group PCR duplicated reads for downstream processes, e,g #45, #54. Special thanks to Nils Koelling (@koelling) for testing the group command.- Adds the –umi-separator option for dedup and group for workflow where umi_tools extract is not used to extract the UMI. This was motivated by #58
0.2.0¶
31 May 2016
0.0.10¶
- Improved memory performace for UMI extraction from paired end reads