07 May 2021
whitelist --filtered-outwith SE reads threw an unassigned error. Thanks @yech1990 for rectifying this (#453)
Also includes a very minor update of syntax (#455)
4 Nov 2020
- Write out reads failing regex matching with extract/whitelist (see options –filtered-out, –filtered-out2). See #328 for motivation
- Ignore template length with paired-end dedup/group (see option –ignore-tlen). See #357 for motivation. Thanks @skitcattCRUKMI
- Ignore read pair suffixes with extract/whitelist e.g /1 or /2. (see option –ignore-read-pair-suffixes). See (#325, #391, #418, PierreBSC/Viral-Track issue 9) for motivation
- Sped up error correction mapping for cell barcodes in whitelist by using BKTree. Thanks @redst4r. Note that this adds a new python dependency (pybktree) which is available via pip and conda-forge.
- Very slight reduction in memory usage for dedup/group via bugfix to reduce the amount of reads being retained in the buffer. Thanks to @mitrinh1 for spotting this (#428). The bug was equivalent to hardcoding the option -buffer-whole-contig on, which ensures all reads with the same start position are grouped together for deduplication, but at the cost of not yielding reads until the end of each contig, thus increasing memory usage. As such, the bug was not detrimental to results output.
- Unmapped mates were not properly discarded with dedup and group. Thanks @Daniel-Liu-c0deb0t for rectifying this.
6 Dec 2019
Debug for KeyError when some reads are missing a cell barode tag and stats output required from umi_tools dedup. See comments from @ZHUwj0 in #281
14 Feb 2019
This release is intended to be a stable release with no plans for significant updates to UMI-tools functionality in the near future. As part of this release, much of the code base has been refactored. It is possible this may have introduced bugs which have not been picked up by the regression testing. If so, please raise an issue and we’ll try and rectify with a minor release update ASAP.
UMI-tools documentation is now available online: https://umi-tools.readthedocs.io/en/latest/index.html
Along with the previous documentation, the readthedocs pages also include new pages:
- Making use of our Alogrithmns: The API
New knee method for whitelist
- The method to detect the “knee” in whitelist has been updated (#317). This method should always identify a threshold and is now set as the default method. Note that this knee method appears to be slightly more conservative (fewer cells above threshold) but having identified the knee, one can always re-run whitelist and use
--set-cell-numberto expand the whitelist if desired
- The old method is still available via
- In addition, to run the old knee method but allow whitelist to exit without error even if a suitable knee point isn’t identified, use the new
- Putative errors in CBs above the knee can be detected using
Explicit options for handling chimeric & inproper read pairs (#312)
The behaviour for chimeric read pairs, inproper read pairs and unmapped reads can now be explictly set with the
16 Nov 2018
Mainly minor debugs and improved detection of incorrect command line options. Minor updates to documentation.
This involves the addition of the
- Testing for OSX has been dropped due to unresolved issues with travis. We hope to resurrect this in the future!
- In line with major python packages (e.g https://www.numpy.org/neps/nep-0014-dropping-python2.7-proposal.html), support for python 2 will be dropped from January 1st 2019.
16 Jul 2018
21 Dec 2017
Adds options to specify a delimiter for a cell barcode or UMI which should be concatenated + options to specify a string splitting the cell barcode or UMI into multiple parts, of which only the first will be used. Note, this options will only work if the barcodes are contained in the BAM tag - if they were appended to the read name using umi_tools extract there is no need for these options. See #217 for motivation:
- remove the delimeter STRING from the UMI. Defaults to None
- split UMI by STRING and take only the first portion. Defaults to None
- remove the delimeter STRING from the cell barcode. Defaults to None
- split cell barcode by STRING and take only the first portion. Defaults to
-to deal with 10X GEMs
Reduced memory requirements for
16 Oct 2017
- Minor update. Improves detection of duplicate reads with paired end
reads, reduces run time with dedup
--output-statsand a few simple debugs.
- Improved identification of duplicate reads from paired end reads - will now use the position of the FIRST splice junction in the read (in reference coords) (#187)
- Speeds up dedup when running with
- Fixes bugs:
whitelist --set-cell-number --plot-prefix-> unwanted error
- dedup gave non-informative error when input contains zero valid reads/read pairs. Now raises a warning but exits with status 0 (#190, #195)
- count errored if gene identifier contained a “:” (#198)
--buffer-whole-contigto avoid confusion with –per-contig` option.
--whole-contigoption will still work but will not be visible in documentation (#196)
18 Aug 2017
Version 0.5.0 introduces new commands to support single-cell RNA-Seq and reduces run-time. The underlying methods have not changed hence the minor release number uptick.
UMI-tools goes single cell
New commands for single cell RNA-Seq (scRNA-Seq):
Extract cell barcodes (CB) from droplet-based scRNA-Seq fastqs and
estimate the number of “true” CBs. Outputs a flatfile listing the true cell barcodes and ‘error’ barcodes within a set distance. See #97 for a motivating example. Thanks to @Hoohm for input and patience in testing. Thanks to @k3yavi for input in discussions about implementing a ‘knee’ method.
In the process of creating these commands, the options for dealing with UMIs on a “per-gene” basis have been re-jigged to make their purpose clearer. See e.g #127 for a motvating example.
To perform group, dedup or count on a per-gene, basis, the
--per-gene option should be provided. This must be combined with either
--gene-tag if the BAM contains gene assignments in a tag, or
--per-contig if the reads have been aligned to a transcriptome. In the later case, if the reads have been aligned to a transcriptome where each contig is a transcript, the option
--gene-transcript-map can be used to operate at the gene level. These options are standardised across all tools such that one can easily change e.g a
count command into a
extractcan now accept regex patterns to describe UMI +/- CB encoding in read(s). See
- We have written a guide for how to use UMI-tools for scRNA-Seq analysis including estimation of the number of true CBs, flexible extraction of cell barcodes and UMIs and
--per-cellread-counting as well as common workflow variations.
- Reduced run-time (#156)
- Introduced a hashing step to limit the scope of the edit-distance comparisons required to build the networks. Big thanks to @mparker2 for this!
- Simplified installation (#145)
- Previously extensions were cythonized and compiled on the fly using
pyximport, requiring users to have access to the install directory the first time the extension was required. Now the cythonized extension is provided, and is compiled at install-time.
8 May 2017
- Tweaks the way group handles paired end BAMs. To simplify the process and ensure all reads are written out, the paired end read (read 2) is now outputted without a group or UMI tag. (#115).
- Introduces the
--skip-tags-regexoption to enable users to skip descriptive gene tags, such as “Unassigned” when using the –gene-tag option. See #108.
- If the
--transcript-gene-mapincluded transcripts not observed in the BAM, this caused an error when trying to retrieve reads aligned to the transcript. This has been resolved. See #109
- Allow output to zipped file with extract using python 3 #104
- Improved test coverage (
--gene-tagoptions). Thanks @MarinusVL for kindly sharing a BAM with gene tags.
- If the
28 Mar 2017
22 Mar 2017
- When using the directional method with the group command, the ‘top’ UMI within each group was not always the most abundant (see comments in #96). This has now been resolved
16 Mar 2017
- Due to a bug in
pysam.fetch()paired end files with a large number of contigs could take a long time to process (see #93). This has now been resolved. Thanks to @gpratt for spotting and resolving this.
9 Mar 2017
Deduplicating on gene ids (#44 <https://github.com/CGATOxford/UMI-tools/issues/44>`_ for motivation) - The user can now group/dedup according to the gene which the read
aligns to. This is useful for single cell RNA-Seq methods such as e.g CEL-Seq where the position of the read on a transcript may be different for reads generated from the same initial molecule. The following options may be used define the gene_id for each read:
UMIs can now be extracted from the BAM tags and group will add a tag to each read describing the read group and UMI. See following options for controlling this behaviour:
Ouput unmapped reads (#78)
The group command will now output unmapped reads if the
--output-unmappedis supplied. These reads will not be assigned to any group.
1 Feb 2017
27 Jan 2017
- The code has been tweaked to improve run-time. See #69 for a discussion about the changes implemented.
23 Jan 2017
- Corrects the edit distance comparison used to generate the network for the
- This will only affect results generated using the directional method and
- Previously, using the
directionalmethod with the option
--edit-distance-thresholdset to > 1 did not return the expected set of de-duplicated reads. If you have used the
directionalmethod with a threshold >1, we recommend updating UMI-tools and re-running dedup.
19 Jan 2017
python 3compatibility issues
17 Jan 2017)
- Minor bump:
- Resolves setuptools-based installation issue
1 Dec 2016
Version bump to allow pypi update. No code changes
1 Dec 2016
- Adds the new
groupcommand to group PCR duplicates and return the groups in a tagged BAM file and/or flat file format. This was motivated by multiple requests to group PCR duplicated reads for downstream processes, e,g #45, #54. Special thanks to Nils Koelling (@koelling) for testing the group command.
- Adds the –umi-separator option for dedup and group for workflow where umi_tools extract is not used to extract the UMI. This was motivated by #58
8 Nov 2016
- directional-adjacency method is renamed directional
2 Nov 2016
- Debugs writing out paired end
- Debugs installation
7 Jun 2016
- Debugs pip installation
31 May 2016
23 May 2016
- Debugs read extraction from 3’ end
- Improved memory performace for UMI extraction from paired end reads