This repository was archived by the owner on Nov 11, 2024. It is now read-only.
Releases: clintval/cvbio
Releases · clintval/cvbio
3.0.0
Changelog
- Breaking: Renamed
UpdateDataContigNames
toUpdateContigNames
Developers
- Correct IDEA configuration files are now created by Mill so we can run test coverage in IntelliJ (#83)
- Checksum calculating input streams (#75) and implicits (#80) for creating checksum hashes as side-effects to consuming byte streams
- Add a command line tool API (#76) along with a Conda abstraction (#82), although this API is public, it may undergo breaking changes more readily than other APIs given its experimental design.
2.1.0
2.0.0
Changelog
- Breaking: Renamed
RelabelReferenceNames
toUpdateDataContigNames
UpdateDataContigNames
Update contig names in delimited data using a name mapping table.
A collection of mapping tables is maintained at the following location:
Features
- Optionally drop rows which have chromosome names not in the mapping file
- Replace multiple fields in a row at once using the same mapping file
- Directly write-out rows that start with arbitrary strings (default of
#
) - Parses any delimited data using any single character delimiter
Command Line Usage
Relabel the contig names in an Ensembl human gene annotation file.
❯ git clone https://github.com/dpryan79/ChromosomeMappings.git
❯ wget ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz
❯ cvbio UpdateDataContigNames \
-i Homo_sapiens.GRCh38.96.gtf.gz \
-o Homo_sapiens.GRCh38.96.ucsc-named.gtf.gz \
-m ChromosomeMappings/GRCh38_ensembl2UCSC.txt \
--comment-chars '#' \
--columns 0 \
--skip-missing false
1.4.1
1.4.0
Changelog
- Tool to relabel reference sequence names in delimited data using a chromosome name mapping table
RelabelReferenceNames
Relabel reference sequence names in delimited data using a chromosome name mapping table.
A collection of mapping tables is maintained at the following location:
Features
- Optionally drop rows which have chromosome names not in the mapping file
- Replace multiple fields in a row at once using the same mapping file
- Directly write-out rows that startwith arbitrary strings (default of
#
) - Parses any delimited data using any single character delimiter
Command Line Usage
Relabel the chromosomes names in a human gene annotation file.
❯ git clone https://github.com/dpryan79/ChromosomeMappings.git
❯ wget -qO- ftp://ftp.ensembl.org/pub/release-96/gtf/homo_sapiens/Homo_sapiens.GRCh38.96.gtf.gz \
| gzip -dc > Homo_sapiens.GRCh38.96.gtf
❯ cvbio RelabelReferenceNames \
-i Homo_sapiens.GRCh38.96.gtf \
-o Homo_sapiens.GRCh38.96.ucsc-named.gtf \
-m ChromosomeMappings/GRCh38_ensembl2UCSC.txt \
--skip-prefixes '#' \
--columns 0 \
--drop false
1.3.0
Changelog
- New preference settings for
IgvBoss
including:- Setting the downsampling status
- Setting minimum and maximum base quality thresholds for shading
❯ cvbio IgvBoss -h 2>&1 | tail -n4
--downsample[[=true|false]] Downsample reads. [Optional].
--base-quality-minimum=Int Minimum base quality to shade. [Optional].
--base-quality-maximum=Int Maximum base quality to shade. [Optional].
1.2.0 IgvBoss
Features
- Will start IGV for you if it's not already running
- Quick syntax to navigate IGV from the commandline only
- Easily re-load new files, travel to loci, and swap genomes.
- Shut IGV down with a single command
cvbio IgvBoss -x
Command Line Usage
❯ cvbio IgvBoss -g mm10.fa -i infile.bam targets.bed -l $(cut -f4 < targets.bed | head -n2)
Long Tool Description
IgvBoss
------------------------------------------------------------------------------------------------------------------------
Take control of your IGV session from end-to-end.
IGV Startup
-----------
There are three supported ways to initialize IGV:
* Let this tool connect to an already-running IGV session
* Supply an IGV JAR file path and let this tool run it
* Let this tool find an 'igv' executable on the system PATH and run it
This tool will always attempt to connect to a running IGV application before attempting to start a new instance of IGV.
Provide a path to an IGV JAR file if no IGV applications are currently running. If no IGV JAR file path is set, and
there are no running instances of IGV, then this tool will attempt to fnd 'igv' on the system PATH and execute the
application.
You can shutdown IGV on exit with the '--close-on-exit' option. This will work regardless of how this tool initially
connected to IGV and is handy for tearing down the application after your investigation is concluded.
Controlling IGV
---------------
If no inputs are provided, then no new sessions will be created. Loci, for now, will result in a split-window view.
References and Prior Art
------------------------
* https://software.broadinstitute.org/software/igv/PortCommands
* https://github.com/stevekm/IGV-snapshot-automator
1.1.0 Featured Template Disambiguation
Features
- Accepts SAM/BAM sources of any sort order.
- Will disambiguate an arbitrary number of BAMs, all aligned to different references
- Writes the ambiguous alignments to an ambiguous-alignment specific directory
Command Line Usage
❯ java -jar cvbio.jar Disambiguate -i infile1.bam infile2.bam -p insilico/disambiguated
Long Tool Description
Disambiguate
------------------------------------------------------------------------------------------------------------------------
Disambiguate reads that were mapped to multiple references.
Disambiguation of aligned reads is performed per-template and all information across primary, secondary, and
supplementary alignments is used as evidence. Alignment disambiguation is commonly used when analyzing sequencing data
from transduction, transfection, transgenic, or xenographic (including patient derived xenograft) experiments. This
tool works by comparing various alignment scores between a template that has been aligned to many references in order
to determine which reference is the most likely source.
All templates which are positively assigned to a single source reference are written to a reference-specific output BAM
file. Any templates with ambiguous reference assignment are written to an ambiguous input-specific output BAM file.
Only BAMs produced from the Burrows-Wheeler Aligner (bwa) or STAR are currently supported.
Input BAMs of arbitrary sort order are accepted, however, an internal sort to queryname will be performed unless the
BAM is already in queryname sort order. All output BAM files will be written in the same sort order as the input BAM
files. Although paired-end reads will give the most discriminatory power for disambiguation of short- read sequencing
data, this tool accepts paired, single-end (fragment), and mixed pairing input data.
Example
-------
To disambiguate templates that are aligned to human (A) and mouse (B):
❯ java -jar cvbio.jar Disambiguate -i sample.A.bam sample.B.bam -p sample/sample -n hg38 mm10
❯ tree sample/
sample/
├── ambiguous-alignments/
│ ├── sample.A.ambiguous.bai
│ ├── sample.A.ambiguous.bam
│ ├── sample.B.ambiguous.bai
│ └── sample.B.ambiguous.bam
├── sample.hg38.bai
├── sample.hg38.bam
├── sample.mm10.bai
└── sample.mm10.bam
Glossary
--------
* MAPQ: A metric that tells you how confident you can be that a read comes from a reported mapping position.
* AS: A metric that tells you how similar the read is to the reference sequence.
* NM: A metric that measures the number of mismatches to the reference sequence (Hamming distance).
Prior Art
---------
* Disambiguate (https://github.com/AstraZeneca-NGS/disambiguate) from AstraZeneca's NGS team
v1.0.0 Template Disambiguation
Command Line Usage
❯ java -jar cvbio.jar Disambiguate -i infile1.bam infile2.bam -p insilico/disambiguated
Benchmarks
Performance benchmarks (albeit crude), can be found in this respository's documentation.
Long Tool Description
Disambiguate
------------------------------------------------------------------------------------------------------------------------
Disambiguate reads that were mapped to multiple references.
Disambiguation of mapped reads is performed per-template and all information across primary, secondary, and
supplementary alignments is used as evidence. Alignment disambiguation is useful when analyzing sequencing data from
transduction, transfection, xenographic (including patient derived xenografts), and transgenic experiments. This tool
works by comparing various alignment scores between a template that has been mapped to many references in order to
determine which reference is the most likely source.
All templates which are positively assigned to a single source reference are written to a reference-specific output BAM
file. Any templates with ambiguous reference assignment are currently dropped.
Caveats
-------
* No ambiguous BAM is currently written to the output prefix.
* All input BAMs must have an Assembly Name defined in the first sequence of the sequence dictionary.
* All input BAM files must be queryname grouped and synchronized on the read name.
* Only BAMs produced from the Burrows-Wheeler Aligner (bwa) and STAR are currently supported.
* Only BAMs produced from the same aligner are currently supported.
Glossary
--------
* MAPQ: A metric that tells you how confident you can be that a read comes from a reported mapping position.
* AS: A metric that tells you how similar the read is to the reference sequence.
* NM: A metric that measures the number of mismatches to th reference sequence (Hamming distance).
Features for a Future Release
-----------------------------
* Override the assembly names (output BAM prefixes)
* Support 'tophat' or 'hisat2' alignments.
* Check whether mixed aligners have been used and raise exception.
Prior Art
---------
* Disambiguate (https://github.com/AstraZeneca-NGS/disambiguate) from AstraZeneca's NGS team