compendium-expression-matrix

Build a matrix of samples vs genes out of individual rsem_genes.results files generated by RSEM.

usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR]
                                  [--mapfile ENSEMBL_HUGO_MAPPING_FILE]

Usage:

Gather a list of paths to the rsem_genes.results files for your samples.
Format in a headerless tab-separated file (eg samples.txt) with the sample name as the first column and the path as the second. For example:


MySample1	/home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results
MySample2	/home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results
Downloaded_XYZ	/shared/downloads/expression/RSEM_XYZ/rsem_genes.results

Choose your output format: Hugo gene names and the log2(TPM+1) metric (--tpm) or Ensembl gene IDs and the Expected Count metric. (--expected-count).
Optionally, specify the output path with --output.
If the Ensembl-to-Hugo mapping file (EnsGeneID_Hugo_Observed_Conversions.txt, provided) is not in the current directory, specify its path with --mapfile.

Example:

./compendium-expression-matrix/build_compendium_matrix.py \
--name MyCompendium \
--input samples.txt \
--tpm \
--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt

Output

Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz. For example, MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz . It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted.

The first column of the TSV is named Gene and consists of the gene names (either Hugo or Ensembl) in alphabetical order. Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name.

Mapping note

Input rsem_genes.results files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the EnsGeneID_Hugo_Observed_Conversions.txt. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample, the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name. In addition, the Ensembl IDs which map to NA are dropped from the Hugo output file.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
EnsGeneID_Hugo_Observed_Conversions.txt		EnsGeneID_Hugo_Observed_Conversions.txt
README.md		README.md
build_compendium_matrix.py		build_compendium_matrix.py
expression.py		expression.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

compendium-expression-matrix

Usage:

Output

Mapping note

About

Releases

Packages

Languages

UCSC-Treehouse/compendium-expression-matrix

Folders and files

Latest commit

History

Repository files navigation

compendium-expression-matrix

Usage:

Output

Mapping note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages