Skip to content

UCSC-Treehouse/compendium-expression-matrix

Repository files navigation

compendium-expression-matrix

Build a matrix of samples vs genes out of individual rsem_genes.results files generated by RSEM.

usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR]
                                  [--mapfile ENSEMBL_HUGO_MAPPING_FILE]

Usage:

  • Gather a list of paths to the rsem_genes.results files for your samples.
  • Format in a headerless tab-separated file (eg samples.txt) with the sample name as the first column and the path as the second. For example:
MySample1 /home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results
MySample2 /home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results
Downloaded_XYZ /shared/downloads/expression/RSEM_XYZ/rsem_genes.results
  • Choose your output format: Hugo gene names and the log2(TPM+1) metric (--tpm) or Ensembl gene IDs and the Expected Count metric. (--expected-count).
  • Optionally, specify the output path with --output.
  • If the Ensembl-to-Hugo mapping file (EnsGeneID_Hugo_Observed_Conversions.txt, provided) is not in the current directory, specify its path with --mapfile.

Example:

./compendium-expression-matrix/build_compendium_matrix.py \
--name MyCompendium \
--input samples.txt \
--tpm \
--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt

Output

Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz. For example, MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz . It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted.

The first column of the TSV is named Gene and consists of the gene names (either Hugo or Ensembl) in alphabetical order. Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name.

Mapping note

Input rsem_genes.results files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the EnsGeneID_Hugo_Observed_Conversions.txt. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample, the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name. In addition, the Ensembl IDs which map to NA are dropped from the Hugo output file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages