Build a matrix of samples vs genes out of individual rsem_genes.results
files generated by RSEM.
usage: build_compendium_matrix.py [-h] --name COMPENDIUM_NAME --input INPUTFILE (--tpm | --expected-count) [--output OUTPUTDIR]
[--mapfile ENSEMBL_HUGO_MAPPING_FILE]
- Gather a list of paths to the
rsem_genes.results
files for your samples. - Format in a headerless tab-separated file (eg
samples.txt
) with the sample name as the first column and the path as the second. For example:
MySample1 | /home/treehouse/samples/MySample1/expression/RSEM/rsem_genes.results |
MySample2 | /home/treehouse/samples/MySample2/expression/RSEM/rsem_genes.results |
Downloaded_XYZ | /shared/downloads/expression/RSEM_XYZ/rsem_genes.results |
- Choose your output format: Hugo gene names and the log2(TPM+1) metric (
--tpm
) or Ensembl gene IDs and the Expected Count metric. (--expected-count
). - Optionally, specify the output path with
--output
. - If the Ensembl-to-Hugo mapping file (
EnsGeneID_Hugo_Observed_Conversions.txt
, provided) is not in the current directory, specify its path with--mapfile
.
Example:
./compendium-expression-matrix/build_compendium_matrix.py \
--name MyCompendium \
--input samples.txt \
--tpm \
--mapfile compendium-expression-matrix/EnsGeneID_Hugo_Observed_Conversions.txt
Running this script will generate a gzipped TSV named in the format NAME_METRIC_TODAY's DATE.tsv.gz.
For example, MyCompendium_hugo_log2tpm_2025-03-07.tsv.gz
.
It will also generate an intermediate json file samples_vs_rgr_files_NAME_DATE.json. This can be deleted.
The first column of the TSV is named Gene
and consists of the gene names (either Hugo or Ensembl) in alphabetical order.
Each subsequent column is named after the corresponding sample and consists of that sample's expression for the gene name.
Input rsem_genes.results
files use Ensembl gene IDs. For output files using Hugo gene names, the IDs are converted using the EnsGeneID_Hugo_Observed_Conversions.txt
. In some cases, multiple Ensembl IDs map to the same Hugo name. To map these rows within a sample,
the values of app matching Ensembl IDs are summed together to produce the value associated with the Hugo name.
In addition, the Ensembl IDs which map to NA
are dropped from the Hugo output file.