Please click the follwoing link to download the Supplementary Data Tables for
A toxicogenomic platform for system-level understanding and prediction of EDC-induced toxicity
Article by Amirhossein Sakhteman, Mario Failli, Jenni Kublbeck, Anna-Liisa Levonen and Vittorio Fortino* (*Corresponding author)
Part I: Development of toxicogenomics-driven predictive model to identify and prioritize compounds with ED activity
R Script | MIEs_from_CTD.R |
---|---|
Input | http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz tested for the release of june 2020 |
http://ctdbase.org/reports/CTD_chemicals.csv.gz tested for the release of june 2020 | |
Output | A list of chemical-gene associations |
Dependencies | data.table, FactoMineR, factoextra |
Summary | A binary data matrix indicating molecular initiating events (MIEs) of compounds annotated in CTD will be generated. Multiple correspondence analysis on the resulting matrix will be performed uisng FactoMineR and factoextra to determine the most distant gene-compound association types. Five types of interactions: reaction, binding, activity, expression and metabolic processing will be used according to the result of MCA. Compounds with too many MIEs will be excluded as outliars. |
R Script | Pathways_Download.R |
---|---|
Input | https://aopwiki.org/downloads/aop-wiki-xml-2019-01-01.gz |
https://aopwiki.org/downloads/aop_ke_ec.tsv | |
https://reactome.org/download/current/ReactomePathways.txt | |
http://ctdbase.org/reports/CTD_genes_pathways.csv.gz | |
https://reactome.org/download/current/miRBase2Reactome_PE_All_Levels.txt | |
Output | A list of pathways extracted from GO, KEGG, wiki and Reactome |
Dependencies | XML, GO.db, org.Hs.eg.db, GSA, msigdbr, rWikiPathways, magrittr, rjson, data.table |
Summary | Pathways related to KEGG, REACTOME, MSIGDB, GO and WIKI with less than 200 genes will be retrieved. GO terms will be systeamtically linked to known Wiki-AOPs. |
R Script | TOXCAST_nuclear_receptors_coregulators.R |
---|---|
Input | hitc_Matrix_190226.csv from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data |
Assay_Summary_190226.csv' from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data | |
DSSTox_Identifiers_and_CASRN.xlsx from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data | |
Result of the script MIEs_from_CTD.R | |
http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz | |
Output | A binary matrix of chemicals and their corresponding hitcalls for the assay endpoitnts related to nuclear receptors and their co-regulators |
Dependencies | tidyr, dplyr,org.Hs.eg.db, readxl, data.table |
Summary | The genes related to nuclear receptors and their co-regulators from domain experts and NURSA database will be merged. Then, ToxCast assays targeting these genes will be selected. |
R Script | EDC_Decoy_selection.R |
---|---|
Input | https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv |
http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz | |
Result of the script MIEs_from_CTD.R | |
Result of the script TOXCAST_nuclear_receptors_coregulators.R | |
Output | A list of EDCs and decoys, and their corresponding MIEs |
Dependencies | ggplot2, ggrepel, magrittr, data.table, dplyr, reshape, cluster |
Summary | An initial list of EDCs will be retrieved from DEDuCT. Then, ToxCast assays targeting ED-reated nuclear receptor and co-regulators will be used to refine the initial selection of EDCs. The most significat in vitro assay endpoints for identifying compounds with ED activity will be selected based on a statistical test aiming at identifying the most relevant directions in ED-gene interactions (e.g. up/down regulation). Moreover, EDCs that are inactive for all the significant assay endpoints will be systemaatically removed from the inital list of EDCs. After selecting a reliable set of EDCs, pairwise jaccard distance between the MIEs of the identified EDCs and other compounds annotated in CTD will be calculated.The compounds with the maximum Jaccard distance towards EDCs will be seleceted as negative controls (decoys). |
R Script | ToxCast_dictionaries.R |
---|---|
Input | hitc_Matrix_190226.csv from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data |
DSSTox_Identifiers_and_CASRN.xlsx from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data | |
Assay_Summary_190226.csv' from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data | |
Output | Hitcall matrix retried from ToxCast |
Dependencies | tidyr, dplyr, readxl |
Summary | A dictionary for ToxCast target genes and their corresponding endpoints will be prepared from ToxCast. Conversion of ToxCast DSSTox_Identifiers to CAS registry identifiers and preparation of the final Hitcall matrix with all ToxCast endpoints will be conducted. |
R Script | Drug_matrix_wTO.R |
---|---|
Input | LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-009?query=S-DIXA-AN-009 |
journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/ | |
Output | A set of four gene networks derived from DrugMatrix |
Dependencies | xlsx, doParallel, wTO |
Summary | The control samples from the preprocessed and normalized LFC values of Drug Matrix for rat in vitro hepatocytes and rat in vivo will be removed. Three exposure time points 1,3 and 5 days for in vivo and 1 day for in vitro will be selected as separate data frames. The genes expressed in liver will be selected and orthology mapping of the probe IDs to entrez gene values will be performed. Four gene co-expression networks using wTO package with bootstrap resampling method will be compiled. |
R Script | TG_Gates_wTO.R |
---|---|
Input | LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-005?query=S-DIXA-AN-005 |
LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-004?query=S-DIXA-AN-004 | |
LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-007?query=S-DIXA-AN-007 | |
journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/ | |
Output | A set of eight gene networks derived from TG-GATEs |
Dependencies | XLSX, doParallel, wTO |
Summary | Control samples from the preprocesses and normalized LFC values related to TG-Gates for rat in vitro, human invitro and rat in vivo will be removed. Three dose levels (high, middle and low) and three time points (8, 15 and 29 days) from the LFC values related to TG-Gates rat in vivo.(6 data frames)will be extracted. 2 separate data frames for 1 day treatment related to human and rat in vitro LFC values will be extracted. Selection of the genes expressed in the liver from each data frame and orthology mapping of probe IDs to gene entrez IDs will be conducted. 8 gene co-expression networks from the resulting data frames using wTO package with bootstrap resampling method will be compiled. |
R Script | LINCS_wTO.R |
---|---|
Input | LFCs and annottations for level 5 studies from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742 (phase I) |
LFCs and annottations for level 5 studies from from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742 (phase II) | |
journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/ | |
Output | 1 consensus gene network for LINCS from phase 1 and phase 2 studies |
Dependencies | cmapR, doParallel, wTO |
Summary | Normalized and preproessed LFC values from the level 5 of phase 1 and phase 2 LINCS data source will be used. Selection of cell line HEPG2 with 24 hours treatment from phase1 and phase 2 gene expression data in LINCS will be done. Selection of the gene IDS which are expressed in the liver will be performed. 2 gene networks from phase 1 and phase 2 using wTO package with bootstrapping resampling method will be compiled.one consensus gene co-expression netowork from the overlapping genes of the two networks using wTO package will be generated. |
R Script | Consensus_Rat_in_vitro_wTO.R |
---|---|
Input | Results of the scripts Drug_matrix_wTO.R and TG_Gates_wTO.R |
Output | 1 consensus gene network for Drug matrix and TG-Gates hepatocytes after 1 day treatment |
Dependencies | wTO |
Summary | A consensus gene co-expression network from the overlapping genes of the networks related to in vitro drug matrix and TG-GATEs will be compiled. |
R Script | PPI_network.R |
---|---|
Input | No input is needed |
Output | PPI network with new combined score |
Dependencies | STRINGdb, igraph, org.Hs.eg.db |
Summary | Protein-protein interaction network from stringDB will be downloaded. The gene co-expression score will be exluded from the combined score and a new combined score will be compiled. Mapping of nodes to entrez gene IDs will be performed. |
R Script | Drug_matrix_tuner.R |
---|---|
Input | result of the scripts Drug_matrix_wTO.R, EDC_Decoy_selection.R |
Output | A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for Drug Matrix networks |
Dependencies | dnet,igraph,cluster |
Summary | The top %2, %3, %5 and %10 edge portions will be extracted from each network. Each network will be subjected to random walk with restart uisng the MIEs of EDCs and decoys (benchmark set) as seeds. The 200,500,700 and 1000 top most visited genes will be extracted after the random walk. A binary vector will be genrated for each compound (EDCs and decoys) with 1 representing the gene in the list of top visited gene.The pairwise jaccard distance between the binary vector of each EDCs and decoy will be calculated. Using the jaccard dismilarity matrix and a vector representing the class of each componud as EDC or Decoy, average silhouette score will be calculated for EDCs. |
R Script | TG_GATEs_tuner.R |
---|---|
Input | result of the scripts TG_Gates_wTO.R, EDC_Decoy_selection.R |
Output | A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for TG-GATEs networks |
Dependencies | dnet,igraph,cluster |
Summary | As described for the script Drug_matrix_tuner.R |
R Script | Consensus_tuner.R |
---|---|
Input | result of the scripts Consensus_Rat_in_vitro_wTO.R, LINCS_wTO.R, EDC_Decoy_selection.R |
Output | A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for consensus networks |
Dependencies | dnet,igraph,cluster |
Summary | As described for the script Drug_matrix_tuner.R |
R Script | PPI_tuner.R |
---|---|
Input | result of the script EDC_Decoy_selection.R |
Output | A matrix of silhouette scores for different combination of combined scores and sorted genes for PPI network |
Dependencies | dnet,igraph,cluster,STRINGdb,org.Hs.eg.db |
Summary | New ppi networks were compiled using the 0.6,0.65,0.7,0.75,0.8,0.85 values as the cut-off levels for the new combined score. All networks will be subjected to random walk with restart uisng the MIEs of EDCs and decoys (benchmark set) as seeds. The 200,500,700 and 1000 top most visited genes will be extracted after random walk. A binary vector will be genrated for each compound (EDCs and decoys) with 1 representing the gene in the list of top visited genes. The pairwise jaccard distance beween the binary vector of each EDCs and decoy will be calculated. Using the jaccard dismilarity matrix and a vector representing the class of each componud as EDC or Decoy average silhouette score will be calculated for EDCs. |
R Script | pareto_solution_on_tuning_results.R |
---|---|
Input | Results of the scripts PPI_tuner.R, Consensus_tuner.R, TG_GATEs_tuner.R , TG_GATEs_tuner.R, Drug_matrix_tuner.R |
Output | The optimized genes and edges solutions for all networks based on pareto solution |
Dependencies | rPref,knitr |
Summary | Pareto solution will be used to select the subnetworks that (1) maiximize the distance between EDCs and decoys,(2) minimize the selected edges or maximize the combined score. |
R Script | RWR_FGSEA_for_edc_decoys.R |
---|---|
Input | The results of the scripts Drug_matrix_wTO.R, TG_Gates_wTO.R, LINCS_wTO.R, Consensus_Rat_in_vitro_wTO.R, PPI_network.R, Pathways_Download.R, EDC_Decoy_selection.R,pareto_solution_on_tuning_results.R |
Output | A matrix of pathway scores for EDCs and decoys for each network |
Dependencies | fgsea,dnet,igraph,BiocParallel |
Summary | Random walk with restart will be performed in order to extend the initial set of EDC-MIEs retrieved from CTD. This operation is repeatd for each compound and network. Then Fast Gene Set Enrichment Analysis will be performed to identify the most responsive pathways of the genes selected with the RWR algorithm. |
R Script | RWR_FGSEA_for_all_compounds_in_CTD.R |
---|---|
Input | The results of the scripts Drug_matrix_wTO.R, TG_Gates_wTO.R, LINCS_wTO.R, Consensus_Rat_in_vitro_wTO.R, PPI_network.R, Pathways_Download.R,MIEs_from_CTD.R ,pareto_solution_on_tuning_results.R |
Output | A matrix of pathway scores for 12k chemical in CTD for each network |
Dependencies | fgsea,dnet,igraph,BiocParallel |
Summary | Random walk with restart will be performed in order to extend the initial set of EDC-MIEs retrieved from CTD. This operation is repeatd for each compound and network. Then Fast Gene Set Enrichment Analysis will be performed to identify the most responsive pathways of the genes selected with the RWR algorithm. |
R Script | manual_curation_of_pathways_as_features.R |
---|---|
Input | The result of the script Pathways_Download.R |
Output | A list of pathways to be used in machine learning |
Dependencies | No dependencies |
Summary | A more restricted set of pathways will be prepared, which excludes pathways related to viral, bacterial and radiation, duplicated pathways or pathways with no genes expressed in liver. |
R Script | Preparation_of_training_datasets.R |
---|---|
Input | The results of the scripts manual_curation_of_pathways_as_features.R,RWR_FGSEA_for_all_compounds_in_CTD.R, RWR_FGSEA_for_edc_decoys.R,EDC_Decoy_selection.R |
Output | A list containing training set for each network. A list containing test set for each network |
Dependencies | No dependencies |
Summary | Preparation of training dataset for each toxicogenomics network will be done. Each training dataset consists of NES scores or simply patwahy activation scores, which were obtained by using the RWR-FGSEA pipeline; and a vector labels indicateing EDCs and decoys. |
R Script | glm_modeling.R |
---|---|
Input | The output of the scripts Preparation_of_training_datasets.R, MIEs_from_CTD.R, EDC_Decoy_selection.R |
Output | List objects with GLM coefficients and models for each network and MIEs level |
Dependencies | caret, doParallel |
Summary | Training of elastic-net-based classifiers on the computed training dataset will be done using k-fold-cross-fold validation. |
R Script | k_fold_cross_validation.R |
---|---|
Input | The output of the scripts Preparation_of_training_datasets.R, MIEs_from_CTD.R, EDC_Decoy_selection.R |
Output | List of objects including Accuracy, F1-scores, confusion matrix, specificity and sensitivity for each trained classifier |
Dependencies | caret, doParallel |
Summary | Repeated 5-fold cross-validation will be performed on all 15 GLM-models. (Pathway level). Repeated 5-fold cross validation will be performed on a binary data matrix of gene-compounds.(Gene level)) |
R Script | comparing_cross_validation_across_all_layers_ANOVA.R |
---|---|
Input | The output of the script k_fold_cross_validation.R |
Output | list of ANOVA results between F1-scores and the boxplot of F1-scores |
Dependencies | ggplot2 |
Summary | The F1 scores for the k-fold-cross validation will be compared using ANOVA. Boxplot will be used to represent the obtained F1 scors across all GLM models. |
6. Selection of the most informative pathways by using the coeffient estimates and ROC analysis based on individual pathways.
R Script | Integration_of_coefficients_stabilties_NES_scores.R |
---|---|
Input | The outputs of the scripts k_fold_cross_validation.R, glm_modeling.R, Preparation_of_training_datasets.R, manual_curation_of_pathways_as_features.R |
Output | CSV and excel files with GLM coefs, ROC-AUCS, mean NES scores for each pathway and each network |
Dependencies | PRROC |
Summary | ROC analysis will be used to verify whether single pathways, which are selected by the elastic-net process, can distinguish alone EDCs from decoys. The NES scores, elastic-net coefficients, ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data). |
R Script | NES_bubble_plot_MOA.R |
---|---|
Input | The output of the script Integration_of_coefficients_stabilties_NES_scores.R |
Output | graphical represntation of the results of Integration_of_coefficients_stabilties_NES_scores.R as bubble plot |
Dependencies | ggplot2, RColorBrewer |
Summary | The map of pathway activation scores and GLM coefficient for each pathway and network will be represented as bubble plot.The generated plot can be used to indicate the putative pathways as the mode of action for the EDCs. |
R Script | prediction_all_compounds_class_probilities.R |
---|---|
Input | Output of the scripts glm_modeling.R, RWR_FGSEA_for_all_compounds_in_CTD.R |
Output | Class probability of a compound to be EDC for each data layer |
Dependencies | caret |
Summary | GLM coefs for each network will be used to predict class probability of all compounds in CTD using their NES scores across different pathways. |
R Script | ROC_TOXCAST_vs_class_probabilities.R |
---|---|
Input | Output of the scripts TOXCAST_nuclear_receptors_coregulators.R, prediction_all_compounds_class_probilities.R |
Output | A matrix of area under the curve for ROC between class probability of each data layer and Hitcall result of ToxCast endpoints |
Dependencies | PRROC |
Summary | ROC analysis will be used between the class probabilty of each network and the binary in vitro experimental hitcall ToxCast assays related to nuclear receptors for each network. The most informative networks will be selected based on the results of ROC curve analysis. |
R Script | Developing_Harmonic_and_average_EDC_scores.R |
---|---|
Input | Output of the scripts EDC_Decoy_selection.R, ROC_TOXCAST_vs_class_probabilities.R, prediction_all_compounds_class_probilities.R |
Output | vectors of harmonic scores and average scores for 12K compounds in CTD |
Dependencies | ggplot2, ggrepel |
Summary | Average EDC score will be defined as the average of class probabilties across selected networks for each compound. Harmonic EDC score will be defined as the harmonic sum class probabilties across selected networks for each compound. |
R Script | edc_score_new_comps.R |
---|---|
Input | A list of compounds and their corresponding MIEs |
Output | Class probability of a compound to be EDC for each data layer |
Dependencies | caret, fgsea, dnet, igraph |
Summary | The MIEs of the compounds will be used as seeds to start random walk with restart and enrichment analysis on different networks. The resulting pathway scores will be used to calculate the EDC scores and the class probabilities for the compounds. |
R Script | ROC_analysis_EDC_scores_vs_TOXCAST_endpoints.R |
---|---|
Input | Output of the scripts Output of the scripts TOXCAST_nuclear_receptors_coregulators.R, Developing_Harmonic_and_average_EDC_scores.R, ROC_TOXCAST_vs_class_probabilities.R |
Output | Matrix of Area under the curve of ROC between harmonic and average EDC scores with ToxCast Hitcall, P_values of ROC proportion test |
Dependencies | PRROC, gplots |
Summary | ROC analysis between EDC scores and ToxCast hitcall data for each assay endpoint will be calculated. |
R Script | Comparison_of_VAM_on_DeDuCt.R |
---|---|
Input | https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, Output of the scripts EDC_Decoy_selection.R, prediction_all_compounds_class_probilities.R |
Output | Excel table with harmonic scores and the compounds in DeDUCT |
Dependencies | xlsx |
Summary | The scores for the compounds in DeDUCT will be calculated |
R Script | Comparison_VAM_scores_vs_TOXPI_scores.R |
---|---|
Input | Suppl. file of http://dx.doi.org/10.1016/j.coph.2014.09.021, https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, Output of the script prediction_all_compounds_class_probilities.R |
Output | plot of the Toxpi scores VS EDC scores, excel files of toxpi and edc scores |
Dependencies | ggplot2, ggrepel |
Summary | The predicted EDC scores will be evaluated with ToxPi scores (The scores developed from ToxCast assay endpoints) |
R Script | Comparison_pathway_scores_vs_TOXDB_pathway_scores.R |
---|---|
Input | ToxDB pathway scores for pathways aryl hydrocarbon receptor, Breaset cancer and estrogen receptor from http://toxdb.molgen.mpg.de/ , Output of the script RWR_FGSEA_for_all_compounds_in_CTD.R, |
Output | CSV file with pathway scores of ToxDB and the pathway scores resutling from RWR-FGSEA |
Dependencies | webchem, ggplot2, ggrepel |
Summary | CHEMBL ids will be mapped to CAS ids using webchem. Comparison between pathway scores from RWR-FGSEA and ToxDB for aryl hydrocarbon receptor, Breaset cancer and estrogen receptor will be conducted. |
R Script | Validation_VAM_scores_Eurion_External_set_compounds.R |
---|---|
Input | https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, output of the script prediction_all_compounds_class_probilities.R |
Output | Excel file and heatmap plot of EDC scores for external validation set |
Dependencies | reshape2,RColorBrewer,ggplot,magrittr |
Summary | Evaluation of EDC scores for known EDCs in our expert domain list and DeDuCT will be done. Calculation of accuaracy based on the scores of validation list will be performed. The validation set of compounds will be depicted as a heatmap plot |
R Script | Validation_with_ToxCast_mies.R |
---|---|
Input | Output of the script ToxCast_dictionaries.R, MIEs_from_CTD.R, |
Output | Plot of predicted EDC scores from ToxCast MIES VS EDC scores predicted from CTD MIES |
Dependencies | ggplot2, ggrepel, knitr |
Summary | The target genes for the compounds with positive assay endpoint in ToxCast wil be considered as the MIEs. Random walk with resetart and fast gene set enrichment analysis pipeline will be repeated starting with the MIEs from ToxCast. The obtained NES scores will be subjected to GLM models and the resulting class probabilities will be compared with the class probabilities of the MIEs annotated in CTD. |
R Script | dose_response_TG_GATEs_single.R |
---|---|
Input | Output of the scripts Developing_Harmonic_and_average_EDC_scores.R, RWR_FGSEA_for_all_compounds_in_CTD.R,Integration_of_coefficients_stabilties_NES_scores.R |
Output | List of the compounds based on the patterns of increase and decrease, List of ANOVA results for pathways in different dose response scenarios, plot depictions |
Dependencies | dplyr,ggplot2 |
Summary | The class probability profile of the compounds in TG-Gates will be categorized as 4 different groups (based on increase or decrease pattern at different doses Low, Middle and High. The significant pathways with regard to increases or decrease of pathway activation scores will be determined using ANOVA followed by post tukey test using bootstrap sampling method. |
R Script | time_exposure_response_TG_GATEs_repeated.R |
---|---|
Input | Output of the scripts Developing_Harmonic_and_average_EDC_scores.R, RWR_FGSEA_for_all_compounds_in_CTD.R,Integration_of_coefficients_stabilties_NES_scores.R |
Output | List of the compounds based on the patterns of increase and decrease, List of ANOVA results for pathways in different time of exposure scenarios, Plot depictions |
Dependencies | dplyr,ggplot2 |
Summary | The class probability profile of the compounds in TG-Gates will be categorized as 4 different groups (based on increase or decrease pattern at different Time of exposure 8 days, 15 days and 29 days). The significant pathways with regard to increases or decrease of pathway activation scores will be determined using ANOVA followed by post tukey test using bootstrap sampling method |
R Script | textmining_16_april_2020_endocrine_disruption.R |
---|---|
Input | The abstract results of pubmed search for endocrine disruption |
Output | A data frame with entrez gene names and their frequencies across different papers |
Dependencies | pubmed.mineR, magrittr |
Summary | After search in pubmed for the term "endocrine disruption" the genes related to ED will be determined using text mining with pubmed.mineR package. |
R Script | time_dose_plots_heatplot.R |
---|---|
Input | Output of the scripts time_exposure_response_TG_GATEs_repeated.R, dose_response_TG_GATEs_single.R,Pathways_Download.R |
Output | Plots of time of exposure and dose for sensitivity analysis, heatplot eseential genes and pathways |
Dependencies | ggpubr, RColorBrewer, ggplot2 |
Summary | The pathways versus genes will be represented as two heatplots (dose and time point plots). |
R Script | categorization_enrichment_result.R |
---|---|
Input | Output of the script Integration_of_coefficients_stabilties_NES_scores.R, |
Output | plots of categorization for specific pathways related to KEGG, REACTOME and wiki AOP |
Dependencies | UpSetR, ggplot2, RColorBrewer |
Summary | For each data layer (network) the number of selected KEGG pathways by elastic GLM will be categorized and represented in plots.For each data layer (network) the number of selected REACTOME pathways by elastic GLM will be categorized and represented in plots. |
R Script | preparation_data_set_disease_biomarker.R |
---|---|
Input | Output of the scripts Pathways_Download.R, EDC_Decoy_selection.R,RWR_FGSEA_for_all_compounds_in_CTD.R |
Output | A list of pathway scores for each data layer with specefic pathways |
Dependencies | No dependencies |
Summary | The NES scores for pathways with max length of 50 genes will be selected for classification task. |
R Script | chem2disease_CTD.R |
---|---|
Input | http://ctdbase.org/reports/CTD_chemicals_diseases.csv.gz, output of the script EDC_Decoy_selection.R, |
Output | A binary matrix of chemicals and diseases with 1 representing the association of a disease and the chemical |
Dependencies | xlsx |
Summary | A binary matrix from chem-disease associations of the chemicals annotated in CTD will be generated. |
R Script | preparation_two_class_training_set_by_disease_name.R |
---|---|
Input | Output of the scripts preparation_data_set_disease_biomarker.R |
Output | A list object with training x data and y vector for each data layer |
Dependencies | No dependencies |
Summary | Training sets (NES scores and class labels) for atherosclerosis, metabolic syndrome and diabetes type 2 will be prepared. |
2. Training elastic-net GLM classifier for the metabolic syndrome, diabetes type2 and atherosclerosis
R Script | metabolic_syndrome_two_class_traning_models_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | A list classification models and coeffecients for each data layer |
Dependencies | caret, doParallel |
Summary | Training GLM model classifiers for metabolic syndrome across all 15 data layers will be conducted. |
R Script | artherosclerosis_two_class_training_models_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | A list classification models and coeffecients for each data layer |
Dependencies | caret, doParallel |
Summary | Training GLM model classifiers for atherosclerosis across all 15 data layers will be conducted. |
R Script | diabetes_2_two_class_traning_models_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | A list classification models and coeffecients for each data layer |
Dependencies | caret, doParallel |
Summary | Training GLM model classifiers for diabetes T2 across all 15 data layers will be conducted. |
R Script | kfold_CV_two_class_metabolic_syndrome_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | List of accuaracy metrics for the classification task for each data layer |
Dependencies | caret, doParallel |
Summary | Performing K-fold-Cross validation on models of metabolic syndrome for all 15 data layers |
R Script | kfold_CV_two_class_artherosclerosis_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | List of accuaracy metrics for the classification task for each data layer |
Dependencies | caret, doParallel |
Summary | Performing K-fold-Cross validation on models of atherosclerosis for all 15 data layers |
R Script | kfold_CV_two_class_diabetes_2_glm.R |
---|---|
Input | Output of the script preparation_two_class_training_set_by_disease_name.R |
Output | List of accuaracy metrics for the classification task for each data layer |
Dependencies | caret, doParallel |
Summary | Performing K-fold-Cross validation on models of diabetes type 2 for all 15 data layers |
R Script | comparison_CV_F1_scores_all_layers_ANOVA_metabolic_syndrome.R |
---|---|
Input | Output of the script kfold_CV_two_class_metabolic_syndrome_glm.R |
Output | A list of ANOVA results between different data layers and boxplot representation |
Dependencies | plotly, dplyr, ggplot2 |
Summary | The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for metabolic syndrome.Boxplot will be used to represent the obtained F1 scors across all GLM models for metabolic syndrome. |
R Script | comparison_CV_F1_scores_all_layers_ANOVA_atherosclerosis.R |
---|---|
Input | Output of the script kfold_CV_two_class_artherosclerosis_glm.R |
Output | A list of ANOVA results between different data layers and boxplot representation |
Dependencies | plotly, dplyr, ggplot2 |
Summary | The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for atherosclerosis. Boxplot will be used to represent the obtained F1 scors across all GLM models for atherosclerosis. |
R Script | comparison_CV_F1_scores_all_layers_ANOVA_diabetes.R |
---|---|
Input | Output of the script kfold_CV_two_class_diabetes_2_glm.R |
Output | A list of ANOVA results between different data layers and boxplot representation |
Dependencies | plotly, dplyr, ggplot2 |
Summary | The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for diabetes type 2. Boxplot will be used to represent the obtained F1 scors across all GLM models for diyabetes type 2. |
R Script | Integration_of_glm_coefs_stabilities_NES_ROC_metabolic_syndrome.R |
---|---|
Input | Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_metabolic_syndrome_glm.R, metabolic_syndrome_two_class_traning_models_glm.R |
Output | A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers |
Dependencies | magrittr, PRROC |
Summary | For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to metabolic synndrome vs edcs not leading to metabolic syndrome. The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data). |
R Script | Integration_of_glm_coefs_stabilities_NES_ROC_atherosclerosis.R |
---|---|
Input | Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_artherosclerosis_glm.R, artherosclerosis_two_class_training_models_glm.R |
Output | A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers |
Dependencies | magrittr, PRROC |
Summary | For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to atherosclerosis vs edcs not leading to atherosclerosis. The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data). |
R Script | Integration_of_glm_coefs_stabilities_NES_ROC_diabetes_2.R |
---|---|
Input | Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_diabetes_2_glm.R, diabetes_2_two_class_traning_models_glm.R |
Output | A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers |
Dependencies | magrittr, PRROC |
Summary | For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to diabetes type 2 vs edcs not leading to diabetes type 2.The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data). |
R Script | bubble_plots_pathways.R |
---|---|
Input | The output of the scripts Integration_of_glm_coefs_stabilities_NES_ROC_diabetes_2.R, Integration_of_glm_coefs_stabilities_NES_ROC_atherosclerosis.R, Integration_of_glm_coefs_stabilities_NES_ROC_metabolic_syndrome.R |
Output | Bubble plot for each disease across all data layers representing the pathways and their coeffeicients |
Dependencies | ggplot2, ggarrange |
Summary | The map of pathway activation scores and GLM coefficient for each pathway and network will be represented as bubble plots for the three diseases metabolic syndrome, atherosclerosis and diabetes type 2. The generated plots can be used to indicate the putative pathways linking EDCs to adverse outcome. |
4. Compiling and validation of disease scores for atherosclerosis, metabolic syndrome and diabetes type 2
R Script | disease_scores.R |
---|---|
Input | Output of the scripts preparation_data_set_disease_biomarker.R, chem2disease_CTD.R, metabolic_syndrome_two_class_traning_models_glm.R, artherosclerosis_two_class_training_models_glm.R, diabetes_2_two_class_traning_models_glm.R |
Output | CSV files with the disease scores related to chemical in CTD for each data layer, ROC aucs of the predicted scores with disease vectors in CTD |
Dependencies | PRROC |
Summary | For each disease (metabolic syndrome, atherosclerosis and diabetes type 2), harmonic sum and average score will be compiled across all 15 networks for 12k compounds in CTD. A ROC analysis will be performed between the predicted scores and the results in CTD |
R Script | disease_scores_pie_chart.R |
---|---|
Input | Output of the script disease_scores.R |
Output | Piechart of disease scores |
Dependencies | ggplot2 |
Summary | For each disease (metabolic syndrome, atherosclerosis and diabetes type 2) a pie chart will be depicted to reveal distribution of disease scores across 12k compounds in CTD. |