EDTOX

Please click the follwoing link to download the Supplementary Data Tables for

A toxicogenomic platform for system-level understanding and prediction of EDC-induced toxicity

Article by Amirhossein Sakhteman, Mario Failli, Jenni Kublbeck, Anna-Liisa Levonen and Vittorio Fortino* (*Corresponding author)

Description of the R scripts used in the pipeline

Part I: Development of toxicogenomics-driven predictive model to identify and prioritize compounds with ED activity

1. Preparation of MIES, pathways and training datasets

R Script	MIEs_from_CTD.R
Input	http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz tested for the release of june 2020
	http://ctdbase.org/reports/CTD_chemicals.csv.gz tested for the release of june 2020
Output	A list of chemical-gene associations
Dependencies	data.table, FactoMineR, factoextra
Summary	A binary data matrix indicating molecular initiating events (MIEs) of compounds annotated in CTD will be generated. Multiple correspondence analysis on the resulting matrix will be performed uisng FactoMineR and factoextra to determine the most distant gene-compound association types. Five types of interactions: reaction, binding, activity, expression and metabolic processing will be used according to the result of MCA. Compounds with too many MIEs will be excluded as outliars.

R Script	Pathways_Download.R
Input	https://aopwiki.org/downloads/aop-wiki-xml-2019-01-01.gz
	https://aopwiki.org/downloads/aop_ke_ec.tsv
	https://reactome.org/download/current/ReactomePathways.txt
	http://ctdbase.org/reports/CTD_genes_pathways.csv.gz
	https://reactome.org/download/current/miRBase2Reactome_PE_All_Levels.txt
Output	A list of pathways extracted from GO, KEGG, wiki and Reactome
Dependencies	XML, GO.db, org.Hs.eg.db, GSA, msigdbr, rWikiPathways, magrittr, rjson, data.table
Summary	Pathways related to KEGG, REACTOME, MSIGDB, GO and WIKI with less than 200 genes will be retrieved. GO terms will be systeamtically linked to known Wiki-AOPs.

R Script	TOXCAST_nuclear_receptors_coregulators.R
Input	hitc_Matrix_190226.csv from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
	Assay_Summary_190226.csv' from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
	DSSTox_Identifiers_and_CASRN.xlsx from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
	Result of the script MIEs_from_CTD.R
	http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz
Output	A binary matrix of chemicals and their corresponding hitcalls for the assay endpoitnts related to nuclear receptors and their co-regulators
Dependencies	tidyr, dplyr,org.Hs.eg.db, readxl, data.table
Summary	The genes related to nuclear receptors and their co-regulators from domain experts and NURSA database will be merged. Then, ToxCast assays targeting these genes will be selected.

R Script	EDC_Decoy_selection.R
Input	https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv
	http://ctdbase.org/reports/CTD_chem_gene_ixns.csv.gz
	Result of the script MIEs_from_CTD.R
	Result of the script TOXCAST_nuclear_receptors_coregulators.R
Output	A list of EDCs and decoys, and their corresponding MIEs
Dependencies	ggplot2, ggrepel, magrittr, data.table, dplyr, reshape, cluster
Summary	An initial list of EDCs will be retrieved from DEDuCT. Then, ToxCast assays targeting ED-reated nuclear receptor and co-regulators will be used to refine the initial selection of EDCs. The most significat in vitro assay endpoints for identifying compounds with ED activity will be selected based on a statistical test aiming at identifying the most relevant directions in ED-gene interactions (e.g. up/down regulation). Moreover, EDCs that are inactive for all the significant assay endpoints will be systemaatically removed from the inital list of EDCs. After selecting a reliable set of EDCs, pairwise jaccard distance between the MIEs of the identified EDCs and other compounds annotated in CTD will be calculated.The compounds with the maximum Jaccard distance towards EDCs will be seleceted as negative controls (decoys).

R Script	ToxCast_dictionaries.R
Input	hitc_Matrix_190226.csv from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
	DSSTox_Identifiers_and_CASRN.xlsx from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
	Assay_Summary_190226.csv' from ToxCast 3.1 https://www.epa.gov/chemical-research/exploring-toxcast-data-downloadable-data
Output	Hitcall matrix retried from ToxCast
Dependencies	tidyr, dplyr, readxl
Summary	A dictionary for ToxCast target genes and their corresponding endpoints will be prepared from ToxCast. Conversion of ToxCast DSSTox_Identifiers to CAS registry identifiers and preparation of the final Hitcall matrix with all ToxCast endpoints will be conducted.

2. Compiling gene co-expression networks

R Script	Drug_matrix_wTO.R
Input	LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-009?query=S-DIXA-AN-009
	journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/
Output	A set of four gene networks derived from DrugMatrix
Dependencies	xlsx, doParallel, wTO
Summary	The control samples from the preprocessed and normalized LFC values of Drug Matrix for rat in vitro hepatocytes and rat in vivo will be removed. Three exposure time points 1,3 and 5 days for in vivo and 1 day for in vitro will be selected as separate data frames. The genes expressed in liver will be selected and orthology mapping of the probe IDs to entrez gene values will be performed. Four gene co-expression networks using wTO package with bootstrap resampling method will be compiled.

R Script	TG_Gates_wTO.R
Input	LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-005?query=S-DIXA-AN-005
	LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-004?query=S-DIXA-AN-004
	LFCs and annottations from https://www.ebi.ac.uk/biostudies/studies/S-DIXA-AN-007?query=S-DIXA-AN-007
	journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/
Output	A set of eight gene networks derived from TG-GATEs
Dependencies	XLSX, doParallel, wTO
Summary	Control samples from the preprocesses and normalized LFC values related to TG-Gates for rat in vitro, human invitro and rat in vivo will be removed. Three dose levels (high, middle and low) and three time points (8, 15 and 29 days) from the LFC values related to TG-Gates rat in vivo.(6 data frames)will be extracted. 2 separate data frames for 1 day treatment related to human and rat in vitro LFC values will be extracted. Selection of the genes expressed in the liver from each data frame and orthology mapping of probe IDs to gene entrez IDs will be conducted. 8 gene co-expression networks from the resulting data frames using wTO package with bootstrap resampling method will be compiled.

R Script	LINCS_wTO.R
Input	LFCs and annottations for level 5 studies from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742 (phase I)
	LFCs and annottations for level 5 studies from from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92742 (phase II)
	journal.pcbi.1004847.s026.XLS from https://pubmed.ncbi.nlm.nih.gov/27028627/
Output	1 consensus gene network for LINCS from phase 1 and phase 2 studies
Dependencies	cmapR, doParallel, wTO
Summary	Normalized and preproessed LFC values from the level 5 of phase 1 and phase 2 LINCS data source will be used. Selection of cell line HEPG2 with 24 hours treatment from phase1 and phase 2 gene expression data in LINCS will be done. Selection of the gene IDS which are expressed in the liver will be performed. 2 gene networks from phase 1 and phase 2 using wTO package with bootstrapping resampling method will be compiled.one consensus gene co-expression netowork from the overlapping genes of the two networks using wTO package will be generated.

R Script	Consensus_Rat_in_vitro_wTO.R
Input	Results of the scripts Drug_matrix_wTO.R and TG_Gates_wTO.R
Output	1 consensus gene network for Drug matrix and TG-Gates hepatocytes after 1 day treatment
Dependencies	wTO
Summary	A consensus gene co-expression network from the overlapping genes of the networks related to in vitro drug matrix and TG-GATEs will be compiled.

R Script	PPI_network.R
Input	No input is needed
Output	PPI network with new combined score
Dependencies	STRINGdb, igraph, org.Hs.eg.db
Summary	Protein-protein interaction network from stringDB will be downloaded. The gene co-expression score will be exluded from the combined score and a new combined score will be compiled. Mapping of nodes to entrez gene IDs will be performed.

3. Optimization of toxicogenomics-driven gene networks

R Script	Drug_matrix_tuner.R
Input	result of the scripts Drug_matrix_wTO.R, EDC_Decoy_selection.R
Output	A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for Drug Matrix networks
Dependencies	dnet,igraph,cluster
Summary	The top %2, %3, %5 and %10 edge portions will be extracted from each network. Each network will be subjected to random walk with restart uisng the MIEs of EDCs and decoys (benchmark set) as seeds. The 200,500,700 and 1000 top most visited genes will be extracted after the random walk. A binary vector will be genrated for each compound (EDCs and decoys) with 1 representing the gene in the list of top visited gene.The pairwise jaccard distance between the binary vector of each EDCs and decoy will be calculated. Using the jaccard dismilarity matrix and a vector representing the class of each componud as EDC or Decoy, average silhouette score will be calculated for EDCs.

R Script	TG_GATEs_tuner.R
Input	result of the scripts TG_Gates_wTO.R, EDC_Decoy_selection.R
Output	A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for TG-GATEs networks
Dependencies	dnet,igraph,cluster
Summary	As described for the script Drug_matrix_tuner.R

R Script	Consensus_tuner.R
Input	result of the scripts Consensus_Rat_in_vitro_wTO.R, LINCS_wTO.R, EDC_Decoy_selection.R
Output	A matrix of silhouette scores for different combination of Edges percentiles and sorted genes for consensus networks
Dependencies	dnet,igraph,cluster
Summary	As described for the script Drug_matrix_tuner.R

R Script	PPI_tuner.R
Input	result of the script EDC_Decoy_selection.R
Output	A matrix of silhouette scores for different combination of combined scores and sorted genes for PPI network
Dependencies	dnet,igraph,cluster,STRINGdb,org.Hs.eg.db
Summary	New ppi networks were compiled using the 0.6,0.65,0.7,0.75,0.8,0.85 values as the cut-off levels for the new combined score. All networks will be subjected to random walk with restart uisng the MIEs of EDCs and decoys (benchmark set) as seeds. The 200,500,700 and 1000 top most visited genes will be extracted after random walk. A binary vector will be genrated for each compound (EDCs and decoys) with 1 representing the gene in the list of top visited genes. The pairwise jaccard distance beween the binary vector of each EDCs and decoy will be calculated. Using the jaccard dismilarity matrix and a vector representing the class of each componud as EDC or Decoy average silhouette score will be calculated for EDCs.

R Script	pareto_solution_on_tuning_results.R
Input	Results of the scripts PPI_tuner.R, Consensus_tuner.R, TG_GATEs_tuner.R , TG_GATEs_tuner.R, Drug_matrix_tuner.R
Output	The optimized genes and edges solutions for all networks based on pareto solution
Dependencies	rPref,knitr
Summary	Pareto solution will be used to select the subnetworks that (1) maiximize the distance between EDCs and decoys,(2) minimize the selected edges or maximize the combined score.

4. Random walk with restart on gene networks and fast gene set enrichment analysis (RWR-FGSEA)

R Script	RWR_FGSEA_for_edc_decoys.R
Input	The results of the scripts Drug_matrix_wTO.R, TG_Gates_wTO.R, LINCS_wTO.R, Consensus_Rat_in_vitro_wTO.R, PPI_network.R, Pathways_Download.R, EDC_Decoy_selection.R,pareto_solution_on_tuning_results.R
Output	A matrix of pathway scores for EDCs and decoys for each network
Dependencies	fgsea,dnet,igraph,BiocParallel
Summary	Random walk with restart will be performed in order to extend the initial set of EDC-MIEs retrieved from CTD. This operation is repeatd for each compound and network. Then Fast Gene Set Enrichment Analysis will be performed to identify the most responsive pathways of the genes selected with the RWR algorithm.

R Script	RWR_FGSEA_for_all_compounds_in_CTD.R
Input	The results of the scripts Drug_matrix_wTO.R, TG_Gates_wTO.R, LINCS_wTO.R, Consensus_Rat_in_vitro_wTO.R, PPI_network.R, Pathways_Download.R,MIEs_from_CTD.R ,pareto_solution_on_tuning_results.R
Output	A matrix of pathway scores for 12k chemical in CTD for each network
Dependencies	fgsea,dnet,igraph,BiocParallel
Summary	Random walk with restart will be performed in order to extend the initial set of EDC-MIEs retrieved from CTD. This operation is repeatd for each compound and network. Then Fast Gene Set Enrichment Analysis will be performed to identify the most responsive pathways of the genes selected with the RWR algorithm.

5. Training of elastic-net-based classifiers for the claassification of EDCs and decoys

R Script	manual_curation_of_pathways_as_features.R
Input	The result of the script Pathways_Download.R
Output	A list of pathways to be used in machine learning
Dependencies	No dependencies
Summary	A more restricted set of pathways will be prepared, which excludes pathways related to viral, bacterial and radiation, duplicated pathways or pathways with no genes expressed in liver.

R Script	Preparation_of_training_datasets.R
Input	The results of the scripts manual_curation_of_pathways_as_features.R,RWR_FGSEA_for_all_compounds_in_CTD.R, RWR_FGSEA_for_edc_decoys.R,EDC_Decoy_selection.R
Output	A list containing training set for each network. A list containing test set for each network
Dependencies	No dependencies
Summary	Preparation of training dataset for each toxicogenomics network will be done. Each training dataset consists of NES scores or simply patwahy activation scores, which were obtained by using the RWR-FGSEA pipeline; and a vector labels indicateing EDCs and decoys.

R Script	glm_modeling.R
Input	The output of the scripts Preparation_of_training_datasets.R, MIEs_from_CTD.R, EDC_Decoy_selection.R
Output	List objects with GLM coefficients and models for each network and MIEs level
Dependencies	caret, doParallel
Summary	Training of elastic-net-based classifiers on the computed training dataset will be done using k-fold-cross-fold validation.

R Script	k_fold_cross_validation.R
Input	The output of the scripts Preparation_of_training_datasets.R, MIEs_from_CTD.R, EDC_Decoy_selection.R
Output	List of objects including Accuracy, F1-scores, confusion matrix, specificity and sensitivity for each trained classifier
Dependencies	caret, doParallel
Summary	Repeated 5-fold cross-validation will be performed on all 15 GLM-models. (Pathway level). Repeated 5-fold cross validation will be performed on a binary data matrix of gene-compounds.(Gene level))

R Script	comparing_cross_validation_across_all_layers_ANOVA.R
Input	The output of the script k_fold_cross_validation.R
Output	list of ANOVA results between F1-scores and the boxplot of F1-scores
Dependencies	ggplot2
Summary	The F1 scores for the k-fold-cross validation will be compared using ANOVA. Boxplot will be used to represent the obtained F1 scors across all GLM models.

6. Selection of the most informative pathways by using the coeffient estimates and ROC analysis based on individual pathways.

R Script	Integration_of_coefficients_stabilties_NES_scores.R
Input	The outputs of the scripts k_fold_cross_validation.R, glm_modeling.R, Preparation_of_training_datasets.R, manual_curation_of_pathways_as_features.R
Output	CSV and excel files with GLM coefs, ROC-AUCS, mean NES scores for each pathway and each network
Dependencies	PRROC
Summary	ROC analysis will be used to verify whether single pathways, which are selected by the elastic-net process, can distinguish alone EDCs from decoys. The NES scores, elastic-net coefficients, ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data).

R Script	NES_bubble_plot_MOA.R
Input	The output of the script Integration_of_coefficients_stabilties_NES_scores.R
Output	graphical represntation of the results of Integration_of_coefficients_stabilties_NES_scores.R as bubble plot
Dependencies	ggplot2, RColorBrewer
Summary	The map of pathway activation scores and GLM coefficient for each pathway and network will be represented as bubble plot.The generated plot can be used to indicate the putative pathways as the mode of action for the EDCs.

7. Compiling EDC class probilities for 12k compounds annotated in CTD

R Script	prediction_all_compounds_class_probilities.R
Input	Output of the scripts glm_modeling.R, RWR_FGSEA_for_all_compounds_in_CTD.R
Output	Class probability of a compound to be EDC for each data layer
Dependencies	caret
Summary	GLM coefs for each network will be used to predict class probability of all compounds in CTD using their NES scores across different pathways.

R Script	ROC_TOXCAST_vs_class_probabilities.R
Input	Output of the scripts TOXCAST_nuclear_receptors_coregulators.R, prediction_all_compounds_class_probilities.R
Output	A matrix of area under the curve for ROC between class probability of each data layer and Hitcall result of ToxCast endpoints
Dependencies	PRROC
Summary	ROC analysis will be used between the class probabilty of each network and the binary in vitro experimental hitcall ToxCast assays related to nuclear receptors for each network. The most informative networks will be selected based on the results of ROC curve analysis.

R Script	Developing_Harmonic_and_average_EDC_scores.R
Input	Output of the scripts EDC_Decoy_selection.R, ROC_TOXCAST_vs_class_probabilities.R, prediction_all_compounds_class_probilities.R
Output	vectors of harmonic scores and average scores for 12K compounds in CTD
Dependencies	ggplot2, ggrepel
Summary	Average EDC score will be defined as the average of class probabilties across selected networks for each compound. Harmonic EDC score will be defined as the harmonic sum class probabilties across selected networks for each compound.

8. Prediction of EDC class probilities and EDC scores for a new compound using its MIEs

R Script	edc_score_new_comps.R
Input	A list of compounds and their corresponding MIEs
Output	Class probability of a compound to be EDC for each data layer
Dependencies	caret, fgsea, dnet, igraph
Summary	The MIEs of the compounds will be used as seeds to start random walk with restart and enrichment analysis on different networks. The resulting pathway scores will be used to calculate the EDC scores and the class probabilities for the compounds.

Part II: Evaluation and validation of EDC scores

R Script	ROC_analysis_EDC_scores_vs_TOXCAST_endpoints.R
Input	Output of the scripts Output of the scripts TOXCAST_nuclear_receptors_coregulators.R, Developing_Harmonic_and_average_EDC_scores.R, ROC_TOXCAST_vs_class_probabilities.R
Output	Matrix of Area under the curve of ROC between harmonic and average EDC scores with ToxCast Hitcall, P_values of ROC proportion test
Dependencies	PRROC, gplots
Summary	ROC analysis between EDC scores and ToxCast hitcall data for each assay endpoint will be calculated.

R Script	Comparison_of_VAM_on_DeDuCt.R
Input	https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, Output of the scripts EDC_Decoy_selection.R, prediction_all_compounds_class_probilities.R
Output	Excel table with harmonic scores and the compounds in DeDUCT
Dependencies	xlsx
Summary	The scores for the compounds in DeDUCT will be calculated

R Script	Comparison_VAM_scores_vs_TOXPI_scores.R
Input	Suppl. file of http://dx.doi.org/10.1016/j.coph.2014.09.021, https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, Output of the script prediction_all_compounds_class_probilities.R
Output	plot of the Toxpi scores VS EDC scores, excel files of toxpi and edc scores
Dependencies	ggplot2, ggrepel
Summary	The predicted EDC scores will be evaluated with ToxPi scores (The scores developed from ToxCast assay endpoints)

R Script	Comparison_pathway_scores_vs_TOXDB_pathway_scores.R
Input	ToxDB pathway scores for pathways aryl hydrocarbon receptor, Breaset cancer and estrogen receptor from http://toxdb.molgen.mpg.de/ , Output of the script RWR_FGSEA_for_all_compounds_in_CTD.R,
Output	CSV file with pathway scores of ToxDB and the pathway scores resutling from RWR-FGSEA
Dependencies	webchem, ggplot2, ggrepel
Summary	CHEMBL ids will be mapped to CAS ids using webchem. Comparison between pathway scores from RWR-FGSEA and ToxDB for aryl hydrocarbon receptor, Breaset cancer and estrogen receptor will be conducted.

R Script	Validation_VAM_scores_Eurion_External_set_compounds.R
Input	https://cb.imsc.res.in/deduct/images/Batch_Download/DEDuCT_ChemicalBasicInformation.csv, output of the script prediction_all_compounds_class_probilities.R
Output	Excel file and heatmap plot of EDC scores for external validation set
Dependencies	reshape2,RColorBrewer,ggplot,magrittr
Summary	Evaluation of EDC scores for known EDCs in our expert domain list and DeDuCT will be done. Calculation of accuaracy based on the scores of validation list will be performed. The validation set of compounds will be depicted as a heatmap plot

R Script	Validation_with_ToxCast_mies.R
Input	Output of the script ToxCast_dictionaries.R, MIEs_from_CTD.R,
Output	Plot of predicted EDC scores from ToxCast MIES VS EDC scores predicted from CTD MIES
Dependencies	ggplot2, ggrepel, knitr
Summary	The target genes for the compounds with positive assay endpoint in ToxCast wil be considered as the MIEs. Random walk with resetart and fast gene set enrichment analysis pipeline will be repeated starting with the MIEs from ToxCast. The obtained NES scores will be subjected to GLM models and the resulting class probabilities will be compared with the class probabilities of the MIEs annotated in CTD.

Part III: Sensitivity and enrichment analysis

R Script	dose_response_TG_GATEs_single.R
Input	Output of the scripts Developing_Harmonic_and_average_EDC_scores.R, RWR_FGSEA_for_all_compounds_in_CTD.R,Integration_of_coefficients_stabilties_NES_scores.R
Output	List of the compounds based on the patterns of increase and decrease, List of ANOVA results for pathways in different dose response scenarios, plot depictions
Dependencies	dplyr,ggplot2
Summary	The class probability profile of the compounds in TG-Gates will be categorized as 4 different groups (based on increase or decrease pattern at different doses Low, Middle and High. The significant pathways with regard to increases or decrease of pathway activation scores will be determined using ANOVA followed by post tukey test using bootstrap sampling method.

R Script	time_exposure_response_TG_GATEs_repeated.R
Input	Output of the scripts Developing_Harmonic_and_average_EDC_scores.R, RWR_FGSEA_for_all_compounds_in_CTD.R,Integration_of_coefficients_stabilties_NES_scores.R
Output	List of the compounds based on the patterns of increase and decrease, List of ANOVA results for pathways in different time of exposure scenarios, Plot depictions
Dependencies	dplyr,ggplot2
Summary	The class probability profile of the compounds in TG-Gates will be categorized as 4 different groups (based on increase or decrease pattern at different Time of exposure 8 days, 15 days and 29 days). The significant pathways with regard to increases or decrease of pathway activation scores will be determined using ANOVA followed by post tukey test using bootstrap sampling method

R Script	textmining_16_april_2020_endocrine_disruption.R
Input	The abstract results of pubmed search for endocrine disruption
Output	A data frame with entrez gene names and their frequencies across different papers
Dependencies	pubmed.mineR, magrittr
Summary	After search in pubmed for the term "endocrine disruption" the genes related to ED will be determined using text mining with pubmed.mineR package.

R Script	time_dose_plots_heatplot.R
Input	Output of the scripts time_exposure_response_TG_GATEs_repeated.R, dose_response_TG_GATEs_single.R,Pathways_Download.R
Output	Plots of time of exposure and dose for sensitivity analysis, heatplot eseential genes and pathways
Dependencies	ggpubr, RColorBrewer, ggplot2
Summary	The pathways versus genes will be represented as two heatplots (dose and time point plots).

R Script	categorization_enrichment_result.R
Input	Output of the script Integration_of_coefficients_stabilties_NES_scores.R,
Output	plots of categorization for specific pathways related to KEGG, REACTOME and wiki AOP
Dependencies	UpSetR, ggplot2, RColorBrewer
Summary	For each data layer (network) the number of selected KEGG pathways by elastic GLM will be categorized and represented in plots.For each data layer (network) the number of selected REACTOME pathways by elastic GLM will be categorized and represented in plots.

Part IV: Linking MIEs to Adverse outcomes

1. Preparation of the training benchmark set

R Script	preparation_data_set_disease_biomarker.R
Input	Output of the scripts Pathways_Download.R, EDC_Decoy_selection.R,RWR_FGSEA_for_all_compounds_in_CTD.R
Output	A list of pathway scores for each data layer with specefic pathways
Dependencies	No dependencies
Summary	The NES scores for pathways with max length of 50 genes will be selected for classification task.

R Script	chem2disease_CTD.R
Input	http://ctdbase.org/reports/CTD_chemicals_diseases.csv.gz, output of the script EDC_Decoy_selection.R,
Output	A binary matrix of chemicals and diseases with 1 representing the association of a disease and the chemical
Dependencies	xlsx
Summary	A binary matrix from chem-disease associations of the chemicals annotated in CTD will be generated.

R Script	preparation_two_class_training_set_by_disease_name.R
Input	Output of the scripts preparation_data_set_disease_biomarker.R
Output	A list object with training x data and y vector for each data layer
Dependencies	No dependencies
Summary	Training sets (NES scores and class labels) for atherosclerosis, metabolic syndrome and diabetes type 2 will be prepared.

2. Training elastic-net GLM classifier for the metabolic syndrome, diabetes type2 and atherosclerosis

R Script	metabolic_syndrome_two_class_traning_models_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	A list classification models and coeffecients for each data layer
Dependencies	caret, doParallel
Summary	Training GLM model classifiers for metabolic syndrome across all 15 data layers will be conducted.

R Script	artherosclerosis_two_class_training_models_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	A list classification models and coeffecients for each data layer
Dependencies	caret, doParallel
Summary	Training GLM model classifiers for atherosclerosis across all 15 data layers will be conducted.

R Script	diabetes_2_two_class_traning_models_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	A list classification models and coeffecients for each data layer
Dependencies	caret, doParallel
Summary	Training GLM model classifiers for diabetes T2 across all 15 data layers will be conducted.

R Script	kfold_CV_two_class_metabolic_syndrome_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	List of accuaracy metrics for the classification task for each data layer
Dependencies	caret, doParallel
Summary	Performing K-fold-Cross validation on models of metabolic syndrome for all 15 data layers

R Script	kfold_CV_two_class_artherosclerosis_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	List of accuaracy metrics for the classification task for each data layer
Dependencies	caret, doParallel
Summary	Performing K-fold-Cross validation on models of atherosclerosis for all 15 data layers

R Script	kfold_CV_two_class_diabetes_2_glm.R
Input	Output of the script preparation_two_class_training_set_by_disease_name.R
Output	List of accuaracy metrics for the classification task for each data layer
Dependencies	caret, doParallel
Summary	Performing K-fold-Cross validation on models of diabetes type 2 for all 15 data layers

R Script	comparison_CV_F1_scores_all_layers_ANOVA_metabolic_syndrome.R
Input	Output of the script kfold_CV_two_class_metabolic_syndrome_glm.R
Output	A list of ANOVA results between different data layers and boxplot representation
Dependencies	plotly, dplyr, ggplot2
Summary	The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for metabolic syndrome.Boxplot will be used to represent the obtained F1 scors across all GLM models for metabolic syndrome.

R Script	comparison_CV_F1_scores_all_layers_ANOVA_atherosclerosis.R
Input	Output of the script kfold_CV_two_class_artherosclerosis_glm.R
Output	A list of ANOVA results between different data layers and boxplot representation
Dependencies	plotly, dplyr, ggplot2
Summary	The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for atherosclerosis. Boxplot will be used to represent the obtained F1 scors across all GLM models for atherosclerosis.

R Script	comparison_CV_F1_scores_all_layers_ANOVA_diabetes.R
Input	Output of the script kfold_CV_two_class_diabetes_2_glm.R
Output	A list of ANOVA results between different data layers and boxplot representation
Dependencies	plotly, dplyr, ggplot2
Summary	The F1 scores of the k-fold-cross validation will be compared using ANOVA across all 15 data layer for diabetes type 2. Boxplot will be used to represent the obtained F1 scors across all GLM models for diyabetes type 2.

3. Univariate analysis and integration with GLM resutls

R Script	Integration_of_glm_coefs_stabilities_NES_ROC_metabolic_syndrome.R
Input	Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_metabolic_syndrome_glm.R, metabolic_syndrome_two_class_traning_models_glm.R
Output	A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers
Dependencies	magrittr, PRROC
Summary	For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to metabolic synndrome vs edcs not leading to metabolic syndrome. The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data).

R Script	Integration_of_glm_coefs_stabilities_NES_ROC_atherosclerosis.R
Input	Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_artherosclerosis_glm.R, artherosclerosis_two_class_training_models_glm.R
Output	A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers
Dependencies	magrittr, PRROC
Summary	For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to atherosclerosis vs edcs not leading to atherosclerosis. The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data).

R Script	Integration_of_glm_coefs_stabilities_NES_ROC_diabetes_2.R
Input	Output of the scripts preparation_two_class_training_set_by_disease_name.R, kfold_CV_two_class_diabetes_2_glm.R, diabetes_2_two_class_traning_models_glm.R
Output	A data frame with stability index, AUC-ROC and GLM coef for each pathway across all data layers
Dependencies	magrittr, PRROC
Summary	For each network, ROC analysis will be used as a univariate method to evaluate the NES scores for each pathway for EDCs leading to diabetes type 2 vs edcs not leading to diabetes type 2.The NES scores, glm coefficients ROC-AUCs, average of NES scores will be integrated across all data layers (suppl. data).

R Script	bubble_plots_pathways.R
Input	The output of the scripts Integration_of_glm_coefs_stabilities_NES_ROC_diabetes_2.R, Integration_of_glm_coefs_stabilities_NES_ROC_atherosclerosis.R, Integration_of_glm_coefs_stabilities_NES_ROC_metabolic_syndrome.R
Output	Bubble plot for each disease across all data layers representing the pathways and their coeffeicients
Dependencies	ggplot2, ggarrange
Summary	The map of pathway activation scores and GLM coefficient for each pathway and network will be represented as bubble plots for the three diseases metabolic syndrome, atherosclerosis and diabetes type 2. The generated plots can be used to indicate the putative pathways linking EDCs to adverse outcome.

4. Compiling and validation of disease scores for atherosclerosis, metabolic syndrome and diabetes type 2

R Script	disease_scores.R
Input	Output of the scripts preparation_data_set_disease_biomarker.R, chem2disease_CTD.R, metabolic_syndrome_two_class_traning_models_glm.R, artherosclerosis_two_class_training_models_glm.R, diabetes_2_two_class_traning_models_glm.R
Output	CSV files with the disease scores related to chemical in CTD for each data layer, ROC aucs of the predicted scores with disease vectors in CTD
Dependencies	PRROC
Summary	For each disease (metabolic syndrome, atherosclerosis and diabetes type 2), harmonic sum and average score will be compiled across all 15 networks for 12k compounds in CTD. A ROC analysis will be performed between the predicted scores and the results in CTD

R Script	disease_scores_pie_chart.R
Input	Output of the script disease_scores.R
Output	Piechart of disease scores
Dependencies	ggplot2
Summary	For each disease (metabolic syndrome, atherosclerosis and diabetes type 2) a pie chart will be depicted to reveal distribution of disease scores across 12k compounds in CTD.

Name		Name	Last commit message	Last commit date
Latest commit History 356 Commits
functions		functions
inputData		inputData
scripts		scripts
Appendix_Supplementary_Data_Tables_Revised.xlsx		Appendix_Supplementary_Data_Tables_Revised.xlsx
README.md		README.md
all_codes_descriptions.R		all_codes_descriptions.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EDTOX

Description of the R scripts used in the pipeline

Part I: Development of toxicogenomics-driven predictive model to identify and prioritize compounds with ED activity

1. Preparation of MIES, pathways and training datasets

2. Compiling gene co-expression networks

3. Optimization of toxicogenomics-driven gene networks

4. Random walk with restart on gene networks and fast gene set enrichment analysis (RWR-FGSEA)

5. Training of elastic-net-based classifiers for the claassification of EDCs and decoys

6. Selection of the most informative pathways by using the coeffient estimates and ROC analysis based on individual pathways.

7. Compiling EDC class probilities for 12k compounds annotated in CTD

8. Prediction of EDC class probilities and EDC scores for a new compound using its MIEs

Part II: Evaluation and validation of EDC scores

Part III: Sensitivity and enrichment analysis

Part IV: Linking MIEs to Adverse outcomes

1. Preparation of the training benchmark set

2. Training elastic-net GLM classifier for the metabolic syndrome, diabetes type2 and atherosclerosis

3. Univariate analysis and integration with GLM resutls

4. Compiling and validation of disease scores for atherosclerosis, metabolic syndrome and diabetes type 2

About

Releases

Packages

Contributors 3

Languages

vittoriofortino84/EDTOX

Folders and files

Latest commit

History

Repository files navigation

EDTOX

Description of the R scripts used in the pipeline

Part I: Development of toxicogenomics-driven predictive model to identify and prioritize compounds with ED activity

1. Preparation of MIES, pathways and training datasets

2. Compiling gene co-expression networks

3. Optimization of toxicogenomics-driven gene networks

4. Random walk with restart on gene networks and fast gene set enrichment analysis (RWR-FGSEA)

5. Training of elastic-net-based classifiers for the claassification of EDCs and decoys

6. Selection of the most informative pathways by using the coeffient estimates and ROC analysis based on individual pathways.

7. Compiling EDC class probilities for 12k compounds annotated in CTD

8. Prediction of EDC class probilities and EDC scores for a new compound using its MIEs

Part II: Evaluation and validation of EDC scores

Part III: Sensitivity and enrichment analysis

Part IV: Linking MIEs to Adverse outcomes

1. Preparation of the training benchmark set

2. Training elastic-net GLM classifier for the metabolic syndrome, diabetes type2 and atherosclerosis

3. Univariate analysis and integration with GLM resutls

4. Compiling and validation of disease scores for atherosclerosis, metabolic syndrome and diabetes type 2

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages