Code in this repository reflect the steps used to pre-process data for machine learning and train models to predict PTSD and create a methylation risk score, published in BMC Medical Genomics. Also provided are the final weights and features for each of the three published risk scores.
Key Instruction: The weights and features for each of the three published risk scores are located in the Data folder. The files are named as follows:
- eMRS_Model1.xlsx
- MoRS_Model2.xlsx
- MoRSAE_Model3.xlsx
install_needed_packages.R
install required packages.DNHS_more_pheno.R
get the required variables for DNHS.Smoking_Scores_PGC_cohorts.R
estimates smoking scores for each individual discovery cohort.MRS_Preprocess.R
pre-process Marine Resilience Study cohort to include in the training.Armystarrs_and_PRISMO_preprocess.R
to pre-process Army STARRS and PRISMO cohorts pre-post deployment samples to test risk scores.Check_after_updating_pheno.R
andCheck_after_updating_pheno.html
to check the updated phenotype file with the old file.cpgassoc2.R
helper function to perform association analysis between each CpG and PTSD.Covariate_adjustment_1.R
example code to show covariate adjustment. paper as we thought to make it an Epic data paper.Compare_Effect_Sizes.Rmd
andCompare_Effect_Sizes.html
code to compare the effect sizes of discovery and Boston VA cohort for model1.Demographics.R
code to get demographic information for the manuscript.Cohort_Information.Rmd
andCohort_Information.html
code to get summary information from different cohorts, e.g., variables in each cohort to check data availability.
makedirectory.py
Is to make a directory to store the outcome files from each run.Settings.ipynb
contains settings for packages and plots.Preprocess_data_updated_1.ipynb
preprocess all cohorts individually for machine learning.pre_post_trauma_processing_v1.ipynb
Is to pre-process the cohorts with pre/post samples and choose post-trauma samples for machine learning.Imputation_Covariate_adjustment_2.1.ipyn
Code to perform imputation and covariate adjustment.Imputation_Covariate_adjustment_including_Expo_vaiables_2.1.ipynb
Code to perform imputation and covariate adjustment, including exposure variables.Feature_Selection_and_training_on_ptsdpm_3.3.ipynb
Feature selection using the covariate-adjusted data (output of step 3).Feature_Selection_and_training_on_ptsdpm_wd_exp_vars_adjustment_3.3.ipynb
Feature selection using the covariate-adjusted data for exposure variables (input is step 4 output).model_performance_5.5.ipynb
Running model and evaluating the performance (input is step 5 output).model_performance_wd_exp_vaars_adjustment_5.5.ipynb
Running model and evaluating the performance with adjusted exposure variables (input is step 6 output).
downstream_analysis_v5.qmd
To estimate risk scores for model 1 and 2 and test the risk scores using the test set in discovery cohorts.downstream_analysis_v5.html
is the generated report. In steps 2 and 3, we test various data sets such as test set, civilians, military, and males and females to look at various scenarios.downstream_analysis_adj_for_Exp_Vars_v5.qmd
is to estimate and test risk scores using model 3 on the test data set.downstream_analysis_adj_for_Exp_Vars_v5.html
is the generated report.Test_RiskScores_with&without_exp_vars_wd_logit_6.Rmd
is a clean version of estimating and testing risk scores. It used the point-biserial correlation between binary and continuous variables. Also, we used the logit model to predict PTSD using risk scores.Test_RiskScores_with&without_exp_vars_wd_logit_6.html
is the generated report. This file was used to generate density, distribution and correlation plots for discovery cohorts.Pre_Post_Deployment_eMRS.qmd
andPre_Post_Deployment_eMRS.html
to test risk scores pre and post-deployment.Enrichment_analysis_1.qmd
to perform enrichment analysis of top CpGs from models 1, 2, and 3. Models 1 and 2 have the same set of CpGs.CpGs_in_previous_studies&ML.R
code to find overlap between identified significant CpGs and previous studies.Overlap_between_MRS_CpGs_metaanalysis_CpGs_Freeze3.R
to check overlap between identified significant CpGs and PGC EWAS meta-analysis and Freeze3 genes.mQTL.qmd
andmQTL.html
Comparing significant CpGs with BIOS QTL browser CpGs.
Create_sample_data.R
andCreate_sample_data without exp vars.R
code to create sample data with and without exposure variables as an example for external cohorts.Covariate_Adj_RiskScores_1.R
andCovariate_Adj_RiskScores_without_exp_vars_1.R
code to estimate risk scores with and without exposure variables, respectively.Test_RiskScores_with&without_exp_vars_wd_logit_2.Rmd
code to test risk scores and generate plots.