easyEWAS.Rmd
easyEWAS is a user-friendly R package designed for conducting epigenome-wide association studies (EWAS). With easyEWAS, we provide a battery of statistical methods to support differential methylation position analysis across various scenarios, as well as differential methylation region analysis based on the DMRcate method. To facilitate result interpretation, we provide comprehensive functional annotation and result visualization functionalities. Additionally, a bootstrap-based internal validation was incorporated into easyEWAS to ensure the robustness of EWAS results. easyEWAS is compatible with currently most cited Illumina high-density microarray platforms for genome-wide methylation profiling, including 27K, 450K, EPIC V1, EPIC V2, and MSA arrays.
Users can install the easyEWAS package by running the following code:
devtools::install_github("ytwangZero/easyEWAS")
easyEWAS supports only .csv
and .xlsx
file
formats, so ensure your data is in one of these formats. The package
also includes internal sample and methylation datasets for users to
explore and test its functions.
Sample data: Users need to prepare a sample file to store sample information, such as sample ID, exposure variables, covariates, etc. The sample ID must be the first column, and each ID should be unique and consistent across all files.
data("sampledata", package = "easyEWAS")
head(sampledata)
#> SampleName var cov1 cov2 batch
#> 1 ZE10000001 76 2 20.96810 1
#> 2 ZE10000002 66 2 26.78373 1
#> 3 ZE10000003 71 2 28.26081 1
#> 4 ZE10000004 81 1 22.54459 1
#> 5 ZE10000005 70 1 21.66550 1
#> 6 ZE10000006 68 2 16.94033 1
Methylation data: Users also need to prepare a methylation file, requiring each row to represent a CpG site and each column to represent a sample. The data format is as follows:
data("methydata", package = "easyEWAS")
head(methydata[,1:6])
#> probe ZE10000001 ZE10000002 ZE10000003 ZE10000004 ZE10000005
#> 1 cg01444397_BC11 0.04768038 0.02783589 0.02852848 0.06711713 0.04478172
#> 2 cg12110931_TC11 0.43900591 0.43166163 0.51290795 0.45075799 0.59972834
#> 3 cg12258811_TC21 0.41950433 0.47744150 0.52750836 0.53600388 0.64258103
#> 4 cg08764927_TC21 0.82080027 0.82292072 0.88725030 0.87222326 0.91267532
#> 5 cg25514503_TC21 0.94965507 0.95133812 0.95787150 0.96158814 0.96021148
#> 6 cg24454741_TC21 0.88100024 0.89587378 0.90661704 0.88505037 0.86547789
library(easyEWAS)
Here, take the easyEWAS internal dataset as an example. After loading
easyEWAS, the user needs to use the initEWAS()
function to
initialize an analysis module. This will generate a folder named
EWASresult under the working path specified by the user to
store the EWAS analysis results. If default, the results will be stored
in the current working path. For detailed information on these
parameters, please refer to initEWAS
documentation.
res <- initEWAS(outpath = "default")
#> Complete initializing the EWAS module! Your EWAS results will be stored in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult
#> 2025-01-03 14:34:07.105257
#> 0.14 sec elapsed
When specifying a custom storage path, make sure that the path actually exists.
res <- initEWAS(outpath = "D:/test")
Note: The function returns an R6 object integrating all
information, so the user needs to assign the result of the function to a
new name, such as res
.
Users can upload sample data and methylation data in two ways.
One is to assign the data set in the R environment directly to the
parameters ExpoData
and MethyData
. If the user
chooses to read data from external sources, the file paths need to be
specified in the parameters ExpoPath
and
MethyPath
.Detailed information can be found at loadEWAS
documentation.
# Here we read data from the environment.
# Or we can source by specifying file paths:
# ExpoPath = 'D:/DATA/dat1.csv', MethyPath = 'D:/DATA/dat2.csv'
res <- loadEWAS(input = res,
ExpoData = sampledata,
MethyData = methydata)
#> Complete loading All files!
#> 2025-01-03 14:34:07.176157
#> 0 sec elapsed
Note the input
parameter, which needs to be
assigned the res generated in the previous step!
This is an optional step where the user can transform certain variables in the sample dataset to a desired type. The function supports both factor and numeric types. Detailed information can be found at transEWAS documentation.
res <- transEWAS(input = res, Vars = "cov1",TypeTo = "factor")
#> Variable types have been transformed successfully!
#> 0.01 sec elapsed
This step is optional. Batch effect correction is based on the
ComBat function from the sva package. The sample data provided by the
user must include known batch information. Before correcting for batch
effects, the user can specify the name of the batch in the batch
parameter, as well as the names of the variates of interest and other
covariates besides batch in the adjustVarparameter to avoid
overcorrection. The methylation data stored in
res$Data$Methy
will ultimately be replaced with a set of
corrected DNA methylation values, with batch effects effectively
removed. Detailed information can be found at batchEWAS
documentation.
res <- batchEWAS(input = res, adjustVar = "var", batch = "batch")
#> Start to adjust batch effects, please be patient ...
#> Batch effect adjustment has been completed !
#> You can find the adjusted methylation data in input$Data$Methy.
#> 2025-01-03 14:34:08.907541
#> 1.64 sec elapsed
If plot = TRUE
, prior plots will be given with black as
a kernel estimate of the empirical batch effect density and red as the
parametric.
This is the core step of the analysis. The user can select
models based on the analysis requirements, including linear regression
models, linear mixed effects models, and Cox proportional risk models.
The results can be exported to a file in working directory. For a
detailed explanation of each parameter, please refer to the package
manual. The startEWAS
function supports parallel
operations, and the number of logical cores required for operations can
be specified in the core
parameter. Detailed information
can be found at startEWAS
documentation.
parallel::detectCores()-1 # Maximum recommended number of cores
#> [1] 17
res <- startEWAS(input = res,
chipType = "EPICV2", # "EPICV1", "450K", "27K", "MSA"
model = "lm", # "lm", "lmer", "cox"
expo = "var", # exposure variable
cov = "cov1,cov2", # covariates
adjustP = TRUE, # adjust p-values (FDR and Bonferroni)
random = NULL, # set only for lmer
time = NULL, # set only for Cox
status = NULL, # set only for Cox
core = 10 # set cores for parallel operations
)
#> It will take some time, please be patient...
#> Start the EWAS analysis...
#> Start multiple testing correction ...
#> Start CpG sites annotation ...
#> EWAS analysis has been completed!
#> You can find results in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult.
#> 2025-01-03 14:34:34.437661
#> 25.42 sec elapsed
The results can be viewed in res$result
.For linear
regression and linear mixed-effects models, when the exposure variable
is continuous, the output includes the coefficient (per unit, per IQR,
per SD), standard error, significance P-value, adjusted P-value, and
annotation for each site. For categorical variables, coefficients are
provided for each class relative to the reference group. In Cox
proportional hazards models, hazard ratios (HR) and confidence intervals
are also included.
head(res$result)
#> probe BETA BETA_perSD BETA_perIQR SE
#> 1 cg01444397_BC11 5.724133e-04 5.345447e-03 0.0087293025 1.594601e-04
#> 2 cg12110931_TC11 2.012009e-03 1.878903e-02 0.0306831439 6.058078e-04
#> 3 cg12258811_TC21 1.917825e-04 1.790949e-03 0.0029246826 5.173171e-04
#> 4 cg08764927_TC21 -3.726331e-05 -3.479812e-04 -0.0005682655 4.552721e-04
#> 5 cg25514503_TC21 6.708728e-06 6.264905e-05 0.0001023081 8.123277e-05
#> 6 cg24454741_TC21 -6.095697e-05 -5.692430e-04 -0.0009295938 1.282312e-04
#> SE_perSD SE_perIQR PVAL FDR Bonfferoni chr pos
#> 1 0.0014891086 0.002431767 0.0005237102 0.1059711 0.645211 1 7784390
#> 2 0.0056572995 0.009238568 0.0012686133 0.1214434 1.000000 1 7784577
#> 3 0.0048309345 0.007889085 0.7116590121 0.9001683 1.000000 1 7784675
#> 4 0.0042515312 0.006942899 0.9349376448 0.9720196 1.000000 1 7784835
#> 5 0.0007585874 0.001238800 0.9343523148 0.9720196 1.000000 1 7785010
#> 6 0.0011974795 0.001955526 0.6356047918 0.8722169 1.000000 1 7785177
#> relation_to_island
#> 1 Island
#> 2 Island
#> 3 Island
#> 4 Island
#> 5 Island
#> 6 Shore
#> gene
#> 1 PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3
#> 2
#> 3 PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3
#> 4 PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3;PER3
#> 5
#> 6
#> location
#> 1 5UTR;exon_1;5UTR;exon_1;5UTR;exon_1;5UTR;exon_1;TSS200
#> 2
#> 3 5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2
#> 4 5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2;5UTR;exon_2
#> 5
#> 6
Visualize EWAS results based on the CMplot package,
including Manhattan plots, QQ plots, etc. The QQ plot is used to test
for genomic inflation, while the Manhattan plot is used to display EWAS
results. The user decides which to use for plotting based on the
parameter p
: “PVAL”, “FDR”, or “Bonferroni”. Additionally,
the threshold
parameter is used to plot a threshold line on
the Manhattan plot. Detailed information can be found at plotEWAS
documentation.
res <- plotEWAS(input = res,
file = "pdf", # "jpg", "png", "tif", "pdf"
p = "PVAL", # "PVAL", "FDR", "Bonfferoni"
threshold = 0.05)
#> EWAS Visualization has completed!
#> You can find results in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult.
#> 2025-01-03 14:34:34.696527
#> 0.17 sec elapsed
Here are the example plots of the EWAS results.
Bootstrap is a statistical resampling technique used for internal
validation and estimation of uncertainty in statistical models. The
parameterbootCI
refers to the type of confidence interval
calculation methods, including “norm”, “basic”, “stud”, “perc” (the
default), and “bca”.Users can select CpG sites to validate in two ways.
One is to set the P-value column (PVAL, FDR, Bonfferoni) to be filtered
and its filtering cutoff value (default is 0.05).
res <- bootEWAS(input = res,
filterP = "PVAL", # FDR, Bonferroni
cutoff = 0.001,
bootCI = "perc",
times = 100)
#> Bootstrap analysis for cg01444397_BC11...
#> Bootstrap analysis for cg26269881_BC21...
#> Bootstrap analysis for cg08693699_BC21...
#> Bootstrap analysis for cg14151354_BC21...
#> Bootstrap analysis for cg13515269_TC21...
#> Bootstrap analysis for cg16047471_BC21...
#> Bootstrap analysis for cg00200653_TC11...
#> Bootstrap analysis for cg04162685_BC21...
#> Bootstrap analysis for cg24867447_BC21...
#> Bootstrap analysis for cg25069388_BC21...
#> Bootstrap for internal validation has been completed !
#> You can find results in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult.
#> 2025-01-03 14:34:35.688672
#> 0.89 sec elapsed
The function ultimately provides the original statistical values for
each CpG site alongside their bootstrap-derived confidence intervals.
The result is stored in res$bootres
.
head(res$bootres)
#> # A tibble: 6 × 4
#> probe original lower_CI upper_CI
#> <chr> <dbl> <dbl> <dbl>
#> 1 cg01444397_BC11 0.000572 0.000130 0.000977
#> 2 cg26269881_BC21 -0.00232 -0.00380 -0.000909
#> 3 cg08693699_BC21 -0.00113 -0.00168 -0.000397
#> 4 cg14151354_BC21 -0.000746 -0.00121 -0.000313
#> 5 cg13515269_TC21 -0.00135 -0.00224 -0.000679
#> 6 cg16047471_BC21 -0.00183 -0.00279 -0.000805
Another way is to specify the names of the sites to be validated
directly in the CpGs
parameter. Here is an example (Not
run). Detailed information can be found at bootEWAS
documentation.
res <- bootEWAS(input = res,
CpGs = "cg01444397_BC11,cg08693699_BC21,cg16047471_BC21",
bootCI = "perc",
times = 100)
easyEWAS provides two enrichment analysis methods (GO and KEGG) based
on the clusterProfiler package. Users can filter the
sites and their corresponding genes for enrichment analysis using the
filterP
and cutoff
parameters.The enrichment
methods include Gene Ontology (GO) and Kyoto Encyclopedia of Genes and
Genomes (KEGG). Within GO, there are three categories: Biological
Process (BP), Molecular Function (MF), and Cellular Component (CC).
Users can choose one or all categories for enrichment analysis, with BP
set as the default. Detailed information can be found at enrichEWAS
documentation.
res <- enrichEWAS(input = res,
method = "GO",
filterP = "PVAL", # "PVAL", "FDR", "Bonfferoni"
cutoff = 0.01,
ont = "BP", # "BP", "MF", "CC", "ALL"
plot = TRUE,
plotType = "dot", # "bar", "dot"
plotcolor = "pvalue", # "pvalue" or "p.adjust" or "qvalue"
showCategory = 10)
#> It will take some time, please be patient...
#> Start enrichment analysis ...
#> Start result visualization ...
#> Saving 7.29 x 4.51 in image
#> Enrichment analysis has been completed !
#> You can find results in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult.
#> 2025-01-03 14:34:55.441067
#> 19.61 sec elapsed
The analysis results are stored in res$enrichresult
.
head(res$enrichres)
#> ID
#> GO:0048511 GO:0048511
#> GO:0032922 GO:0032922
#> GO:0018107 GO:0018107
#> GO:0018210 GO:0018210
#> GO:2000058 GO:2000058
#> GO:0007623 GO:0007623
#> Description
#> GO:0048511 rhythmic process
#> GO:0032922 circadian regulation of gene expression
#> GO:0018107 peptidyl-threonine phosphorylation
#> GO:0018210 peptidyl-threonine modification
#> GO:2000058 regulation of ubiquitin-dependent protein catabolic process
#> GO:0007623 circadian rhythm
#> GeneRatio BgRatio RichFactor FoldEnrichment zScore
#> GO:0048511 3/5 297/18888 0.010101010 38.15758 10.502803
#> GO:0032922 2/5 70/18888 0.028571429 107.93143 14.584739
#> GO:0018107 2/5 96/18888 0.020833333 78.70000 12.419415
#> GO:0018210 2/5 106/18888 0.018867925 71.27547 11.806381
#> GO:2000058 2/5 166/18888 0.012048193 45.51325 9.373424
#> GO:0007623 2/5 205/18888 0.009756098 36.85463 8.399045
#> pvalue p.adjust qvalue geneID Count
#> GO:0048511 3.759965e-05 0.009099115 0.003285022 PER3/BHLHE41/CSNK2A1 3
#> GO:0032922 1.344212e-04 0.016264968 0.005872085 PER3/BHLHE41 2
#> GO:0018107 2.531146e-04 0.018668599 0.006739860 PRKAG2/CSNK2A1 2
#> GO:0018210 3.085719e-04 0.018668599 0.006739860 PRKAG2/CSNK2A1 2
#> GO:2000058 7.545427e-04 0.036519868 0.013184641 CSNK1A1/CSNK2A1 2
#> GO:0007623 1.147290e-03 0.037627614 0.013584567 PER3/BHLHE41 2
Additionally, the function can visualize the enrichment analysis
results, including both dotplot and barplot. The colors in the plot can
be based on the values of pvalue, p.adjust, or qvalue, with pvalue being
the default. Here, we can obtain a bubble plot of GO enrichment
analysis.
The easyEWAS package integrates the DMRcate method
to support DMR analysis. Similar to DMP analysis, users need to specify
the exposure variable and covariates for DMR analysis, as well as the
chip type and methylation value type (β-values or M-values). It is
recommended to use the default settings for parameters such as
lambda
, C
, fdrCPG
,
pcutoff
, and min.cpgs
for DMR selection. Here,
since there are too few significant DMRs, the pcutoff
is
set to 0.05 (Default to “fdr”). For detailed information on these
parameters, please refer to dmrEWAS
documentation.
res <- dmrEWAS(input = res,
chipType = "EPICV2",
what = "Beta",
expo = "var",
cov = "cov1,cov2",
genome = "hg38",
lambda=1000,
C = 2,
pcutoff = 0.05
)
#> Filtering out position replicates from an EPICv2 beta- or M-matrix. Please be patient...
#> Starting the differentially methylated region analysis. Please be patient...
#> Differentially Methylated Region analysis has been completed!
#> You can find results in D:/R/Rwork/Rpackage/easyEWAS/vignettes/EWASresult.
#> 2025-01-03 14:35:52.345173
#> 56.8 sec elapsed
The result is stored in res$dmrres
.
head(res$dmrres)
#> seqnames start end width strand no.cpgs min_smoothed_fdr
#> 1 chr12 26121435 26121868 434 * 3 0.0006926403
#> 2 chr1 7826573 7827921 1349 * 10 0.0053198587
#> 3 chr1 7784390 7784835 446 * 4 0.0053198587
#> 4 chr3 4981365 4981709 345 * 3 0.0054574766
#> 5 chr2 100924866 100925130 265 * 5 0.0081198870
#> 6 chr15 60591026 60591370 345 * 2 0.0081198870
#> Stouffer HMFDR Fisher maxdiff meandiff
#> 1 2.827015e-04 0.002443061 0.0001300913 -0.001834445 -0.0011539964
#> 2 6.166338e-06 0.039505450 0.0001300913 -0.002527727 -0.0012950516
#> 3 4.259913e-03 0.003212979 0.0007271626 0.002012009 0.0006847355
#> 4 3.654183e-04 0.003212979 0.0007015946 -0.002322104 -0.0015218217
#> 5 3.580945e-04 0.016418020 0.0007015946 -0.002072670 -0.0013957993
#> 6 1.749110e-03 0.003092856 0.0013320402 -0.001368543 -0.0006471749
#> overlapping.genes
#> 1 BHLHE41
#> 2 PER3, Z98884.1
#> 3 PER3
#> 4 BHLHE40
#> 5 NPAS2
#> 6 RORA-AS1, RORA
1. Hall P (1992). The Bootstrap and Edgeworth
Expansion. Springer, New York. ISBN 9781461243847.
2. https://CRAN.R-project.org/package=CMplot
3. Yu, G., Wang, L. G., Han, Y., & He, Q. Y.
(2012). clusterProfiler: an R package for comparing biological themes
among gene clusters. Omics : a journal of integrative biology, 16(5),
284–287. https://doi.org/10.1089/omi.2011.0118
4. Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe,
A. E., & Storey, J. D. (2012). The sva package for removing batch
effects and other unwanted variation in high-throughput experiments.
Bioinformatics (Oxford, England), 28(6), 882–883. https://doi.org/10.1093/bioinformatics/bts034
5. Peters, T. J., Buckley, M. J., Statham, A. L.,
Pidsley, R., Samaras, K., V Lord, R., Clark, S. J., & Molloy, P. L.
(2015). De novo identification of differentially methylated regions in
the human genome. Epigenetics & chromatin, 8, 6. https://doi.org/10.1186/1756-8935-8-6