Mendelian Randomization Analysis

Mendelian Randomization (MR) Analysis: Principles, Advantages, Limitations and Practical R Implementation

Mendelian Randomization (MR) analysis is a powerful epidemiological method that uses genetic variants (single nucleotide polymorphisms, SNPs) as instrumental variables (IVs) to infer the causal relationship between an exposure factor and an outcome. It overcomes the limitations of traditional observational studies in dealing with confounding factors and reverse causality, and has been widely used in the study of causal associations between complex traits (e.g., allergic diseases) and diseases (e.g., cancer), as demonstrated in the research on allergic rhinitis (AR) and cancer risk. This article will elaborate on the core principles, advantages, practical limitations of MR analysis, and provide a detailed, reproducible R code implementation based on a two-sample MR study of AR and cancers.

1. Core Principles of MR Analysis

MR analysis is based on the Mendelian law of independent assortment: genetic variants are randomly inherited during conception, and their distribution in the population is not affected by environmental factors, lifestyle, or disease status (the outcome). This randomness makes genetic variants ideal instrumental variables to link exposure and outcome, and the causal inference of MR analysis must satisfy three basic assumptions (the “golden rule” of MR), which are also emphasized in the AR-cancer association study:
 
  1. Relevance assumption: IVs (genetic variants) are significantly associated with the exposure factor (e.g., SNPs are closely linked to the genetic susceptibility of AR).
  2. Independence assumption: IVs are independent of any confounding factors that affect both the exposure and the outcome (e.g., SNPs for AR are not associated with smoking, age, etc., which affect cancer risk).
  3. Exclusivity assumption: IVs affect the outcome only through the exposure factor, and there is no other independent causal pathway (e.g., AR-related SNPs affect cancer risk only by regulating AR, not by other biological mechanisms).
 

Two-sample MR: The Most Commonly Used MR Design

 
The study of AR and cancers adopts the two-sample MR design, which is the most widely used form of MR analysis in current research. Its core is to use summary data of genome-wide association studies (GWAS) from two independent populations for exposure and outcome respectively:
 
  • The exposure GWAS dataset: Contains genetic variation and exposure phenotype data (e.g., AR GWAS data from 112,583 European participants).
  • The outcome GWAS dataset: Contains genetic variation and outcome phenotype data (e.g., breast cancer, lung cancer GWAS data from different European cohorts).
 
This design avoids the problem of cohort overlap, improves the reliability of results, and can make full use of public GWAS databases (e.g., IEU OpenGWAS, GWAS Catalog) to reduce research costs and improve research efficiency.
mr assumption

2. Key Advantages of MR Analysis

Compared with traditional observational studies (e.g., cohort study, case-control study), MR analysis has unique methodological advantages, which are fully reflected in the practical application of the AR-cancer study:
 

2.1 Avoid confounding bias fundamentally

 
Genetic variants are randomly inherited at conception, and their distribution is not affected by postnatal confounding factors (e.g., lifestyle, environmental exposure, disease progression). This eliminates the confounding bias that plagues traditional observational studies, making the causal inference of exposure and outcome more reliable. For example, in the study of AR and cancer, factors such as smoking and age that affect both allergic diseases and cancer risk will not interfere with the genetic association between AR and cancer.
 

2.2 Resolve the problem of reverse causality

 
In traditional observational studies, it is often difficult to distinguish the causal direction (e.g., whether AR causes cancer, or cancer leads to immune abnormalities and then AR). MR analysis uses genetic variants that are determined at birth as IVs, and the exposure phenotype is usually expressed after birth, which naturally avoids reverse causality. In addition, the study also uses the Steiger Test to further verify the causal direction of SNPs and phenotypes, and exclude the possibility of reverse causality.
 

2.3 Make full use of public GWAS resources

 
With the development of genome-wide association research, a large number of high-quality GWAS summary data have been made public (e.g., IEU OpenGWAS, GWAS Catalog, FinnGen). Two-sample MR can directly use these data for analysis without collecting individual-level data, which greatly reduces the difficulty of sample collection and shortens the research cycle. The AR-cancer study all uses public GWAS data, covering more than 100,000 participants for exposure and thousands to hundreds of thousands for outcomes.
 

2.4 Support multi-method verification and improve result reliability

 
MR analysis has a variety of analytical methods and sensitivity analysis strategies. It can use the main method for primary analysis and multiple robust methods for verification, and eliminate the impact of abnormal IVs through sensitivity analysis. For example, the AR-cancer study uses not only traditional methods (IVW, MR-Egger, Weighted median) but also four novel methods (cML-MA, ConMix, MR-RAPS, dIVW) to cross-verify the results, ensuring the robustness of causal inference.

3. Limitations of MR Analysis (Methodological + Practical Application)

Although MR analysis has obvious advantages, it is not a “perfect” method, and there are both inherent methodological limitations and practical application limitations in actual research. The AR-cancer study also clearly points out these limitations and their impacts on the research results.
 

3.1 Inherent methodological limitations

 
(1) Risk of weak instrumental variables
 
IVs with low correlation with the exposure factor are called weak IVs, which will lead to biased causal effect estimation. The study uses the F-statistic to evaluate the strength of IVs, and the general standard is (F-statistic < 10 indicates weak IVs). All SNPs selected in the AR-cancer study have F-statistic > 10, which ensures that the IVs are strong, but this step is essential and easy to be ignored in practical research.
 
(2) Horizontal pleiotropy
 
Horizontal pleiotropy refers to IVs affecting the outcome through non-exposure pathways, which violates the exclusivity assumption of MR analysis and is the main source of bias in MR. Although the study uses MR-Egger-intercept, MR-PRESSO and other methods to detect and correct horizontal pleiotropy (all P-values > 0.05 in the study, indicating no significant pleiotropy), these methods have certain limitations (e.g., MR-Egger has low statistical power).
 
(3) Linkage disequilibrium (LD) interference
 
Genetic variants in the genome are often in LD (i.e., non-random combination), which may lead to the selection of non-independent SNPs as IVs, resulting in overestimation of causal effects. The study uses the clumping process (threshold: kb>10000, ) to remove SNPs in LD, and excludes proxy SNPs to ensure the independence of IVs.
 
(4) Limitations of statistical methods
 
Traditional MR methods (e.g., IVW) are sensitive to abnormal IVs and heterogeneity, and novel methods (e.g., ConMix, MR-RAPS) have their own applicable scenarios (e.g., ConMix is suitable for a large number of IVs, MR-RAPS is suitable for weak IVs). There is no “one-size-fits-all” method, and multiple methods need to be combined for analysis.
 

3.2 Practical application limitations (Reflected in the AR-cancer study)

 
(1) Population stratification and poor generalizability
 
The AR-cancer study only uses GWAS data of European ancestry, and the research results cannot be directly extended to other ethnic groups (e.g., Asian, African). Genetic background differences between ethnic groups will lead to differences in the frequency of genetic variants and the strength of gene-trait association, which is a common limitation of MR studies based on European GWAS data.
 
(2) Uneven statistical power of different outcomes
 
Statistical power is the probability of correctly rejecting the null hypothesis, and the conventional standard is for sufficient power. In the AR-cancer study:
 
  • Sufficient power: ER- breast cancer (80.00%), skin cancer (SC, 77.20%);
  • Moderate power: Lung squamous cell carcinoma (LUSC, 65.90%), basal cell carcinoma (BCC, 58.00%);
  • Insufficient power: Lung adenocarcinoma (LUAD, 9.60%), ER+ breast cancer (43.90%), glioma (44.10%).
 
The null results of outcomes with insufficient statistical power (e.g., no significant association between AR and LUAD) cannot be directly considered as “no causal relationship”, and need to be verified by expanding the sample size.
 
(3) Lack of stratified analysis and prospective verification
 
The study does not conduct age, gender, or clinical subtype stratified analysis, and the results lack supporting evidence from prospective cohort studies. MR analysis is based on cross-sectional GWAS data, and the combination with prospective studies can further improve the persuasiveness of causal inference.
 
(4) Inability to analyze gene-environment interaction
 
MR analysis focuses on the genetic causal effect of exposure on outcome, and cannot effectively analyze the interaction between exposure and environmental factors (e.g., whether the association between AR and cancer is different in smokers and non-smokers). This is a common limitation of MR analysis, which needs to be supplemented by traditional observational studies.

4. Standard Analysis Workflow of Two-sample MR

Combined with the AR-cancer association study, the standard workflow of two-sample MR analysis for complex traits and diseases is summarized as follows, which is the basis of practical R code implementation:
 
  1. Data acquisition: Obtain GWAS summary data of exposure and outcome from public databases (IEU OpenGWAS, GWAS Catalog, FinnGen, etc.), and ensure the consistency of population ancestry.
  2. IVs screening: Strictly screen SNPs according to the criteria (genome-wide significance , clumping to remove LD, Steiger Test to verify causal direction, exclude SNPs associated with outcomes/confounders, RadialMR to remove abnormal IVs).
  3. IVs strength evaluation: Calculate the F-statistic of the screened SNPs to ensure no weak IVs ().
  4. Core MR analysis: Use the main method (IVW) for primary analysis, and multiple robust methods (MR-Egger, Weighted median, Weighted mode) and novel methods (cML-MA, ConMix, MR-RAPS, dIVW) for cross-verification.
  5. Sensitivity analysis:
    • Cochran Q test: Evaluate the heterogeneity of IVs;
    • MR-Egger-intercept + MR-PRESSO: Detect horizontal pleiotropy;
    • Leave-one-out analysis: Verify whether the results are driven by a single abnormal SNP.
     
  6. Result visualization: Draw forest plots, scatter plots, leave-one-out plots to intuitively display the causal effect and sensitivity analysis results.
  7. Result interpretation: Combine statistical power, population characteristics, and biological mechanisms to interpret the results, and clarify the limitations of the study.

5. Practical R Code Implementation of Two-sample MR

The AR-cancer study uses R 4.3.3 for all statistical analyses, with core packages TwoSampleMR (for MR data processing, analysis and visualization) and MRPRESSO (for horizontal pleiotropy detection). The following code is a complete, reproducible two-sample MR analysis code based on the study, taking allergic rhinitis (AR) as exposure and ER- breast cancer as outcome as an example, and the code can be directly modified for other exposure-outcome combinations.
 

5.1 Environment Construction

 
Install and load the required R packages, and set the working directory
 
 

# Set working directory
setwd("your_working_directory")

# Install packages (run only once)
install.packages(c("TwoSampleMR", "MRPRESSO", "tidyverse", "ggplot2"))
devtools::install_github("MRCIEU/TwoSampleMR") # Install the latest version of TwoSampleMR

# Load packages
library(TwoSampleMR)
library(MRPRESSO)
library(tidyverse)
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

5.2 Data Acquisition

 
Obtain GWAS summary data of exposure (AR) and outcome (ER- breast cancer) from the IEU OpenGWAS database (the most commonly used database for two-sample MR). The dataset IDs are from the AR-cancer study:
 
  • AR: ukb-b-7178 (IEU OpenGWAS)
  • ER- breast cancer: ieu-a-1166 (IEU OpenGWAS)
 

# Obtain exposure data (AR)
exposure_dat <- extract_instruments(
  id = "ukb-b-7178", # AR dataset ID
  p1 = 5e-8, # Genome-wide significance threshold
  clump = TRUE, # Open clumping to remove LD
  clump_kb = 10000, # Clumping kb threshold
  clump_r2 = 0.001, # Clumping r2 threshold
  clump_p = 1,
  pop = "EUR" # European ancestry
)

# Obtain outcome data (ER- breast cancer)
outcome_dat <- extract_outcome_data(
  snps = exposure_dat$SNP, # SNPs from exposure data
  id = "ieu-a-1166", # ER- breast cancer dataset ID
  pop = "EUR"
)

# Harmonize exposure and outcome data (unify allele direction, remove incompatible SNPs)
mr_dat <- harmonise_data(
  exposure_dat = exposure_dat,
  outcome_dat = outcome_dat,
  action = 2 # Remove SNPs with allele direction inconsistency
)

5.3 IVs Strength Evaluation (Calculate F-statistic)

 

Calculate the and F-statistic of each SNP to verify that there are no weak IVs (). The formula is from the AR-cancer study:

 

 

 
  • : Regression coefficient of SNP-exposure association;
  • EAF: Effect allele frequency;
  • N: Sample size of exposure GWAS;
  • K: Number of IVs (SNPs).
 

# Add EAF and sample size (AR sample size N=112583)
mr_dat$eaf <- mr_dat$exposure_eaf
mr_dat$N <- 112583
K <- nrow(mr_dat)

# Calculate R2 and F-statistic
mr_dat <- mr_dat %>%
  mutate(
    R2 = (exposure_beta^2) * 2 * eaf * (1 - eaf),
    F = R2 * (N - K - 1) / (K * (1 - R2))
  )

# View IVs strength (F-statistic >10 is strong IVs)
print(paste("Number of final IVs:", K))
print(round(mr_dat[, c("SNP", "F")], 2))
print(paste("Minimum F-statistic:", min(mr_dat$F)))

5.4 Core MR Analysis

 
Implement traditional MR methods (IVW, MR-Egger, Weighted median, Weighted mode) and the main novel method (dIVW) used in the study, and output the causal effect index (OR and 95% CI) — the AR-cancer study uses odds ratio (OR) because the outcome is a binary variable (cancer/non-cancer).
 
 

# Core MR analysis (traditional methods)
mr_results <- mr(mr_dat, method_list = c(
  "ivw_fixed", # Fixed effects IVW (main method)
  "mr_egger_regression", # MR-Egger
  "weighted_median", # Weighted median
  "weighted_mode" # Weighted mode
))

# Convert beta to OR (OR = exp(beta), 95% CI = exp(beta ± 1.96*se))
mr_results_or <- mr_results %>%
  mutate(
    OR = exp(b),
    OR_lower = exp(b - 1.96 * se),
    OR_upper = exp(b + 1.96 * se),
    p_value = pval
  ) %>%
  select(method, OR, OR_lower, OR_upper, p_value)

# View results
print(round(mr_results_or, 4))

# Debiased IVW (dIVW, novel method) - eliminate weak IVs bias
divw_results <- mr_dwive(mr_dat)
divw_or <- data.frame(
  method = "debiased_ivw",
  OR = exp(divw_results$b),
  OR_lower = exp(divw_results$b - 1.96 * divw_results$se),
  OR_upper = exp(divw_results$b + 1.96 * divw_results$se),
  p_value = divw_results$pval
)
print(round(divw_or, 4))

5.5 Sensitivity Analysis

 
Implement the three key sensitivity analyses in the AR-cancer study: heterogeneity test (Cochran Q), horizontal pleiotropy test (MR-Egger-intercept + MR-PRESSO), and leave-one-out analysis.
 
 

#### 5.5.1 Heterogeneity test (Cochran Q)
hetero_results <- mr_heterogeneity(mr_dat)
print(hetero_results[, c("method", "Q", "Q_pval")])

#### 5.5.2 Horizontal pleiotropy test
# MR-Egger-intercept
pleio_egger <- mr_pleiotropy_test(mr_dat)
print(paste("MR-Egger-intercept P-value:", round(pleio_egger$pval, 4)))

# MR-PRESSO (detect and correct abnormal IVs)
presso_dat <- mr_dat %>%
  select(SNP, beta.exposure = exposure_beta, se.exposure = exposure_se,
         beta.outcome = outcome_beta, se.outcome = outcome_se)
presso_results <- MRPRESSO(presso_dat, NbDistribution = 1000, SignifThreshold = 0.05)
print(presso_results$MainResults) # Global test P-value
print(presso_results$OutlierTest) # Abnormal SNP detection

#### 5.5.3 Leave-one-out analysis
leave1out_results <- mr_leaveoneout(mr_dat)
# Visualize leave-one-out results
p_leave1out <- mr_leaveoneout_plot(leave1out_results)
print(p_leave1out)
ggsave("leaveoneout_plot.pdf", p_leave1out, width = 10, height = 6)

5.6 Result Visualization

 
Draw the forest plot (core result of causal effect) and scatter plot (SNP exposure-outcome association) — the most important visualization graphs in MR analysis, consistent with the figures in the AR-cancer study.
 
 

#### 5.6.1 Forest plot (OR and 95% CI of causal effect)
p_forest <- mr_forest_plot(mr_results, mr_dat)
print(p_forest)
ggsave("mr_forest_plot.pdf", p_forest, width = 12, height = 8)

#### 5.6.2 Scatter plot (SNP level exposure-outcome association)
p_scatter <- mr_scatter_plot(mr_results, mr_dat)
print(p_scatter[[1]])
ggsave("mr_scatter_plot.pdf", p_scatter[[1]], width = 8, height = 6)

#### 5.6.3 Funnel plot (detect publication bias)
p_funnel <- mr_funnel_plot(mr_singlesnp(mr_dat))
print(p_funnel)
ggsave("mr_funnel_plot.pdf", p_funnel, width = 8, height = 6)

5.7 Extended Analysis (Novel Methods)

 
For the other three novel methods used in the AR-cancer study (cML-MA, ConMix, MR-RAPS), they can be implemented through the following R packages:
 
 
The code logic is consistent with the above dIVW, and the GWAS summary data after harmonization can be directly input for analysis.
 
 
 

6. Summary and Research Suggestions

 MR analysis is a revolutionary epidemiological method for causal inference, which has become an important tool for studying the causal association between complex traits and diseases by using genetic variants as instrumental variables. The two-sample MR study of AR and cancers fully demonstrates the application value of MR analysis: it clarifies that genetically predicted AR is a risk factor for ER- breast cancer and a protective factor for LUSC, SC and BCC, and provides a reliable causal basis for subsequent biological mechanism research.

 
However, MR analysis has its inherent limitations, and practical research needs to take multiple measures to 规避偏倚 (avoid bias) and improve the reliability of results:
 
  1. Strictly screen IVs: Follow the three basic assumptions, use clumping to remove LD, Steiger Test to verify causal direction, and F-statistic to exclude weak IVs.
  2. Multi-method cross-verification: Combine traditional methods (IVW, MR-Egger) and novel methods (cML-MA, MR-RAPS) for analysis, and take the results with consistent directions of multiple methods as the final conclusion.
  3. Comprehensive sensitivity analysis: Use Cochran Q, MR-Egger-intercept, MR-PRESSO and leave-one-out analysis to evaluate heterogeneity and horizontal pleiotropy, and remove abnormal IVs.
  4. Pay attention to statistical power: For outcomes with low statistical power, expand the sample size or combine multiple GWAS datasets for meta-analysis to avoid false negative results.
  5. Strengthen the combination with other studies: MR analysis provides genetic causal evidence, and the combination with prospective cohort studies, in vitro cell experiments and in vivo animal experiments can further verify the causal relationship and explore the underlying biological mechanisms.
  6. Expand the research population: Collect GWAS data of different ethnic groups to improve the generalizability of the research results.
 
In the future, with the continuous development of GWAS technology and MR statistical methods (e.g., multi-exposure MR, MR for gene-environment interaction), MR analysis will play a more important role in the field of precision medicine and epidemiological research, and provide more reliable causal evidence for the prevention and treatment of complex diseases.

发表评论

您的邮箱地址不会被公开。 必填项已用 * 标注