Journal of the College of Physicians and Surgeons Pakistan
ISSN: 1022-386X (PRINT)
ISSN: 1681-7168 (ONLINE)
Affiliations
doi: 10.29271/jcpsp.2025.12.1511ABSTRACT
Objective: To identify potential pivotal genes associated with Crohn's disease (CD) that may serve as potential diagnostic biomarkers.
Study Design: A bioinformatics analysis.
Place and Duration of the Study: Department of Gastrointestinal Dysfunction Centre, Anhui No.2 Provincial People's Hospital, Hefei, China, from June 2023 to June 2024.
Methodology: Multiple gene expression datasets from the NCBI Gene Expression Omnibus (GEO) were used to identify potential diagnostic biomarkers. Weighted gene co-expression network analysis (WGCNA) was conducted; CD-related genes were selected through WGCNA, gene-gene interaction network analysis, and Least Absolute Shrinkage and Selection Operator (LASSO) regression. Logistic regression and six supervised learning techniques—including support vector machine, random forest, K-nearest neighbour, neural network, decision tree, and extreme gradient boosting (XGB)—were used to assess and compare the diagnostic performance of various models.
Results: Hub gene screening, immune infiltration analysis, and gene set enrichment analysis were used to explore the association between the potential genes and CD. A nomogram based on the logistic regression model was developed, and the clinical utility of the hub genes was evaluated through decision curve analysis (DCA). Additionally, six machine learning models were developed, with the XGB model demonstrating the highest performance. The area under the curve (AUC) for the XGB model was 99.4% for the training set, 77.7% for the validation set, and 83.5% for external validation data, indicating its superior diagnostic potential.
Conclusion: This study identified five genes that are closely associated with CD. Based on these genes, a CD diagnostic model was established, with the XGB model showing the most promising results among the six tested models.
Key Words: Crohn's disease, Genes, Prediction model, Machine learning.
INTRODUCTION
In the early 21st century, the incidence of inflammatory bowel disease (IBD) rose sharply in newly industrialised countries, turning it into a major global health issue.1 IBD, which encompasses CD and ulcerative colitis (UC), is marked by persistent intestinal inflammation and characterised by alternating periods of remission and flare-ups. Major complications of IBD can include the development of fibrotic stenosis, fistulas, impaired absorption, an increased risk of cancer, and the need for surgical intervention. Research on biomarkers that support CD diagnosis, along with studies on the underlying immune mechanisms, continues to be a vital area for improving both the under- standing and treatment of the disease.
CD is an autoimmune condition marked by repeated bouts of intestinal inflammation. The development of CD is thought to be closely linked to alterations in immune cell types and numbers, which disrupt immune homeostasis, trigger an inflammatory cascade, and cause sustained damage to the intestinal lining.2 Recent research on immune cell infiltration and pathway enrichment reinforces this idea, showing that differentially expressed genes are highly linked to immune responses and associated pathways.
In recent years, machine learning (ML) has emerged as a widely used bioinformatics approach for mining medical data.3 Additionally, biological networks, including weighted gene co-expression network analysis (WGCNA) and gene-gene interaction networks, are essential for uncovering the relationships between genes and their biological functions. This study aimed to identify potential biomarkers through bionetwork analysis and ML methods, assessing their uniqueness using ML-based diagnostic models.
METHODOLOGY
Seven datasets were retrieved from the GEO database. Each dataset included over 10 samples and was generated using four distinct microarray platforms. The raw data underwent pre- processing using the oligo and affy R packages, followed by background correction and normalisation through the Robust Multichip Average (RMA) approach. The GSE83448 dataset was utilised to build a WGCNA. Next, four additional datasets (GSE16879, GSE59071, GSE75214, and GSE102133) were analysed using the limma R package to identify differentially expressed genes (DEGs) between CD patients and healthy controls. DEGs were filtered with a threshold of |log2FC| >1 and p <0.05, and the findings were visualised using heat maps and volcano plots generated by the complex heat maps and ggplot2 R packages. The GSE186582 dataset included 317 samples from CD patients and 25 intestinal mucosal samples from healthy controls. The Combat algorithm from the sva R package was applied to adjust for batch effects across various platforms and experiments. Lastly, the GSE95095 dataset was used as an independent dataset for external validation (EV).
The R package WGCNA was used to identify gene modules and assess the association between each module and the sample type. The GSE83448 dataset provided 14 healthy control samples and 19 CD patient samples, totalling 33 samples; these samples were categorised into a CD group and a Healthy Control group. The hclust function was employed to cluster samples and detect outliers using the average method on the expression matrix. A soft threshold of seven was selected to optimise the scale-free topological fitting index, yielding an index value of 0.86, which enhanced the average network connectivity. The correlation between module Eigen genes and sample types was assessed by examining the characteristic gene expression profiles and the classification of clinical samples. Inclusion criteria included patients aged between 18 and 65 years and colonoscopy pathological diagnosis of Crohn's disease. Exclusion criteria included patients aged under 18 years, diagnosed with inflammatory diseases other than Crohn's disease (such as ulcerative colitis), and patients with intestinal tuberculosis or irritable bowel syndrome. Genes labelled as green within the selected module were exported to Cytoscape for detailed examination. To explore the roles of common DEGs in greater detail, Gene Ontology (GO) annotation and Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway enrichment analyses were conducted using the clustering analyser in R.
The intersection of 196 genes screened from four gene sets, with 68 genes in the green module analysed by WGCNA, was taken to obtain 22 common genes. Their relationship with immune cell infiltration was evaluated, and finally, 5 hub genes were obtained through Least Absolute Shrinkage and Selection Operator (LASSO) regression dimension reduction processing. The intersection of these analyses pinpointed five immune cell types that exhibited significantly distinct infiltration patterns. Spearman’s correlation analysis was performed in R software to assess the relationship between the five hub genes and 22 significantly elevated immune cell types. The ggplot2 R package was then used to create visual representations of these correlations.
The datasets were randomly divided into training and testing sets, with 70% allocated to the training set and 30% to the testing set. Repeated stratified cross-validation was then applied to the training set. The stratified K-fold method maintained consistent proportions of sample types in each fold, aligning with the overall dataset distribution. To further assess model robustness, external validation was conducted using an independent dataset, GSE95095. This study evaluated six commonly used ML algorithms to determine the most effective classifier for the diagnostic model. The algorithm tested models included extreme gradient boosting (XGB), k-nearest neighbour (KNN), support vector machine (SVM), random forest (RF), decision tree (DT), and neural network (NN). Ultimately, model performance on both the test and external validation datasets was evaluated using a comprehensive array of metrics.
RESULTS
As shown in Figure 1, the differential genes between patients the CD group and the Healthy Control group were analysed using the limma package in R. After filtering the genes with thresholds of |log2FC| >1 and p <0.05, a total of 1,843 DEGs were screened in the GSE16879 dataset. A total of 691 DEGs were screened in the GSE59071 dataset. The GSE75214 dataset filtered out a total of 400 differential genes. In the GSE102133 dataset, 534 DEGs were identified (Figure 1A-D). After taking the intersection of the Venn diagram of the above differential genes, 196 common differential genes were identified (Figure 1E). The DEGs table intersected four gene clusters, which included 147 commonly upregulated genes and 49 commonly downregulated genes.
The WGCNA R package was used to assess the correlations between clinical features and genes by building a gene network with 820 genes and 33 samples from the GSE83448 dataset. A sample clustering diagram was generated (Figure 2A). To ensure a scale-free topology and preserve zero mean connectivity, a threshold value of 17 was selected (Figure 2C). The module merging threshold was set at 0.2, resulting in the creation of four distinct modules (Figure 2B). Yellow and green modules have the highest positive correlation and are important modules for subsequent analysis (Figure 2D).
To further elucidate the functions of these shared DEGs, GO annotation together with KEGG pathway enrichment analyses were performed (Table I).
To identify the pivotal genes, a cross-analysis was performed between the 68 related genes in the green module and the previously identified 196 DEGs, resulting in 22 key genes. These genes were analysed by immune infiltration, and subsequently, dimensionality reduction was performed by LASSO regression. Finally, five key genes were obtained: GPX8, COL1A1, RAB31, S100A9, and STX11 (Figure 3A-B). In CD samples, S100A9 showed a positive correlation with resting NK cell infiltration levels, while demonstrating a negative correlation with macrophage M1, monocytes, and eosinophils. STX11 was also positively associated with resting NK cell infiltration.
Figure 1: Identification of common DEGs. (A-D) Volcano maps show the differential genes between CD patients and the healthy controls in the GSE16879, GSE75214, GSE59071, and GSE102133 datasets, respectively. (E) The Venn diagram shows 196 common DEGs in the intersection of the four datasets.
Figure 2: Progress of GSE83448 WGCNA (A) Cluster tree of 33 samples from GSE83448. (B) Aggregation in GSE83448 across conditions (p <0.05; |logFC|>1). A tree of 820 genes was screened. Using the topological overlap matrix, the genes were organised into distinct modules. (C) Determine soft thresholds (left) and average connectivity (right) for the fit index of the optimal scale-free topological model. (D) Heat map showing relationship between GSE83448 module feature genes and clinical traits.
Figure 3: Association between hub gene and immune infiltration. (A) LASSO coefficient distribution of 5 hub genes. (B) Tuning parameters (λ) select cross-validation error curves. (C-N) Visualisation of the correlation between immune cells and hub genes.
Figure 4: Final selection of the hub gene logistic regression modelling. (A) Drawing of the nomogram of the prediction model for CD; (B) Calibration curve of the model; (C) The representation of single factor and comprehensive factor in the model DCA curve.
Table I: Analysis of the biological processes, molecular functions, cellular components and KEGG pathway enrichment of genes.
|
Categories |
Terms |
Counts |
p-values |
|
Biological process |
Cytokine-mediated signalling pathway |
20~30 |
p <0.01 |
|
Biological process |
Leucocyte migration |
20~30 |
p <0.01 |
|
Biological process |
Cell chemotaxis |
20~30 |
p <0.01 |
|
Biological process |
Response to molecule of bacterial origin |
20~30 |
p <0.01 |
|
Biological process |
Response to lipopolysaccharide |
20~30 |
p <0.01 |
|
Biological process |
Leucocyte chemotaxis |
20~30 |
p <0.01 |
|
Biological process |
Myeloid leucocyte migration |
20~30 |
p <0.01 |
|
Biological process |
Granulocyte migration |
20~30 |
p <0.01 |
|
Biological process |
Neutrophil migration |
20~30 |
p <0.01 |
|
Biological process |
Neutrophil chemotaxis |
20~30 |
p <0.01 |
|
Molecular function |
Receptor ligand activity |
8~20 |
p <0.01 |
|
Molecular function |
Signalling receptor activator activity |
8~20 |
p <0.01 |
|
Molecular function |
Extracellular matrix structural constituent |
8~20 |
p <0.01 |
|
Molecular function |
Cytokine activity |
8~20 |
p <0.01 |
|
Molecular function |
Cytokine receptor binding |
8~20 |
p <0.01 |
|
Molecular function |
Immune receptor activity |
8~20 |
p <0.01 |
|
Molecular function |
Chemokine receptor binding |
8~20 |
p <0.01 |
|
Molecular function |
Chemokine activity |
8~20 |
p <0.01 |
|
Molecular function |
CXCR chemokine receptor binding |
8~20 |
p <0.01 |
|
Molecular function |
Extracellular matrix structural constituent conferring tensile strength |
8~20 |
p <0.01 |
|
Cellular component |
Collagen-containing extracellular matrix |
5~25 |
p <0.01 |
|
Cellular component |
External side of plasma membrane |
5~25 |
p <0.01 |
|
Cellular component |
Endoplasmic reticulum lumen |
5~25 |
p <0.01 |
|
Cellular component |
Secretory granule membrane |
5~25 |
p <0.01 |
|
Cellular component |
Secretory granule lumen |
5~25 |
p <0.01 |
|
Cellular component |
Cytoplasmic vesicle lumen |
5~25 |
p <0.01 |
|
Cellular component |
Vesicle lumen |
5~25 |
p <0.01 |
|
Cellular component |
Collagen trimer |
5~25 |
p <0.01 |
|
Cellular component |
Platelet alpha granule |
5~5 |
p <0.01 |
|
Cellular component |
Complex of collagen trimers |
5~25 |
p <0.01 |
|
KEGG pathway |
Cytokine-cytokine receptor interaction |
9~21 |
p <0.01 |
|
KEGG pathway |
Rheumatoid arthritis |
9~21 |
p <0.01 |
|
KEGG pathway |
IL-17 signalling pathway |
9~21 |
p <0.01 |
|
KEGG pathway |
AGE-RAGE signalling pathway in diabetic complications |
9~21 |
p <0.01 |
|
KEGG pathway |
Viral protein interaction with cytokine and cytokine receptor |
9~21 |
p <0.01 |
|
KEGG pathway |
TNF signalling pathway |
9~21 |
p <0.01 |
|
KEGG pathway |
Staphylococcus aureus infection |
9~21 |
p <0.01 |
|
KEGG pathway |
Amoebiasis |
9~21 |
p <0.01 |
|
KEGG pathway |
Leishmaniasis |
9~21 |
p <0.01 |
|
KEGG pathway |
Malaria |
9~21 |
p <0.01 |
Table II: Accuracy, Kappa value, sensitivity, specificity, precision, model recall, F, and AUC score of six supervised ML models.
|
Datasets |
ML models |
Accuracy |
Kappa values |
Sensitivity |
Specificity |
Precision |
Recall |
F1 |
AUC |
|
Training sets |
KNN |
0.8833 |
0.5096 |
0.8739 |
1 |
1 |
0.8739 |
0.9327 |
0.971 |
|
DT |
0.8167 |
0.3462 |
0.8108 |
0.8889 |
0.989 |
0.8108 |
0.8911 |
0.888 |
|
|
NN |
0.775 |
0.3182 |
0.7568 |
1 |
1 |
0.7568 |
0.8615 |
0.94 |
|
|
RF |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
1 |
|
|
XGB |
0.9542 |
0.7421 |
0.9505 |
1 |
1 |
0.9505 |
0.9746 |
0.994 |
|
|
SVM |
0.9458 |
0.6688 |
0.955 |
0.8333 |
0.986 |
0.955 |
0.9703 |
0.933 |
|
|
Testing sets |
KNN |
0.8529 |
0.4226 |
0.8421 |
1 |
1 |
0.8421 |
0.9143 |
0.921 |
|
DT |
0.8235 |
0.1692 |
0.8526 |
0.4286 |
0.9529 |
0.8526 |
0.9 |
0.651 |
|
|
NN |
0.7647 |
0.2078 |
0.7684 |
0.7143 |
0.9733 |
0.7684 |
0.8588 |
0.886 |
|
|
RF |
0.9412 |
0.3742 |
0.9895 |
0.2857 |
0.9495 |
0.9895 |
0.9691 |
0.844 |
|
|
XGB |
0.8922 |
0.2087 |
0.9368 |
0.2857 |
0.9468 |
0.9368 |
0.9418 |
0.777 |
|
|
SVM |
0.8824 |
0.3412 |
0.9053 |
0.5714 |
0.9663 |
0.9053 |
0.9348 |
0.839 |
|
|
External validation sets |
KNN |
0.8056 |
0.5532 |
0.875 |
0.6667 |
0.84 |
0.875 |
0.8571 |
0.773 |
|
DT |
0.7778 |
0.4783 |
0.875 |
0.5833 |
0.8077 |
0.875 |
0.84 |
0.769 |
|
|
NN |
0.8056 |
0.5714 |
0.8333 |
0.75 |
0.8696 |
0.8333 |
0.8511 |
0.802 |
|
|
RF |
0.6667 |
0.1429 |
0.875 |
0.25 |
0.7 |
0.875 |
0.7778 |
0.769 |
|
|
XGB |
0.8333 |
0.625 |
0.875 |
0.75 |
0.875 |
0.875 |
0.875 |
0.835 |
|
|
SVM |
0.7222 |
0.4 |
0.75 |
0.6667 |
0.8182 |
0.75 |
0.7826 |
0.688 |
Similarly, GPX8 exhibited a positive correlation with the infiltration of both resting NK cells and macrophages M0. RAB31 was positively linked to the infiltration of macrophages M0, macrophages M2, resting NK cells, and activated mast cells (Figure 3C-N).
First, a logistic regression model was developed for the five selected hub genes, followed by the creation of a column graph and a calibration curve, and subsequently, a DCA curve of single factor and comprehensive factor for comparative analysis was drawn. Afterwards, six models were built using ML algorithms (Figure 4A-C), and ROC curves for all six of the CD categorical diagnostic assistants were established using the predicted probabilities for different categories. The diagnostic performance of the six CD prediction models was further assessed using the EV dataset, which included 24 CD samples and 12 healthy controls. The training dataset results for all six models are presented in Table II. Among them, the RF model demonstrated the highest accuracy at 100%, achieving the best specificity and sensitivity. In contrast, the NN model demonstrated the weakest performance, with an accuracy of merely 77.5% and a sensitivity of 75.7%, whereas the SVM model exhibited a specificity of only 83.3%. The testing dataset performance of all diagnostic models is illustrated in Table II. After training, the RF model continued to outperform others with an accuracy of 94.1%. Meanwhile, the NN model remained the least accurate, achieving nearly 76.5%. Sensitivity analysis revealed that this metric was generally the weakest, with a sensitivity of 76.8%. On the other hand, the RF model demonstrated the highest sensitivity at 98.9%, highlighting its strong predictive ability for detecting CD samples. Regarding specificity, the KNN algorithm performed best, achieving 100% specificity for CD samples. The specificity of the SVM algorithm is the lowest at 57.1%. The RF and XGB algorithms predicted the highest specificity of normal samples (approximately 72.5%). The results show that each of the six models has its own advantages and disadvantages. The F1 score can be used as a more reliable indicator. According to the F1 score, among the evaluated models, the RF algorithm exhibited superior performance, achieving the accuracy and sensitivity values of 100% and 96.9%, respectively. The performance metrics for all six CD diagnostic models on the external validation dataset are presented in Table II. The results indicate that all six models accurately classified normal samples; however, their performance varied for CD samples. In terms of accuracy, the XGB algorithm performed the best, achieving 83.3%, while the RF model showed the lowest accuracy at 66.7%. The sensitivity of all models was relatively consistent (approximately 87%). The NN and XGB boosting models demonstrated the highest specificity, both reaching 75%. In the end, gradient-lifted F1 scored the highest at 87.5%. The AUC of gradient lifting on EV was 83.5%.
DISCUSSION
The protein encoded by the GPX8 gene is Glutathione Peroxidase 8, which is a member of the glutathione peroxidase family. Glutathione peroxidase is an important class of antioxidant enzymes; it protects cells from oxidative damage. GPx7 (NPGPx) and GPx8 exhibit a high level of structural similarity.4 Studies have demonstrated that GPx8-deficient mice are more prone to colitis. GPx8 is particularly important in macrophage function. Colon tissue samples from IBD patients showed low GPx8 expression and high caspase-4 levels, suggesting that GPx8 may protect against colitis by inhibiting caspase-4/11 activity.5
S100A9 is a member of the S100 protein family. It is involved in intracellular and intercellular signalling, as well as the migration and activation of inflammatory cells. S100A9 also function independently as monomers, contributing to inflammatory processes when released into the extracellular environment.6 Research has demonstrated a correlation between the expression levels of S100A9 in blood leucocytes and the clinical characteristics of IBD.7 S100A9 is, therefore, considered to be a potential biomarker that can be used to assess the activity of inflammation and the prognosis of disease.
The COL1A1 gene encodes a protein that is crucial for preserving normal tissue structure and function. Fibrosis in CD results from diverse pathological mechanisms, including the accumulation of extracellular matrices (ECM). These processes compromise intestinal tissue integrity.8 High COL1A1 expression is closely linked to the fibrotic phenotype in the intestine,9 as it promotes ECM accumulation and thereby accelerates the development of intestinal fibrosis.10
The STX11 gene encodes the protein grammarin 11, which plays a crucial role in membrane and vesicle fusion processes both within and between cells. Grammarin 11 functions mainly in immune cells, especially in lymphocytes and antigen-presenting cells. It also plays a role in intercellular signalling and the modulation of immune responses.11 Additionally, STX11 may contribute to regulating inflammation and contributing to the onset and progression of autoimmune diseases.12 RAB31, a member of the RAB protein family,13 shows high RAB31 expression in breast cancer tissues.14 Nevertheless, the specific molecular pathways through which RAB31 influences the onset and progression of IBD remain insufficiently elucidated.
The underlying mechanisms of CD are believed to involve changes in the types and numbers of immune cells, leading to disrupted immune balance and sustained damage to the intestinal mucosa.15 The analyses of immune cell infiltration and pathway enrichment have demonstrated that genes with differential expression are intimately linked to immune responses and their associated signalling pathways. Notably, the intestinal samples from CD patients display significant diversity in immune cell composition, including effector memory CD8+ T cells, neutrophils, mast cells, activated dendritic cells, and γδ T cells. Furthermore, macrophages are recognised as the primary source of IL-23 in the gut, and they crosstalk with T cell subsets and innate lymphocytes to amplify intestinal inflammatory responses.16 Patients with CD often show a reduced number of regulatory T cells and impaired inhibitory function.17 In addition, other immune cells, including dendritic cells and mast cells, are important contributors to the pathogenesis of CD.18,19 Neutrophils activate and recruit other immune cells through degranulation, further exacerbating the inflammatory response.20 Spearman’s correlation analysis was performed on five key hub genes and 22 distinct immune cell types.21 The study's results further imply that S100A9, RAB31, STX11, and GPx8 could influence the fibrotic processes in CD by modulating immune responses and the ECM process.
However, the study has some limitations. First, this study relies on bioinformatics analysis of multiple databases, thus further experimental verification is needed. Second, while the study conducted extensive bioinformatics analysis, the underlying mechanisms of the identified hub genes remain under- explored.
CONCLUSION
During the process of gene screening, it was found that RAB31 and STX11 were newly related to the occurrence and development of CD, in addition to COL1A1, S100A9, and GPX8. These genes can be used as potential biomarkers to build clinical predictive models through ML methods.
ETHICAL APPROVAL:
The public databases involved have been ethically reviewed and this article compiles with the terms and policies of the data provider, so ethical approval is not required.
COMPETING INTEREST:
The authors declared no conflict of interest.
AUTHORS’ CONTRIBUTION:
RQ: Statistical analysis of the data and writing of the paper.
LY: Research design and scattered work.
GGS: Design of the study.
All authors approved the final version of the manuscript to be published.
REFERENCES