Journal of the College of Physicians and Surgeons Pakistan
ISSN: 1022-386X (PRINT)
ISSN: 1681-7168 (ONLINE)
Affiliations
doi: 10.29271/jcpsp.2025.12.1590ABSTRACT
The present systematic review aimed to evaluate the utilisation of artificial intelligence (AI) across several aspects of periodontal diagnosis and treatment planning by studying and analysing recent literature on the assessment of periodontitis through various radiographic analysis models using AI. The databases including PubMed, Cochrane, ScienceDirect, and Google Scholar were searched from 1st June to August 2024. From the shortlisted studies, 15 original research articles were included in the review and were assessed for risk of bias using the Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool by the Cochrane Collaboration as a quality evaluation tool. All the models showed comparable sensitivity compared to that of examiners. AI can serve as a time-saving aid for clinicians; however, further studies are required using well-defined and accepted gold standard, applied in clinical setting with datasets of intraoral periapical series.
Key Words: Artificial intelligence, Periodontitis, Diagnosis.
INTRODUCTION
Alan Turing introduced the term artificial intelligence (AI) in 1950;1 however, McCarthy provided the first definition of AI in 1956.2 AI is an abstract term describing various fundamental technologies that allow electronic machines to perform actions that pertain to human-like ability.3 As the name indicates, artificial neural networks use artificial neurons, similar to human neurons, and thus can mimic the human brain and reproduce similar cognitive skills such as problem resolution, acquisition, and judgement. This system consists of three layers: an input layer that handles data input, a hidden layer that refines the data, and an output layer that makes the final decisions for the task.4
Recently, this technology has been applied across various fields, particularly in engineering and medicine. The research of Alatrany et al. focuses on detecting early Alzheimer's disease on MRI data through AI-based models,5 verifying the role of this technology in the early recognition and treatment of leading diseases. AI in dentistry is thriving, serving clinicians, and delivering optimal care. Several studies worldwide have been published highlighting the various uses of AI.6-8
The use of AI has increased in all fields of dentistry; however, it remains in its immaturity and has not yet found satisfactory implementation in periodontology.9
Periodontal disease, often known as gum malady, is a common oral health disorder.10 It is a complex, microbial, inflammatory condition that causes progressive breakdown of the tooth's supporting tissues, leading to periodontal attachment and bone loss.11 This disease is diagnosed clinically by probing and measuring recession.12 However, this method is not satisfactory, as its reliability depends on the force, type, tip diameter, and angulation of the instrument.13 Measuring the amount of bone loss (ABL) in radiographs is another method of diagnosis; however, multiple evaluators have limited accuracy and decreased reliability, as demonstrated in several studies.14 In recent years, AI has shown promise in improving diagnostic accuracy across various medical and dental disciplines.15
Despite increasing interest and a number of essential studies addressing AI applications in periodontology, the current body of evidence was fragmented due to variability in AI models used, sample sizes, performance metrics, and validation techniques.16 As a result, there was a lack of consolidated synthesis of the current capabilities, limitations, and future directions of AI in diagnosing and assessing chronic periodontitis. Therefore, this systematic review aimed to evaluate the present application of AI across several aspects of periodontal diagnosis and treatment planning by studying and analysing recent literature on the assessment of periodontitis through radiographic analysis.
METHODOLOGY
This systematic review followed the guidelines given by Preferred Reporting Items for Systematic Reviews and Meta- Analysis (PRISMA).17 The research question was formulated by following the PICO format. In patients undergoing diagnosis and assessment for periodontitis (P), how do convolutional neural network (CNN)-based AI models (I) compare to clinicians using a well-defined gold standard or other AI-based models (C) in terms of their clinical utility for diagnosis, detection, prognosis, or treatment planning of periodontitis (O) based on the performance metrics of models.
In terms of eligibility criteria, studies were included if they were original research articles available in open access, published between 2018 and 2024, and used artificial neural networks (ANNs) or convolutional neural networks (CNNs) for the diagnosis, assessment, or evaluation of periodontal bone loss, comparing AI-based models with clinicians or a well-defined gold standard. Systematic reviews, randomised controlled trials, editorials, and book chapters published before 2018, not written in English, or those that did not clearly explain the methodology for AI model construction were excluded.
Databases, including PubMed, Cochrane, ScienceDirect, and Google Scholar, were initially searched from 1st June to August 2024 by using the following keywords: AI [Mesh] OR Machine Learning [Mesh] OR Deep Learning OR Neural Networks OR Computer-aided Diagnosis OR Decision Support Systems, Clinical [Mesh] OR AI OR ML) AND (Periodontitis, Chronic [Mesh] OR Chronic Periodontitis OR Periodontal Disease OR Gum Disease) AND (Diagnosis [Mesh] OR Assessment OR Detection OR Classification OR Grading OR Staging). The search was repeated once a week for the next two months. To increase the data, reference articles from selected studies were also added if they met the inclusion criteria.
One independent researcher carefully evaluated the generated studies. First, titles and then abstracts of the articles were reviewed and retained for full-text assessment. The selected articles were reviewed by a senior professor for a more accurate selection. Any conflicts were resolved through mutual agreement.
The selected articles underwent full-text review by another researcher. After the application of inclusion criteria, the following information was extracted in an Excel sheet: first author’s name and year of publication, country, AI model used, the entity to which the model was compared, source of data, data set, and main findings (Table I).18-32
The selected studies were evaluated for risk of bias using the Quality Assessment of Diagnostic Accuracy Studies (QUDAS-2) tool. Two examiners carefully evaluated the studies by following the guidelines (Table II).33
RESULTS
A total of fifteen studies employing AI for the detection and evaluation of periodontal bone loss were examined. The research encompassed multiple countries, such as Turkiye, South Korea, China, Germany, the United States, Saudi Arabia, the United Kingdom, Russia, the Netherlands, and Thailand. All studies utilised CNNs or other deep learning models as the main AI model, using radiographic images—such as orthopantomograms (OPGs), periapical radiographs, or cone-beam computed tomography (CBCT)—as input data.
In most studies, AI models demonstrated moderate to high levels of diagnostic accuracy, often similar to that of skilled dental professionals. Bayrakdar et al. found high accuracy in identifying diseased cases, with the model misclassifying only 6 out of 105 instances.18 Chang et al. noted a significant relationship between the model and experienced radiolo- gists.19 Jiang et al. showed that AI models outperformed general dentists in detecting early-stage bone loss, while human examiners achieved superior performance in cases of advanced lesions.20
Krois et al. observed that stricter diagnostic thresholds led to a decline in model sensitivity, resulting in poorer performance than that of clinicians (p = 0.067).21 Cerda Mardini et al. disco-vered that the AI model was effective at detecting mild to moderate bone loss (F1-score = 0.29), but completely ineffective for severe loss (F1-score = 0). Periodontists exceeded the model in every parameter.22
Kim et al. revealed that after several training cycles, the model performance greatly improved; however, third molar regions remained a compromising area of the model.23 Alotaibi et al. reported low diagnostic quality for severe bone loss cases using the VGG-16 model.24
Figure 1: Characteristics of the included studies.
Table I: Characteristics of the included studies.
|
First Authors (years) |
Countries |
Types of AI model (CNN) |
Input types |
Comparisons |
Datasets |
Main findings |
|
Sevda Kurt (2020) |
Turkiye |
Pre-trained CNN GoogleNet, Inception v3 network
|
Orthopantomograms (OPG)
|
Compared with maxillofacial radiologist and periodontologist (≥9 years of experience)
|
A total of 2,276 panoramic images are divided into the training (1,856), validation (210), and testing (210) sets. |
Of 105 diseased cases, the model system evaluated six incorrectly and 99 correctly. Furthermore, in disease-free cases, it was less accurate with 12 incorrect diagnoses.18 |
|
Chang HJ (2020) |
South Korea |
R-CNN (supported by pyramid network with ResNet-101) |
OPG
|
Compared with three OMFS radiologists (professor working for 10 years of experience, fellow: 5 years, resident: 3 years) |
A total of 340 OPG were used. To evaluate multiple variables, 330, 115, and 73 images were analysed. The images in each set were distributed into a training set (306) and a testing set (34). |
The mean average difference values were lower for canines than that for incisors and molars.19
The ICC between the professor and AI algorithm showed the highest co-relation, showing superior reliability.19 |
|
Linhong Jiang (2022)
|
China |
CNN (U-Net and YOLO-v4)
|
OPG
|
Compared with three general dentists (three years of experience each) |
A total of 640 panoramic radiographs, after segmentation, were separated into the preparation set (512 images) and the two trial sets of 64. |
The model holds improved scores compared to the dentists in stage I and II lesions; however, general dentists showed better results for stage III lesions.20 |
|
Joachim Krois (2019)
|
Germany
|
Seven-layered network supported by the TensorFlow framework and the Keras software.
|
OPG
|
Compared with 6 dentists: one periodontist, one endodontist, and four general dentists |
A total of 2,001 hand-operated single tooth images obtained from 85 radiographs, randomly divided into preparation and validation sets by reordering. |
The AI model was not accurate compared to the investigators (p = 0.067). Increasing the cut-off value resulted in decreased model sensitivity compared to the examiners.21 |
|
Diego Cerda Mardini (2024) |
The United States of America
|
Deep Neural Network (DCNN), trained with Google TensorFlow Keras and designed by Xception Networks. |
OPG
|
Compared with two radiologists, two periodontists, and one general dentist |
A total of 500 panoramic radiographs segmented into 2,010 rectangular images, divided into 1,576 for training, 394 for internal testing, and 40 for final testing. |
The network showed satisfactory performance in diagnosing low to medium levels of bone loss (F1-score = 0.29) but was ineffective for diagnosing severe bone loss (F1-score = 0). The periodontist outperformed the model in all features.22 |
|
Jaeyoung Kim (2019)
|
Korea
|
DCNNs called DeNTNet. |
OPG
|
Compared clinicians with 5, 9, 16, 17, and 19 years of experience |
A total of 12,179 panoramic dental x-rays images randomly divided into training = 11,189, validation (190), and test (800) sets.
|
The baseline DeNTNet model, trained directly, showed satisfactory performance compared to clinicians. However, after multiple rounds of training, the model accomplished better results; its performance was considerably inferior on third molars as compared to clinicians.23 |
|
Ghala Alotaibi (2022)
|
Saudi Arabia |
VGG-16 (Visual Geometry Group) network supported by TensorFlow and Keras. |
Intraoral Periapical films
|
Three examiners, including a periodontist |
A total of 1,724 intraoral periapical images were arbitrarily divided into preparation (70%), validation (20%), and testing (10%) sets. |
Diagnostic quality was the lowest for severe bone loss.24 |
|
Raymond P (2021) |
The United Kingdom |
Deep network with symmetric hourglass blocks. |
Periapical films
|
Compared a modified hourglass network with a baseline ResNet-based regression model. |
A total of 340 fully anonymised periapical radiographs were divided into three groups for 3-fold cross-validation.
|
The projected model was assessed to show its performance on radicular structure, showing a high performance of 88.9% for anteriors.25 |
|
Kubra Ertas (2022)
|
Turkiye |
Multiple artificial neural networks, such as Support Vector Machines (SVM), Nearest Neighbours Random Forest, Naive Bayes, and Logistic Regression.28 |
OPG
.
|
Compared with multiple algorithms |
A total of 280 OPG, 236 were selected.
|
In the models evaluated in this study, success was higher when objective and radiographic evaluations were used. The ResNet50 + SVM dual model showed the highest performance in pre-processed images, achieving a classification accuracy value of 88.2%.26 |
|
Bilge Cansu Uzun Saylan (2023)
|
Turkiye |
PyTorch-implemented YOLO-v5 model.
|
OPG
|
Compared domain-specific or local bone loss detection with general bone loss detection by the same AI model. |
A total of 685 panoramic x-rays were divided into training (80%) and assessment (20%) sets.
|
When assessing the maxillofacial region, the model demonstrated higher efficacy in determining bone loss in the maxilla. Furthermore, it showed greater accuracy in uncovering regional bone loss.27 |
|
Ezhov M (2021) |
USA and Russia |
Diagnocat. |
Cone-Beam Computed Tomography (CBCT)
|
Compared the aided group (dentists assisted by AI models) with the unaided group (dentists not assisted by AI models) |
For the periodontitis module, the data set of 99 CBCTs was used. |
The AI-assisted unit had higher operational efficiency. The introduction of the model decreased the time required to evaluate a single CBCT by 1.19 min (6.78%).28 |
|
Nektaros Tsoromokos (2022) |
Netherlands |
A 13-layer deep model with ReLU and four MaxPooling layers. |
Periapical radiographs
|
Manual annotation by a radiologist |
A total of 446 radiographs were annotated. The training set included 327 images, the validation set 49, and the test set 70 images. |
Overall, the AI model underestimated bone loss values, which was significant for multi-rooted teeth (8.5%) and for teeth with angled defect (10% accuracy).29 |
|
Bhornsawan Thanathornwong (2020) |
Thailand |
R-CNN network with a ResNet architecture. |
OPG
|
Manual annotation by 3 experts in periodontology |
A total of 100 anonymised panoramic radiographs were used, with 70% randomly selected for training,, 10% for validation, and 20% for testing. |
The model achieved an exemplary Average Recall Rate, showing that the bone loss region demarcated by the model excluded most areas of normal teeth.30 |
|
Jae-Hong Lee (2018) |
Korea |
13-Layered model, based on the Keras framework in Python.
|
Periapical radiographs
|
A single periodontist |
A total of 1,740 radiographs were divided into training (n = 1,044), validation (n = 348), and test (n = 348) sets.
|
The model had an increased AUC value for premolars compared to clinicians; however, clinicians showed superior AUC values for molars in the evaluation of tooth prognosis. Both of these values were not significant.31 |
|
Patrick Hoss (2023) |
Germany |
Multiple types of pre-trained CNNs, including ResNet-18, MobileNet V2, ConvNeXT/Small, ConvNeXT/Base, and ConvNeXT/Large
|
Periapical films
|
Dentists classified all X-rays as healthy, mild, intermediate, or severe bone losses. Experienced examiners re-evaluated each diagnosis independently. |
A total of 21,819 radiographs were divided into a training set (n = 18,819) and a test set (n = 3,000).
|
The model demonstrated superior performance for mandibular teeth compared to maxillary teeth, with an accuracy ranging between 82% and 84%. However, none of the evaluated models reached an accuracy of 90%. The model also showed different quadrants.32 |
Danks et al. achieved an 88.9% accuracy for the anterior teeth using a symmetric HRG-CNN,25 and Uzun Saylan et al. found that their YOLO-v5 model had a higher accuracy for bone loss in the maxilla.27 For studies using model comparisons, such as SVM, Random Forest, ResNet50, Ertas et al. reported best results with the ResNet50 + SVM combination with 88.2% accuracy.26 Hoss et al. observed mandibular teeth were more accurately examined than maxillary teeth, and no models attained more than 90% precision.32
Table II: Risk of bias of the selected studies.
|
Studies |
Years |
Countries |
Patient selection |
Index tests |
Reference standards |
Flow and timing |
Overall risk of bias |
|
Hoss et al.32 |
2023 |
Germany |
Low |
Low |
Low |
Low |
Low |
|
Lee et al.23 |
2019 |
Korea |
Intermediate |
Low |
Intermediate |
Intermediate |
Intermediate |
|
Thanathornwong et al.30 |
2020 |
Thailand |
Intermediate |
Low |
Intermediate |
Intermediate |
Intermediate |
|
Tsoromokos et al.29 |
2022 |
Netherlands |
Intermediate |
Low |
Intermediate |
Intermediate |
Intermediate |
|
Krois et al.21 |
2019 |
Germany |
Intermediate |
Low |
Intermediate |
Low |
Intermediate |
|
Cerda Mardini et al.22 |
2024 |
USA |
Intermediate |
Intermediate |
Intermediate |
Intermediate |
Intermediate |
|
Jiang et al.20 |
2022 |
China |
Intermediate |
Low |
Low |
Intermediate |
Intermediate |
|
Chang et al.19 |
2020 |
Korea |
Intermediate |
Low |
Intermediate |
Low |
Intermediate |
|
Alotaibi et al.24 |
2022 |
Saudi Arabia |
Intermediate |
Low |
High |
Low |
High |
|
Danks et al.25 |
2021 |
United Kingdom |
Intermediate |
Low |
Intermediate |
Low |
Intermediate |
|
Uzun Saylan et al.27 |
2023 |
Turkiye |
Intermediate |
Low |
Intermediate |
Low |
Intermediate |
|
Ezhov et al.28 |
2021 |
Russia |
Intermediate |
Low |
Intermediate |
Intermediate |
Intermediate |
|
Ertas et al.26 |
2022 |
Turkiye |
Intermediate |
Low |
High |
Low |
High |
|
Lee et al.31 |
2018 |
Korea |
Intermediate |
Low |
Low |
Intermediate |
Intermediate |
|
Bayrakdar et al.18 |
2020 |
Turkey |
Intermediate |
Low |
Intermediate |
Low |
Intermediate |
|
Green = Low risk; Yellow = Moderate risk; Red = High risk. |
|||||||
Table III: Values of model performance in the selected studies.
|
Authors/Years |
Sensitivities |
Specificities |
F1-sore |
Accuracies |
Precisions |
|
Bayrakdar et al.18 2020 |
0.9429 |
0.8857 |
0.9167 |
0.9143 |
0.8919 |
|
Jiang et al.20 2022 |
0.77 |
0.88 |
0.77 |
0.77 |
0.77 |
|
Krois et al.21 2019 |
0.81 |
0.81 |
0.78 |
0.81 |
NA |
|
Diego et al.22 2024 |
0.230 |
0.260 |
0.150 |
NA |
0.110 |
|
Kim et al.23 2019 |
0.77 |
0.95 |
0.75 |
NA |
NA |
|
Uzun Saylan et al.27 2023 |
0.75 |
NA |
0.75 |
NA |
0.76 |
|
Thanathornwong et al.30 2020 |
0.84 |
0.88 |
0.81 |
NA |
0.81 |
|
Alotalbi et al.24 2022 |
73% |
0.79 |
0.73 |
0.73 |
0.73 |
|
Tsoromokos et al.29 2022 |
0.96 |
0.41 |
NA |
0.80 |
NA |
|
Lee et al.31 2018 |
NA |
NA |
NA |
82.8-73.4% |
NA |
|
Hoss et al.32 2023 |
93.9 |
72.7 |
NA |
82.0-84.8% |
NA |
|
Ezhov et al.28 2021 |
0.9489 |
0.9661 |
NA |
NA |
NA |
|
Chang et al.19 2020 |
NA |
NA |
NA |
0.8143 |
NA |
|
Danks et al.25 2021 |
NA |
NA |
NA |
0.58 |
NA |
|
Ertas et al.26 2022 |
NA |
NA |
0.872 |
0.882 |
0.864 |
Ezhov et al. bordered the usefulness of AI-assisted diagnosis since it shortened examination time.28 Tsoromokos et al. and Lee and Kim showed limited performance in molar prognosis (Table III).29,31
DISCUSSION
Periodontitis is a prevalent chronic inflammatory condition characterised by microbial-induced destruction of tissue, including gingival recession, widening of the periodontal ligament space, and ultimately loss of attachment. These pathological changes contribute to increased tooth mobility and may result in tooth loss, thereby adversely affecting patients’ oral health-related quality of life.34,35 As a result, timely and accurate diagnosis is essential.
Recent studies have explored the use of AI, particularly CNNs, for the detection of periodontitis. AI systems simulate the cognitive functions of the human brain and offer several advantages over conventional diagnostic methods.36 Typically, clinicians annotate datasets using gold-standard criteria, which are subsequently used to train, validate, and test AI algorithms.37 Among the included studies, CNN-based systems demonstrated reduced diagnostic error rates, particularly when compared with human assessors, who are more inclined to fatigue-related inaccuracies.38 Additionally, these systems can identify subtle radiographic attributes that may be overlooked by clinicians and offer data storage capabilities, enabling future reference and model improvement.39
A significant source of heterogeneity among the included studies was the variation in AI models employed. For instance, Ozdan et al. compared multiple algorithms and reported that the decision tree classifier showed higher diagnostic accuracy for disease detection.40 However, the absence of a standardised model led to inconsistent accuracy rates across studies. To address this issue, an agreement on baseline model architecture, preferably applying decision tree classifiers, should be constituted. This would allow uniformity in performance assessment while maintaining adaptability for future improvement.
Regarding the type of radiographic inputs, most studies used OPGs, while five studies engaged periapical radiographs, and one study used CBCT. OPGs have gained popularity due to their lower cost, shorter acquisition time, and patient compliance.41 However, these advantages are followed by inherent limitations, such as lower resolution and magnification-related impairment.42 Moreover, segmenting OPGs into individual tooth regions for AI input is labour-intensive and may further degrade image quality.43 To resolve these issues, future research should prioritise high-resolution periapical radiographs and employ paralleling techniques for improved accuracy and consistency.44
Further analysis revealed that AI models demonstrated higher diagnostic accuracy for maxillary anterior teeth and premolars. Conversely, diagnostic accuracy was consistently diminished for mandibular anterior teeth and maxillary molars. This discrepancy is credited to two main factors: overlapping anatomical structures in the mandibular anterior region on OPGs, which conceal radiographic details45 and significant anatomical variability in maxillary molar furcation areas.46 These findings highlight the need for employing diverse and increased numbers of input datasets to train AI systems that can reliably detect anatomical variations.47
Although AI offers respectable potential, current models still heavily depend on clinician input for data annotation and ground truth constitution. The training process is sensitive to the expertise level of the annotating clinician, resulting in variability in diagnostic accuracy across studies.48,49 This suggests that human supervision remains an integral part of AI model development, underscoring the need for standardised training and annotation protocols.
An overarching limitation in the examined literature is the disproportionate emphasis placed on radiographic data, frequently to the detriment of clinical judgment. Periodontitis is a multifactorial disease that necessitates a comprehensive clinical assessment, including probing depth, bleeding on probing, and clinical attachment level.50 Future studies should explore integrative models that combine radiographic and clinical parameters to support long-term treatment planning.
CONCLUSION
With the increasing use of digital or radiographic diagnosis tools, AI models can be used as an auxiliary tool by examiners to evaluate radiographic bone loss in the assessment of periodontitis. These models, despite their limitations, show acceptable accuracy and precision when compared with clinicians. However, for these models to fully replace the clinician in the diagnostic process, multiple limitations should be considered, including the absence of a baseline model for the evaluation of periodontitis, insufficient studies with an acceptable data set, and inadequate reference standard for these models associated with AI.51
COMPETING INTEREST:
The authors declared no conflict of interest.
AUTHORS’ CONTRIBUTION:
FT: Conception, literature review, intellectual content, and proofreading.
EH: Manuscript write-up, data analysis, interpretation, literature search, and data collection.
TB: Literature search, plagiarism check and improvement, data collection, and assessment.
All authors approved the final version of the manuscript to be published.