모바일 메뉴

KJO Korean Journal of Orthodontics

Open Access

pISSN 2234-7518
eISSN 2005-372X
QR Code QR Code

퀵메뉴 버튼

Article

home All Articles View
Split Viewer

Original Article

Korean J Orthod 2022; 52(1): 3-19

Published online January 25, 2022 https://doi.org/10.4041/kjod.2022.52.1.3

Copyright © The Korean Association of Orthodontists.

Accuracy of one-step automated orthodontic diagnosis model using a convolutional neural network and lateral cephalogram images with different qualities obtained from nationwide multi-hospitals

Sunjin Yima , Sungchul Kimb , Inhwan Kimb, Jae-Woo Parkc, Jin-Hyoung Chod, Mihee Honge, Kyung-Hwa Kangf, Minji Kimg, Su-Jung Kimh, Yoon-Ji Kimi, Young Ho Kimj, Sung-Hoon Limk, Sang Jin Sungi, Namkug Kiml , Seung-Hak Baekm

aDepartment of Orthodontics, School of Dentistry, Seoul National University, Seoul, Korea
bDepartment of Biomedical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
cPrivate Practice, Incheon, Korea
dDepartment of Orthodontics, Chonnam National University School of Dentistry, Gwangju, Korea
eDepartment of Orthodontics, School of Dentistry, Kyungpook National University, Daegu, Korea
fDepartment of Orthodontics, School of Dentistry, Wonkwang University, Iksan, Korea
gDepartment of Orthodontics, College of Medicine, Ewha Womans University, Seoul, Korea
hDepartment of Orthodontics, Kyung Hee University School of Dentistry, Seoul, Korea
iDepartment of Orthodontics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
jDepartment of Orthodontics, Institute of Oral Health Science, Ajou University School of Medicine, Suwon, Korea
kDepartment of Orthodontics, College of Dentistry, Chosun University, Gwangju, Korea
lDepartment of Convergence Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
mDepartment of Orthodontics, School of Dentistry, Dental Research Institute, Seoul National University, Seoul, Korea

Correspondence to:Seung-Hak Baek.
Professor, Department of Orthodontics, School of Dentistry, Dental Research Institute, Seoul National University, 101, Daehak-ro, Jongno-gu, Seoul 03080, Korea.
Tel +82-2-2072-3952 e-mail drwhite@unitel.co.kr
Corresponding author: Namkug Kim.
Professor, Department of Convergence Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea.
Tel +82-2-3010-6573 e-mail namkugkim@gmail.com

Sunjin Yim and Sungchul Kim contributed equally to this work (as co-first authors).

Received: March 23, 2021; Revised: June 1, 2021; Accepted: July 2, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Objective: The purpose of this study was to investigate the accuracy of one-step automated orthodontic diagnosis of skeletodental discrepancies using a convolutional neural network (CNN) and lateral cephalogram images with different qualities from nationwide multi-hospitals. Methods: Among 2,174 lateral cephalograms, 1,993 cephalograms from two hospitals were used for training and internal test sets and 181 cephalograms from eight other hospitals were used for an external test set. They were divided into three classification groups according to anteroposterior skeletal discrepancies (Class I, II, and III), vertical skeletal discrepancies (normodivergent, hypodivergent, and hyperdivergent patterns), and vertical dental discrepancies (normal overbite, deep bite, and open bite) as a gold standard. Pre-trained DenseNet-169 was used as a CNN classifier model. Diagnostic performance was evaluated by receiver operating characteristic (ROC) analysis, t-stochastic neighbor embedding (t-SNE), and gradientweighted class activation mapping (Grad-CAM). Results: In the ROC analysis, the mean area under the curve and the mean accuracy of all classifications were high with both internal and external test sets (all, > 0.89 and > 0.80). In the t-SNE analysis, our model succeeded in creating good separation between three classification groups. Grad-CAM figures showed differences in the location and size of the focus areas between three classification groups in each diagnosis. Conclusions: Since the accuracy of our model was validated with both internal and external test sets, it shows the possible usefulness of a one-step automated orthodontic diagnosis tool using a CNN model. However, it still needs technical improvement in terms of classifying vertical dental discrepancies.

Keywords: One-step automated orthodontic diagnosis, Convolutional neural networks, Lateral cephalogram, Multi-center study

INTRODUCTION

Accurate positioning of cephalometric landmarks is one of the most important steps in successful cephalometric analyses. Since the location and visibility of some anatomic landmarks are highly influenced by superimposition of the anatomical structures in the face between the right and left sides,1,2 it is not easy to identify these anatomic landmarks consistently and accurately.

For the last several decades, clinicians have manually indicated the cephalometric landmarks and measured several angles and distances between these landmarks to assess dentofacial deformities.3 Although this manual cephalometric analysis has been substituted with digital cephalometric analysis,4,5 the process is still laborious, time-consuming, and sometimes inaccurate in detection of cephalometric landmarks.3,6-8

Recently, research on automatic detection of cephalometric landmarks using artificial intelligence (AI) with convolutional neural networks (CNNs) has gained popularity.1-3,9-11 These studies have focused mainly on automatic detection of cephalometric landmarks and reported that most cephalometric landmarks were detected within a 2-mm range of accuracy.1,10 However, these approaches still require further measurements of cephalometric parameters including distance, angle, and ratio. Although Kunz et al.11 developed an AI algorithm to analyze 12 cephalometric parameters, they did not make a one-step automated orthodontic diagnosis tool in practice. Therefore, it is necessary to develop a one-step automated orthodontic diagnosis algorithm based on a CNN to avoid the need of additional measurements of cephalometric parameters.

In terms of a one-step CNN algorithm for classification of skeletal discrepancies, Yu et al.8 reported > 90% accuracy, sensitivity, and specificity for diagnosis of the sagittal and vertical skeletal discrepancies in three models (Models I, II, and III). However, they intentionally excluded some portion of the data adjacent to the classification cutoff with intervals of 0.2 standard deviations (SDs) in Model II and 0.3 SDs in Model III in the test set.8 As a result, Models II and III showed a significant increase in the values for accuracy, sensitivity, and specificity compared to Model I.8

The major limitations in previous studies can be summarized as follows:1-3,8-11 (1) Most studies used lateral cephalograms from only one or two hospitals, not from nationwide several different hospitals which had different machine types, radiation exposure conditions, sensors, and image conditions; (2) No study has simultaneously reported dental and skeletal discrepancies using a one-step automated classification algorithm; and (3) If some portion of the data adjacent to the classification cutoff were excluded in the test set, there would be issues in the continuity of the test set and an exaggerated increase in accuracy. Therefore, the purpose of this study was to investigate the accuracy of a novel one-step automated orthodontic diagnosis model for determining anteroposterior skeletal discrepancies (APSDs: Class I, Class II, and Class III), vertical skeletal discrepancies (VSDs: normodivergent, hyperdivergent, and hypodivergent), and vertical dental discrepancies (VDDs: normal overbite, open bite, and deep bite) using a CNN and lateral cephalogram images with different qualities from nationwide 10 unrelated dental hospitals in Korea.

MATERIALS AND METHODS

Description of the dataset

A total of 2,174 lateral cephalogram images were retrospectively obtained from the Departments of Orthodontics in nationwide 10 hospitals including Seoul National University Hospital (SNUDH), Kooalldam Dental Hospital (KADH), Ajou University Dental Hospital (AJUDH), Asan Medical Center (AMC), Chonnam National University Dental Hospital (CNUDH), Chosun University Dental Hospital (CSUDH), Ewha University Medical Center (EUMC), Kyung Hee University Dental Hospital (KHUDH), Kyungpook National University Dental Hospital (KNUDH), and Wonkwang University Dental Hospital (WKUDH) in Korea. The inclusion criteria were Korean adult patients who underwent orthodontic treatment with/without orthognathic surgery between 2013 and 2020. The exclusion criteria were (1) patients who were in childhood and adolescence and (2) patients who had mixed dentition. All datasets were strictly anonymized before use. The study protocol was reviewed and approved by the Institutional Review Board of SNUDH (ERI20022), Korean National Institute for Bioethics Policy for KADH (P01-202010-21-020), Ajou University Hospital Human Research Protection Center (AJIRB-MED-MDB-19-039), AMC (2019-0927), CNUDH (CNUDH-2019-004), CSUDH (CUDHIRB 1901 005), EUMC (EUMC 2019-04-017-003), KHUDH (D19-007-003), KNUDH (KNUDH-2019-03-02-00), and WKUDH (WKDIRB201903-01).

Lateral cephalogram images, 1,993 from two hospitals, were used for the training set (n = 1,522) and internal test set (n = 471), and 181 from eight other hospitals were used as the external test set to validate our model (Figure 1). Table 1 summarizes information on the product, radiation exposure condition, sensor, and image conditions in each hospital, which showed diverse conditions.

Table 1 . Information on the product, radiation exposure condition, sensor, and image condition of the cephalometric radiograph system in 10 multi-centers

Cephalometric radiograph systemsSNUDHKADHAJUDHAMCCNUDHCSUDHEUMCKHUDHKNUDHWKUDH
ProductCompanyAsahiVatechPlanmecaCarestreamInstrumentariumPlanmecaAsahiAsahiAsahiPlanmeca
ModelCX-90SP-IIUni3D NCProline XCCS9300OrthoCeph
OC 100
Proline XCOrtho stage
(Auto III
N CM)
CX-90SPCX-90SP-
II
Promax
Radiation
exposure
condition
Kvp768568808580757070Female 72,
Male 74
mA8010712121215158010
sec0.320.92.30.631.61.810.3–0.350.321.87
SensorImage
sensor
Cassette
(CR system)
CCD sensorCCD sensorCCD sensorCassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
CCD
sensor
Sensor
size
10 × 12 (inch)30 × 25 (cm)10.6 × 8.85 (inch)30 × 30 (cm)10 × 12 (inch)8 × 10 (inch)8 × 12
(inch)
10 × 12
(inch)
11 × 14
(inch)
27 × 30
(cm)
ImageImage size
(pixel × pixel)
2,000 × 2,510/
2,010 × 1,670
2,360 × 1,8801,039 × 1,2002,045 × 2,272/
1,012 × 2,020
2,500 × 2,0482,392 × 1,792
/ various
2,510 × 2,0002,500
× 2,048
1,950 × 2,460/
2,108 × 1,752
1,818
× 2,272
Actual
resolution
(mm/pixel)
0.150/ 0.1000.1100.2500.132/ 0.1450.1150.1000.1000.1100.1000.132
Lateral cephalogram images used in this study (number)1,1298642221203026231920

SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; EUMC, Ewha University Medical Center; KHUDH, Kyung Hee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital; CR, computed radiography; CCD, charge-coupled device.


Figure 1. Flowchart of dataset and experimental setup.
CNN, convolutional neural network.

Setting a gold standard for the diagnosis of APSDs, VSDs, and VDDs

After detection of the cephalometric landmarks including A point, nasion, B point, orbitale, porion, gonion, menton, sella, maxilla 1 crown, maxilla 6 distal, mandible 1 crown, and mandible 6 distal by a single operator (SY), the cephalometric parameters including A point-Nasion-B point (ANB) angle, Frankfort mandibular plane angle (FMA), Jarabak’s posterior/anterior facial height ratio (FHR), and overbite were calculated using V-Ceph 8.0 (Osstem, Seoul, Korea) to set a gold standard.

All cephalometric images were classified into the three classification groups by a single operator (SY) as follows. For classification of APSDs, we defined the ANB value between –1 SD and 1 SD from the ethnic norm of each sex12 as skeletal Class I; > 1 SD as skeletal Class II; and < –1 SD as skeletal Class III. For classification of VSDs, we combined FMA and FHR values from the ethnic norm of each sex12 for training. First, we normalized the FMA and FHR values by using the SD values. Second, the FHR values were flipped due to an opposite sign compared to the FMA values. Third, the values of FMA and flipped FHR were added because each are regarded as having equal weights. Fourth, the mean and SD values were obtained for classification into three groups. Then, we defined the values between –1 SD and 1 SD from the mean as normodivergent pattern, > 1 SD as hyperdivergent pattern, and < –1 SD as hypodivergent pattern. For classification of the VDDs, we defined the overbite value between 0 mm and 3 mm as a normal overbite, > 3 mm as a deep bite, and < 0 mm as an open bite (Tables 2 and 3).

Table 2 . Classification criteria for the anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) for orthodontic analysis

SexAPSDsVSDsVDDs
ANBFMAFHROverbite
MeanSDMeanSDMeanSDMeanSD
Female2.41.824.24.66591.51.5
Male1.782.0226.781.7966.375.07

ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; SD, standard deviation.


Table 3 . Distribution of classification groups in each diagnosis for human gold standard in the training set, internal test set, and external test set

ClassificationsTraining setInternal test setExternal test setSum
SNUDHKADHSumSNUDHKADHSumAJUDHAMCEUMCCNUDHCSUDHKHUDHKNUDHWKUDHSumInternal + external
test sets
Total
APSDsClass I238323561 (36.9)12240162 (34.4)86547117250 (27.6)212 (32.5)773 (35.6)
Class II183263446 (29.3)11244156 (33.1)881181344662 (34.3)218 (33.4)664 (30.5)
Class III359156515 (33.8)11538153 (32.5)6710810881269 (38.1)222 (34.0)737 (33.9)
Sum7807421,52234912247122212620302319201816522,174
VSDsNormodivergent331389720 (47.3)14650196 (41.6)1067917107773 (40.3)270 (41.4)989 (45.5)
Hyperdivergent314241555 (36.5)13540175 (37.2)59126378656 (30.9)231 (35.4)786 (36.2)
Hypodivergent135112247 (16.2)6832100 (21.2)76751064752 (28.7)151 (23.2)399 (18.4)
Sum7807421,52234912247122212620302319201816522,174
VDDsNormal overbite440493933 (61.3)19653249 (52.9)1111108910101079 (43.6)328 (50.3)1,261 (58.0)
Open bite209194403 (26.5)9941140 (29.7)4795984551 (28.2)191 (29.3)594 (27.3)
Deep bite13155186 (12.2)542882 (17.4)73771255551 (28.2)133 (20.4)319 (14.7)
Sum7807421,52234912247122212620302319201816522,174

Values are presented as number only or number (%).

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; EUMC, Ewha University Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; KHUDH, Kyunghee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital.



To assess intra-examiner reliability, all classifications of APSDs, VSDs, and VDDs were performed again after one month by the same investigator (SY). Since the minimum sample size13 was suggested as 49 from a 3 × 3 Cohen’s kappa agreement test, 100 images were randomly selected to classify APSDs, VSDs, and VDDs. Cohen’s kappa agreement test showed an “almost perfect” agreement (kappa value; 0.939 for APSDs, 0.984 for VSDs, and 0.907 for VDDs).14 Therefore, the first classification results were used for further statistical analysis.

To evaluate inter-examiner reliability, the same images used to assess intra-examiner reliability were selected. Classifications of APSDs, VSDs, and VDDs were performed by the other investigator (KL). Cohen’s kappa agreement test showed an “almost perfect” agreement for APSDs and VSDs (kappa value; 0.985 for APSDs, 0.919 for VSDs) and “substantial’ agreement” for VDDs (0.601).14

Preprocessing of the data

Augmentation techniques including cropping, padding, spatial transformations, and appearance transformation were conducted in real time.

Model architecture (Figure 2)

As the backbone of the model, DenseNet-169 pre-trained with weights of the ImageNet dataset was used with group normalization (GN).15-20 After the global average pooling (GAP) of the backbone, ArcFace was added in parallel with the softmax layer in order to overcome imbalanced data sets and obtain discriminative features during training.21

Figure 2. Diagrams of the model architecture. A, During training, an ArcFace head was added to the last convolutional layer of the backbone in parallel with the softmax layer. B, After training, the ArcFace head was removed and inference was implemented using only the softmax layer.

After training, the ArcFace head was removed, and inference was implemented using only the softmax layer as a basic CNN classifier. Because sex was included as a classification criterion of APSDs and VSDs, the one-hot vector about sex was concatenated with the feature vector after GAP.

Model training (Figures 1 and 2)

Training for APSDs, VSDs, and VDDs was performed using only a gold standard determined by a single operator (SY), not by measurement of cephalometric parameters including ANB, FMA, FHR, and overbite.

Model testing

After training was completed, one-step classification was performed with both the internal and external test sets to validate the performance of the constructed model. It took 55 seconds (sec) to diagnose the internal test set (0.1168 sec per lateral cephalogram) and 22 sec to diagnose the external test set (0.1215 sec per lateral cephalogram). The results for the internal and external test sets were compared with gold standard diagnostic data.

Analysis method

Receiver operating characteristic (ROC) analysis

The performance of our model was evaluated using accuracy, area under the curve (AUC), sensitivity, and specificity using both binary and multiple class ROC analysis.8,22,23

t-stochastic neighbor embedding (t-SNE)

Since this technique can visualize high-dimensional data by giving each datapoint a location in a two or three-dimensional map, it was used to check the feature distribution of the training set, internal test set, and external test set after GAP layering.24 In each diagnosis, the labels of ground truth (GT) and prediction (PD) were set to check the distribution of each data set.

Gradient-weighted class activation mapping (Grad-CAM)25

As this technique can produce visual explanations of AI models, it can show the regions where the AI focuses for PD. It was used to confirm the regions where our model mainly focused on the diagnosis of APSDs, VSDs, and VDDs.

RESULTS

Metrology distribution of the APSDs, VSDs, and VDDs per dataset (Figure 3)

The continuity of the dataset between the normal groups (Class I in APSDs, normodivergent pattern in VSDs, and normal overbite in VDDs) and the other two groups (Class II and III in APSDs, hyperdivergent and hypodivergent patterns in VSDs, and open bite and deep bite in VDDs) was confirmed.

Figure 3. Metrology distribution of the anteroposterior skeletal discrepancies (APSDs: Class I, Class II, and Class III), vertical skeletal discrepancies (VSDs: normodivergent pattern, hyperdivergent pattern, and hypodivergent pattern), and vertical dental discrepancies (VDDs: normal overbite, open bite, and deep bite) per dataset. Red lines in APSDs and VSDs indicate one standard deviation of the normal classification. Red lines in VDDs indicate the boundary values, which were 0 mm and 3 mm.
ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; norm, normalized; Man, mandible 1 crown; Max, maxilla 1 crown; dist, distance.

Accuracy and AUC of the internal test set in binary ROC analysis (Table 4 and Figure 4)

In APSDs, Class III had the highest accuracy and AUC (0.9372 and 0.9807, respectively), followed by Class II (0.8972 and 0.9533, respectively) and Class I (0.8488 and 0.9212, respectively). In VSDs, hypodivergent pattern had the highest accuracy and AUC (0.9346 and 0.9824, respectively), followed by hyperdivergent pattern (0.9019 and 0.9730, respectively) and normodivergent pattern (0.8365 and 0.9186, respectively). In VDDs, open bite had the highest accuracy and AUC (0.8730 and 0.9475, respectively), followed by deep bite (0.8637 and 0.9286, respectively) and normal overbite (0.7376 and 0.8177, respectively).

Table 4 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the binary ROC analysis

ClassificationsAccuracyAUCSensitivitySpecificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I0.84880.01030.83200.02300.92120.00380.90420.01950.79380.03280.78400.02970.87640.01860.85040.0273
Class II0.89720.00570.87960.01530.95330.00260.96010.00670.81920.03340.72260.05150.93590.01610.96130.0046
Class III0.93720.00630.95250.01080.98070.00250.99300.00230.91110.02250.96520.00790.94970.00860.94460.0160
Mean0.89440.03680.88800.05180.95170.02450.95240.03820.84140.05710.82390.10760.92060.03450.91880.0516
VSDsNormodivergent0.83650.00820.83090.02670.91860.00460.91570.01510.82350.02790.76990.04160.84580.01220.87220.0178
Hyperdivergent0.90190.00350.90610.02030.97300.00470.97300.00470.81490.02730.91430.02930.95340.01900.90240.0360
Hypodivergent0.93460.00980.90940.01640.98240.00150.96840.00260.90000.03940.80000.06610.94450.01270.95350.0110
Mean0.89100.04130.88210.04100.95800.02830.95230.02730.84610.04780.82800.07570.91460.05050.90940.0398
VDDsNormal overbite0.73760.02910.75910.02300.81770.01660.83590.01520.65300.09560.65820.06640.82880.04410.83730.0557
Open bite0.87300.01300.89170.01390.94750.00530.96260.00740.83710.03660.82750.06110.88820.03040.92620.0228
Deep bite0.86370.02700.85860.01270.92860.00990.92380.00550.80000.11000.81960.08360.87810.05300.87230.0457
Mean0.82480.06540.83650.05840.89790.05820.90740.05380.76340.11110.76840.10060.86510.04680.87860.0535

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation.


Figure 4. The results of the binary receiver operating characteristic curve analysis (A) in the internal test set from two hospitals and (B) in the external test set from other 8 hospitals for diagnosis of anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).
AUC, area under the curve.

In APSDs and VSDs, the total accuracy reached nearly 0.9 and the total AUC exceeded 0.95 (0.9517 and 0.9580, respectively). However, VDDs showed a relatively lower total accuracy (0.8248 vs. 0.8944 and 0.8910) and total AUC (0.8979 vs. 0.9517 and 0.9580) than APSDs and VSDs.

Accuracy and AUC of the external test set in binary ROC analysis (Table 4 and Figure 4)

In APSDs, Class III had the highest accuracy and AUC (0.9525 and 0.9930, respectively), followed by Class II (0.8796 and 0.9601, respectively) and Class I (0.8320 and 0.9042, respectively). In VDDs, open bite had the highest accuracy and AUC (0.8917 and 0.9626, respectively), followed by deep bite (0.8586 and 0.9238, respectively) and normal overbite (0.7591 and 0.8359, respectively). However, VSDs showed a different pattern between accuracy and AUC. Although the accuracy was highest for hypodivergent pattern (0.9094), followed by hyperdivergent pattern (0.9061) and normodivergent pattern (0.8309), the AUC was highest for hyperdivergent pattern (0.9730), followed by hypodivergent pattern (0.9684) and normodivergent pattern (0.9157).

In APSDs and VSDs, the total accuracy reached nearly 0.9 and the total AUC exceeded 0.95. However, VDDs showed a relatively lower total accuracy (0.8365 vs. 0.8880 and 0.8821) and total AUC (0.9074 vs. 0.9524 and 0.9523) than APSDs and VSDs.

Comparison of AUC values between internal and external test sets in binary ROC analysis (Table 4)

In APSDs and VSDs, Class III and open bite showed the highest AUC compared to other classifications (0.9807 and 0.9903 in the internal test set, 0.9475 and 0.9626 in external test set, respectively). However, VSDs showed a different pattern. The internal test set showed the highest AUC for hypodivergent pattern (0.9824), while the external test set showed the highest AUC for hyperdivergent pattern (0.9730). However, the difference in the AUC values was less than 0.01.

Comparison of AUC values between internal and external test sets in multiple ROC analysis (Table 5)

In terms of pairwise AUCs in the internal and external test sets of APSDs, VSDs, and VDDs, Class II vs. Class III ([II→III, 0.9913; II←III, 0.9920]; [II→III, 0.9992; II←III, 0.9989]; Δ value [II→III, 0.0079; II←III, 0.0069]), hyperdivergent pattern vs. hypodivergent pattern ([hyper→hypo, 0.9998; hyper←hypo, 0.9998]; [hyper→hypo, 0.9930; hyper←hypo, 0.9977]; Δ value [hyper→hypo, –0.0068; hyper←hypo, –0.0021]), and open bite vs. deep bite ([open→deep, 0.9982; open←deep, 0.9951]; [open→deep, 0.9924; open←deep, 0.9956]; Δ value [open→deep, –0.0058; open←deep, 0.0005]) showed the highest values in both the internal and external test sets and the smallest differences compared to other pairwise classifications.

Table 5 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the multiple ROC analysis

ClassificationsAccuracyPairwise AUCPairwise sensitivityPairwise specificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I → Class II0.85030.00860.80540.02220.89430.01060.82220.38300.88020.02830.90800.00980.81920.02990.72260.0461
Class I ← Class II0.91750.00390.90610.01360.81920.02990.72260.04610.88020.02830.90800.0098
Class I → Class III0.91430.00920.92770.01470.94860.00570.97800.00390.91730.01490.87600.03200.91110.02010.96520.0071
Class I ← Class III0.96980.00350.98560.00320.91110.02010.96520.00710.91730.01490.87600.0320
Class II → Class III0.97540.00330.97250.01420.99130.00140.99920.00090.96540.00770.94190.02990.98560.00261.00000.0000
Class II ← Class III0.99200.00130.99890.00130.98560.00261.00000.00000.96540.00770.94190.0299
VSDsHyper → Hypo0.99050.00370.97780.01260.99980.00020.99300.00190.98510.00581.00000.00001.00000.00000.95380.0261
Hyper ← Hypo0.99980.00010.99770.00031.00000.00000.95380.02610.98510.00581.00000.0000
Hyper → Normo0.87550.00400.87910.02230.95930.00630.95870.00680.81490.02440.91430.02620.92960.02570.85210.0485
Hyper ← Normo0.90340.00880.93290.01190.92960.02570.85210.04850.81490.02440.91430.0262
Hypo → Normo0.89590.01390.86880.02120.96690.00240.94590.00420.90000.03520.80000.05910.89390.02310.91780.0173
Hypo ← Normo0.94510.01530.89720.03160.89390.02310.91780.01730.90000.03520.80000.0591
VDDsOpen → Deep0.97660.01120.97060.01860.99820.00120.99240.00440.98140.01160.99220.00960.96830.04120.94900.0319
Open ← Deep0.99510.00660.99560.00420.96830.04120.9490.03190.98140.01160.99220.0096
Open → Normal0.84630.01410.85380.02010.93080.00630.94340.00840.84140.03180.83140.05200.84900.03630.86840.0284
Open ← Normal0.81900.04520.83730.03410.84900.03630.86840.02840.84140.03180.83140.0520
Deep → Normal0.80660.03380.80620.01320.89110.01300.87750.00890.80000.09840.82750.07880.80880.07410.79240.0682
Deep ← Normal0.81560.03880.83450.02770.80880.07410.79240.06820.80000.09840.82750.0788

ROC curve analysis with multiple classification tasks was performed.

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation; Hyper, hyperdivergent; Hypo, hypodivergent; Normo, normodivergent; Open, open bite; Deep, deep bite; Normal, normal overbite.



t-SNE of APSDs, VSDs, and VDDs per dataset (Figure 5)

The GT in the training set, internal test set, and external test set showed that dots with different colors were mixed irregularly in the classification cutoff areas (dotted circle in Figure 5, GT) between the normal group (Class I in APSDs, normodivergent pattern in VSDs, and normal overbite in VDDs) and the other two groups (Class II and III for APSDs, hyperdivergent and hypodivergent patterns for VSDs, and open bite and deep bite for VDDs).

Figure 5. The results of t-stochastic neighbor embedding in anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) per dataset. The labels of ground truth (GT) and prediction (PD) were set to check their distribution. Dotted circles indicate areas with irregular mixing. Dotted lines indicate cutoff lines.

However, in the AI PD, the areas with irregular mixing had almost disappeared enough to indicate a cutoff line between the normal group and the other two groups in the training set, internal test set, and external test set (Figure 5, PD). This indicated that our model succeeded in creating good separation between the three classification groups in each diagnosis, resulting in consistent classification within each group.

Grad-CAM for each diagnosis (Figure 6)

Heat maps show differences in the location and size of the focus areas between three classification groups in each diagnosis. These indicate that our model can effectively use the information in the lateral cephalogram images.

Figure 6. Gradient-weighted class activation mapping plots for anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).

DISCUSSION

The present study has some meaningful outcomes as follows: (1) Despite the different quality of lateral cephalogram images from diverse conditions of cephalometric radiograph systems in nationwide 10 hospitals (Table 1), a clinically acceptable accuracy of diagnosis was obtained for APSDs, VSDs, and VDDs; and (2) since it was possible to give a proper diagnosis for APSDs, VSDs, and VDDs with input of lateral cephalograms only, our model showed the possibility of general-purpose one-step orthodontic diagnosis tool.

Clinical meaning of the comparison results between internal and external test sets in binary and multiple ROC analysis

Since the differences in AUC values for APSDs, VSDs, and VDDs in both binary and multiple ROC analyses were almost insignificant (Tables 4 and 5), it could be regarded that our model was well-validated in the external test set.

Comparison of accuracy with a previous study using binary ROC analysis results

Compared to model I of Yu et al.,8 our model showed slightly lower scores for total accuracy (< 0.011) and slightly higher scores for total AUC (< 0.020) (Table 6). Although our dataset had some disadvantages including a relatively smaller number of images in the dataset and an imbalanced data set compared to Yu et al.’s study8 (n = 5,890 lateral cephalogram images, and even distribution of data set after under-sampling), our model exhibited nearly the same performance as model I by Yu et al.8 To overcome this disadvantageous environment, we elaborated on constructing the proper architecture of our model using GN, ArcFace, and a softmax layer (Figure 2).

Table 6 . Comparison of the binary ROC analysis results between multi-models in a previous study and a single model in this study

ModelsAPSDsVSDs
SensitivitySpecificityAccuracyAUCSensitivitySpecificityAccuracyAUC
Yu et al’s
study8
This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This
study
Model I
(no exclusion of
data set)
0.85750.84140.92880.92060.90500.89440.9380.95170.84270.84610.92130.91460.89510.89100.9370.9580
Model II
(exclusion of data
set within interval
of 0.2 SD)
0.9079NA0.9539NA0.9386NA0.970NA0.9222NA0.9611NA0.9481NA0.985NA
Model III
(exclusion of data
set within interval
of 0.3 SD)
0.9355NA0.9677NA0.9570NA0.978NA0.9459NA0.9729NA0.9640NA0.984NA

ROC, receiver operating characteristic; APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; AUC, area under the curve; SD, standard deviation; NA, not applicable.



Excluding specific data, especially in the test set, may increase the risk of sample selection bias and lead to inaccurate validation of the model. Therefore, in the present study, all datasets with a whole distribution were included to properly validate the model (Figure 3).

Difference in the AUC values of in Class II and Class III groups in APSDs and hyperdivergent and hypodivergent groups in VSDs in binary and multiple ROC analysis

The hypodivergent group showed a higher AUC score than the hyperdivergent group in the internal test set, while the hyperdivergent group showed a higher AUC than the hypodivergent group in the external test set (0.9824 vs. 0.9730 in the internal test set, respectively; 0.9684 vs. 0.9730 in the external test set, respectively; Table 4).

The Class III group showed higher AUC values than the Class II group, which was in accordance with the results of Yu et al.8 for both internal and external test sets (0.9807 vs. 0.9533 in the internal test set, respectively; 0.9930 vs. 0.9601 in the external test set, respectively; Table 4). The reason might be a difference in the location and size of the focus areas in the diagnosis of VSDs and APSDs (i.e., relatively larger difference between Class II and Class III groups compared to between the hyperdivergent and hypodivergent groups; Figure 6). Further studies are necessary to investigate the reason why the Class III group showed a higher AUC than the Class II group.

Lower AUC values in VDDs compared to APSDs and VSDs in binary ROC analysis

The lower AUCs in VDDs in both internal and external test sets (Table 4) and relatively unclear separation of the normal overbite group from the deep bite and open bite groups in the GT of the t-SNE result (Figure 5) might be due to two reasons: (1) the imbalanced data composition in the training set, internal test set, and external test set (normal overbite, 61.3%, 52.9% and 43.6%; open bite, 26.5%, 29.7% and 28.2%; deep bite, 12.2%, 17.4% and 28.2%, respectively; Table 3) or (2) an inherent problem in the superimposed image between the anterior teeth.

Current status of CNN-based orthodontic diagnosis

Most previous CNN studies have focused on detecting cephalometric landmarks and/or calculating cephalometric variables for a two-step automated diagnosis.1-3,8-11 The study design, methods, and results of previous CNN studies are summarized in Table 7. In the present study, we proposed a one-step orthodontic diagnosis model, which only needs input of lateral cephalograms. The degree of performance of the AI model used in this study was comparable to the human gold standard (Tables 4 and 5). Automated AI-assisted procedures might save clinicians valuable time and labor in classification of skeletodental characteristics in a large sample size. However, it still needs an ultimate decision from a human expert, especially in borderline cases.

Table 7 . Summary of the study design, methods and results in the orthodontic diagnosis of previous CNN studies and this study

Author (year)SamplesModel and its applicationData setResults
Arık et al.
(2017)1

400 publicly available cephalograms

19 landmarks

8 cephalometric parameters

2 human examiners

Deep learning with CNN and shape-based model

Landmark detection

Cephalometric analysis

Training set: 150

Test set: 250

High anatomical landmark detection accuracy (∼1% to 2% higher success detection rate for a 2-mm range compared with the top benchmarks in the literature)

High anatomical type classification accuracy (~76% average classification accuracy for test set)

Park et al.
(2019)9

1,028 lateral cephalograms

80 landmarks

1 human examiner

Deep learning with YOLOv3 and SSD

Landmark detection

Training set: 1,028

Test set: 283

The YOLOv3 algorithm outperformed SSD in accuracy for 38 of 80 landmarks

The other 42 of 80 landmarks did not show a statistically significant difference between YOLOv3 and SSD

Error plots of YOLOv3 showed not only a smaller error range but also a more isotropic tendency

The mean computational time spent per image was 0.05 seconds and 2.89 seconds for YOLOv3 and SSD, respectively

YOLOv3 showed approximately 5% higher accuracy compared with the top benchmarks in the literature

Nishimoto
et al. (2019)3

219 lateral cephalograms from internet

10 skeletal landmarks

12 cephalometric parameters

Human examiners – not mentioned

Personal desktop computer

CNN

Landmark detection

Cephalometric analysis

Training set: 153 (expanded 51 folds)

Test set: 66

Average and median prediction errors were 17.02 and 16.22 pixels

No difference in Angles and lengths between CNN and manually plotted points

Despite the variety of image quality, using cephalogram images on the internet is a feasible approach for landmark prediction

Hwang et al.
(2020)10

1,028 lateral cephalograms

80 landmarks

2 human examiners

Deep learning with YOLOv3

Landmark detection

Training set: 1,028

Test set: 283

Upon repeated trials, AI always detected identical positions on each landmark

Human intra-examiner variability of repeated manual detections demonstrated a detection error of 0.97 ± 1.03 mm

The mean detection error between AI and human: 1.46 ± 2.97 mm

The mean difference between human examiners: 1.50 ± 1.48 mm

Comparisons in the detection errors between AI and human examiners: less than 0.9 mm, which did not seem to be clinically significant

Kunz et al.
(2020)11

1,792 cephalograms

18 landmarks

12 orthodontic parameters

12 human examiners

CNN deep learning algorithm

Landmark detection

Cephalometric analysis

Humans' gold standard: median values of the 12 examiners

Training set: 1,731

Validation set: 61

Test set: 50

No clinically significant differences between humans' gold standard and the AI's predictions

Yu et al.
(2020)8

5,890 lateral cephalograms and demographic data from one institute

4 cephalometric parameters

2 human examiners

One-step diagnostic system for skeletal classification

Multimodal CNN model

<Model I>
Sagittal

Training set: n = 1,644

Validation set: n = 351

Test set: n = 351

Vertical

Training set: n = 1,912

Validation set: n = 375

Test set: n = 375

Vertical and sagittal skeletal diagnosis: > 90% sensitivity, specificity, and accuracy

Vertical classification: highest accuracy at 96.40 (95% CI, 93.06 to 98.39; model III)

Binary ROC analysis: excellent performance (mean area under the curve > 95%)

Heat maps of cephalograms: visually representing the region of the cephalogram

Kim et al.
(2020)2

2,075 lateral cephalograms from two institutes

400 open dataset

23 landmarks

8 cephalometric parameters

2 human examiners

Stacked hourglass deep learning

Two-stage automated algorithm

Web-based application

Landmark detection

Cephalometric analysis

Evaluation group 1:

Training set: n = 1,675

Validation set: n = 200

Test set: n = 200

Evaluation group 2:

Training set: n = 1,675

Validation set: n = 175

Test set: n = 225

Evaluation group 3:

ISBI 2015 test set: n = 400

Landmark detection error: 1.37 ± 1.79 mm

Successful classification rate: 88.43%

This study
(2020)

2,174 lateral cephalograms from ten institutes

4 cephalometric parameters

1 human examiners

One-step diagnostic system for skeletal and dental discrepancy

CNN including Densenet-169, Arcface, Softmax

External validation

Training set: n = 1,522 from 2 institutes

Internal test set: n = 471 from 2 institutes

External test set: n = 181 from the other 8 institutes

Binary ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range: 0.8248–0.8944 and 0.8979–0.9580 in internal test set; 0.8821–0.8880 and 0.9074–0.9524 in external test set) in diagnosis of the skeletal and dental discrepancies

Multiple ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range:0.8066–0.9905 and 0.8156–0.9998 in internal test set; 0.8054–0.9725 and 0.8222–0.9992 in external test set) in diagnosis of the skeletal and dental discrepancies

t-SNE analysis succeeded in creating the well-separated boundaries between the three classification groups in each diagnosis

Grad-CAM showed different patterns and sizes of the focus areas according to three classification groups in each diagnosis

CNN, convolutional neural network; YOLO, “you only look once” real-time object detection; SSD, single shot detector; ISBI, International Symposium on Biomedical Imaging; AI, artificial intelligence; CI, confidence interval; ROC, receiver operating characteristic; t-SNE, t-stochastic neighbor embedding; Grad-CAM, gradient-weighted class activation mapping.



Limitations of this study and suggestions for future studies

The present study has some limitations. First, this study had a relative imbalance in the data sets of some centers. Second, more demographic, clinical, and cephalometric parameters should be included in setting the gold standard and training AI models in future studies.

As suggestions for future studies, it is necessary to develop a one-step automated classification algorithm for diagnosis of transverse and asymmetry problems. Prospective studies with larger diagnostic cohort data sets will allow more robust validation of the model.

CONCLUSION

  • The accuracy of our model was well-validated with internal test sets from two hospitals as well as external test sets from eight other hospitals without issues regarding the continuity of the data sets or exaggerated accuracy.

  • Our model shows the possible usefulness of a one-step automated orthodontic diagnosis tool for classifying skeletal and dental discrepancies with input of lateral cephalograms only in an end-to-end manner. However, it still needs technical improvement in terms of classifying VDDs.

ACKNOWLEDGEMENTS

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C1638). This article was based on the study of Dr. Yim’s PhD dissertation. We thank Professor Won-hee Lim and Dr. Keunoh Lim for their contribution in performing the inter-examiner reliability test.

CONFLICTS OF INTEREST

No potential conflict of interest relevant to this article was reported.

References

  1. Arık S?, Ibragimov B, Xing L. Fully automated quantitative cephalometry using convolutional neural networks. J Med Imaging (Bellingham) 2017;4:014501.
    Pubmed KoreaMed CrossRef
  2. Kim H, Shim E, Park J, Kim YJ, Lee U, Kim Y. Web-based fully automated cephalometric analysis by deep learning. Comput Methods Programs Biomed 2020;194:105513.
    Pubmed CrossRef
  3. Nishimoto S, Sotsuka Y, Kawai K, Ishise H, Kakibuchi M. Personal computer-based cephalometric landmark detection with deep learning, using cephalograms on the internet. J Craniofac Surg 2019;30:91-5.
    Pubmed CrossRef
  4. Erkan M, Gurel HG, Nur M, Demirel B. Reliability of four different computerized cephalometric analysis programs. Eur J Orthod 2012;34:318-21.
    Pubmed CrossRef
  5. Wen J, Liu S, Ye X, Xie X, Li J, Li H, et al. Comparative study of cephalometric measurements using 3 imaging modalities. J Am Dent Assoc 2017;148:913-21.
    Pubmed CrossRef
  6. Rudolph DJ, Sinclair PM, Coggins JM. Automatic computerized radiographic identification of cephalometric landmarks. Am J Orthod Dentofacial Orthop 1998;113:173-9.
    Pubmed CrossRef
  7. Mosleh MA, Baba MS, Malek S, Almaktari RA. Ceph-X: development and evaluation of 2D cephalometric system. BMC Bioinformatics 2016;17(Suppl 19):499.
    Pubmed KoreaMed CrossRef
  8. Yu HJ, Cho SR, Kim MJ, Kim WH, Kim JW, Choi J. Automated skeletal classification with lateral cephalometry based on artificial intelligence. J Dent Res 2020;99:249-56.
    Pubmed CrossRef
  9. Park JH, Hwang HW, Moon JH, Yu Y, Kim H, Her SB, et al. Automated identification of cephalometric landmarks: part 1-comparisons between the latest deep-learning methods YOLOV3 and SSD. Angle Orthod 2019;89:903-9.
    Pubmed KoreaMed CrossRef
  10. Hwang HW, Park JH, Moon JH, Yu Y, Kim H, Her SB, et al. Automated identification of cephalometric landmarks: part 2-might it be better than human? Angle Orthod 2020;90:69-76.
    Pubmed KoreaMed CrossRef
  11. Kunz F, Stellzig-Eisenhauer A, Zeman F, Boldt J. Artificial intelligence in orthodontics: evaluation of a fully automated cephalometric analysis using a customized convolutional neural network. J Orofac Orthop 2020;81:52-68.
    Pubmed CrossRef
  12. Korean Association of Orthodontics Malocclusion White Paper Publication Committee. Cephalometric analysis of normal occlusion in Korean adults. Seoul: Korean Association of Orthodontists; 1997.
  13. Bujang MA, Baharum N. Guidelines of the minimum sample size requirements for Cohen's Kappa. Epidemiol Biostat Public Health 2017;14:e12267.
  14. McHugh ML. Interrater reliability: the Kappa statistic. Biochem Med (Zagreb) 2012;22:276-82.
    Pubmed KoreaMed CrossRef
  15. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52.
    Pubmed CrossRef
  16. Huang G, Liu Z, van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. Paper presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Jul 21-26; Honolulu, USA: Piscataway: IEEE, 2017. p. 2261-9.
    KoreaMed CrossRef
  17. Goyal P, Doll?r P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, et al. Accurate, large minibatch SGD: training ImageNet in 1 hour [Internet]. arxiv. 2017 Jun 8 [updated 2018 Apr 30; cited 2020 Aug 7]. Available from: https://arxiv.org/abs/1706.02677.
  18. Jia X, Song S, He W, Wang Y, Rong H, Zhou F, et al. Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes [Internet]. arxiv. 2018 Jul 30 [cited 2020 Aug 7]. Available from: https://arxiv.org/abs/1807.11205.
  19. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift [Internet]. arxiv. 2015 Feb 11 [updated 2015 Mar 2; cited 2020 Sep 8]. Available from: https://arxiv.org/abs/1502.03167.
  20. Wu Y, He K. Group normalization. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. ECCV 2018: Computer vision - ECCV 2018. Cham: Springer; 2018. p. 3-19.
    CrossRef
  21. Deng J, Guo J, Xue N, Zafeiriou S. ArcFace: additive angular margin loss for deep face recognition. Paper presented at: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15-20; Long Beach, USA: Piscataway: IEEE, 2019. p. 4690-9.
    Pubmed KoreaMed CrossRef
  22. Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Med Decis Making 2000;20:323-31.
    Pubmed CrossRef
  23. Li J, Fine JP. ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostatistics 2008;9:566-76.
    Pubmed CrossRef
  24. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579-605.
  25. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Paper presented at: 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22-29; Venice, Italy: Piscataway: IEEE, 2017. p. 618-26.
    CrossRef

Article

Original Article

Korean J Orthod 2022; 52(1): 3-19

Published online January 25, 2022 https://doi.org/10.4041/kjod.2022.52.1.3

Copyright © The Korean Association of Orthodontists.

Accuracy of one-step automated orthodontic diagnosis model using a convolutional neural network and lateral cephalogram images with different qualities obtained from nationwide multi-hospitals

Sunjin Yima , Sungchul Kimb , Inhwan Kimb, Jae-Woo Parkc, Jin-Hyoung Chod, Mihee Honge, Kyung-Hwa Kangf, Minji Kimg, Su-Jung Kimh, Yoon-Ji Kimi, Young Ho Kimj, Sung-Hoon Limk, Sang Jin Sungi, Namkug Kiml , Seung-Hak Baekm

aDepartment of Orthodontics, School of Dentistry, Seoul National University, Seoul, Korea
bDepartment of Biomedical Engineering, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
cPrivate Practice, Incheon, Korea
dDepartment of Orthodontics, Chonnam National University School of Dentistry, Gwangju, Korea
eDepartment of Orthodontics, School of Dentistry, Kyungpook National University, Daegu, Korea
fDepartment of Orthodontics, School of Dentistry, Wonkwang University, Iksan, Korea
gDepartment of Orthodontics, College of Medicine, Ewha Womans University, Seoul, Korea
hDepartment of Orthodontics, Kyung Hee University School of Dentistry, Seoul, Korea
iDepartment of Orthodontics, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
jDepartment of Orthodontics, Institute of Oral Health Science, Ajou University School of Medicine, Suwon, Korea
kDepartment of Orthodontics, College of Dentistry, Chosun University, Gwangju, Korea
lDepartment of Convergence Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Korea
mDepartment of Orthodontics, School of Dentistry, Dental Research Institute, Seoul National University, Seoul, Korea

Correspondence to:Seung-Hak Baek.
Professor, Department of Orthodontics, School of Dentistry, Dental Research Institute, Seoul National University, 101, Daehak-ro, Jongno-gu, Seoul 03080, Korea.
Tel +82-2-2072-3952 e-mail drwhite@unitel.co.kr
Corresponding author: Namkug Kim.
Professor, Department of Convergence Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88 Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea.
Tel +82-2-3010-6573 e-mail namkugkim@gmail.com

Sunjin Yim and Sungchul Kim contributed equally to this work (as co-first authors).

Received: March 23, 2021; Revised: June 1, 2021; Accepted: July 2, 2021

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Objective: The purpose of this study was to investigate the accuracy of one-step automated orthodontic diagnosis of skeletodental discrepancies using a convolutional neural network (CNN) and lateral cephalogram images with different qualities from nationwide multi-hospitals. Methods: Among 2,174 lateral cephalograms, 1,993 cephalograms from two hospitals were used for training and internal test sets and 181 cephalograms from eight other hospitals were used for an external test set. They were divided into three classification groups according to anteroposterior skeletal discrepancies (Class I, II, and III), vertical skeletal discrepancies (normodivergent, hypodivergent, and hyperdivergent patterns), and vertical dental discrepancies (normal overbite, deep bite, and open bite) as a gold standard. Pre-trained DenseNet-169 was used as a CNN classifier model. Diagnostic performance was evaluated by receiver operating characteristic (ROC) analysis, t-stochastic neighbor embedding (t-SNE), and gradientweighted class activation mapping (Grad-CAM). Results: In the ROC analysis, the mean area under the curve and the mean accuracy of all classifications were high with both internal and external test sets (all, > 0.89 and > 0.80). In the t-SNE analysis, our model succeeded in creating good separation between three classification groups. Grad-CAM figures showed differences in the location and size of the focus areas between three classification groups in each diagnosis. Conclusions: Since the accuracy of our model was validated with both internal and external test sets, it shows the possible usefulness of a one-step automated orthodontic diagnosis tool using a CNN model. However, it still needs technical improvement in terms of classifying vertical dental discrepancies.

Keywords: One-step automated orthodontic diagnosis, Convolutional neural networks, Lateral cephalogram, Multi-center study

INTRODUCTION

Accurate positioning of cephalometric landmarks is one of the most important steps in successful cephalometric analyses. Since the location and visibility of some anatomic landmarks are highly influenced by superimposition of the anatomical structures in the face between the right and left sides,1,2 it is not easy to identify these anatomic landmarks consistently and accurately.

For the last several decades, clinicians have manually indicated the cephalometric landmarks and measured several angles and distances between these landmarks to assess dentofacial deformities.3 Although this manual cephalometric analysis has been substituted with digital cephalometric analysis,4,5 the process is still laborious, time-consuming, and sometimes inaccurate in detection of cephalometric landmarks.3,6-8

Recently, research on automatic detection of cephalometric landmarks using artificial intelligence (AI) with convolutional neural networks (CNNs) has gained popularity.1-3,9-11 These studies have focused mainly on automatic detection of cephalometric landmarks and reported that most cephalometric landmarks were detected within a 2-mm range of accuracy.1,10 However, these approaches still require further measurements of cephalometric parameters including distance, angle, and ratio. Although Kunz et al.11 developed an AI algorithm to analyze 12 cephalometric parameters, they did not make a one-step automated orthodontic diagnosis tool in practice. Therefore, it is necessary to develop a one-step automated orthodontic diagnosis algorithm based on a CNN to avoid the need of additional measurements of cephalometric parameters.

In terms of a one-step CNN algorithm for classification of skeletal discrepancies, Yu et al.8 reported > 90% accuracy, sensitivity, and specificity for diagnosis of the sagittal and vertical skeletal discrepancies in three models (Models I, II, and III). However, they intentionally excluded some portion of the data adjacent to the classification cutoff with intervals of 0.2 standard deviations (SDs) in Model II and 0.3 SDs in Model III in the test set.8 As a result, Models II and III showed a significant increase in the values for accuracy, sensitivity, and specificity compared to Model I.8

The major limitations in previous studies can be summarized as follows:1-3,8-11 (1) Most studies used lateral cephalograms from only one or two hospitals, not from nationwide several different hospitals which had different machine types, radiation exposure conditions, sensors, and image conditions; (2) No study has simultaneously reported dental and skeletal discrepancies using a one-step automated classification algorithm; and (3) If some portion of the data adjacent to the classification cutoff were excluded in the test set, there would be issues in the continuity of the test set and an exaggerated increase in accuracy. Therefore, the purpose of this study was to investigate the accuracy of a novel one-step automated orthodontic diagnosis model for determining anteroposterior skeletal discrepancies (APSDs: Class I, Class II, and Class III), vertical skeletal discrepancies (VSDs: normodivergent, hyperdivergent, and hypodivergent), and vertical dental discrepancies (VDDs: normal overbite, open bite, and deep bite) using a CNN and lateral cephalogram images with different qualities from nationwide 10 unrelated dental hospitals in Korea.

MATERIALS AND METHODS

Description of the dataset

A total of 2,174 lateral cephalogram images were retrospectively obtained from the Departments of Orthodontics in nationwide 10 hospitals including Seoul National University Hospital (SNUDH), Kooalldam Dental Hospital (KADH), Ajou University Dental Hospital (AJUDH), Asan Medical Center (AMC), Chonnam National University Dental Hospital (CNUDH), Chosun University Dental Hospital (CSUDH), Ewha University Medical Center (EUMC), Kyung Hee University Dental Hospital (KHUDH), Kyungpook National University Dental Hospital (KNUDH), and Wonkwang University Dental Hospital (WKUDH) in Korea. The inclusion criteria were Korean adult patients who underwent orthodontic treatment with/without orthognathic surgery between 2013 and 2020. The exclusion criteria were (1) patients who were in childhood and adolescence and (2) patients who had mixed dentition. All datasets were strictly anonymized before use. The study protocol was reviewed and approved by the Institutional Review Board of SNUDH (ERI20022), Korean National Institute for Bioethics Policy for KADH (P01-202010-21-020), Ajou University Hospital Human Research Protection Center (AJIRB-MED-MDB-19-039), AMC (2019-0927), CNUDH (CNUDH-2019-004), CSUDH (CUDHIRB 1901 005), EUMC (EUMC 2019-04-017-003), KHUDH (D19-007-003), KNUDH (KNUDH-2019-03-02-00), and WKUDH (WKDIRB201903-01).

Lateral cephalogram images, 1,993 from two hospitals, were used for the training set (n = 1,522) and internal test set (n = 471), and 181 from eight other hospitals were used as the external test set to validate our model (Figure 1). Table 1 summarizes information on the product, radiation exposure condition, sensor, and image conditions in each hospital, which showed diverse conditions.

Table 1 . Information on the product, radiation exposure condition, sensor, and image condition of the cephalometric radiograph system in 10 multi-centers.

Cephalometric radiograph systemsSNUDHKADHAJUDHAMCCNUDHCSUDHEUMCKHUDHKNUDHWKUDH
ProductCompanyAsahiVatechPlanmecaCarestreamInstrumentariumPlanmecaAsahiAsahiAsahiPlanmeca
ModelCX-90SP-IIUni3D NCProline XCCS9300OrthoCeph
OC 100
Proline XCOrtho stage
(Auto III
N CM)
CX-90SPCX-90SP-
II
Promax
Radiation
exposure
condition
Kvp768568808580757070Female 72,
Male 74
mA8010712121215158010
sec0.320.92.30.631.61.810.3–0.350.321.87
SensorImage
sensor
Cassette
(CR system)
CCD sensorCCD sensorCCD sensorCassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
CCD
sensor
Sensor
size
10 × 12 (inch)30 × 25 (cm)10.6 × 8.85 (inch)30 × 30 (cm)10 × 12 (inch)8 × 10 (inch)8 × 12
(inch)
10 × 12
(inch)
11 × 14
(inch)
27 × 30
(cm)
ImageImage size
(pixel × pixel)
2,000 × 2,510/
2,010 × 1,670
2,360 × 1,8801,039 × 1,2002,045 × 2,272/
1,012 × 2,020
2,500 × 2,0482,392 × 1,792
/ various
2,510 × 2,0002,500
× 2,048
1,950 × 2,460/
2,108 × 1,752
1,818
× 2,272
Actual
resolution
(mm/pixel)
0.150/ 0.1000.1100.2500.132/ 0.1450.1150.1000.1000.1100.1000.132
Lateral cephalogram images used in this study (number)1,1298642221203026231920

SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; EUMC, Ewha University Medical Center; KHUDH, Kyung Hee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital; CR, computed radiography; CCD, charge-coupled device..


Figure 1. Flowchart of dataset and experimental setup.
CNN, convolutional neural network.

Setting a gold standard for the diagnosis of APSDs, VSDs, and VDDs

After detection of the cephalometric landmarks including A point, nasion, B point, orbitale, porion, gonion, menton, sella, maxilla 1 crown, maxilla 6 distal, mandible 1 crown, and mandible 6 distal by a single operator (SY), the cephalometric parameters including A point-Nasion-B point (ANB) angle, Frankfort mandibular plane angle (FMA), Jarabak’s posterior/anterior facial height ratio (FHR), and overbite were calculated using V-Ceph 8.0 (Osstem, Seoul, Korea) to set a gold standard.

All cephalometric images were classified into the three classification groups by a single operator (SY) as follows. For classification of APSDs, we defined the ANB value between –1 SD and 1 SD from the ethnic norm of each sex12 as skeletal Class I; > 1 SD as skeletal Class II; and < –1 SD as skeletal Class III. For classification of VSDs, we combined FMA and FHR values from the ethnic norm of each sex12 for training. First, we normalized the FMA and FHR values by using the SD values. Second, the FHR values were flipped due to an opposite sign compared to the FMA values. Third, the values of FMA and flipped FHR were added because each are regarded as having equal weights. Fourth, the mean and SD values were obtained for classification into three groups. Then, we defined the values between –1 SD and 1 SD from the mean as normodivergent pattern, > 1 SD as hyperdivergent pattern, and < –1 SD as hypodivergent pattern. For classification of the VDDs, we defined the overbite value between 0 mm and 3 mm as a normal overbite, > 3 mm as a deep bite, and < 0 mm as an open bite (Tables 2 and 3).

Table 2 . Classification criteria for the anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) for orthodontic analysis.

SexAPSDsVSDsVDDs
ANBFMAFHROverbite
MeanSDMeanSDMeanSDMeanSD
Female2.41.824.24.66591.51.5
Male1.782.0226.781.7966.375.07

ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; SD, standard deviation..


Table 3 . Distribution of classification groups in each diagnosis for human gold standard in the training set, internal test set, and external test set.

ClassificationsTraining setInternal test setExternal test setSum
SNUDHKADHSumSNUDHKADHSumAJUDHAMCEUMCCNUDHCSUDHKHUDHKNUDHWKUDHSumInternal + external
test sets
Total
APSDsClass I238323561 (36.9)12240162 (34.4)86547117250 (27.6)212 (32.5)773 (35.6)
Class II183263446 (29.3)11244156 (33.1)881181344662 (34.3)218 (33.4)664 (30.5)
Class III359156515 (33.8)11538153 (32.5)6710810881269 (38.1)222 (34.0)737 (33.9)
Sum7807421,52234912247122212620302319201816522,174
VSDsNormodivergent331389720 (47.3)14650196 (41.6)1067917107773 (40.3)270 (41.4)989 (45.5)
Hyperdivergent314241555 (36.5)13540175 (37.2)59126378656 (30.9)231 (35.4)786 (36.2)
Hypodivergent135112247 (16.2)6832100 (21.2)76751064752 (28.7)151 (23.2)399 (18.4)
Sum7807421,52234912247122212620302319201816522,174
VDDsNormal overbite440493933 (61.3)19653249 (52.9)1111108910101079 (43.6)328 (50.3)1,261 (58.0)
Open bite209194403 (26.5)9941140 (29.7)4795984551 (28.2)191 (29.3)594 (27.3)
Deep bite13155186 (12.2)542882 (17.4)73771255551 (28.2)133 (20.4)319 (14.7)
Sum7807421,52234912247122212620302319201816522,174

Values are presented as number only or number (%)..

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; EUMC, Ewha University Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; KHUDH, Kyunghee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital..



To assess intra-examiner reliability, all classifications of APSDs, VSDs, and VDDs were performed again after one month by the same investigator (SY). Since the minimum sample size13 was suggested as 49 from a 3 × 3 Cohen’s kappa agreement test, 100 images were randomly selected to classify APSDs, VSDs, and VDDs. Cohen’s kappa agreement test showed an “almost perfect” agreement (kappa value; 0.939 for APSDs, 0.984 for VSDs, and 0.907 for VDDs).14 Therefore, the first classification results were used for further statistical analysis.

To evaluate inter-examiner reliability, the same images used to assess intra-examiner reliability were selected. Classifications of APSDs, VSDs, and VDDs were performed by the other investigator (KL). Cohen’s kappa agreement test showed an “almost perfect” agreement for APSDs and VSDs (kappa value; 0.985 for APSDs, 0.919 for VSDs) and “substantial’ agreement” for VDDs (0.601).14

Preprocessing of the data

Augmentation techniques including cropping, padding, spatial transformations, and appearance transformation were conducted in real time.

Model architecture (Figure 2)

As the backbone of the model, DenseNet-169 pre-trained with weights of the ImageNet dataset was used with group normalization (GN).15-20 After the global average pooling (GAP) of the backbone, ArcFace was added in parallel with the softmax layer in order to overcome imbalanced data sets and obtain discriminative features during training.21

Figure 2. Diagrams of the model architecture. A, During training, an ArcFace head was added to the last convolutional layer of the backbone in parallel with the softmax layer. B, After training, the ArcFace head was removed and inference was implemented using only the softmax layer.

After training, the ArcFace head was removed, and inference was implemented using only the softmax layer as a basic CNN classifier. Because sex was included as a classification criterion of APSDs and VSDs, the one-hot vector about sex was concatenated with the feature vector after GAP.

Model training (Figures 1 and 2)

Training for APSDs, VSDs, and VDDs was performed using only a gold standard determined by a single operator (SY), not by measurement of cephalometric parameters including ANB, FMA, FHR, and overbite.

Model testing

After training was completed, one-step classification was performed with both the internal and external test sets to validate the performance of the constructed model. It took 55 seconds (sec) to diagnose the internal test set (0.1168 sec per lateral cephalogram) and 22 sec to diagnose the external test set (0.1215 sec per lateral cephalogram). The results for the internal and external test sets were compared with gold standard diagnostic data.

Analysis method

Receiver operating characteristic (ROC) analysis

The performance of our model was evaluated using accuracy, area under the curve (AUC), sensitivity, and specificity using both binary and multiple class ROC analysis.8,22,23

t-stochastic neighbor embedding (t-SNE)

Since this technique can visualize high-dimensional data by giving each datapoint a location in a two or three-dimensional map, it was used to check the feature distribution of the training set, internal test set, and external test set after GAP layering.24 In each diagnosis, the labels of ground truth (GT) and prediction (PD) were set to check the distribution of each data set.

Gradient-weighted class activation mapping (Grad-CAM)25

As this technique can produce visual explanations of AI models, it can show the regions where the AI focuses for PD. It was used to confirm the regions where our model mainly focused on the diagnosis of APSDs, VSDs, and VDDs.

RESULTS

Metrology distribution of the APSDs, VSDs, and VDDs per dataset (Figure 3)

The continuity of the dataset between the normal groups (Class I in APSDs, normodivergent pattern in VSDs, and normal overbite in VDDs) and the other two groups (Class II and III in APSDs, hyperdivergent and hypodivergent patterns in VSDs, and open bite and deep bite in VDDs) was confirmed.

Figure 3. Metrology distribution of the anteroposterior skeletal discrepancies (APSDs: Class I, Class II, and Class III), vertical skeletal discrepancies (VSDs: normodivergent pattern, hyperdivergent pattern, and hypodivergent pattern), and vertical dental discrepancies (VDDs: normal overbite, open bite, and deep bite) per dataset. Red lines in APSDs and VSDs indicate one standard deviation of the normal classification. Red lines in VDDs indicate the boundary values, which were 0 mm and 3 mm.
ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; norm, normalized; Man, mandible 1 crown; Max, maxilla 1 crown; dist, distance.

Accuracy and AUC of the internal test set in binary ROC analysis (Table 4 and Figure 4)

In APSDs, Class III had the highest accuracy and AUC (0.9372 and 0.9807, respectively), followed by Class II (0.8972 and 0.9533, respectively) and Class I (0.8488 and 0.9212, respectively). In VSDs, hypodivergent pattern had the highest accuracy and AUC (0.9346 and 0.9824, respectively), followed by hyperdivergent pattern (0.9019 and 0.9730, respectively) and normodivergent pattern (0.8365 and 0.9186, respectively). In VDDs, open bite had the highest accuracy and AUC (0.8730 and 0.9475, respectively), followed by deep bite (0.8637 and 0.9286, respectively) and normal overbite (0.7376 and 0.8177, respectively).

Table 4 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the binary ROC analysis.

ClassificationsAccuracyAUCSensitivitySpecificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I0.84880.01030.83200.02300.92120.00380.90420.01950.79380.03280.78400.02970.87640.01860.85040.0273
Class II0.89720.00570.87960.01530.95330.00260.96010.00670.81920.03340.72260.05150.93590.01610.96130.0046
Class III0.93720.00630.95250.01080.98070.00250.99300.00230.91110.02250.96520.00790.94970.00860.94460.0160
Mean0.89440.03680.88800.05180.95170.02450.95240.03820.84140.05710.82390.10760.92060.03450.91880.0516
VSDsNormodivergent0.83650.00820.83090.02670.91860.00460.91570.01510.82350.02790.76990.04160.84580.01220.87220.0178
Hyperdivergent0.90190.00350.90610.02030.97300.00470.97300.00470.81490.02730.91430.02930.95340.01900.90240.0360
Hypodivergent0.93460.00980.90940.01640.98240.00150.96840.00260.90000.03940.80000.06610.94450.01270.95350.0110
Mean0.89100.04130.88210.04100.95800.02830.95230.02730.84610.04780.82800.07570.91460.05050.90940.0398
VDDsNormal overbite0.73760.02910.75910.02300.81770.01660.83590.01520.65300.09560.65820.06640.82880.04410.83730.0557
Open bite0.87300.01300.89170.01390.94750.00530.96260.00740.83710.03660.82750.06110.88820.03040.92620.0228
Deep bite0.86370.02700.85860.01270.92860.00990.92380.00550.80000.11000.81960.08360.87810.05300.87230.0457
Mean0.82480.06540.83650.05840.89790.05820.90740.05380.76340.11110.76840.10060.86510.04680.87860.0535

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation..


Figure 4. The results of the binary receiver operating characteristic curve analysis (A) in the internal test set from two hospitals and (B) in the external test set from other 8 hospitals for diagnosis of anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).
AUC, area under the curve.

In APSDs and VSDs, the total accuracy reached nearly 0.9 and the total AUC exceeded 0.95 (0.9517 and 0.9580, respectively). However, VDDs showed a relatively lower total accuracy (0.8248 vs. 0.8944 and 0.8910) and total AUC (0.8979 vs. 0.9517 and 0.9580) than APSDs and VSDs.

Accuracy and AUC of the external test set in binary ROC analysis (Table 4 and Figure 4)

In APSDs, Class III had the highest accuracy and AUC (0.9525 and 0.9930, respectively), followed by Class II (0.8796 and 0.9601, respectively) and Class I (0.8320 and 0.9042, respectively). In VDDs, open bite had the highest accuracy and AUC (0.8917 and 0.9626, respectively), followed by deep bite (0.8586 and 0.9238, respectively) and normal overbite (0.7591 and 0.8359, respectively). However, VSDs showed a different pattern between accuracy and AUC. Although the accuracy was highest for hypodivergent pattern (0.9094), followed by hyperdivergent pattern (0.9061) and normodivergent pattern (0.8309), the AUC was highest for hyperdivergent pattern (0.9730), followed by hypodivergent pattern (0.9684) and normodivergent pattern (0.9157).

In APSDs and VSDs, the total accuracy reached nearly 0.9 and the total AUC exceeded 0.95. However, VDDs showed a relatively lower total accuracy (0.8365 vs. 0.8880 and 0.8821) and total AUC (0.9074 vs. 0.9524 and 0.9523) than APSDs and VSDs.

Comparison of AUC values between internal and external test sets in binary ROC analysis (Table 4)

In APSDs and VSDs, Class III and open bite showed the highest AUC compared to other classifications (0.9807 and 0.9903 in the internal test set, 0.9475 and 0.9626 in external test set, respectively). However, VSDs showed a different pattern. The internal test set showed the highest AUC for hypodivergent pattern (0.9824), while the external test set showed the highest AUC for hyperdivergent pattern (0.9730). However, the difference in the AUC values was less than 0.01.

Comparison of AUC values between internal and external test sets in multiple ROC analysis (Table 5)

In terms of pairwise AUCs in the internal and external test sets of APSDs, VSDs, and VDDs, Class II vs. Class III ([II→III, 0.9913; II←III, 0.9920]; [II→III, 0.9992; II←III, 0.9989]; Δ value [II→III, 0.0079; II←III, 0.0069]), hyperdivergent pattern vs. hypodivergent pattern ([hyper→hypo, 0.9998; hyper←hypo, 0.9998]; [hyper→hypo, 0.9930; hyper←hypo, 0.9977]; Δ value [hyper→hypo, –0.0068; hyper←hypo, –0.0021]), and open bite vs. deep bite ([open→deep, 0.9982; open←deep, 0.9951]; [open→deep, 0.9924; open←deep, 0.9956]; Δ value [open→deep, –0.0058; open←deep, 0.0005]) showed the highest values in both the internal and external test sets and the smallest differences compared to other pairwise classifications.

Table 5 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the multiple ROC analysis.

ClassificationsAccuracyPairwise AUCPairwise sensitivityPairwise specificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I → Class II0.85030.00860.80540.02220.89430.01060.82220.38300.88020.02830.90800.00980.81920.02990.72260.0461
Class I ← Class II0.91750.00390.90610.01360.81920.02990.72260.04610.88020.02830.90800.0098
Class I → Class III0.91430.00920.92770.01470.94860.00570.97800.00390.91730.01490.87600.03200.91110.02010.96520.0071
Class I ← Class III0.96980.00350.98560.00320.91110.02010.96520.00710.91730.01490.87600.0320
Class II → Class III0.97540.00330.97250.01420.99130.00140.99920.00090.96540.00770.94190.02990.98560.00261.00000.0000
Class II ← Class III0.99200.00130.99890.00130.98560.00261.00000.00000.96540.00770.94190.0299
VSDsHyper → Hypo0.99050.00370.97780.01260.99980.00020.99300.00190.98510.00581.00000.00001.00000.00000.95380.0261
Hyper ← Hypo0.99980.00010.99770.00031.00000.00000.95380.02610.98510.00581.00000.0000
Hyper → Normo0.87550.00400.87910.02230.95930.00630.95870.00680.81490.02440.91430.02620.92960.02570.85210.0485
Hyper ← Normo0.90340.00880.93290.01190.92960.02570.85210.04850.81490.02440.91430.0262
Hypo → Normo0.89590.01390.86880.02120.96690.00240.94590.00420.90000.03520.80000.05910.89390.02310.91780.0173
Hypo ← Normo0.94510.01530.89720.03160.89390.02310.91780.01730.90000.03520.80000.0591
VDDsOpen → Deep0.97660.01120.97060.01860.99820.00120.99240.00440.98140.01160.99220.00960.96830.04120.94900.0319
Open ← Deep0.99510.00660.99560.00420.96830.04120.9490.03190.98140.01160.99220.0096
Open → Normal0.84630.01410.85380.02010.93080.00630.94340.00840.84140.03180.83140.05200.84900.03630.86840.0284
Open ← Normal0.81900.04520.83730.03410.84900.03630.86840.02840.84140.03180.83140.0520
Deep → Normal0.80660.03380.80620.01320.89110.01300.87750.00890.80000.09840.82750.07880.80880.07410.79240.0682
Deep ← Normal0.81560.03880.83450.02770.80880.07410.79240.06820.80000.09840.82750.0788

ROC curve analysis with multiple classification tasks was performed..

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation; Hyper, hyperdivergent; Hypo, hypodivergent; Normo, normodivergent; Open, open bite; Deep, deep bite; Normal, normal overbite..



t-SNE of APSDs, VSDs, and VDDs per dataset (Figure 5)

The GT in the training set, internal test set, and external test set showed that dots with different colors were mixed irregularly in the classification cutoff areas (dotted circle in Figure 5, GT) between the normal group (Class I in APSDs, normodivergent pattern in VSDs, and normal overbite in VDDs) and the other two groups (Class II and III for APSDs, hyperdivergent and hypodivergent patterns for VSDs, and open bite and deep bite for VDDs).

Figure 5. The results of t-stochastic neighbor embedding in anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) per dataset. The labels of ground truth (GT) and prediction (PD) were set to check their distribution. Dotted circles indicate areas with irregular mixing. Dotted lines indicate cutoff lines.

However, in the AI PD, the areas with irregular mixing had almost disappeared enough to indicate a cutoff line between the normal group and the other two groups in the training set, internal test set, and external test set (Figure 5, PD). This indicated that our model succeeded in creating good separation between the three classification groups in each diagnosis, resulting in consistent classification within each group.

Grad-CAM for each diagnosis (Figure 6)

Heat maps show differences in the location and size of the focus areas between three classification groups in each diagnosis. These indicate that our model can effectively use the information in the lateral cephalogram images.

Figure 6. Gradient-weighted class activation mapping plots for anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).

DISCUSSION

The present study has some meaningful outcomes as follows: (1) Despite the different quality of lateral cephalogram images from diverse conditions of cephalometric radiograph systems in nationwide 10 hospitals (Table 1), a clinically acceptable accuracy of diagnosis was obtained for APSDs, VSDs, and VDDs; and (2) since it was possible to give a proper diagnosis for APSDs, VSDs, and VDDs with input of lateral cephalograms only, our model showed the possibility of general-purpose one-step orthodontic diagnosis tool.

Clinical meaning of the comparison results between internal and external test sets in binary and multiple ROC analysis

Since the differences in AUC values for APSDs, VSDs, and VDDs in both binary and multiple ROC analyses were almost insignificant (Tables 4 and 5), it could be regarded that our model was well-validated in the external test set.

Comparison of accuracy with a previous study using binary ROC analysis results

Compared to model I of Yu et al.,8 our model showed slightly lower scores for total accuracy (< 0.011) and slightly higher scores for total AUC (< 0.020) (Table 6). Although our dataset had some disadvantages including a relatively smaller number of images in the dataset and an imbalanced data set compared to Yu et al.’s study8 (n = 5,890 lateral cephalogram images, and even distribution of data set after under-sampling), our model exhibited nearly the same performance as model I by Yu et al.8 To overcome this disadvantageous environment, we elaborated on constructing the proper architecture of our model using GN, ArcFace, and a softmax layer (Figure 2).

Table 6 . Comparison of the binary ROC analysis results between multi-models in a previous study and a single model in this study.

ModelsAPSDsVSDs
SensitivitySpecificityAccuracyAUCSensitivitySpecificityAccuracyAUC
Yu et al’s
study8
This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This
study
Model I
(no exclusion of
data set)
0.85750.84140.92880.92060.90500.89440.9380.95170.84270.84610.92130.91460.89510.89100.9370.9580
Model II
(exclusion of data
set within interval
of 0.2 SD)
0.9079NA0.9539NA0.9386NA0.970NA0.9222NA0.9611NA0.9481NA0.985NA
Model III
(exclusion of data
set within interval
of 0.3 SD)
0.9355NA0.9677NA0.9570NA0.978NA0.9459NA0.9729NA0.9640NA0.984NA

ROC, receiver operating characteristic; APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; AUC, area under the curve; SD, standard deviation; NA, not applicable..



Excluding specific data, especially in the test set, may increase the risk of sample selection bias and lead to inaccurate validation of the model. Therefore, in the present study, all datasets with a whole distribution were included to properly validate the model (Figure 3).

Difference in the AUC values of in Class II and Class III groups in APSDs and hyperdivergent and hypodivergent groups in VSDs in binary and multiple ROC analysis

The hypodivergent group showed a higher AUC score than the hyperdivergent group in the internal test set, while the hyperdivergent group showed a higher AUC than the hypodivergent group in the external test set (0.9824 vs. 0.9730 in the internal test set, respectively; 0.9684 vs. 0.9730 in the external test set, respectively; Table 4).

The Class III group showed higher AUC values than the Class II group, which was in accordance with the results of Yu et al.8 for both internal and external test sets (0.9807 vs. 0.9533 in the internal test set, respectively; 0.9930 vs. 0.9601 in the external test set, respectively; Table 4). The reason might be a difference in the location and size of the focus areas in the diagnosis of VSDs and APSDs (i.e., relatively larger difference between Class II and Class III groups compared to between the hyperdivergent and hypodivergent groups; Figure 6). Further studies are necessary to investigate the reason why the Class III group showed a higher AUC than the Class II group.

Lower AUC values in VDDs compared to APSDs and VSDs in binary ROC analysis

The lower AUCs in VDDs in both internal and external test sets (Table 4) and relatively unclear separation of the normal overbite group from the deep bite and open bite groups in the GT of the t-SNE result (Figure 5) might be due to two reasons: (1) the imbalanced data composition in the training set, internal test set, and external test set (normal overbite, 61.3%, 52.9% and 43.6%; open bite, 26.5%, 29.7% and 28.2%; deep bite, 12.2%, 17.4% and 28.2%, respectively; Table 3) or (2) an inherent problem in the superimposed image between the anterior teeth.

Current status of CNN-based orthodontic diagnosis

Most previous CNN studies have focused on detecting cephalometric landmarks and/or calculating cephalometric variables for a two-step automated diagnosis.1-3,8-11 The study design, methods, and results of previous CNN studies are summarized in Table 7. In the present study, we proposed a one-step orthodontic diagnosis model, which only needs input of lateral cephalograms. The degree of performance of the AI model used in this study was comparable to the human gold standard (Tables 4 and 5). Automated AI-assisted procedures might save clinicians valuable time and labor in classification of skeletodental characteristics in a large sample size. However, it still needs an ultimate decision from a human expert, especially in borderline cases.

Table 7 . Summary of the study design, methods and results in the orthodontic diagnosis of previous CNN studies and this study.

Author (year)SamplesModel and its applicationData setResults
Arık et al.
(2017)1

400 publicly available cephalograms.

19 landmarks.

8 cephalometric parameters.

2 human examiners.

Deep learning with CNN and shape-based model.

Landmark detection.

Cephalometric analysis.

Training set: 150.

Test set: 250.

High anatomical landmark detection accuracy (∼1% to 2% higher success detection rate for a 2-mm range compared with the top benchmarks in the literature).

High anatomical type classification accuracy (~76% average classification accuracy for test set).

Park et al.
(2019)9

1,028 lateral cephalograms.

80 landmarks.

1 human examiner.

Deep learning with YOLOv3 and SSD.

Landmark detection.

Training set: 1,028.

Test set: 283.

The YOLOv3 algorithm outperformed SSD in accuracy for 38 of 80 landmarks.

The other 42 of 80 landmarks did not show a statistically significant difference between YOLOv3 and SSD.

Error plots of YOLOv3 showed not only a smaller error range but also a more isotropic tendency.

The mean computational time spent per image was 0.05 seconds and 2.89 seconds for YOLOv3 and SSD, respectively.

YOLOv3 showed approximately 5% higher accuracy compared with the top benchmarks in the literature.

Nishimoto
et al. (2019)3

219 lateral cephalograms from internet.

10 skeletal landmarks.

12 cephalometric parameters.

Human examiners – not mentioned.

Personal desktop computer.

CNN.

Landmark detection.

Cephalometric analysis.

Training set: 153 (expanded 51 folds).

Test set: 66.

Average and median prediction errors were 17.02 and 16.22 pixels.

No difference in Angles and lengths between CNN and manually plotted points.

Despite the variety of image quality, using cephalogram images on the internet is a feasible approach for landmark prediction.

Hwang et al.
(2020)10

1,028 lateral cephalograms.

80 landmarks.

2 human examiners.

Deep learning with YOLOv3.

Landmark detection.

Training set: 1,028.

Test set: 283.

Upon repeated trials, AI always detected identical positions on each landmark.

Human intra-examiner variability of repeated manual detections demonstrated a detection error of 0.97 ± 1.03 mm.

The mean detection error between AI and human: 1.46 ± 2.97 mm.

The mean difference between human examiners: 1.50 ± 1.48 mm.

Comparisons in the detection errors between AI and human examiners: less than 0.9 mm, which did not seem to be clinically significant.

Kunz et al.
(2020)11

1,792 cephalograms.

18 landmarks.

12 orthodontic parameters.

12 human examiners.

CNN deep learning algorithm.

Landmark detection.

Cephalometric analysis.

Humans' gold standard: median values of the 12 examiners.

Training set: 1,731.

Validation set: 61.

Test set: 50.

No clinically significant differences between humans' gold standard and the AI's predictions.

Yu et al.
(2020)8

5,890 lateral cephalograms and demographic data from one institute.

4 cephalometric parameters.

2 human examiners.

One-step diagnostic system for skeletal classification.

Multimodal CNN model.

<Model I>
Sagittal

Training set: n = 1,644.

Validation set: n = 351.

Test set: n = 351.

Vertical

Training set: n = 1,912.

Validation set: n = 375.

Test set: n = 375.

Vertical and sagittal skeletal diagnosis: > 90% sensitivity, specificity, and accuracy.

Vertical classification: highest accuracy at 96.40 (95% CI, 93.06 to 98.39; model III).

Binary ROC analysis: excellent performance (mean area under the curve > 95%).

Heat maps of cephalograms: visually representing the region of the cephalogram.

Kim et al.
(2020)2

2,075 lateral cephalograms from two institutes.

400 open dataset.

23 landmarks.

8 cephalometric parameters.

2 human examiners.

Stacked hourglass deep learning.

Two-stage automated algorithm.

Web-based application.

Landmark detection.

Cephalometric analysis.

Evaluation group 1:

Training set: n = 1,675.

Validation set: n = 200.

Test set: n = 200.

Evaluation group 2:

Training set: n = 1,675.

Validation set: n = 175.

Test set: n = 225.

Evaluation group 3:

ISBI 2015 test set: n = 400.

Landmark detection error: 1.37 ± 1.79 mm.

Successful classification rate: 88.43%.

This study
(2020)

2,174 lateral cephalograms from ten institutes.

4 cephalometric parameters.

1 human examiners.

One-step diagnostic system for skeletal and dental discrepancy.

CNN including Densenet-169, Arcface, Softmax.

External validation.

Training set: n = 1,522 from 2 institutes.

Internal test set: n = 471 from 2 institutes.

External test set: n = 181 from the other 8 institutes.

Binary ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range: 0.8248–0.8944 and 0.8979–0.9580 in internal test set; 0.8821–0.8880 and 0.9074–0.9524 in external test set) in diagnosis of the skeletal and dental discrepancies.

Multiple ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range:0.8066–0.9905 and 0.8156–0.9998 in internal test set; 0.8054–0.9725 and 0.8222–0.9992 in external test set) in diagnosis of the skeletal and dental discrepancies.

t-SNE analysis succeeded in creating the well-separated boundaries between the three classification groups in each diagnosis.

Grad-CAM showed different patterns and sizes of the focus areas according to three classification groups in each diagnosis.

CNN, convolutional neural network; YOLO, “you only look once” real-time object detection; SSD, single shot detector; ISBI, International Symposium on Biomedical Imaging; AI, artificial intelligence; CI, confidence interval; ROC, receiver operating characteristic; t-SNE, t-stochastic neighbor embedding; Grad-CAM, gradient-weighted class activation mapping..



Limitations of this study and suggestions for future studies

The present study has some limitations. First, this study had a relative imbalance in the data sets of some centers. Second, more demographic, clinical, and cephalometric parameters should be included in setting the gold standard and training AI models in future studies.

As suggestions for future studies, it is necessary to develop a one-step automated classification algorithm for diagnosis of transverse and asymmetry problems. Prospective studies with larger diagnostic cohort data sets will allow more robust validation of the model.

CONCLUSION

  • The accuracy of our model was well-validated with internal test sets from two hospitals as well as external test sets from eight other hospitals without issues regarding the continuity of the data sets or exaggerated accuracy.

  • Our model shows the possible usefulness of a one-step automated orthodontic diagnosis tool for classifying skeletal and dental discrepancies with input of lateral cephalograms only in an end-to-end manner. However, it still needs technical improvement in terms of classifying VDDs.

ACKNOWLEDGEMENTS

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI18C1638). This article was based on the study of Dr. Yim’s PhD dissertation. We thank Professor Won-hee Lim and Dr. Keunoh Lim for their contribution in performing the inter-examiner reliability test.

CONFLICTS OF INTEREST

No potential conflict of interest relevant to this article was reported.

Fig 1.

Figure 1.Flowchart of dataset and experimental setup.
CNN, convolutional neural network.
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Fig 2.

Figure 2.Diagrams of the model architecture. A, During training, an ArcFace head was added to the last convolutional layer of the backbone in parallel with the softmax layer. B, After training, the ArcFace head was removed and inference was implemented using only the softmax layer.
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Fig 3.

Figure 3.Metrology distribution of the anteroposterior skeletal discrepancies (APSDs: Class I, Class II, and Class III), vertical skeletal discrepancies (VSDs: normodivergent pattern, hyperdivergent pattern, and hypodivergent pattern), and vertical dental discrepancies (VDDs: normal overbite, open bite, and deep bite) per dataset. Red lines in APSDs and VSDs indicate one standard deviation of the normal classification. Red lines in VDDs indicate the boundary values, which were 0 mm and 3 mm.
ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; norm, normalized; Man, mandible 1 crown; Max, maxilla 1 crown; dist, distance.
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Fig 4.

Figure 4.The results of the binary receiver operating characteristic curve analysis (A) in the internal test set from two hospitals and (B) in the external test set from other 8 hospitals for diagnosis of anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).
AUC, area under the curve.
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Fig 5.

Figure 5.The results of t-stochastic neighbor embedding in anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) per dataset. The labels of ground truth (GT) and prediction (PD) were set to check their distribution. Dotted circles indicate areas with irregular mixing. Dotted lines indicate cutoff lines.
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Fig 6.

Figure 6.Gradient-weighted class activation mapping plots for anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs).
Korean Journal of Orthodontics 2022; 52: 3-19https://doi.org/10.4041/kjod.2022.52.1.3

Table 1 . Information on the product, radiation exposure condition, sensor, and image condition of the cephalometric radiograph system in 10 multi-centers.

Cephalometric radiograph systemsSNUDHKADHAJUDHAMCCNUDHCSUDHEUMCKHUDHKNUDHWKUDH
ProductCompanyAsahiVatechPlanmecaCarestreamInstrumentariumPlanmecaAsahiAsahiAsahiPlanmeca
ModelCX-90SP-IIUni3D NCProline XCCS9300OrthoCeph
OC 100
Proline XCOrtho stage
(Auto III
N CM)
CX-90SPCX-90SP-
II
Promax
Radiation
exposure
condition
Kvp768568808580757070Female 72,
Male 74
mA8010712121215158010
sec0.320.92.30.631.61.810.3–0.350.321.87
SensorImage
sensor
Cassette
(CR system)
CCD sensorCCD sensorCCD sensorCassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
Cassette
(CR system)
CCD
sensor
Sensor
size
10 × 12 (inch)30 × 25 (cm)10.6 × 8.85 (inch)30 × 30 (cm)10 × 12 (inch)8 × 10 (inch)8 × 12
(inch)
10 × 12
(inch)
11 × 14
(inch)
27 × 30
(cm)
ImageImage size
(pixel × pixel)
2,000 × 2,510/
2,010 × 1,670
2,360 × 1,8801,039 × 1,2002,045 × 2,272/
1,012 × 2,020
2,500 × 2,0482,392 × 1,792
/ various
2,510 × 2,0002,500
× 2,048
1,950 × 2,460/
2,108 × 1,752
1,818
× 2,272
Actual
resolution
(mm/pixel)
0.150/ 0.1000.1100.2500.132/ 0.1450.1150.1000.1000.1100.1000.132
Lateral cephalogram images used in this study (number)1,1298642221203026231920

SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; EUMC, Ewha University Medical Center; KHUDH, Kyung Hee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital; CR, computed radiography; CCD, charge-coupled device..


Table 2 . Classification criteria for the anteroposterior skeletal discrepancies (APSDs), vertical skeletal discrepancies (VSDs), and vertical dental discrepancies (VDDs) for orthodontic analysis.

SexAPSDsVSDsVDDs
ANBFMAFHROverbite
MeanSDMeanSDMeanSDMeanSD
Female2.41.824.24.66591.51.5
Male1.782.0226.781.7966.375.07

ANB, angle among A point, nasion, and B point; FMA, Frankfort mandibular plane angle; FHR, Jarabak’s posterior/anterior facial height ratio; SD, standard deviation..


Table 3 . Distribution of classification groups in each diagnosis for human gold standard in the training set, internal test set, and external test set.

ClassificationsTraining setInternal test setExternal test setSum
SNUDHKADHSumSNUDHKADHSumAJUDHAMCEUMCCNUDHCSUDHKHUDHKNUDHWKUDHSumInternal + external
test sets
Total
APSDsClass I238323561 (36.9)12240162 (34.4)86547117250 (27.6)212 (32.5)773 (35.6)
Class II183263446 (29.3)11244156 (33.1)881181344662 (34.3)218 (33.4)664 (30.5)
Class III359156515 (33.8)11538153 (32.5)6710810881269 (38.1)222 (34.0)737 (33.9)
Sum7807421,52234912247122212620302319201816522,174
VSDsNormodivergent331389720 (47.3)14650196 (41.6)1067917107773 (40.3)270 (41.4)989 (45.5)
Hyperdivergent314241555 (36.5)13540175 (37.2)59126378656 (30.9)231 (35.4)786 (36.2)
Hypodivergent135112247 (16.2)6832100 (21.2)76751064752 (28.7)151 (23.2)399 (18.4)
Sum7807421,52234912247122212620302319201816522,174
VDDsNormal overbite440493933 (61.3)19653249 (52.9)1111108910101079 (43.6)328 (50.3)1,261 (58.0)
Open bite209194403 (26.5)9941140 (29.7)4795984551 (28.2)191 (29.3)594 (27.3)
Deep bite13155186 (12.2)542882 (17.4)73771255551 (28.2)133 (20.4)319 (14.7)
Sum7807421,52234912247122212620302319201816522,174

Values are presented as number only or number (%)..

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; SNUDH, Seoul National University Dental Hospital; KADH, Kooalldam Dental Hospital; AJUDH, Ajou University Dental Hospital; AMC, Asan Medical Center; EUMC, Ewha University Medical Center; CNUDH, Chonnam National University Dental Hospital; CSUDH, Chosun University Dental Hospital; KHUDH, Kyunghee University Dental Hospital; KNUDH, Kyungpook National University Dental Hospital; WKUDH, Wonkwang University Dental Hospital..


Table 4 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the binary ROC analysis.

ClassificationsAccuracyAUCSensitivitySpecificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I0.84880.01030.83200.02300.92120.00380.90420.01950.79380.03280.78400.02970.87640.01860.85040.0273
Class II0.89720.00570.87960.01530.95330.00260.96010.00670.81920.03340.72260.05150.93590.01610.96130.0046
Class III0.93720.00630.95250.01080.98070.00250.99300.00230.91110.02250.96520.00790.94970.00860.94460.0160
Mean0.89440.03680.88800.05180.95170.02450.95240.03820.84140.05710.82390.10760.92060.03450.91880.0516
VSDsNormodivergent0.83650.00820.83090.02670.91860.00460.91570.01510.82350.02790.76990.04160.84580.01220.87220.0178
Hyperdivergent0.90190.00350.90610.02030.97300.00470.97300.00470.81490.02730.91430.02930.95340.01900.90240.0360
Hypodivergent0.93460.00980.90940.01640.98240.00150.96840.00260.90000.03940.80000.06610.94450.01270.95350.0110
Mean0.89100.04130.88210.04100.95800.02830.95230.02730.84610.04780.82800.07570.91460.05050.90940.0398
VDDsNormal overbite0.73760.02910.75910.02300.81770.01660.83590.01520.65300.09560.65820.06640.82880.04410.83730.0557
Open bite0.87300.01300.89170.01390.94750.00530.96260.00740.83710.03660.82750.06110.88820.03040.92620.0228
Deep bite0.86370.02700.85860.01270.92860.00990.92380.00550.80000.11000.81960.08360.87810.05300.87230.0457
Mean0.82480.06540.83650.05840.89790.05820.90740.05380.76340.11110.76840.10060.86510.04680.87860.0535

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation..


Table 5 . Performance of our model for the diagnosis of the APSDs, VSDs, and VDDs in the internal test set and external test set using the multiple ROC analysis.

ClassificationsAccuracyPairwise AUCPairwise sensitivityPairwise specificity
Internal test setExternal test setInternal test setExternal test setInternal test setExternal test setInternal test setExternal test set
MeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSDMeanSD
APSDsClass I → Class II0.85030.00860.80540.02220.89430.01060.82220.38300.88020.02830.90800.00980.81920.02990.72260.0461
Class I ← Class II0.91750.00390.90610.01360.81920.02990.72260.04610.88020.02830.90800.0098
Class I → Class III0.91430.00920.92770.01470.94860.00570.97800.00390.91730.01490.87600.03200.91110.02010.96520.0071
Class I ← Class III0.96980.00350.98560.00320.91110.02010.96520.00710.91730.01490.87600.0320
Class II → Class III0.97540.00330.97250.01420.99130.00140.99920.00090.96540.00770.94190.02990.98560.00261.00000.0000
Class II ← Class III0.99200.00130.99890.00130.98560.00261.00000.00000.96540.00770.94190.0299
VSDsHyper → Hypo0.99050.00370.97780.01260.99980.00020.99300.00190.98510.00581.00000.00001.00000.00000.95380.0261
Hyper ← Hypo0.99980.00010.99770.00031.00000.00000.95380.02610.98510.00581.00000.0000
Hyper → Normo0.87550.00400.87910.02230.95930.00630.95870.00680.81490.02440.91430.02620.92960.02570.85210.0485
Hyper ← Normo0.90340.00880.93290.01190.92960.02570.85210.04850.81490.02440.91430.0262
Hypo → Normo0.89590.01390.86880.02120.96690.00240.94590.00420.90000.03520.80000.05910.89390.02310.91780.0173
Hypo ← Normo0.94510.01530.89720.03160.89390.02310.91780.01730.90000.03520.80000.0591
VDDsOpen → Deep0.97660.01120.97060.01860.99820.00120.99240.00440.98140.01160.99220.00960.96830.04120.94900.0319
Open ← Deep0.99510.00660.99560.00420.96830.04120.9490.03190.98140.01160.99220.0096
Open → Normal0.84630.01410.85380.02010.93080.00630.94340.00840.84140.03180.83140.05200.84900.03630.86840.0284
Open ← Normal0.81900.04520.83730.03410.84900.03630.86840.02840.84140.03180.83140.0520
Deep → Normal0.80660.03380.80620.01320.89110.01300.87750.00890.80000.09840.82750.07880.80880.07410.79240.0682
Deep ← Normal0.81560.03880.83450.02770.80880.07410.79240.06820.80000.09840.82750.0788

ROC curve analysis with multiple classification tasks was performed..

APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; VDDs, vertical dental discrepancies; ROC, receiver operating characteristic; AUC, area under the curve; SD, standard deviation; Hyper, hyperdivergent; Hypo, hypodivergent; Normo, normodivergent; Open, open bite; Deep, deep bite; Normal, normal overbite..


Table 6 . Comparison of the binary ROC analysis results between multi-models in a previous study and a single model in this study.

ModelsAPSDsVSDs
SensitivitySpecificityAccuracyAUCSensitivitySpecificityAccuracyAUC
Yu et al’s
study8
This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This studyYu et al’s study8This
study
Model I
(no exclusion of
data set)
0.85750.84140.92880.92060.90500.89440.9380.95170.84270.84610.92130.91460.89510.89100.9370.9580
Model II
(exclusion of data
set within interval
of 0.2 SD)
0.9079NA0.9539NA0.9386NA0.970NA0.9222NA0.9611NA0.9481NA0.985NA
Model III
(exclusion of data
set within interval
of 0.3 SD)
0.9355NA0.9677NA0.9570NA0.978NA0.9459NA0.9729NA0.9640NA0.984NA

ROC, receiver operating characteristic; APSDs, anteroposterior skeletal discrepancies; VSDs, vertical skeletal discrepancies; AUC, area under the curve; SD, standard deviation; NA, not applicable..


Table 7 . Summary of the study design, methods and results in the orthodontic diagnosis of previous CNN studies and this study.

Author (year)SamplesModel and its applicationData setResults
Arık et al.
(2017)1

400 publicly available cephalograms.

19 landmarks.

8 cephalometric parameters.

2 human examiners.

Deep learning with CNN and shape-based model.

Landmark detection.

Cephalometric analysis.

Training set: 150.

Test set: 250.

High anatomical landmark detection accuracy (∼1% to 2% higher success detection rate for a 2-mm range compared with the top benchmarks in the literature).

High anatomical type classification accuracy (~76% average classification accuracy for test set).

Park et al.
(2019)9

1,028 lateral cephalograms.

80 landmarks.

1 human examiner.

Deep learning with YOLOv3 and SSD.

Landmark detection.

Training set: 1,028.

Test set: 283.

The YOLOv3 algorithm outperformed SSD in accuracy for 38 of 80 landmarks.

The other 42 of 80 landmarks did not show a statistically significant difference between YOLOv3 and SSD.

Error plots of YOLOv3 showed not only a smaller error range but also a more isotropic tendency.

The mean computational time spent per image was 0.05 seconds and 2.89 seconds for YOLOv3 and SSD, respectively.

YOLOv3 showed approximately 5% higher accuracy compared with the top benchmarks in the literature.

Nishimoto
et al. (2019)3

219 lateral cephalograms from internet.

10 skeletal landmarks.

12 cephalometric parameters.

Human examiners – not mentioned.

Personal desktop computer.

CNN.

Landmark detection.

Cephalometric analysis.

Training set: 153 (expanded 51 folds).

Test set: 66.

Average and median prediction errors were 17.02 and 16.22 pixels.

No difference in Angles and lengths between CNN and manually plotted points.

Despite the variety of image quality, using cephalogram images on the internet is a feasible approach for landmark prediction.

Hwang et al.
(2020)10

1,028 lateral cephalograms.

80 landmarks.

2 human examiners.

Deep learning with YOLOv3.

Landmark detection.

Training set: 1,028.

Test set: 283.

Upon repeated trials, AI always detected identical positions on each landmark.

Human intra-examiner variability of repeated manual detections demonstrated a detection error of 0.97 ± 1.03 mm.

The mean detection error between AI and human: 1.46 ± 2.97 mm.

The mean difference between human examiners: 1.50 ± 1.48 mm.

Comparisons in the detection errors between AI and human examiners: less than 0.9 mm, which did not seem to be clinically significant.

Kunz et al.
(2020)11

1,792 cephalograms.

18 landmarks.

12 orthodontic parameters.

12 human examiners.

CNN deep learning algorithm.

Landmark detection.

Cephalometric analysis.

Humans' gold standard: median values of the 12 examiners.

Training set: 1,731.

Validation set: 61.

Test set: 50.

No clinically significant differences between humans' gold standard and the AI's predictions.

Yu et al.
(2020)8

5,890 lateral cephalograms and demographic data from one institute.

4 cephalometric parameters.

2 human examiners.

One-step diagnostic system for skeletal classification.

Multimodal CNN model.

<Model I>
Sagittal

Training set: n = 1,644.

Validation set: n = 351.

Test set: n = 351.

Vertical

Training set: n = 1,912.

Validation set: n = 375.

Test set: n = 375.

Vertical and sagittal skeletal diagnosis: > 90% sensitivity, specificity, and accuracy.

Vertical classification: highest accuracy at 96.40 (95% CI, 93.06 to 98.39; model III).

Binary ROC analysis: excellent performance (mean area under the curve > 95%).

Heat maps of cephalograms: visually representing the region of the cephalogram.

Kim et al.
(2020)2

2,075 lateral cephalograms from two institutes.

400 open dataset.

23 landmarks.

8 cephalometric parameters.

2 human examiners.

Stacked hourglass deep learning.

Two-stage automated algorithm.

Web-based application.

Landmark detection.

Cephalometric analysis.

Evaluation group 1:

Training set: n = 1,675.

Validation set: n = 200.

Test set: n = 200.

Evaluation group 2:

Training set: n = 1,675.

Validation set: n = 175.

Test set: n = 225.

Evaluation group 3:

ISBI 2015 test set: n = 400.

Landmark detection error: 1.37 ± 1.79 mm.

Successful classification rate: 88.43%.

This study
(2020)

2,174 lateral cephalograms from ten institutes.

4 cephalometric parameters.

1 human examiners.

One-step diagnostic system for skeletal and dental discrepancy.

CNN including Densenet-169, Arcface, Softmax.

External validation.

Training set: n = 1,522 from 2 institutes.

Internal test set: n = 471 from 2 institutes.

External test set: n = 181 from the other 8 institutes.

Binary ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range: 0.8248–0.8944 and 0.8979–0.9580 in internal test set; 0.8821–0.8880 and 0.9074–0.9524 in external test set) in diagnosis of the skeletal and dental discrepancies.

Multiple ROC analysis: Accuracy and area under the curve were high in both internal and external test set (range:0.8066–0.9905 and 0.8156–0.9998 in internal test set; 0.8054–0.9725 and 0.8222–0.9992 in external test set) in diagnosis of the skeletal and dental discrepancies.

t-SNE analysis succeeded in creating the well-separated boundaries between the three classification groups in each diagnosis.

Grad-CAM showed different patterns and sizes of the focus areas according to three classification groups in each diagnosis.

CNN, convolutional neural network; YOLO, “you only look once” real-time object detection; SSD, single shot detector; ISBI, International Symposium on Biomedical Imaging; AI, artificial intelligence; CI, confidence interval; ROC, receiver operating characteristic; t-SNE, t-stochastic neighbor embedding; Grad-CAM, gradient-weighted class activation mapping..


References

  1. Arık S?, Ibragimov B, Xing L. Fully automated quantitative cephalometry using convolutional neural networks. J Med Imaging (Bellingham) 2017;4:014501.
    Pubmed KoreaMed CrossRef
  2. Kim H, Shim E, Park J, Kim YJ, Lee U, Kim Y. Web-based fully automated cephalometric analysis by deep learning. Comput Methods Programs Biomed 2020;194:105513.
    Pubmed CrossRef
  3. Nishimoto S, Sotsuka Y, Kawai K, Ishise H, Kakibuchi M. Personal computer-based cephalometric landmark detection with deep learning, using cephalograms on the internet. J Craniofac Surg 2019;30:91-5.
    Pubmed CrossRef
  4. Erkan M, Gurel HG, Nur M, Demirel B. Reliability of four different computerized cephalometric analysis programs. Eur J Orthod 2012;34:318-21.
    Pubmed CrossRef
  5. Wen J, Liu S, Ye X, Xie X, Li J, Li H, et al. Comparative study of cephalometric measurements using 3 imaging modalities. J Am Dent Assoc 2017;148:913-21.
    Pubmed CrossRef
  6. Rudolph DJ, Sinclair PM, Coggins JM. Automatic computerized radiographic identification of cephalometric landmarks. Am J Orthod Dentofacial Orthop 1998;113:173-9.
    Pubmed CrossRef
  7. Mosleh MA, Baba MS, Malek S, Almaktari RA. Ceph-X: development and evaluation of 2D cephalometric system. BMC Bioinformatics 2016;17(Suppl 19):499.
    Pubmed KoreaMed CrossRef
  8. Yu HJ, Cho SR, Kim MJ, Kim WH, Kim JW, Choi J. Automated skeletal classification with lateral cephalometry based on artificial intelligence. J Dent Res 2020;99:249-56.
    Pubmed CrossRef
  9. Park JH, Hwang HW, Moon JH, Yu Y, Kim H, Her SB, et al. Automated identification of cephalometric landmarks: part 1-comparisons between the latest deep-learning methods YOLOV3 and SSD. Angle Orthod 2019;89:903-9.
    Pubmed KoreaMed CrossRef
  10. Hwang HW, Park JH, Moon JH, Yu Y, Kim H, Her SB, et al. Automated identification of cephalometric landmarks: part 2-might it be better than human? Angle Orthod 2020;90:69-76.
    Pubmed KoreaMed CrossRef
  11. Kunz F, Stellzig-Eisenhauer A, Zeman F, Boldt J. Artificial intelligence in orthodontics: evaluation of a fully automated cephalometric analysis using a customized convolutional neural network. J Orofac Orthop 2020;81:52-68.
    Pubmed CrossRef
  12. Korean Association of Orthodontics Malocclusion White Paper Publication Committee. Cephalometric analysis of normal occlusion in Korean adults. Seoul: Korean Association of Orthodontists; 1997.
  13. Bujang MA, Baharum N. Guidelines of the minimum sample size requirements for Cohen's Kappa. Epidemiol Biostat Public Health 2017;14:e12267.
  14. McHugh ML. Interrater reliability: the Kappa statistic. Biochem Med (Zagreb) 2012;22:276-82.
    Pubmed KoreaMed CrossRef
  15. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis 2015;115:211-52.
    Pubmed CrossRef
  16. Huang G, Liu Z, van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. Paper presented at: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Jul 21-26; Honolulu, USA: Piscataway: IEEE, 2017. p. 2261-9.
    KoreaMed CrossRef
  17. Goyal P, Doll?r P, Girshick R, Noordhuis P, Wesolowski L, Kyrola A, et al. Accurate, large minibatch SGD: training ImageNet in 1 hour [Internet]. arxiv. 2017 Jun 8 [updated 2018 Apr 30; cited 2020 Aug 7]. Available from: https://arxiv.org/abs/1706.02677.
  18. Jia X, Song S, He W, Wang Y, Rong H, Zhou F, et al. Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes [Internet]. arxiv. 2018 Jul 30 [cited 2020 Aug 7]. Available from: https://arxiv.org/abs/1807.11205.
  19. Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift [Internet]. arxiv. 2015 Feb 11 [updated 2015 Mar 2; cited 2020 Sep 8]. Available from: https://arxiv.org/abs/1502.03167.
  20. Wu Y, He K. Group normalization. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y, eds. ECCV 2018: Computer vision - ECCV 2018. Cham: Springer; 2018. p. 3-19.
    CrossRef
  21. Deng J, Guo J, Xue N, Zafeiriou S. ArcFace: additive angular margin loss for deep face recognition. Paper presented at: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019 Jun 15-20; Long Beach, USA: Piscataway: IEEE, 2019. p. 4690-9.
    Pubmed KoreaMed CrossRef
  22. Dreiseitl S, Ohno-Machado L, Binder M. Comparing three-class diagnostic tests by three-way ROC analysis. Med Decis Making 2000;20:323-31.
    Pubmed CrossRef
  23. Li J, Fine JP. ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostatistics 2008;9:566-76.
    Pubmed CrossRef
  24. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579-605.
  25. Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Paper presented at: 2017 IEEE International Conference on Computer Vision (ICCV); 2017 Oct 22-29; Venice, Italy: Piscataway: IEEE, 2017. p. 618-26.
    CrossRef