Big Data Research for Diabetes-Related Diseases Using the Korean National Health Information Database
Article information
Abstract
The Korean National Health Information Database (NHID), which contains nationwide real-world claims data including sociodemographic data, health care utilization data, health screening data, and healthcare provider information, is a powerful resource to test various hypotheses. It is also longitudinal in nature due to the recommended health checkup every 2 years and is appropriate for long-term follow-up study as well as evaluating the relationships between health outcomes and changes in parameters such as lifestyle factors, anthropometric measurements, and laboratory results. However, because these data are not collected for research purposes, precise operational definitions of diseases are required to facilitate big data analysis using the Korean NHID. In this review, we describe the characteristics of the Korean NHID, operational definitions of diseases used for research related to diabetes, and introduce representative research for diabetes-related diseases using the Korean NHID.
Highlights
• The Korean NHID is a powerful resource to test various hypotheses.
• The Korean NHID is longitudinal in nature and is appropriate for long-term follow-up study.
• Research papers addressing diabetes using the Korean NHID has steadily increased.
• Precise operational definitions are required to facilitate big data analysis.
INTRODUCTION
In clinical research, the importance of real-world evidence obtained by big data analysis, as well as by randomized controlled trials, is increasing day by day. Big data analyses can provide research insights hard to come by using existing data sources and research methods. The Korean National Health Information Database (NHID) contains a wealth of information that can be analyzed using big data analyses methods, and numerous publications have used the information in this database [1-6]. In this review, we describe the characteristics of the Korean NHID, provide commonly used operational definitions of diabetes-related diseases, and introduce representative studies for diabetes-related diseases using the Korean NHID.
KOREAN NATIONAL HEALTH INFORMATION DATABASE
The Ministry of Health and Welfare supervises the National Health Insurance Service (NHIS) and Health Insurance Review & Assessment Service (HIRA) in Korea. The HIRA evaluates the adequacy of healthcare service costs by reviewing medical billing and claims and provides the review results to the NHIS and healthcare service providers [7,8]. The NHIS as a non-profit organization is the single insurer in Korea (Supplementary Fig. 1). Approximately 97% of the Korean population is subscribers to the Korean NHIS, while the remaining 3% of the population is covered by medical aid programs. The general health screening program involves examinations at least once every 2 years for the entire population of Korean adults aged 40 years or older. General health screening for regional household members and dependents was expanded to include people aged 20 years or older in 2019 [9].
The NHID, established in 2011, combines information obtained from the NHIS and health examinations [9]. It incorporates all data from the NHIS and is unique in that it includes health screening information, including detailed lifestyle questionnaires, laboratory results, and anthropometric measurements, which are not included in other claims databases. The NHID consists of five databases: an eligibility database, a national health screening database, a healthcare usage database, a long-term care insurance database, and a healthcare provider database [10].
A limitation of the NHID is that its data are not collected for research purposes. It is difficult to define causal relationships when performing outcome studies [9,10]. Furthermore, the NHID does not include data on medications or procedures not covered by the NHIS. There may be discrepancies between the diagnosis encoded for medical claims and the actual disease. To reduce inaccuracy, appropriate operational definitions and validation of these definitions are crucial. The biggest advantage of the Korean NHID is that it includes a very large number of individuals (approximately 50 million) and is representative of the entire Korean population [9,10]. In addition, the NHID includes health screening information as well as claims data and can be linked to mortality data from Statistics Korea. The NHID is longitudinal in nature and is also appropriate for long-term follow-up studies. Specifically, given that a health checkup is recommended once every 2 years for individuals aged 20 years or older, it can be used to identify health outcomes and their relationships to changes in various parameters including lifestyle factors, anthropometric measurements, and laboratory results.
RESEARCH TRENDS USING THE NATIONAL HEALTH INFORMATION DATABASE
Since the establishment of the NHID, research and publications based on this database have increased explosively. We searched PubMed using the keyword ‘Korean National Health Insurance’ and found more than 6,000 research papers published as of 2024 (Fig. 1). In 2008, 7.5% of research papers were found using the keywords ‘Korean National Health Insurance’ and ‘diabetes,’ but since then, research papers addressed diabetes have steadily increased to 19.5% (170/874) of all research papers using the NHID in 2023. As of the end of 2024, more than 1,000 articles were identified in PubMed using the keywords ‘Korean National Health Insurance’ and ‘diabetes.’ In the future, we hope that more studies addressed diabetes using the NHID aim to improve the welfare of the public by promoting public health, reducing medical costs, and guiding healthcare policies.
OPERATIONAL DEFINITIONS IN BIG DATA RESEARCH RELATED TO DIABETES
To perform appropriate big data analyses using the NHID, a precise operational definition of diseases is mandatory. Because there may also be a discrepancy between the actual disease and the diagnosis claimed by healthcare providers, International Classification of Diseases, 10th Revision (ICD-10) codes alone are not sufficient to define diseases appropriately [9,10]. To improve operational definitions, health screening results can be incorporated, data can be integrated with prescription records, diagnoses registered during admission can be limited, diagnoses requiring repeated outpatient visits can be considered, or a special registration code for payment reduction can be added.
For example, type 2 diabetes mellitus (T2DM) is usually defined by a fasting plasma glucose concentration ≥126 mg/dL or the presence of at least 1 prescription claim per year for antidiabetic drugs under ICD-10 codes E11–14 [11]. One study evaluated the validity and reliability of the NHIS data-based definition of T2DM by comparing it with data from another population-based database, the Korea National Health and Nutrition Examination Survey (KNHANES), as a standard reference [12]. In the study population (n=13,006), two algorithms were used to determine whether the diagnostic claim codes for T2DM in the NHIS dataset were accompanied by prescription codes for antidiabetic drugs (algorithm 1) or not (algorithm 2). Although both algorithms showed good reliability in defining T2DM, the accuracy (0.93 vs. 0.89) and specificity (0.96 vs. 0.90) tended to be higher for algorithm 1 than algorithm 2. This study showed that population-based NHIS claims data can be useful in identifying subjects with T2DM using diagnostic and prescription codes as diagnostic criteria. Commonly used operational definitions of diseases related to diabetes are summarized in Table 1.
REPRESENTATIVE RESEARCH RELATED TO DIABETES USING THE NATIONAL HEALTH INFORMATION DATABASE
Gestational diabetes mellitus
The prevalence of gestational diabetes mellitus (GDM) may be overestimated in Korea as many clinicians input related codes to avoid pushback by the NHIS when they prescribe an oral glucose tolerance test [13]. To overcome this bias, patients with GDM have been defined as those who visited the outpatient clinic more than twice with GDM codes (O24.4 or O24.9) (Table 1) [14]. As a result, the prevalence of GDM in Korean women was 12.70% overall in 2011 to 2015. Advanced maternal age, pre-pregnancy body mass index, waist circumference, fasting plasma glucose, high income, smoking, and drinking were associated with an increased risk for GDM [14].
Fatty liver disease
Hepatic steatosis should be diagnosed histologically or by imaging, but that information is usually unavailable in nationwide large-scale databases. Fatty liver index (FLI) as an alternative to ultrasound or liver biopsy is a simple and accurate surrogate marker of hepatic steatosis and has been validated in many studies [15-17]. FLI was used to define hepatic steatosis and was calculated using the following equation: (e[0.953×ln(TG)+0.139×body mass index+0.718×ln(GGT)+0.053×waist circumference–15.745])/(1+e[0.953×ln(TG)+0.139×body mass index+0.718×ln(GGT)+0.053×waist circumference–15.745])×100 [15]. In Western populations, FLI ≥60 accurately identifies the presence of hepatic steatosis [15], but FLI ≥30 has been validated for hepatic steatosis in the general population of Korea [18]. Using the Korea NHID, the risk of nonalcoholic fatty liver disease (NAFLD) for cardiovascular disease and all-cause death in patients with T2DM was evaluated [19]. NAFLD was defined as the presence of hepatic steatosis without viral hepatitis or excessive alcohol consumption (≥30 g/day). Patients were divided into the following three groups: no NAFLD: FLI <30; grade 1 (G1) NAFLD: 30≤ FLI <60; and grade 2 (G2) NAFLD: FLI ≥60. NAFLD in patients with T2DM was associated with a higher risk of cardiovascular disease (myocardial infarction or ischemic stroke) (Table 1) and all-cause death, even in patients with mild liver disease. Risk differences for cardiovascular disease and all-cause death between the no NAFLD group and the grade 1 or grade 2 NAFLD groups were higher in patients with T2DM than in those without T2DM. A study that applied FLI ≥30 as fatty liver showed that subjects with mixed-etiology metabolic dysfunction-associated fatty liver disease (MAFLD) had an approximately 1.3-fold increased risk of cancer incidence and a 1.5-fold higher risk of cancer mortality than those without MAFLD, whereas those with single-etiology MAFLD only had modestly increased risks [20]. In addition, NAFLD (defined as FLI ≥30) was associated with an increased risk of young-onset stomach, colorectal, liver, pancreatic, biliary tract, and gallbladder cancers among more than 5 million individuals aged 20 to 39 years [21].
Variability in body weight and glucose as risk factors for various diseases
Because all enrollees in the Korean NHIS are advised to receive medical checkups every 2 years, the intraindividual visit-to-visit variability in glucose levels and body weight can be calculated using values obtained from serial health examinations [9,10]. Variability independent of the mean, coefficient of variation, and average real variability are representative indices of variability [22]. Body weight variability was associated with increased risks of myocardial infarction, stroke, and all-cause mortality in patients with T2DM and was a predictor of cardiovascular outcomes [23]. In a study investigating the association of body weight or glucose variability or their combination with the risk of hip fracture in people with diabetes, the risk was approximately 30% higher in the groups with high variability in body weight or glucose than in the group without high variability [24]. In addition, combined high variability of body weight and glucose level had an additive effect with a greater than 60% higher risk of hip fracture. In patients with predialysis chronic kidney disease, higher body mass index variability was significantly associated with higher risks of all-cause mortality, myocardial infarction, stroke, and progression to need for kidney replacement [25].
Body weight change as risk factor for various diseases
Although obesity is a proven risk factor for cardiovascular disease and metabolic diseases, there are limited data on the associations between weight change and those health outcomes. Body weight change has been calculated as the difference in body weight between the first and second general health checkups in the NHID [9,10]. Patients were categorized into five groups according to body weight change: severe weight loss (weight change ≤–10%), moderate weight loss (weight change of –10% to ≤–5%), stable weight (weight change of –5% to ≤5%), moderate weight gain (weight change of 5% to ≤10%), and severe weight gain (weight change >10%) [26,27]. A study exploring the association of weight change with the risk of dementia in patients newly diagnosed with T2DM demonstrated showed a significant U-shaped association with the risk of all-cause dementia (Table 1) [26]. Body weight change >10% was significantly associated with an increased risk of all-cause dementia. In addition, weight loss >10% was significantly associated with an increased risk of Alzheimer disease [26]. In a study of 1,522,241 patients with T2DM, a U-shaped association was found between body weight change and major cardiovascular event risks such as myocardial infarction, ischemic stroke, atrial fibrillation, heart failure, and all-cause death [27].
Metabolic syndrome
Anthropometric measurements and laboratory data from the NHIS health checkup and claims data include components of metabolic syndrome (waist circumference, triglycerides, high-density lipoprotein cholesterol, blood pressure, and fasting plasma glucose), which can be diagnosed using these data [9,10]. Among 8,320 earlier-onset colorectal cancer cases, metabolic syndrome and obesity were positively associated with earlier-onset colorectal cancer, particularly in the distal colon and rectum, but not the proximal colon [28]. Individuals who recovered from metabolic syndrome were shown to have a higher risk of pancreatic cancer than those free of metabolic syndrome but a lower risk than those with persistent metabolic syndrome [29]. Furthermore, in young men, development of metabolic syndrome was associated with increased risk of incident gout, and recovery from metabolic syndrome was associated with reduced risk of incident gout [30].
Lifestyle behavior
In the NHIS health screening, physical activity is measured using the International Physical Activity Questionnaire developed by the World Health Organization [31,32]. The questionnaire includes exercise intensity, duration, and frequency per week. One study defined regular physical activity as ≥30 minutes of moderate physical activity at least five times per week or ≥20 minutes of vigorous physical activity at least three times per week [33]. This study reported that regular physical activity was independently associated with lower risks of all-cause dementia, Alzheimer disease, and vascular dementia among participants with new-onset T2DM [33]. The interval change in regular physical activity can also be determined using consecutive health screenings [9,10]. When regular exercise was defined as moderate intensity exercise for more than 30 minutes or vigorous intensity exercise for more than 20 minutes at least once a week, starting exercise, maintaining exercise, and even cessation of exercise after thyroidectomy for treatment of thyroid cancer were associated with a lower risk of incident T2DM [34]. In addition, starting and maintaining regular physical activity were both associated with lower risk of incident atrial fibrillation in patients with T2DM (Table 1) [35].
Information on the frequency of alcohol intake per week and the amount of alcohol consumed per drinking episode is collected during the biannual health checkup using a self-administered questionnaire and is included in the Korean NHIS– Health Screening Cohort database [9,10]. In one study, subjects were classified into one of three groups based on average amount of alcohol intake per day: (1) no alcohol consumption (0 g/day); (2) mild alcohol consumption (<20 g/day); and (3) moderate to heavy consumption (≥20 g/day). Alcohol abstainers, constant drinkers, and nondrinkers were also defined to evaluate the impact of alcohol behavioral changes on various outcomes. This study found that 1,112,682 patients newly diagnosed with T2DM that abstained from alcohol had a low risk of atrial fibrillation (Table 1) [36].
Income status and diabetes
Previous studies have suggested that low socioeconomic status may contribute to a poor T2DM prognosis and an increased risk of mortality [37,38]. However, it is not easy to evaluate the effects of socioeconomic status on various health problems since many databases do not contain data on income or only include baseline data. In the Korean NHIS, household income is evaluated using health insurance premiums because the NHIS does not provide actual household income data [9]. Monthly health insurance premiums that are determined by wages and property do not change throughout a 1-year period unless an extreme income change occurs and are divided into 20 groups. To investigate whether income status was associated with various health outcomes, individuals’ baseline income status was categorized into quartiles from 1 (lowest) to 4 (highest), and changes in income status were compared between the first assessment and the last. A study that included approximately 7.8 million Korean adults found that sustained low income and decreases in income were associated with elevated T2DM risk, whereas a sustained high income was associated with lower T2DM risk (Table 1) [39]. In addition, sustained low-income status and declines in income were associated with increased risk of mortality in a study of >1.9 million adults with T2DM [40]. Higher income variability and sustained low income over 5 years were associated with increased cardiovascular disease risk in 1,528,108 adults aged 30 to 64 years with T2DM and no history of cardiovascular disease [41].
All-cause and cause-specific mortality risks in diabetes
The NHID has information on death because it is linked to death certificates from Statistics Korea regarding cause of death and date. The cause of death can be identified based on ICD-10 codes and specific causes of death can be classified as cardiovascular (code I), neoplasm (code C), respiratory (code J), infectious (code A and B), and so on. In a study with nearly 2 million patients with T2DM, hepatic steatosis and advanced fibrosis were significantly associated with risks of all-cause and cause-specific mortality including cardiovascular, cancer, respiratory and liver disease [42]. In addition, individuals with diabetes living alone (IDLA) were at a 20% higher risk of all-cause mortality compared to those not living alone in the study with nearly 2.5 million individuals with diabetes [43]. The risks of mortality from cardiovascular disease, cancer, respiratory disease, infectious disease, and other causes were all significantly higher in the IDLA group by 7% to 33% compared with the non-IDLA group.
Cancer risk and diabetes
A cancer case can be defined as the presence of an ICD-10 code of ‘C’ and an admission history with the cancer code as the principal diagnosis using the NHID. In patients with diabetes, the risks of many cancers are increased, through increased endogenous insulin levels resulting from insulin resistance, hyperglycemia, chronic inflammation, and increased oxidative stress [44]. A study enrolled a total of 25,709,497 patients showed that the risk for stomach, colorectal, liver, pancreas, and kidney cancer appeared to be higher in patients with diabetes than in those without diabetes regardless of the sex or duration of diabetes [45]. In the study of over 9 million individuals, even light-to-moderate alcohol consumption was associated with an increased risk of biliary tract cancer in individuals with prediabetes and diabetes, but not in normoglycemic individuals [46].
CONCLUSIONS
The Korean NHID contains nationwide claims data including sociodemographic data, health care utilization data, health screening data, and healthcare provider information, representing an attractive research resource for real-world data. The database has been used extensively in clinical and public health research related to diabetes. Advantages of the NHID include its longitudinal nature and the ability to evaluate associations between health outcomes and changes in lifestyle factors, anthropometric measurements, and laboratory results because of the recommendation of a health checkup every 2 years in Korea. To improve the quality of research using the NHID, it is important to understand the characteristics, design research accordingly, and clarify operational definitions of diseases. We are optimistic that research using the NHID will help advance medicine and improve human health.
SUPPLEMENTARY MATERIALS
Supplementary materials related to this article can be found online at https://doi.org/10.4093/dmj.2024.0780.
Operational structure of the National Health Insurance System (NHIS). Reproduced from Kim et al. [21]. HIRA, Health Insurance Review & Assessment Service.
Notes
CONFLICTS OF INTEREST
Kyung-Soo Kim has been associate editor of the Diabetes & Metabolism Journal since 2024. He was not involved in the review process of this article. Otherwise, there was no conflict of interest.
FUNDING
None
Acknowledgements
None