LEVEL ANALYSIS OF RHEUMATOID ARTHRITIS VIA MULTI DOMAIN MACHINE LEARNING By ZHAOYI CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGRE E OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2019
Â© 2019 Zhaoyi Chen
To my mo m and dad, girlfriend and Daona
4 ACKNOWLEDGMENTS This dissertation would not be possible if it were not for the guidance and help from my PhD committee. Drs. Mattia Prosperi , Amanda Hicks , Jinying Zhao, and Todd Manini have provided me with incredible mentorship as an epidemiologist and scientist. I want to especially thank my research mentor, Dr. Mattia Prosperi, for his t remendous guidance, support, and mentorship over the years. I am highly grateful for his willingness to take me as a student during a tough time in my first year of PhD study. He has trained me to become a strong and productive scientist . I will always be grateful for the opportunities, encouragements, and trusts he has given me. I would like to thank Dr. Robert Cook for his guidance and support over the years. As my academic advisor, he had always been positive, encouraging, and supportive . He had always m ade sure I met all of my programmatic milestones and had helped me identifying career paths. I would like to thank Drs. Victoria Bird, John Ash, and Carolina Otero Mejia . T hey and their students have collaborated with me on many fun research projects. Thei r guidance and support have helped me to become a better researcher than I was, and their dedication and determination will always guide me in my future career. I am deeply grateful to the University of Florida Informatics Institute for supporting this di ssertation work in the past two years. I am always grateful for the wonderful opportunities the UFII provided for me and other fellows. The staff of UFII are always there for our troubles. I am highly indebted and thoroughly grateful for the opportunity to work with the faculty, staff, and fellow students in the Department of Epidemiology at the University of Florida. Finally , and perhaps most importantly, many thanks and appreciation should go to my parents, Yan Pei and Yan Chen , who worked really hard to give me the opportunity to study abroad . I sincerely thank them for their unconditional love and sacrifices. I would like to thank
5 my girlfriend , Yiyang Liu , for her support, understanding, and patience , I would not be who I am today if it is not for you.
6 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ........................... 8 LIST OF FIGURES ................................ ................................ ................................ ......................... 9 LIST OF ABBREVIATIONS ................................ ................................ ................................ ........ 10 ABSTRACT ................................ ................................ ................................ ................................ ... 12 CHAPTER 1 BACKGROUND ................................ ................................ ................................ .................... 14 Epidemiology of Rheumatoid Arthritis ................................ ................................ .................. 14 Research Gap and Significance ................................ ................................ .............................. 17 Multi domain Approach and Theoretical Framework ................................ ............................ 19 Utility of M achine L earning in D isease R isk P rediction ................................ ........................ 20 Research Objectives ................................ ................................ ................................ ................ 23 Data Source ................................ ................................ ................................ ............................. 24 2 PREDICTING RISK OF RHEUMATOID ARTHRITIS IN AT RISK POPULATION USING DEMOGRAPHIC, CLINICAL AND SOCIO-ECOLOGICAL DOMAINS ............27 Introduction 27 ................................ ................................ ................................ ............................. Methods 28 ................................ ................................ ................................ ................................ .. Study population .............................................................................................................. 28 Predictors 29 ................................ ................................ ................................ ......................... Statistical analysis ........................................................................................................... 29 Results 30 ................................ ................................ ................................ ................................ ..... Discussion 35 ................................ ................................ ................................ ............................... 3 CLINICAL CORRELATES OF RHEUMATOID ARTHRITIS USING A STATISTICAL ROBUST FRAMEWORK ................................ ................................ ........... 41 Introduction ................................ ................................ ................................ ............................. 41 Methods ................................ ................................ ................................ ................................ .. 41 Results ................................ ................................ ................................ ................................ ..... 43 Discussion ................................ ................................ ................................ ............................... 46 4 THE HETEROG E N E ITY OF OTHER RHEUMATIC DISEASES IN PATIENTS WITH RHEUMATOID ARTHRITIS: A CLUSTERING ANALYSIS ................................ 50
7 Introduction ................................ ................................ ................................ ............................. 50 M ethods ................................ ................................ ................................ ................................ .. 51 Results ................................ ................................ ................................ ................................ ..... 5 3 Discussion ................................ ................................ ................................ ............................... 58 5 CONCLUSIONS ................................ ................................ ................................ .................... 63 Summary of Research Objectives ................................ ................................ ........................... 63 Accom plishment of this Dissertation ................................ ................................ ...................... 65 Limitations and Strengths ................................ ................................ ................................ ....... 68 Future Directions and Concluding Thoughts ................................ ................................ .......... 70 LIST OF REFERENCES ................................ ................................ ................................ ............... 73 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ......... 81
8 LIST OF TABLES Table page 2 1 Population characteristics of the study population . ................................ ........................... 30 2 2 Top 10 (where applicable) m ost important variables in each domain, variable sorted by Chi square value . ................................ ................................ ................................ .......... 31 2 3 Frequencies of rheumatic diseases in the two groups . ................................ ....................... 32 2 4 Performance of prediction models . ................................ ................................ .................... 34 2 5 Final multi domain model from the logitboost algorithm . ................................ ................ 35 3 1 P opulation characteristics of the two groups before propensity score matching . .............. 43 3 2 P opulation ch aracteristics of the two groups after propensity score matching . ................. 44 3 3 Top post diagnosis clinical variables identified in multivariable analysis from the three measurements and their information gain, AUROC, and odds rati os values (bold indicates the variables are positively associated with RA) . ................................ ..... 46 4 1 P opulation characteristics . ................................ ................................ ................................ 53 4 2 C lass assignment and demographic factors and selected clinical conditio ns bivariate analyses . ................................ ................................ ................................ ............................ 57 4 3 Adjusted odds ratios (aOR) and 95% confidence intervals from multivariate logistic regression models (class 1 is used as reference) . ................................ .............................. 58
9 LIST OF FIGURES Figure page 1 1 A schematic overview to illustrate the different endotype of RA patient and their overlap, and how endotype can facilitate treatment outcome . ................................ ........... 19 1 2 Theoretical framew ork . ................................ ................................ ................................ ...... 20 1 3 Flow chart of study population selection and inclusion criteria for each aim . .................. 26 2 1 Model comparison via AUROC (Left panel: model comparison; Right panel: domain comparison) . ................................ ................................ ................................ ....................... 35 3 1 The top 10 diagnosis in RA group after diagnosis . ................................ ........................... 45 3 2 Venn diagram of top post diagnosis clinical variables identified in multivariable analysis from the three measurements, the ICD 9 codes indicates these are positively associated with RA . ................................ ................................ ................................ ........... 46 4 1 Dendrogram from hierarchical clustering showing how the RDs are correlated . .............. 55 4 2 Results from expectation maximization (EM ) clustering showing how patients are grouped and the probability of having each RD . ................................ ............................... 56 4 3 Results from k means (KM) clustering showing how patients are grouped and the probability of having each RD . ................................ ................................ .......................... 56
10 LIST OF ABBREV IATIONS ACR American College of Rheumatology AIC Akaike information criterion anti CCP anti cyclic citrullinated peptide AQI Air Quality Index AUROC area under receiver operating characteristic CCI CO carbon monoxide DMA RD disease modifying antirheumatic drugs DNA deoxyribonucleic acid EHR electronic health record EM expectation maximization EULAR European League Against Rheumatism HC hierarchical clustering HCUP Healthcare Cost and Utilization Project HLA human l eukocyte antigen ICD the International Classification of Diseases IQR interquartile range KM k means MTX methotrexate NADW National Arthritis Data Workgroup NO2 nitrogen dioxide OR Odds ratio PM particulate matter RA rheumatoid arthritis
11 RD rheu matic disease RDCI rheumatic disease comorbidity index RF r heumatoid f actor SASD State Ambulatory Surgery and Services Databases SEDD State Emergency Department SID State Inpatient Databases
12 Abstract of Dissertation Presented to the Graduate Sc hool of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy LEVEL ANALYSIS OF RHEUMATOID ARTHRITIS VIA MULTI DOMAIN MACHINE L EAR NING By Zhaoyi Chen May 2019 Chair: Mattia Prosperi M ajor: Epidemiology In this work, using a multi domain framework and robust machine learning techniques, I examined the prediction of RA diagnosis, the clinical consequences of RA, and underlying heterogeneity within RA population. A large database from Fl orida statewide inpatient, outpatient and emergency data of the HCUP was used . In Chapter 1, prediction models for RA was built with multiple information domains using a collection of statistical and machine learning models with feature select ion s. I compa red the performances of different models in 10 fold cross validation setting . The final model from this chapter showed a fair performance at the optimal cut point. In terms of the discriminative power of individual domain, model using soci al ecological dom ain had the best performance, followed by clinical diagnosis, demographic, and clinical procedures. Several predictive factors for RA was also identified. In Chapter 2, I used a matched case control study design to explore the clinical consequences of RA . M ultivariable logistic regression was used and a combination of different measurements was used to assess the impacts of RA on these clinical conditions. several clinical conditions were identified to be positively associated with RA at significant level o f 0.05 (adjusted p value) . T his is a comprehensive analysis on RA and risk s of clinical comorbidities .
13 T he results suggest that RA may result in severe physical burdens. The findings from this chapter could guide further personalized management of RA patie nts to reduce their physical burden and improve overall wellbeing . In Chapter 3, I examined the patterns of other rheumatic diseases in patients who have RA . T he patient population were clustered into groups based on the ir status of other rheumatic disease s. Several unsupervised learning techniques were used . Some interesting patterns of how the rheumatic diseases may be related with each other were identified. As for patient clustering, a total of five classes of patients were identified, including one gro up of disease free patients, two groups of patients with certain rheumatic diseases, and two groups of patients with all rheumatic diseases at different frequencies. The results from this chapter suggests that RA patients can be grouped based on their rheu matic diseases . T hese groups may be the results of disease progression and accumulation and may be associated with the presented phenotypes of RA patients.
14 CHAPTER 1 BACKGROUND Epidemiology of Rheumatoid Arthritis Rheumatoid arthritis (RA) is a disease w ith shadowy etiology that carries a substantial healthcare burden and impairment in quality of life. It affects about 1% of adults in the U.S. each year 1 . The therapeutics of RA is limited to relieve symptoms and reduce long term implications. The chemotherapy ( such as methotrexate and biologics) used for RA treatment usually have s erious side effects and suppression of the immune system, leading to multiple morbidities. The typical symptoms of RA usually include warm, swollen, and painful joints, but it could also affect other parts of the body, for example low red blood cell count, inflammation around the lungs and/or hear t 1 . For most of the cases, the symptoms of RA are very similar to other rheumat ic diseases (RD) or other inflammatory ar thritis (e.g. psoriatic arthritis, ankylosing spondylitis etc. ), therefore early diagnosis of RA is important for effective and targeted treatment. T he diagnostic criteria have changed sensibly in the past 30 years with the latest guideline established in 2010 . The 2010 c lassification c riteria 2 use a classification criterion with both objective exams (e.g. symptoms, duration) and serological parameters, to identify patients with a high likelihood of developing a chronic form of RA. Although the latest guidelines have put more emph asis on biomarkers such as anti cyclic citrullinated peptides ( anti CCP ) test, the sensitivity of these biomarkers is still low, and performance is often measured in a selected population (e.g. those with current symptoms of arthritis). The sensitivity and specificity traditionally used rheumatoid factor (RF) testing were around 70% and 80%, while the newer anti CCP testing for the diagnosis RA showed a high specificity (~90%) but low sensitivity (~60%) 2 4 . Diagnostic tools that can improve both sensitivity and specificity is n eeded 5 . Therefore, there is a need of development of better diagnostic tool to distinguish RA from other
15 rheumatoid conditions and inflammatory arthritis, and to find better way to identify discrete subsets of RA to guide treatment an d stratified medicine. In addition, as an etiologically complex disease, many risk factors of RA remain unclear whether or how they contribute to the development of RA, some factors such as gender, autoantibodies, smoking and genetic variants, have been de monstrated to be associated with RA repeatedly, but other factors (such as infections, different comorbidities, environments) have shown inconsistent findings 6 9 . One of the strongest risk factor s for RA is racial/ethnic group. Previous studies have reported that Mexican American, Hispanic white or other Hispanic ethnicity (reference, non Hispanic white; OR = 0.54, 95% CI [0.31 0.96], P = 0.036) have lower risk of having RA 6 . The race and ethic disparity not only affected the risk of RA, but also resulted in different management and outcomes in RA patients 10,11 . There are many genetic and environmental factors playing role s in the development of RA. The human leukocyte antigen (HLA) locus have been demonstrated to be the most important genetic risk factor for RA 12 , results from twin studies showed that genetic factors have an important impact on susceptibility to RA with an estimated contribution to RA ranging from 50% to 60% 12 . Environmental and lifestyle factors such as smoking, history of live births, obesity and diet have all been linked to increased risk of RA 8 . In addition, recent researches using DNA sequence based analyses suggests that gut microbial and infection may also affect the development and progression o f RA 13 . In this work, I will use a large state wide inpatient data set to develop a risk prediction model for RA using a multiple domains approach. The domains that I am planning to examine include demographics, clinical variables ( from medical history) and social ecolog ical variables, the goal is to improve the accuracy of RA prediction by incorporating information from different domains and identify potential new risk
16 factors, this approach also allows us to understand the contribution of RA occurrence from these differ ent domains. In addition to its similarity with other rheumatoid diseases, RA itself is a heterogenous condition, the RA patient population can be divided into different sub groups based on different criteria. For example, based on the presence of anti CC P in blood test results, two phenotypes can be defined: seropositive and seronegative, and it is estimated that 60 to 80% of RA patients are seropositive 14 . Another common classification is based on whether a patient possesses the RF, which is another antibody that is used to determ ine the presence of RA, and patients can be classified as RF positive or RF negative. It is reported that different subtype may be associated with different health conditions, for example, previous study showed that RF negative subtype patients reported mo re time with depression, a longer diagnostic delay and greater emotional expression although the differences were not statistically significant 15 . The main problem with these classification approaches are the subtypes has no or little direct relationship to the disease process, therefore a new subtype classification ap proach is needed based on distinct functional or pathobiological mechanism of RA. In addition, the classification criteria of RA may introduce heterogeneity as well. Under the latest guideline, if a patient has synovitis in at least 1 joint without an alte rnative diagnosis , and a chieve of a total score of 6 of more (of a possible 10) from 4 categories , this patient will be diagnosed with RA. The 4 categories are: (1) number and site of involved joints (score range 0 5) ; (2) serologic abnormality (score rang e 0 3) ; (3) elevated acute phase response (score range 0 1) ; (4) symptom duration (2 levels; range 0 1) 16 . As the result, one patient may have higher score is category (1) but low scores in other categories, while another patient may have different scoring patterns. In this work, I will cluster different clinical
17 conditions and other features of RA patients aiming to identif y the underlying patterns and connections between these clusters. Research G ap and S ignificance The current differential diagnosis of RA is problematic: the sensitivity and specificity of diagnostic testing is insufficient: sensitivity and specificity of t raditionally used rheumatoid RF testing were around 70% and 80%, while the recently developed anti CCP testing showed a high specificity (~90%) but low sensitivity (~60%) 2,3,14 . Diagnostic tools that can impro ve both sensitivity and specificity is needed. CCP test), if the patient tests positive for rheumatoid factor and/or anti CPP antibodies, then the patient is classif their negative test results may be resulted from the fact that the patients poss ess very low levels of the antibodies. Therefore, discovery of endotypes in the heterogenous RA patient population is important. An endotype is a subtype of a condition or disease that is defined by a distinct functional or pathobiological mechanism 17,18 . Patients who have the same endotype u sually share a similar disease process. The key to the concept of endotype is that the endotypes are classified by underlying disease mechanisms, while in contrast, the concept of phenotype focuses solely on observable characteristics or traits of a diseas e, such as morphology, development, biochemical or physiological properties, or behavior, without any implication of a mechanism. Due to the heterogeneity in RA patients, the responses to therapies varies in patients as well, as demonstrated from previous studies. The most commonly used treatment, or the most often initiated treatment is methotrexate (MTX) therapy or other disease modifying antirheumatic
18 drugs (DMARDs) in combination with corticosteroids, and if DMARD fails, additional treatment with differ ent targets may be used 16,19,20 . The performance of biologic agents is evaluated by the same level of improvement in disease activity based on criteria established by the American Colleg e of Rheumatology (ACR) defining 20% (ACR20), 50% (ACR50), and 70% (ACR70) improvement. The approximate response rates in randomized controlled studies of patients who have failed DMARD therapy have typically been 50 to 65% for the ACR20, 20 to 45% for the ACR50, and 10 to 25% for the ACR70 13,20 . These data sugg esting the existence of heterogeneity in treatment response . This highlights the need of discovery RA endotypes as it would provide insights on how to optimize treatment options for each patient based on their endotypes and underlying disease process. Kars dal et al. 5 highlighted the need of endotype discovery and precision medicine approach in clinical management of RA. Figure 1 1 shows a schematic overview to illustrate that different endotype of RA patient exist that to some extent o verlap, but only though targeting of the right subpopulation the most optimal cost benefit ratio may be achieved 5 . Panel A shows that patients with different unidentified RA endotypes, which consist of different molecular and clinical features, will respond differentially to therapies, but there may be overlapping, therefore, fits of the whole patient population, as the treatment is not currently targeted by endotype. In panel B, with the identification of RA endotypes, patient can have personalized health care that targets a given endotype, and it will result a greater potential to respond to the specific therapy.
19 Figure 1 1 . A schematic overview to illustrate the dif ferent endotype of RA patient and their overlap, and how endotype can facilitate treatment outcome Multi domain Approach and Theoretical Framework One important aspect of this dissertation work focuses on the development of personalized medicine models by exploiting big data sets with information from multiple domains. Th is approach requires the integration of multi source data and the use of a wide range of big data analytics techniques. The rationale is that an approach incorporating information from mult iple domains (such as such as demographic, clinical, environmental, genetic, social media updates) could significantly improve the predictivity of disease occurrence and prognostic outcomes, and by examining a large volume of variables from difference doma ins, new predictors and risk factors will also be identified, thus provide insights on mechanisms of the disease. For my PhD dissertation, I will focus on the utilization of information from demographic, clinical and environmental domains. Figure 1 2 depi cts the theoretical framework of my dissertation work.
20 Figure 1 2 Theoretical framework Utility of Machine Learning in Disease Risk Prediction Risk prediction of diseases is complex because the etiological mechanisms of many diseases are complex, the am ounts of risk factors and their corresponding associations vary, and these risk factors sometimes may interact with each other. Therefore, the development of risk prediction model for disease diagnosis usually requires the simultaneous consideration of ten s or hundreds of variables to produce an accurate classification. Machine learning is a potentially effective approach for the development of risk predictions especially for complex diseases as machine learning has the ability of processing large amount of information, and is capable of derive non linear models, which may reflect the distribution of risk factors better than traditional linear models. Some advantages of machine learning algorithms over traditional statistical approaches for the development of prediction models are : First, as the name suggested, machine learning methods determine the most effective algorithm on its own by processing through many
21 iterations and learns the best set of operations and variables for data classification, rather tha n pre determined by researchers. For example, most traditional statistical methods require the research specify what variables need to be included in the model, while with machine learning methods, we can input all variables that are potentially related wi th the outcome of interests, and the machine will select the most influential ones. Th ese features of machine learning could improve predictivity, identify previously unknown predictors , and minimize potential bias from How ever, it is critical to have a good study design when using machine learning techniques, as biases and other errors may be picked up in the models rather comparison group and reasonab le inclusion and exclusion criteria are needed . In addition, machine learning algorithms can take complex combinations of variables into consideration, such as interactions between variables. Machine learning approaches are also capable of processing high dimensional data. Compared to traditional approaches, machine learning approaches are generally more resistant overfitting or optimism etc. There are a wide ra nge of machine learning algorithms, and for risk diagnosis, supervised learning algorithms are the most suited, as I aim to predict a definitive outcome (whether the subject will have the disease or not), and prediction models are developed based on both i nput and output data. Some commonly used supervised machine learning algorithms include decision tree, random forest, and rule learning algorithms etc. In contrast to supervised learning, the unsupervised learning is used to group or cluster the input data , such as Expectation Maximization (EM) clustering or nearest neighbor methods, in which we aim to examine the patterns of input data 21,22 . In this dissertation work, supervised machine learning algorithms will be used to develop the predictive models for RA diagnosis in aim 1, while
22 unsupervised machine learning algorithms will be used to examine the patterns of rheumatic diseases in RA patients in aim 3. When using machine learning techniques, one important issue is the choice of black box vs white model 23 . Black box technique means that the workflow of the algorithm is somewhat or very difficult to interpret; while the white box technique refers to approaches that have high interpretability. For example, the decision tree is a white box approach because the node/split of the tree is visible, one can easily identify what variables are included in the prediction model and what roles they are playing 24,25 . Deep learning or neural network, on the other hand, are black box approaches as it contains multiple hidden layers, and we cannot identify what variables are used in the model and how to interpret its inner workings 26,27 . In theory, given sufficient volume of data, the flexibility and c omplexity of deep learning models and other black box models usually enable them to outperform other machine learning models most of the time. F model should be the overall performan ce . F or example, if a black box ensures an 10% increase in performance (in terms of AUROC, sensitivity, and/or specificity) over a white box method, and if this increase is clinically relevant, then the black box will be our choice. However, in addition to the prediction performance, epidemiologist and clinicians , are also interested in the interpretation of the prediction model , b ecause from the prediction model we can identify new risk factors, and ultimately help better understanding the etiology and ris k factors of a disease and the biological mechanism behind the disease occurrence. Therefore, when in the model performances from black box and white box are comparable, the white box is a preferable alternative. For example, in a study conducted by Fracca ro et al. 28 , several black box and white box models predicting age related macular were developed and compared. In their study, the
23 more complex blac k box methods (random forest) yielded an average of 92% of AU RO C while some more interpretable white box methods such as decision tree had a comparable performance (90% AU RO C). As the result, the decision tree was selected as one of their final models as i ts combined interpretability and performance, the tree diagram can be easily followed by a clinician during the diagnostic process. Research Objectives In my dissertation work, I will focus on (1) to increase predictivity of RA diagnosis and better disting uish RA from other rheumatoid conditions or conditions with similar symptoms; and (2) to examine the underlying heterogeneity within RA population and lay a foundation for future RA endotype discovery. The success of this work will provide useful prognosti c information and further the understanding of RA pathogenesis, identify under testing or other missed opportunities in rheumatology healthcare. The specific aims I am proposing are as follows: Aim 1: to develop a multi domain precision models/scores for rheumatoid arthritis (RA) using demographic, clinical information and environmental data . I will develop a diagnostic algorithm based on individual level information domains (including demographics and clinical comorbidities), combining with environmental data (e.g. socioeconomic level, pollution) The hypothesis is that an approach incorporating demographic, clinical characteristics and additional environmental data would improve the predictivity of a RA prediction model, leading to earlier diagnosis. Aim 2: to explore medical consequences (clinical diagnosis) associated with RA This is an exploratory analysis: the goal is to explore what health conditions may be consequences of having RA, this will provide information on clinical management of RA patients
24 Aim 3: to identify subtypes in RA population with putatively different clusters of other rheumatoid conditions. The hypothesis is that RA is a progressive and accumulative progress of multiple other rheumatoid conditions, and that subtypes of RA can be d efined by distinct functional or pathophysiological mechanism, and in this case, how these rheumatoid conditions cluster. Data source To accomplish these aims, the Florida statewide database from the Health Cost and Utilization Project (HCUP) was used 29 . HCUP is a family of health care databases and related software tools and products developed through a Federal State Industry partnership and sponsored by the Agency for Healthcare Research and Quality . Inpatient Databases (SID), State Ambulatory Surgery and Services Databases (SASD), and State Emergency Department Databases (SEDD) for the state of Florida, USA, between 2005 and 2014 (inclusive) were purchased after completion of required trainings and data use agreement . SID, SASD and SEDD contains anonymized, inpatient, outpatient, and emergency room visits data, with digit US zip codes), insurance status, hospital information, diagnoses, a nd procedures billed (encoded through the International Classification of Diseases, 9th revision (ICD 9) ontology ) 30 . HCUP contains a unique patient code (de identified) longitudinally across individual patients and link data longitudinally in this study. The HCUP databases did not contain laboratory or pharmacy data. I use 10 years (2005 2014) of Florida SID for this work, I chose this time period for 2 reasons: first, the encrypted person identifier was only available after 2005, the inclusion of this that each individual would have compl ete medical records during this period; second, ICD 10
25 ontology was implemented in HCUP starting from 2015, to avoid the heterogeneity and potential errors in recoding, only data that are encoded with ICD 9 ontology were used. The observational unit of th is work was the single patient aged 18 years or older. We designed a case risk control study by grouping patients who had a diagnosis of RA and patients who had a diagnosis of RD . RA and RD was defined as a twice confirmed diagnostic code. A twice confirme d diagnosis of RA/RD was chosen to increase specificity in absence of codes for the ordering of related tests and prescriptions. The baseline time for RA and RD patients was the first RA or RD diagnosis date . Other study inclusion criteria were: non missin g age, gender or race/ethnicity, availability of zip code, and medical records for at least five years before diagnosis year. In addition to variables in HCUP, were obtained from th e American Community Survey (n=65 indices, including education, income, poverty, foreign born) 31 and U.S. Environmental Protection Agency (n=8 air quality ind ices, including median Air Quality Index, atmospheric levels of particulate matter 2.5, particulate matter 10, sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide) 32 corresponding to the zip code of residence. Since patients usually had more than one visit and possibly more than one residence over the study period, the zip code most frequently reported prior to ba seline was used. All statistical analysis was conducted at the zip code level, but results were also aggregated by county. Social ecological indices were extracted from aforementioned sources for every year (2005 2014) and matched to patient zip codes by y ear of diagnosis. Missing values in the social ecological indices were imputed. Figure 1 3 displays the sample population and study flow used in each of the three aims.
26 de mographic variables were extracted. External social ecological variables from ACS and EPA database were then linked with individual level data using zip code. With this sample population, we used the National Arthritis Data Workgroup definition of arthriti s and other rheumatic diseases with ICD 9 codes to further separate the patients into RA and RD groups. Those whose first RA/RD diagnosis made before 2010 were excluded, and those who have less than 5 years of medical history were also excluded. The case c ontrol designed was used for Aim 1 and 2. As for Aim 3, only patients with RA were used. Figure 1 3 Flow chart of study population selection and inclusion criteria for each aim
27 CHAPTER 2 PREDICTING THE RISK OF RHEUMATOID ARTHRITIS IN AT RISK POPULATIO N USING DEMOGRAPHIC, CLINICAL, AND SOCIO ECOLOGICAL DOMAINS Introduction Rheumatoid arthritis (RA) is a disease with shadowy etiology carrying a substantial healthcare burden and impairment in quality of life. A recent analysis suggested that the overall a ge adjusted prevalence of RA ranged from 0.53 to 0.55% in 2014 and the age adjusted prevalence has been steadily increased in the past decade 33 . RA is usually characterized by painful and swollen joints, it can cause extra articular molestation such as anemia, interstit ial ling diseases, vasculitis, etc. 34 . The symptoms of RA are very similar to other rheumatic diseases (RD) and distinguishing patients with RA from patients with other RDs is important for effective and targeted treatment. The current diagno sis of rheumatoid arthritis (RA) remains challenging; commercially available biomarkers are imperfect, for example, the sensitivity and specificity of traditionally used rheumatoid factor (RF) test is less than 80% while the anti cyclic citrullinated pepti de (anti CCP) test showed a high specificity (~90%) but low sensitivity (~60%) 3,14 . In addition, the etiology of RA remains poorly understood and it is still u nclear how many of the previously postulated risk factors for RA contribute to the development of the disease. Some factors such as gender, smoking, and genetic variants have been demonstrated to be associated with RA repeatedly 6,34,35 , but these factors can on ly partially explain individual risk for RA . T here are other factors or conditions (including infections, other comorbidities such as depression, environmental exposure, weather) that have shown inconsistent findings 36 38 . In this study, we use a large state wide electronic health record data set to develop a risk prediction model for RA using a multiple domains approach. The goal is to improve the accuracy of RA prediction by incorporating information from different domains and identify clinical
28 comorbidities ass ociated with development of RA including prodromal and concurrent comorbidities. Methods with protocol no. IRB201701906 as exempt. Study Population The Healthcare Cost State Ambulatory Surgery and Services Databases (SASD), and State Emergency Department Databases (SEDD) for the state of Florida, USA, between 2005 and 2014 (inclusive) were used 29,39 . The SID, SASD and SEDD contains anonymized, longitudinally linked inpatient, outpatient and emergenc codes, insurance status, hospital information, diagnoses, and procedures billed from each of their visits . The diagnoses and procedures were classified using the International Classification of Diseases version 9 (ICD 9) ontology. The observational unit of our study was single patient with a diagnosis of RA or RD. The definitions of cases and controls are adapted from National Arthritis Data Workgroup (NADW) definition of arthritis and other rheumatic con ditions using a set of ICD 9 codes, the diagnosis of RA or RD are considered as confirmed when their corresponding diagnostic code were recorded twice in two distinct visits. To be specific, RA is defined as diagnostic code ICD9 714.0, and RD patients are patients who were diagnosed with all the other rheumatic diseases (RD) from the NADW list (include 710, 711, 712, 713, 715, 716, 719, 720, 721, 725, 726, 727, 728, 729.0, 729.1, 729.4, 095.6, 095.7, 098.5, 099.3, 136.1, 274, 277.2, 287.0, 344.6, 353.0, 354 .0, 355.5, 357.1, 390, 391, 437.4, 443.0, 446, 447.6, 696.0). Because the changes of diagnosis guideline in 2010, patients who had RA or RD diagnosed before 2010 were excluded. Other
29 inclusion criteria were age of 18 or older at baseline and availability o f medical records prior to baseline for at least five years from diagnosis. Predictors To each patient we associated all of their previous diagnoses and procedures using three digit ICD formula with ICD 9 codes) 40,41 10, and a set of social ecological variables obtained from the American Community Survey 31 (n=65, including education, income, poverty, foreign bor n) and U.S. Environmental Protection Agency 32 (n=18 air quality indices) . The social ecological variables were l inked w ith individual level data using zip code . Because patients usually had more than one visit and possibly more than one residence over the study period, the zip code most f requently reported prior to diagnosis was used ocial ecolo gical variables were extracted for every year (2005 2014) from ACS and EPA dataset and were matched to patient zip codes and year of diagnosis . The diagnostic codes and procedures that occurred in less than 1% the RA group were removed. Missing values in s ocial ecological domain were imputed via population median/mode. Statistical Analysis To identify demographic, clinical and socio ecological predictors of RA, we first used univariate analysis to assess the distribution of the predictors in RA and RD using test, Wilcoxon rank test, or chi square test, where appropriate, p values were adjusted using Bonferroni correction. We then fitted a collection of multivariable models with different techniques using predictors from all input domains. To be s pecific, we evaluated the following models: (a) a decision tree 42 ; (b) LogitBoost algorithm in conjunction to logistic regression 43 ; (c) a random forest (optimizing number of trees up to 1,000) 44 ; (d) one rule 45 . Model comparison, evaluation,
30 and selection were carried out using a 10 fold cross validation framework, and the performance and discriminative ability of models was assessed using sensitivity (true positive rate), specificity (true negative rate), and the area under the r eceiver operating characteristic (AUROC), the 46 . To analyze and compare the importance of each domain, we fitted the algorithm with best perfo rmance and interpretability with each information domai n included in our analysis. The se individual domains are: (1) demographic variables; (2) clinical diagnosis; (3) procedures conducted; (4) social ecological variables and (5) all variables available. T he performances of models using different information domains were accessed using sensitivity, specificity and AUROC under a 10 fold cross validation setting. All statistical analyses were conducted using R and Weka ver. 3.9. Results The 2005 2014 HCUP SID /SASD/SEDD contained a total of 21,091,289 patients after data merge and cleaning. There were 145,512 patients meeting the inclusion criteria (first baseline year in 2010), of which 20,351 were diagnosed with RA (confirmed) and 125,161 with confirmed RD. T able 2 1 shows the population characteristics of RA and RD group. There was a higher proportion of females and White ethnicity in the RA group compared to the RD population. The RA group also had higher median age (61 years old) and higher median CCI (1) t han the RD group (median age 55 and median CCI 0). Table 2 1. Population characteristics of the study population RA RD Total N 20351 125161 Female N (%) 15533 (76.3) 72746 (58.1) Race White N (%) 15012 (73.4) 81335 (65.0) Race African American N (%) 2755 (13.5) 25243 (20.2) Ethnicity Hispanic N (%) 2145 (10.5) 14922 (11.9) Race Asian or Pacific Islander N (%) 105 (0.5) 796 (0.6)
31 RA RD Race Native American N (%) 15 (0.07) 129 (0.1) Race Other N (%) 320 (1.6) 2736 (2.2) Age median (interquartile range, I QR) 61 (47, 72) 55 (41, 67) Year of RA/RD diagnosis (or baseline) median (IQR) 2012 (2011, 2013) 2012 (2011, 2013) Years of prior medical history median (IQR) 6 (5, 7) 7 (5, 8) 1 (0, 2) 0 (0, 2) A total of 377 three digit ICD 9 diagnostic codes and 108 procedure codes were identified after the removal of codes with low frequencies . We preformed univariate analysis of variables from diagnosis, procedures and socio ecological domains comparing RA with RD. Table 2 2 displays frequencies of the top 10 variables from diagnosis, procedure and socio ecological domains. All variables had Bonferroni adjusted p value below 0.05. Higher prevalence of different infections, anemia, osteoporosis, asthma and pain were observed in the RA group. There were also differences in the procedures performed in each group. Interestingly, a number of social ecological variables were found to be different in the two groups. The RA group had a higher value of deprivation index and lower inc ome, while the RD group had higher values in several different air quality measures. Table 2 2. Top 10 (where applicable) most important variables in each domain, variable sorted by Chi square value variables RA RD n (%) Diagnosis (ICD 9 codes) Post inflammatory pulmonary fibrosis (515) 439 (7.5) 6068 (2.6) Osteoporosis (733.0) 1269 (21.8) 30089 (12.7) Candidiasis (112) 700 (12.0) 14953 (6.3) Urinary tract infection, site not specified (599.0) 2346 (40.3) 71762 (30.2) Pneumonia, organism unspecifi ed (486) 1654 (28.4) 47198 (19.9) Pain not elsewhere classified (338) 1379 (23.7) 38045 (16.0) Anemia, unspecified (285.9) 2414 (41.5) 75999 (32.0) Disorders of fluid, electrolyte, and acid base balance (276) 3677 (63.2) 126517 (53.2) Anemia in chronic illness (285.2) 845 (14.5) 21115 (8.9) Asthma (493) 1433 (24.6) 41105 (17.3)
32 variables RA RD n (%) Procedures (ICD 9 codes) Esophagogastroduodenoscopy [Egd] With Closed Biopsy (45.16) 5110 (25.1) 22332 (17.8) Transfusion of Packed Cells (99.04) 2389 (11.7) 10254 (8.2) C olonoscopy (45.23) 3598 (17.7) 19338 (15.5) Other Manually Assisted Delivery (73.59) 280 (1.4) 3512 (2.8) Other Immobilization, Pressure, And Attention to Wound (93.59) 421 (2.1) 2020 (1.6) Injection of Anesthetic into Peripheral Nerve for Analgesia (04 .81) 785 (3.9) 4950 (4.0) Injection or Infusion of Other Therapeutic or Prophylactic Substance (99.29) 2388 (11.7) 8030 (6.4) Insertion of Intraocular Lens Prosthesis at Time of Cataract Extraction, One Stage (13.71) 893 (4.4) 4134 (3.3) Angiocardiograp hy of Left Heart Structures (88.53) 2950 (14.5) 14593 (11.7) Other Incision with Drainage of Skin and Subcutaneous Tissue (86.04) 1183 (5.8) 7894 (6.3) Social ecological variables Deprivation Index 106.0 (100.2, 109.3) 105.0 (99.3, 108.9) Median Hous ehold Income 44962 (37110, 53841) 46432 (36560, 54680) Household Income 10611 (7508, 14331) 11483 (7825, 14671) 2nd highest 1 hour measurement of CO in the year 1.9 (1.7, 2.3) 2.2 (1.7, 2.3) 2nd highest 8 hour average measurement of CO in the year 1.3 ( 1.1, 1.6) 1.4 (1.3, 1.9) PM2.5 8.1 (7.6, 8.1) 8.0 (7.8, 8.5) highest daily AQI value in the year 119 (100, 147) 133 (101, 147) In addition, we examined the frequencies of rheumatic diseases in the two groups. Table 2 3 shows the results . Overall, RD gr oups have higher frequencies in most of these diseases. Interestingly, RA patients have higher rates of d iffuse diseases of connective tissue (ICD 9 code: 710) (ICD 9 code: 99.3), and r heumatic fever no heart involvement ( ICD 9 code: 39 0) despite these codes were used to define RD group. Table 2 3 . Frequencies of rheumatic diseases in the two groups Rheumat ic diseases RA RD adjusted p value n (%) Diseases of the musculoskeletal system and connective tissue Diffuse diseases of con nective tissue (710) 1542 (9.9) 649 (7.0) <.0001 Infectious arthropathies (711) 215 (1.4) 116 (2.0) 0.3599 Osteoarthritis and allied disorders (715) 5499 (35.4) 2186 (37.5) <.0001
33 Rheumat ic diseases RA RD adjusted p value n (%) Other/unspecified arthropathies (e.g. allergic arthritis, transient arthr opathy) (716) 1412 (1.3) 524 (9.0) <.0001 Other/unspecified joint disorders (e.g. joint effusion, walking difficulty) (719) 871 (5.6) 375 (6.4) <.0001 Ankylosing spondylitis/inflammatory spondylopathies (720) 140 (0.9) 60 (1.0) <.0001 Spondylosis and al lied disorders (721) 1041 (6.7) 415 (7.1) <.0001 Polymyalgia rheumatica (725) 350 (2.3) 142 (2.4) <.0001 Peripheral enthesopathies and allied conditions (726) 369 (2.4) 159 (2.7) <.0001 Other disorders synovium/tendon/bursa (e.g. tenosynovitis, synovial cyst) (727) 367 (2.4) 176 (3.0) 0.0256 Disorders of muscle/ligament/fascia (728) 598 (3.9) 253 (4.3) <.0001 Rheumatism (729) 2587 (16.7) 1089 (18.7) <.0001 Selected codes from other ICD9 CM chapters Gonococcal infection of joint (98.5) 0 0 0.4595 7 (0.05) 0 0.133 Syndrome (163.1) 9 (0.06) 5 (0.1) 0.1107 gout (274) 865 (5.6) 352 (6.0) <.0001 Allergic Purpura (287.0) 3 (0.02) 2 (0.03) 0.3319 Cauda Equina Syndrome (344.6) 16 (0.1) 8 (0.1) <.0001 Brachial plexus le sion/thoracic outlet syndromes (353.0) 7 (0.05) 3 (0.05) 0.0195 Carpal Tunnel Syndrome (354.0) 72 (0.5) 30 (0.5) 0.0019 Tarsal Tunnel Syndrome (355.5) 0 0 0.2147 Polyneuropathy collagen vascular disease (357.1) 14 (0.09) 10 (0.2) <.0001 Rheumatic fev er no heart involvement (390) 7 (0.05) 1 (0.02) 0.004 Cerebral arteritis (437.4) 6 (0.04) 3 (0.05) 0.5078 145 (0.9) 57 (1.0) <.0001 Polyarteritis nodosa and allied conditions (446) 189 (1.2) 89 (1.5) <.0001 Arteritis, unspe cified (447.6) 74 (0.5) 46 (0.8) <.0001 Psoriatic arthropathy (696.0) 219 (1.4) 80 (1.4) <.0001 In order to build a multi domain prediction model of RA diagnosis, we fitted a number of models on selected input domains as specified in the Methods section . Table 2 4 summarizes the performance indices for these models. We first fitted multivariable models on single and merged domains, using the top variables for each domain (200 variables, optimized through a bootstrapped grid search). In terms of AUROC, wh en using variables from all domains on the
34 whole data set the logitboost (containing 10 variables after feature selection) performed the best, followed by decision tree and random forest. A model using one rule (income household median) had the worst predi ctivity. Interestingly, the model using random forest had the highest sensitivity (0.694), while the model using one rule had the highest specificity (0.991). The model using LogitBoost had fairly good values in both sensitivity (0.685) and specificity (0. 662). For the single domain analysis, we then used logitboost. Figure 2 1 shows the discriminatory power of each domain and of domains together. When distinguishing asthma from RD, the domain with the highest discriminatory power was the social ecological (AUROC 69.3%), followed by clinical diagnoses (65.4%), demographics (64.7%), and clinical procedures (58%). Merging domains together improved the overall performance by approximately 5%, yielding an AUROC of 73.5%. Notably, the model with clinical diagnosi s or procedure domain had the highest specificity (0.681 and 0.685 respectively), while the model with demographic domain had the highest sensitivity (0.747). The models with all domains and the model with social ecological domain achieved relatively good balance in sensitivity and specificity. Table 2 4 . Performance of prediction models Model Domain(s) AUROC (95% CI) Sensitivity Specificity Cutoff Decision tree All 0.702 (0.698, 0.706) 0.652 0.66 0.123 Random forest 0.695 (0.691, 0.699) 0.655 0.628 0.1 52 LASSO 0.622 (0.618, 0.626) 0.533 0.655 0.15 One rule 0.522 (0.520, 0.524) 0.052 0.991 0.152 Logit Boost All 0.735 (0.731, 0.739) 0.685 0.662 0.139 Social ecological 0.693 (0.689, 0.697) 0.666 0.644 0.139 Clinical diagnoses 0.654 (0.650, 0.658) 0.603 0.634 0.127 Demographics 0.647 (0.643, 0.651) 0.631 0.579 0.139 Clinical procedures 0.580 (0.575, 0.584) 0.472 0.663 0.121
35 Figure 2 1. Model comparison via AUROC ( Left panel: model comparison; Right panel: domain comparison ) The find choice o f model is the model using logitboost with all input domains, a summary of the model is displayed in Table 2 5 . A total of 12 variables were selected in the final model including demographic variables (age, gender), clinical variables (4 different diagnosi s and CCI), and social ecological variables (4 air quality measurements and area deprivation index). Table 2 5 . Final multi domain model from the logitboost algorithm variable OR (95% CI) p value age 1.01 (1.01, 1.02) <0.0001 Gender (female) 2.37 (2.24, 2.52) <0.0001 (710) Diffuse diseases of connective tissue 3.33 (2.97, 3.73) <0.0001 (715) Osteoarthrosis and allied disorders 1.40 (1.31, 1.48) <0.0001 (716) Other and unspecified arthropathies 2.08 (1.94, 2.22) <0.0001 (719) Other and unspecified dis orders of joint 1.42 (1.34, 1.51) <0.0001 max AQI of the year 0.97 (0.97, 0.98) <0.0001 NO2 annual mean 0.92 (0.89, 0.96) <0.0001 PM10 annual mean 1.12 (1.09, 1.13) <0.0001 2nd highest 8 hour CO measurement 0.40 (0.34, 0.47) <0.0001 Area Deprivation I ndex 1.00 (0.99, 1.01) 0.1192 CCI 0.99 (0.98, 1.01) 0.5503 Discussions In this study, we developed a compact model to predict rheumatoid arthritis (RA) utilizing a large database from the state of Florida. The final model we built showed a fair performa nce with AUROC of 0.74, sensitivity of 0.68 and specificity of 0.65 at the optimal cut
36 point. The results from our analysis demonstrated that by integrating multiple information domains, we are able to improve the accuracy and utility of a prediction model compared to models with individual domains, and some interesting findings warrant further investigation. The current diagnostic criteria of RA heavily rely on clinical symptoms and blood test results especially the anti CCP test. The sensitivity and speci ficity of the test ranges from 53% to 71% and 95% to 96% respectively 3,14 . Thus, there is still a need to improve the predictivity of RA diagnosis, especially f or early diagnosis, and there is no tool for the prediction of at risk population. In this study, we compared the characteristics between RA and a selected population of those who had RD. This comparison group may be more prone to RA. Although the overall predictivity of our model is not good enough to be used for disease prediction or diagnosis by itself, it can be used for complementing differential diagnosis. The combination of our model with lab test results and other domains should yield more accurate predictions of RA. In addition to potential use for RA prediction, we also identified several interesting variables associated with risk of RA in our analysis, including clinical diagnosis such as postinflammatory pulmonary fibrosis and anemia, socio ecol ogical variables such as deprivation index, income level, and clinical procedures such as blood transfusion. These variables occurred or were collected before or at the same time as the diagnosis of RA, so they can provide important knowledge on concurrent and prodromal risk factors or comorbidities of RA. RA is a chronic inflammatory disorder, and previous studies suggested that systemic inflammation and autoimmunity in RA usually begin years before the onset of detectable joint inflammation and other symp toms 47,48 . Therefore, diseases and procedures that are usually associated with inflammation tend to be more prevalent in RA patients. For example, anemia usually occurs in people who have ongoing inflammation in their b ody, including anemia of chronic inflammation
37 and iron deficiency anemia 49 , which is consistent with our obs ervations that multiple ICD 9 codes for anemia are positively associated with RA. The higher frequency of anemia in RA group is also related with higher frequency of blood transfusion procedures (Transfusion of packed cells) 50 . RA patients tend to have increased concentrations of homocysteine and leptin 51 , which could lead to higher frequencies of heart conditions and heart procedures. Lung diseases such as pulmonary fibrosis is common extra articular manifestation of RA, and inflam mation could also increase the risks of these diseases 52 . In addition, abnormal immune system can trigger RA and lead to lastin g damage to cells in the body, this combined with immunosuppressant medication that are generally taken by RA patients can make the patients prone to increased prevalence of infections. However, in our analysis, some variables (e.g. asthma) are positively associated with RA, but the rationale for their association were not fully understood. Future studies could verify these findings and seek to explore the biological mechanisms between these associations. Perhaps the most intriguing finding of this study is that the socio ecological domain is the strongest predictor for RA among all information domains. Lower income level, lower socio economic status and different air quality measures were observed in the RA group. Previous studies have demonstrated that som e soci al ecological factors may contribute to higher risk of latitudes had elevated risk for RA. Compared to women who lived in the West, those who lived in the East were at greater risk for RA (RR 1.36; 95% CI 1.01, 1.84) after adjusting for potential confounders 38 . They also found that women living within 50 meters of a road had an elevated risk for RA (HR: 1.31; 95% CI, 0.98 1.74) compared with those living 200 meters or farther away from a road, but no statistically significant d ifference was observed in risks for RA with
38 exposure to the different pollutants 53 . In our analysis, the model with soci al ecological variables (including socio economic status and air quality measures) has the strongest predictivity among all input domains tested, which is consistent with previous reports, and indicate that environmental factors may play important roles in the pathogeni c mechanisms of RA. Future studies could focus on examining the independent contribution of these socio ecological variables. The final model of choice was a multi domain model derived using LogitBoost algorithm (Table 2 4). In this model, female (OR=2.37) , older age (OR=1.01), and several other types of arthropathies or arthritis were positively associated with RA. Consistent with previous studies, it has been reported that females or older population were more likely to have RA than males or younger indiv iduals 6,33,34 , it has also been demonstrated that patients with other diseases of connective tissues or rheumatoid conditions were at higher risk for RA compared with those without these conditions 48,54 , our observation suggest that RA is closely related with RDs. In addition, several socio ecological variables were also identified as strong predictors for RA, a highe r value of PM10 is associated with higher risk of RA (OR=1.12), while other variables such as NO 2 and CO measurements were negatively associated with the occurrence of RA. The mechanisms of the associations between RA and these air pollutants has not been fully understand, previous studies have also yielded inconsistent results 55 58 , in this analy sis, we included only a limited number of social ecological indicators (especially air quality measures) and made broad assumptions on these characteristics over time and space, this may introduce misclassifications given their large temporal variations. H owever, given the strong predictive signal of the social ecological domain, a more thorough analysis is warranted. There are some limitations in this study. First, RA or RD were defined solely by using ICD 9 codes since there is no information of prescrip tion or lab marker in our data. We used an
39 existing definition of RA in EHR settings 59,60 in our study: if a patient had two separate ICD codes for RA/RD from two d istinct visits in 6 months, the patient was categorized into their respective group. Previous studies showed such classification had 90% sensitivity and about 60% positive precative value (PPV) 2,3,14 , which in dicates we may have misclassification in our study. Future studies that use a more definitive definition of RA/RD are needed. Second, we used only ICD 9 codes for clinical diagnosis and procedures, a strength of using ICD 9 is that under a pure predictive approach, these codes are useful as they can be pulled automatically from an EHR, so that the models can be directly applied in external EHR databases. However, if we want to fully understand and interpret mechanics of how these predictors work from an act ionable point of view, clinical interpretations of these codes are needed. In this analysis, we utilized the hierarchy structure of the ICD 9 ontology when an identified 3 digit code contains a broad range of conditions, we used the 5 digit codes under it to identify more meaningful conditions. In addition, the translational relevance of using exclusively ICD 9 is limited as the codes need to be translated into other ontologies (such as ICD 10). Strategies for integrating ICD 9 with other ontologies are nee ded in the future. The findings from this work may also need additional semantic integration strategies to link the ICD codes to free text type information domains. For example, ICD codes 338 (pain, not classified elsewhere) were identified as an important predictor in our analysis, future studies that link this ICD code with descriptions of pain from Despite these limitations, this study provided an algorithm that has the potential of improvi ng the differential diagnosis and early prediction of RA, which can be used in conjunction with other diagnostic information. We also found a significant contribution of socio -
40 ecological domains, future studies examine independent socio ecological and envi ronmental risk factors and predictors of RA are needed .
41 CHAPTER 3 CLINICAL CORRELATES OF RHEUMATOID ARTHRITIS USING A STATISTICALLY ROBUST FRAMEWORK Introduction Rheumatoid arthritis (RA) is a chronic inflammatory disorder that carries a substantial healt hcare burden and impairment in quality of life. The prevalence of RA in the United States was estimated to be 0.53 to 0.55% in 2014 with a higher prevalence in females compared to males (0.73 0.78% vs 0.29 0.31%) 33 . While there is no cure for RA currently, physiotherapy and medication are usually used to treat RA. These disease modifying antirheumatic drugs (DMARDs), along with other medications and lifestyle changes, could slow the disease's s been reported that in an insured RA patient population, only two thirds of newly diagnosed RA patients were prescribed a DMARD in year one and 28% received no antirheumatic therapy 61 . Because of the lack of effective treatment and the natural progress of the diseases itself, patients who have RA are prone to many health consequences. Previous studies have demonstrated that RA patients are at higher risks for various of conditions, including persistent pain, functional disability, fatigue, and depression 62 66 . While these studies compared individuals with RA to control groups determined by specific mortality or risk of single dis ease, a comprehensive study on all morbidity has not been carried out. The objective of this study is to identify clinical consequences of RA by assessing the independent association of RA and the risk of diagnosis of specific clinical comorbidities for in dividuals who are at high risk of RA. Methods with protocol no. IRB201701906 as exempt.
42 (SID), State Ambulatory Surgery and Services Databases (SASD), and State Emergency Department Databases (SEDD) for the state of Florida, USA, between 2005 and 2014 were used 29 . The SID, S ASD and SEDD contain anonymized, longitudinally linked inpatient, outpatient and emergency room visits data. From Florida SID/SASD/SEDD, patients with RA and RD were identified using definitions adapted from National Arthritis Data Workgroup (NADW) definit ion of arthritis and other rheumatic conditions using a set of ICD 9 codes. To be specific, the cases are patients who were diagnosed with RA (ICD 9: 714 .0), and controls are patients who were diagnosed with other rheumatoid diseases (RD) as defined from t he NADW list, and RA and RD are considered as confirmed when the corresponding diagnostic code were recorded from two distinct visits. In addition, we excluded patients whose first RA or RD were diagnosed before 2010. To each patient we associated their ot her diagnoses before and after the diagnosis of RA/RD using three digit ICD 9 codes. Charlson comorbidity index (CCI) 40,41 and rheumatic disease comorbidity index (RDCI) 67 for each individual were calculated using ICD 9 codes prior to RA/RD diagnosis. Diagnostic codes that occurred in less than 1% the RA group were removed. To ensure the two gr oups have comparable overall health, propensity score matching matched on the propensity score with a 1:3 ratio. To be specific, w e first fitted a logistic regression u sing a disease (e.g. essential hypertension) as outcome variable, CCI and RDCI as covariates to generate the logistic probabilities of propensity score for each observation in the . Next, one patient in the RA group were matched to 3 patients in the RD group based on their propensity scores to
43 generate the study sample population. The association between RA status and disease comorbidities was assessed by multivariable logistic regression, wit h RA status as independent variable and each of ICD 9 code after the occurrence of RA/RD as the outcome. The models were adjusted by age, gender, race, insurance status, CCI, RDCI, and status of the corresponding ICD 9 code prior to diagnosis. After fittin g all multivariable logistic models, we ranked the adjusted odds ratios of RA, Akaike information criterion (AIC) 68 and area under receiver operating characteristic curve (A UROC) 69 of each model b ased on their respective criteria. Results from these three measurements were combined to create a more robust measurement to identify importance consequences. All statistical analysis was conducted using R 14. Results The 2005 2014 HCUP SID/SASD/SEDD cont ained a total of 21,091,289 patients after data merge and cleaning. A total of 19,609 RA patients and 58,827 RD patients were identified after propensity score matching, and a total of 237 three digit diagnostic codes were included in the analysis (all abo ve 1% frequency). Table 3 1 displays population characteristics before the propensity score matching. Table 3 1. P opulation characteristics of the two groups before propensity score matching RA RD Total N 20351 125161 Female N (%) 15533 (76.3) 72746 (58 .1) Race White N (%) 15012 (73.4) 81335 (65.0) Race African American N (%) 2755 (13.5) 25243 (20.2) Ethnicity Hispanic N (%) 2145 (10.5) 14922 (11.9) Race Asian or Pacific Islander N (%) 105 (0.5) 796 (0.6) Race Native American N (%) 15 (0.07) 129 (0. 1) Race Other N (%) 320 (1.6) 2736 (2.2) Age median (interquartile range, IQR) 61 (47, 72) 55 (41, 67) Year of RA/RD diagnosis (or baseline) median (IQR) 2012 (2011, 2013) 2012 (2011, 2013) CCI (IQR) 1 (0, 2) 0 (0, 2)
44 Table 3 2 . P opulation characteri stics of comparison group after propensity score matching RA RD Total N 19609 58827 Female N (%) 11063 (56.4) 33014 (56.1) Race White N (%) 15095 (77.0) 45027 (76.5) Race African American N (%) 2186 (11.2) 6744 (11.5) Ethnicity Hispanic N (%) 1786 (9 .1) 5510 (9.4) Race Asian or Pacific Islander N (%) 117 (0.6) 317 (0.5) Race Native American N (%) 18 (0.1) 72 (0.1) Race Other N (%) 349 (1.8) 1017 (1.7) Age median (interquartile range, IQR) 67 (54, 78) 67 (54, 78) RDCI (IQR) 3 (2, 5) 3 (2, 5) CCI (IQR) 3 (1, 4) 3 (1, 4) After the propensity score matching, the two groups have similar ranges in RDCI and CCI, indicating that the two groups of patients have similar overall health. Interestingly, demographic variables such as gender, race, and age we re all similar in the two groups, suggesting that these variables may be associated with CCI and RDCI, and by matching CCI and RDCI, these demographic variables were controlled in some level . Table 3 2 displays the population characteristics of the two gro ups after propensity score matching. Among the RA group, the most common diagnosis after baseline was essential hypertension (401), more than 70% of RA patients were diagnosed with essential hypertension after the occurrence of RA. Figure 3 1 shows the top 10 most frequent diagnoses in RA group and their frequencies in RD group. Among these top ten ICD 9 codes, only essential hypertension (401) and osteoarthrosis and allied disorders (715) had higher percentages in RA group (72.4% and 37.5% respectively) th an in RD group (71.8% and 33.7% respectively).
45 Figure 3 1. The top 10 diagnosis in RA group after diagnosis Next, the potential clinical consequences of RA were identified by assessing the associations between RA and diagnosis made after RA/RD. Figure 3 2 shows the top 20 ICD 9 codes identified using the three distinct measurements: AIC, AUROC, or absolute coefficient. To ensure the robustness of our analysis, ICD 9 codes that were identified by two or more measurements were considered as important conseq uences of RA. A total of 131 variables were associated with RA at a significant level of 0.05 after Bonferroni adjustment, and 31 of these variables were positively associated with RA. Table 3 3 displays the top overlapped variables in each comparison and their: information gain, AUROC, and odds ratios values. Among these top variables, eight of them are positively associated with RA, including Disorders of lipoid metabolism (272), Disorders of fluid electrolyte and acid base balance (276), Other and unspec ified anemias (285) Essential hypertension (401), Postinflammatory pulmonary fibrosis (515), Diffuse diseases of connective tissue (710), Other disorders of bone and cartilage (733), and Fracture of neck of femur (820).
46 Figure 3 1. Venn diagram of top p ost diagnosis clinical variables identified in multivariable analysis from the three measurements, the ICD 9 codes indicates these are positively associated with RA Table 3 3 . T op post diagnosis clinical variables identified in multivariable analysis from the three measurements and their information gain, AUROC, and odds ratios values (bold indicates the variables are positively associated with RA) condition (ICD) AIC AUROC OR (95% CI) Disorders of lipoid metabolism (272) 88189.037 0.509 1.19 (1.12, 1.25) Disorders of fluid electrolyte and acid base balance (276) 88185.187 0.510 1.16 (1.10, 1.21) Other and unspecified anemias (285) 88178.841 0.512 1.17 (1.12, 1.23) Hereditary and idiopathic peripheral neuropathy (356) 88145.595 0.510 0.69 (0.63, 0.75) Inflammatory and toxic neuropathy (357) 88105.426 0.510 0.59 (0.54, 0.66) Essential hypertension (401) 88177.187 0.509 1.24 (1.17, 1.32) Hypertensive heart and chronic kidney disease (404) 88093.350 0.509 0.50 (0.44, 0.57) Cardiomyopathy (425) 88185.965 0.507 0.78 (0.72, 0.85) Other peripheral vascular disease (443) 88187.408 0.507 0.80 (0.74, 0.86) Postinflammatory pulmonary fibrosis (515) 88179.533 0.505 1.48 (1.32, 1.65) Nephritis and nephropathy not specified as acute or chronic (583) 87998.634 0. 511 0.37 (0.42, 0.43) Disorders resulting from impaired renal function (588) 88008.293 0.510 0.70 (0.32, 0.43) Hyperplasia of prostate (600) 88108.558 0.510 0.57 (0.51, 0.64) Diffuse diseases of connective tissue (710) 88135.741 0.507 1.95 (1.70, 2.23) Other disorders of bone and cartilage (733) 88116.296 0.513 1.41 (1.32, 1.50) Fracture of neck of femur (820) 88177.313 0.507 1.46 (1.31, 1.63)
47 condition (ICD) AIC AUROC OR (95% CI) Complications peculiar to certain specified procedures (996) 88176.646 0.508 0.78 (0.72, 0.84) Discussions In this study, we used a large state wide database to explore the potential clinical consequences of RA. We found that RA could result in severe physical burdens and comorbidities, these results could be used to inform research on clinical management and c are of patients. In our analysis, compared to RD patients, RA patients tend to have higher risks of certain clinical conditions, such as hypertension, disorders of joints and tissues etc. In order to identify which of these clinical conditions have strong er association with RA, we used a robust framework to determine the importance of the variables by combining three distinct measurements from regression models. The AIC estimates the relative information lost by a given model, the less information a model loses, the higher the quality of that model 68 . The AUROC measures the discrimination power of a model, higher value indicates the model predicts the outcome more accurately 69 , and the odds ratio (OR) measures the magnitude of the association. For an association with OR>1, a larger OR usually indicates a stronger association between the independent variable (predictor) and the dependent variable (outcome). The combination of these 3 measures provid es a more robust overall estimation on the relationship. Among all identified clinical conditions that are positively associated with occurrence of RA, disorders of lipoid metabolism, disorders of fluid electrolyte and acid base balance, anemias, essential hypertension, postinflammatory pulmonary fibrosis, diffuse diseases of connective tissue, other disorders of bone and cartilage, and fracture of neck of femur were considered as the most important ones using our approach. RA and its associated inflammatio n could lead to metabolic disturbances, inflammation of adipose tissue, which in turn leads to insulin resistance
48 and associated molecular disorders of and abnormal lipid metabolism 70 72 . RA patients tend to have changes in the acid base status of the synovial fluid which could lead to fluid electrolyte and acid base in balance 73 75 . In addition, RA is usually associated with different types of anemia. Anemia can occur in RA patients who have ongoing inflammation in their body, including anemia of chronic inflammation and iron deficiency anemia, and sometimes medication that used to treat RA may also trigger anemia 76 78 . Consistent with our finding, previous studies have also demonstrated that RA patients are at high risk of hypertension. It is hypothesized that hypertension in RA patients is not associated with generalized systemic inflammation or insulin resistance but is associated with increasing concentrations of homocysteine and leptin 79,80 . The pathogenesis of hypertension in RA may involve pathways that are associated with fat and vascul ar homeostasis. RA could also cause pulmonary fibrosis as an extra articular manifestation, due to inflammation 52,81,82 . Previous study has shown that high counts of RA antibodies are linked to the development of interstitial lung disease. Lastly, we found that compared with RD patients, RA patients have higher risks of diseases in connective tissues and bones. Studies have shown increased risks of bone loss, fracture or osteoporosis in individuals with RA. These associations may be explained by pain and loss of joint function caused by RA. Glucocorticoid medications that are often prescribed for RA treatment could also trigger significant bone loss 83,84 . These mechanisms of RA path ogenesis were supported by our study, our findings suggest that targeted interventions should be given to RA patients to improve their treatment outcomes and quality of life. There are several limitations in this work which merit discussion. It is a retros pective analysis of an observational dataset from a single state although Florida is the state with the third largest population in the United States. In this study, RA and RD were defined based on ICD 9 codes
49 because the HCUP database does not include pha rmacy data or laboratory data. We adapted an existing definition of RA in EHR settings: if a patient had two separate ICD 9 codes for RA/RD from two distinct visits in six months, the patient was categorized into their respective group. It has been reporte d that such classification had 90% sensitivity and about 60% positive predictive value 85 . Therefore, we may have misclassification in our study, future studies that use serological test as confirmation of RA/RD are needed. In addition, the use of ICD 9 codes alone may be problematic as these codes were designed primarily for billing purpose, thus some codes are not well specified from clinical point of view. In conclusion, w e conducted a comprehensive analysis on RA and risk of clinical comorbidities usi ng a large statewide clinical database. Several clinical conditions were identified to be potential health consequences of RA suggesting directions for further investigation into the mechanisms of RA and the management of RA patients.
50 CHAPTER 4 HETEROGEN EITY OF OTHER RHEUMATIC DISEASES IN PATIENTS WITH RHEUMATOID ARTHRITIS: A CLUSTERING ANALYSIS Introduction Rheumatoid arthritis (RA) is a chronic inflammatory disorder that carries a substantial healthcare burden and impairment in quality of life. The pre valence of RA in the United States was estimated to be 0.53 to 0.55% in 2014 with a higher prevalence in females compared to males (0.73 0.78% vs 0.29 0.31%) 33 . Previous evidence s have shown that RA is a heterogenous disease in different ways . First, heterogeneous gene expression profiles have been observed in RA patients. For example, the gene expression signatures of synovial tissues from RA patients showed significant variability in certain pathways and genetic alleles, resulting in the identification of molecularly d istinct forms of RA tissues 86,87 . Second, b response to the two main serological tests, RA patients can be ca tegorized into different phenotypes: RF positive and RF negative (from RF test) , anti CCP positive and anti CCP negative (from anti CCP test) . Third, there has been v ariation of symptoms reported in RA patients in terms of the presence, location, and sever ity of symptoms. Physicians use symptoms, blood antibody, lab tests along, and x ray to make a diagnosis of rheumatoid arthritis, which may be a start point to try and further refine the diagnosis, determine phenotypes and endotypes, and adjust therapy acc ordingly . However, few studies have investigated other aspects of RA that are related with the observed heterogeneity. Rheumatic disease (RD) is a class of conditions that causes inflammation on connecting and supporting structures such as joints, tendons, bones etc. The symptoms of RDs include chronic, intermittent pain on joints and connective tissue, and these diseases can ultimately cause loss of function in those body parts. In this study, we aim to assess the heterogeneity of other rheumatic diseases (RD) in patients with RA by use several different unsupervised statistical learning
51 techniques. The goal is to explore whether RA patients can be clustered by their prodromal or concurrent RDs. Methods This study has been approved by the University of Flor protocol no. IRB201701906 as exempt. 2014 were used 29 , which include inpatient, outpatient (ambulatory) and emergency department demographics, insurance status, hospital information, diagnoses, and procedures encoded vi a ICD 9 ontology. Adult patients whose first RA diagnosis was made after 2010 were included. A diagnosis of RA was considered as confirmed when the corresponding ICD 9 code (714.0) was recorded twice in two distinct visits within 6 months. To each patient , we associated all of their previous diagnoses of other rheumatic diseases (RD), the definitions of RDs were adapted from National Arthritis Data Workgroup (NADW) with a sets of ICD 9 codes (include Diffuse diseases of connective tissue, 711, 712, 713, Os teoarthrosis and allied disorders, Other and unspecified arthropathies, Other and unspecified disorders of joint, 720, Spondylosis and allied disorders, 725, Peripheral enthesopathies and allied syndromes, Other disorders of synovium tendon and bursa, Diso rders of muscle ligament and fascia, Other disorders of soft tissues.0, Other disorders of soft tissues.1, Other disorders of soft tissues.4, 095.6, 095.7, 098.5, 099.3, 136.1, Gout, 277.2, 287.0, 344.6, 353.0, 354.0, 355.5, 357.1, 390, 391, 437.4, 443.0, 446, 447.6, 696.0). RDs with low frequency
52 ICD 9 codes) 40,41 were associated with each patient. In addition, patients who had no RD diagnosis recorded were excluded. To explore the patterns of RDs in RA population, several unsupervised statistical learning techniques were applied, including k means clustering (KM) , expectation maximization (EM) algo rithm, and hierarchical clustering. The k means clustering assigns observations into k clusters based on the distance between the observation and the nearest mean, the distance is calculated using features associated with the observations 88 . In this study, binary variables for the status of different RDs were used, and several k values were tested. The EM clustering, on the other hand 89 , calculates the probabilities of cluster memberships based on one of more probability distributions, and then assigns a probability distribution to each observation. The number of clusters can be determined automatically with EM in cross validation. hierarchical clustering examines the similarities between observations 90 , it starts by treating each observation as a separate cluster, and th en repeatedly merges two clusters that are closet together until all clusters identified. In our analysis, hierarchical clustering was used to cluster the RDs using a binary distance matrix. The results from different clustering methods were compared to se e if consistent patterns are observed. To be specific, results from EM clustering and k means clustering were compared directly to examine the class assignments of patients, and percent agreement was calculated with adjusted rand index. Results from EM clu stering/k means were also compared with results from hierarchical clustering to examine if the diseases clusters from Next, the class assignment identified from EM clustering or k means clustering was used to examine whether these classes are associated with different health outcomes and demographic variables. Univariate analysis and multinomial logistic regression were used. In the multinomial
53 regression model, class assignment was used as depende nt variable, demographic variables (age, sex, and race) and selected common comorbidities of RA (anemia, p ostinflammatory pulmonary fibrosis , and pain) were used as independent variable. All statistical analyses were conducted using R. Results After data m erging and cleaning, a total of 20,351 RA patients were eligible to be included in this analysis. Table 4 1 displays the demographic characteristics and total RD frequencies in this population. The median age is 47, and the median CCI is 1. A large proport ion of RA patients are females (76.3%), and the race/ethnicity group with the largest percentage is White (73.8%) followed by Black (13.5%) and Hispanic (10.5%). The most prevalent RDs in this population is Osteoarthrosis and allied disorders, more than on e third (35.8%) of the patients have had Osteoarthrosis and allied disorders, followed by Other disorders of soft tissues (31.5%) and Other and unspecified disorders of joint and Other and unspecified arthropathies (26.8%). Table 4 1 . P opulation character istics variable N (%) Age (median, IQR) 47 (61, 72) Diagnosis year (median, IQR) 2012 (2011, 2013) CCI (median, IQR) 1 (0, 2) N (%) Female 15533 (76.3) Race White 15012 (73.8) Black 2755 (13.5) Hispanic 2145 (10.5) Asian or Pacific Islander 1 04 (0.5) Native American 15 (0.1) Other 320 (1.6) Diffuse diseases of connective tissue 1376 (6.8) Osteoarthrosis and allied disorders 6674 (35.8) Other and unspecified arthropathies 4708 (23.1) Other and unspecified disorders of joint 5461 (26.8) S pondylosis and allied disorders 2469 (12.1)
54 variable N (%) Peripheral enthesopathies and allied syndromes 1943 (9.6) Other disorders of synovium tendon and bursa 2130 (10.5) Disorders of muscle ligament and fascia 1355 (6.7) Other disorders of soft tissues 6418 (31.5 ) Gout 1097 (5.4) Carpal tunnel syndrome 1147 (5.6) We used hierarchical clustering to examine how the RDs are clustered with each other. Figure 1 displays the dendrogram from HC; it shows the hierarchical relationship between the clusters, as well as the Approximately Unbiased (AU) p value and Bootstrap Probability (BP). For a cluster with AU p value > 0.95, the hypothesis that "the cluster does not exist" is rejected. In our result, almost all clusters had AU p value > 0.95 suggesting there is evidenc e that these RDs have distinct clustering patterns. When we use the mean distance between any two RDs (0.90, se = 0.11) as the cut point, we did not observe a clear clustered structure, but a distinct pattern of how these RDs are related were observed. Per ipheral enthesopathies and allied syndromes (ICD code: 726) and Other disorders of synovium tendon and bursa (ICD codes: 727) were grouped together, while Spondylosis and allied disorders (ICD code: 721), Other and unspecified arthropathies (ICD code: 716) , Osteoarthrosis and allied disorders (ICD code: 715), Other and unspecified disorders of joint (ICD code: 719), and Other disorders of soft tissues (ICD code: 729) were grouped together. The ICD codes that were clustered together means the distance betwee n them is relatively close, so these two conditions are more likely to appear together. For example, an RA patient who has osteoarthrosis and allied disorders is more likely to have other and unspecified disorders of joint as well. Next, we used EM cluste ring and k means clustering to examine how patients were clustered based on their status of RDs. Using EM clustering, a 5 class solution was identified. Figure 2 displays these five classes and the probability of different RDs in each of the classes. In th is
55 solution, class 1 (N = 12,256, 60.2%) has lowest frequencies in all RDs, while class 5 (N = 954, 4.7%) has relatively higher frequencies in all RDs. Class 4 (N = 3,517, 17.3%) shows a similar trend as class 5, i.e., most of the RDs are presented in this group of patients, but the patients who were classified in class 4 have lower probabilities of having the RDs. Class 3 (N = 2,175, 10.7%) and 2 (N = 1,449, 7.1%) have a high prevalence in Osteoarthrosis and allied disorders compared, but have lower freque ncies of other RDs, except Other and unspecified arthropathies. All the patients who were classified into class 2 had Other and unspecified arthropathies. In order to directly compare the results from different clustering methods, we set k = 5 in the k mea ns clustering. Figure 3 displays the five classes and the probability of different RDs in each class. Similar to results from EM clustering, a class with very low frequencies of RDs were identified (class 1: N = 9,660, 47.5%), and a class with relatively h igh frequencies in all RDs was also identified (class 5: N = 1,566, 7.7%). Class 4 (N = 3048, 15.0%) had slightly higher frequencies of most RDs compared with class 1 but had a much higher frequency of Other disorders of soft tissues. Finally, class 2 (N = 3,614, 17.8%) and class 3 (N = 2463, 12.1%) had a higher frequency in Other disorders of soft tissues, and also had high frequencies in Osteoarthrosis and allied disorders and Other and unspecified disorders of joint respectively.
56 Figure 4 1 Dendrogram from hierarchical clustering showing how the RDs are correlated Figure 4 2 Results from expectation maximization (EM) clustering showing how patients are grouped and the probability of having each RD Figure 4 3 Results from k means (KM) clustering show ing how patients are grouped and the probability of having each RD Next, we examined whether the patient class is associated with other aspects of RA patients using the class assignments from EM clustering and selected demographic factors and
57 comorbidities . Table 4 2 displays results from univariate analysis. Patients in class 1, 2 and 3 have similar age, and they were older than patients in class 4 and 5. Heterogeneity in race and gender have also been observed in different classes, although a distinctive pattern was not presented. Class 4 and 5 have higher rates of almost all included comorbidities compared with other classes, while class 1 tend to have lower rates of these comorbidities. Results from the multinomial regression showed similar findings. Com pared with class 1, all other classes have higher odds of higher CCI, anemia, pain, and p ostinflammatory pulmonary fibrosis , with the largest OR observed for class 5. Interestingly, patients in class 2 have higher odds of being Hispanic, while patients in class 2 and 5 are more likely to be Black. These findings suggest that the patient class assignments we identified may be associated with other aspects of heterogeneity in RA patients. Table 4 3 shows the a djusted odds ratios and 95% confidence intervals f rom multivariate logistic regression models . Table 4 2 . C lass assignment and demographic factors and selected clinical conditions bivariate analyses class 1 class 2 class 3 class 4 class 5 p value N 12256 1449 2175 3517 954 median (IQR) A ge 62 (49, 7 3) 63 (50, 73) 64 (53, 73) 53 (42, 66) 55 (46, 68) <0.0001 CCI 1 (0, 2) 2 (0, 3) 1 (0, 3) 1 (0, 2) 2 (1, 4) <0.0001 N (%) White 9108 (74.3) 1020 (70.4) 1706 (78.4) 2509 (71.3) 669 (70.1) < 0.0001 Black 1519 (12.4) 237 (16.4) 262 (12.1) 554 (15.8) 183 (19.2) Hispanic 1342 (11.0) 166 (11.5) 173 (8.0) 378 (10.8) 86 (9.0) Asian or Pacific Islander 71 (0.6) 7 (0.5) 7 (0.3) 16 (0.5) 3 (0.3) Native American 9 (0.1) 0 0 6 (0.2) 0 Other 207 (1.7) 19 (1.3) 27 (1.2) 54 (1.5) 13 (1.4) F emale 9386 (76.6) 1095 (75.6) 1462 (67.2) 2873 (81.7) 717 (75.2) < 0.0001 Unspecified anemia 2945 (24.0) 670 (46.2) 832 (38.3) 1207 (34.3) 497 (52.1) <0.0001 Pain 1522 (12.4) 1449 (7.1) 2175 (10.7) 3517 (17.3) 954 (100) <0.0001 Postinflammatory pulmonary fibrosis 233 (1. 9) 58 (4.0) 55 (2.5) 113 (3.2) 36 (3.77) < 0.0001
58 Table 4 3 . Adjusted odds ratios (aOR) and 95% confidence intervals from multivariate logistic regression models (class 1 is used as reference) class 2 class 3 class 4 class 5 OR (95% CI) OR (95% CI) OR (95% CI) OR (95% CI) A ge 1.01 (1.00, 1.01) 1.01 (1.01, 1.01) 0.98 (0.97, 0.98) 0.99 (0.98, 1.00) CCI 1.18 (1.14, 1.21) 1.10 (1.07, 1.13) 1.20 (1.17, 1.23) 1.22 (1.18, 1.26) Black 1.35 (1.15, 1.58) 0.98 (0.85, 1.14) 1.02 (0.91, 1.14) 1.39 (1.15, 1.66) Hispanic 1.22 (1.02, 1.46) 0.77 (0.65, 0.91) 0.98 (0.86, 1.11) 0.98 (0.77, 1.25) Asian or Pacific Islander 0.99 (0.45, 2.19) 0.60 (0.27, 1.31) 0.85 (0.48, 1.48) 0.69 (0.21, 2.23) Native American n/a sample size too small Other F emale 0.98 (0.86, 1.12 ) 0.66 (0.60, 0.73) 1.31 (1.19, 1.45) 0.90 (0.77, 1.06) Unspecified anemia 1.79 (1.58, 2.02) 1.58 (1.42, 1.75) 1.20 (1.10, 1.32) 1.94 (1.67, 2.25) Pain 2.54 (2.23, 2.91) 1.70 (1.50, 1.93) 1.94 (1.76, 2.13) 4.95 (4.27, 5.75) Postinflammatory pulmonary fi brosis 1.40 (1.03, 1.89) 0.98 (0.73, 1.33) 1.45 (1.14, 1.83) 1.27 (0.88, 1.84) Discussions In this study, we examined the heterogeneity of RDs among patients with RA utilizing a large data set with more than 20,000 RA patients and several robust unsuperv ised learning techniques. We explored the patterns of RDs distributions in RA patients using EM and KM, and how the RDs are clustered to each other using hierarchal clustering. The heterogeneity in RA population is confirmed as we identified multiple group s of patients using these methods, some interesting observations from our analysis worth further discussion. From the hierarchal clustering, although a clear clustered structure was not identified, we did observe an interesting pattern of how these RDs re lated with each other. We found that peripheral enthesopathies and allied syndromes (ICD code: 726) and Other disorders of synovium tendon and bursa (ICD code: 727) were grouped together, while spondylosis and allied disorders (ICD code: 721), Other and un specified arthropathies (ICD code: 716), Osteoarthrosis and allied disorders (ICD code: 715), Other and unspecified disorders of joint (ICD code: 719), and Other disorders of soft tissues (ICD code: 729) were grouped together. Importantly, the
59 results from hierarchal clustering did not reflect the hierarchy structure of ICD ontology, for example diffuse diseases of connective tissue (ICD code: 710) and other and unspecified arthropathies (ICD code: 716) were in the same chapter in ICD 9 (Arthropathies And R elated Disorders), but these two conditions have a long distance between them from hierarchal clustering. When comparing results from EM to KM, similar trends in identified patterns were observed. Both methods identified a large portion of patients that h ad a very low probability of other RDs, and a smaller number of patients who had a generally high probability of having other RDs, and certain RDs are more likely to be diagnosed together in certain patient. For example, both EM and KM identified classes w ith high probability in both Osteoarthrosis and allied disorders and Other disorders of soft tissues. Interestingly, results from EM and KM showed cumulative pattern in the probability and prevalence of RDs in different patient classes. The classes we iden tified include one class with the absence of any RDs; two classes with higher frequencies in one or two RDs; one class with almost every RD at low frequencies; and one class with almost every RD at high frequencies. These observations suggest that these cl asses may be the results of diseases progression and the accumulations of RDs. Previous studies have showed that the progression of RA is reflected by various of changes in biomarkers and symptoms, including Interleukin (IL) 7 level, rheumatoid factor (RF) positivity, C reactive protein (CRP) measurements, and increased number of swollen and painful joints, etc., however, little is known as to how the disease progression is associated with other RDs 10 13. The results of our study indicate that the progress ion of RA may be characterized by the presentations of other rheumatic diseases, and future studies are needed to confirm our findings and associate the presence of RDs with other biomarkers to better predict RA progression. In our multivariate regression model,
60 higher odds of different comorbidities were observed in classes with higher frequencies of RDs, which supports the hypothesis that the patient classes are identified by the cumulative RD progressions. When comparing results from EM and KM to the den drogram from hierarchal clustering, consistency can be observed, the patterns from hierarchal clustering are reflected in the results of EM and KM clustering. In the dendrogram, Osteoarthrosis and allied disorders and Other and unspecified disorders of joi nt were clustered together, while in from EM and KM clustering results, the class of patients who have high probability of Osteoarthrosis and allied disorders also tend to have Other and unspecified disorders of joint at similar level, while those who have low probability of Osteoarthrosis and allied disorders also tend to have low probability of Other and unspecified disorders of joint. Peripheral enthesopathies and allied syndromes and Other disorders of synovium tendon and bursa were also clustered toget her from hierarchal clustering, and a similar probability of having these two conditions was also observed from EM and KM results. Ankylosing spondylitis and other inflammatory spondylopathies, Gout, Diffuse diseases of connective tissue, Carpal tunnel syn drome and Disorders of muscle ligament and fascia were clustered far away from the rest of RD codes in hierarchal clustering, possibly due to the small overall prevalence of these conditions in the whole study population, it could also be explained by EM a nd KM clustering results as the possibility of having these conditions are likely to be independent to the possibility of having other RDs, but those who have one or more of these five conditions may be more likely to have the other one of these conditions . There are a few limitations in this study. First, RA and RD were defined based on ICD 9 codes because the HCUP database does not have information on medication or laboratory. We adapted an existing definition of RA in EHR settings: if a patient had two s eparate ICD codes for RA
61 has been reported that such classification had 90% sensitivity and about 60% positive precative value (PPV) 85 ; therefore, we may ha ve misclassification in our study. In addition, the use of ICD codes in the analysis can be potentially problematic due to the nature of ICD 9. Some of the codes for RDs are not corresponding to specific condition, which limits the interpretability and cli ot the disorders listed elsewhere in ICD. In the hierarchy clustering, we found that joint disorders and soft tissue disorders are linked, but we do not actually know which joint and soft tissue disorders they are. These concerns apply wherever ICD codes t by their prodromal and concurrent RDs. To our knowledge, little is known about the relationships between RA and RD, and whether the relationship is linked with other aspects of RA patients. The use of other information domains may provide important information for understanding the etiology of RA and to discover RA endotypes. One can examine the patte rns of the genetics, blood testing, and presentations of symptoms (e.g., locations of pain) to identify the associations between different patterns. Future studies could also examine whether and how these clusters are associated with the clinical outcomes of RA such as comorbidities, treatment responses, etc. Despite these limitations, in this exploratory analysis, we observed a large heterogeneity of RDs in RA patients, and we identified some interesting patterns of patients and RDs. The RDs did not have a clear clustered structure, but a cumulative structure may present, the clusters identified in
62 this analysis can be considered as the sequences of disease progression in RA patients. Future studies are needed to confirm our findings, and associate these cl usters with other aspects of RA.
63 CHAPTER 5 CONCLUSIONS Summary of Research Objectives Rheumatoid arthritis (RA) is a long term autoimmune disorder that primarily cause chronic inflammation that affects many joints, most frequently on hands and feet. The symptoms of RA usually include warm, swollen, and painful joints , and these symptoms are very similar to other rheumatoid conditions or other inflammatory arthritis . Currently in the United States, it is estimated that 1.3 million adults are living with RA 91 , and a recent study using insurance claim data suggests that the prevalence of RA in the US is 0.53 to 0.55% with a higher prevalence in females (0.73 0.78%) than in males (0.29 0.31%) 33 . The etiology of RA is still unclear. There are some well established risk factors such as gender, racial disparities, certain genetic variants, and smoking, while other risk factors (e.g. socio economic status, environm ent) have had inconsistent findings from past literature. The diagnosis of RA remains challenging. The diagnostic guideline was revised in 2010 with the addition of emphasizing roles of serological tests, however, the biomarkers being used have relatively low sensitivity (~70%) 3 which means there may be false negative results from the tests , and thus more cases of disease may be missed . T here is a need of development of better diagnostic tool to distinguish RA from other rheumatic diseases for effective and targeted treatment. In addition to the difficulties in differential diagnosis of RA, th e current classific ation criteria for RA also highlights the heterogeneity in the patient population. Under current criteria, if a patient who have synovitis in at least 1 joint without an alternative diagnosis and achieve a total score of 6 or more from the following catego ries, this patient would be considered as having RA. These categories include 16 : (1) number and site of involved joints ( score range 0 5) ; (2) serologic abnormality (score range 0 3) ; (3) elevated acute phase response (score range 0 1) ; (4) symptom
64 duration (2 levels; range 0 1) . Different patients may have different scoring patterns in these categories, and these difference s may be associated with different pathogenic pathways. Several level of biomarkers and responses to the medications . The observed heterogeneity is associated with the progr ession of RA and therapy outcomes, therefore, better understanding of how the underlying patholog ical and biologic al mechanisms is needed . Finally, currently there is no cure for RA, the treatment of RA is limited to relieve symptoms and improve overall w ellbeing by stop inflammation, prevent joint and organ damage, and reduce long term complications 1 . The progression of t he disease and lack of effective treatments makes RA patients prone to many adverse health outcomes. Previous studies have demonstrated that RA patients are at higher risks of persistent pain, functional disability, fatigue, and depression 63,81 . W hile these studies compared individuals with RA to control groups determined by specific mortality or risk of single disease, a compre hensive study on all morbidity has not been carried out. In this dissertation work, I a systematical and comprehensive point of view. I examined risk factors and predictors of RA using infor mation collect ed before the occurrence of RA , I also examined the concurrent and prodromal comorbidities of RA, as well as clinical consequences of the disease. The success of this work , along with future research, could provide useful prognostic informati on and further the understanding of RA pathogenesis, identify under testing or other missed opportunities in rheumatology healthcare. There are three specific aims for the dissertation: Aim 1: To develop a risk prediction model for rheumatoid arthritis (R A) using a multiple domains approach . Aim 2: To identify clinical
65 consequences of RA by assessing the independent association of RA and the risk of diagnosis of specific clinical comorbidities . Aim 3: To assess the heterogeneity of prodromal or concurrent RDs in patients with RA . To accomplish these objectives, t he Florida State d atabase from the HCUP was used. A case control design was used for A im 1 and 2. In Aim 1, demographics, clinical variables, and social ecological variables were used to build a pre diction model for RA in an at risk population. In Aim 2, propensity score matching was used to ensure an appropriate comparison group, and a robust statistical framework was used to identify important and significant clinical consequences that are associat ed with RA. In Aim 2, several unsupervised statistical learning methods were used to analyze the patterns of the spectrum of rheumatic conditions in RA patients. Accomplishment of this Dissertation In Chapter 2, we aimed to develop a risk prediction model for rheumatoid arthritis (RA) using a multiple domains approach, and to identify clinical comorbidities associated with development of RA including prodromal and concurrent comorbidities. We utilized a state wide database of inpatient, outpatient and emerg ency room visits. Patients diagnosed with RA were compared to groups of patients with other rheumatic conditions (RD). We analyzed multiple information domains (demographic, clinical diagnosis, procedures conducted, social ecological variables) using a col lection of statistical and machine learning models with feature selectors. We compared sensitivity, specificity and area under the receiver operating characteristic (AUROC) of different models using 10 fold cross validation. A total of 141,729 patients mee t inclusion criteria, of whom 56,052 were diagnosed with RA and 85,677 with RD. The final model we built showed a moderate performance with AUROC of 0.74, sensitivity of 0.68 and specificity of 0.65 at the optimal cut point. In terms of the discriminative power of individual domain, model using socio ecological domain had the highest AUROC (0.693), followed by clinical diagnosis (0.654),
66 demographic (0.647) and clinical procedures (0.580). We also identified several predictive factors for RA. This analysis provided an algorithm that has the potential of improving the differential diagnosis and early prediction of RA, which may be used in conjunction with other diagnostic information. We also found stronger contribution of socio ecological domains compared wi th other domains. Future studies examining independent socio ecological and environmental risk factors and predictors of RA are needed. Some of the key messages from this chapter can be summarized as: (1) c linical and socio ecological domains can predict a portion of RA in at risk population ; (2) s ignificant contribution of socio ecological variables was observed in RA prediction models ; (3) t he combination of multiple information domains is needed for better risk prediction in the future . In Chapter 3, w e conducted a comprehensive analysis on all morbidity among rheumatoid arthritis (RA) patients to identify clinical consequences of RA . A state wide database of inpatient, outpatient, and emergency room visits was used. A case control study was conducted by grouping patients who had RA and who had other rheumatic diseases (RD). The two groups were matched by propensity score calculated with the Charlson Comorbidity Index and Rheumatic Disease Comorbidity Index at 1:3 ratio. Multivariable logistic regression w as used to assess the associations between RA/RD and disease comorbidities. The adjusted odds ratios of RA, AIC, and AUROC of each model were ranked and used to identify important clinical consequences of RA. A total of 19,609 RA patients and 58,827 RD pat ients were included after propensity score matching, and a total of 237 clinical conditions (coded by three digit ICD 9 codes) were included in the analysis (all above 1% frequency). After fitting regression models, the clinical conditions with top OR, AIC or AUROC values were ranked. Among these top variables, 8 of them are positively associated with RA, including disorders of lipid metabolism,
67 disorders of fluid electrolyte and acid base balance, other and unspecified anemias, essential hypertension, post inflammatory pulmonary fibrosis, diffuse diseases of connective tissue, other disorders of bone and cartilage, and fracture of neck of femur. In this comprehensive analysis on RA and risk of clinical comorbidities, several clinical conditions were identifi ed to be potential health consequences of RA. The key findings from this chapter can be summarized as (1) i n this comprehensive big data analysis, we observed patients with RA are prone to various of clinical consequences ; (2) t hese clinical consequences a re closely related with the pathogenic mechanisms of RA. These findings can guide future targeted intervention for RA patients. In Chapter 4, h eterogeneity has been observed in rheumatoid arthritis (RA) patients in terms of serological testing results and gene expression profiling, but other aspects of RA have not been examined before. In this analysis , we aim ed to explore whether RA patients can be clustered by their prodromal or concurrent RDs. RA patients whose first RA diagnosis made after 2010 were ide (HCUP) database. To explore the patterns of RDs in RA population, several unsupervised statistical learning techniques were applied, including k means clustering (KM), e xpectation maximization (EM) algorithm, and hierarchical clustering (HC). A total of 20,351 RA patients were identified. After removing RDs with low frequencies, a total of 13 RDs were included in the analysis. Results from HC showed evidence of clusters, but a clear clustered structure was not observed. Results from EM and KM clustering suggested that RA patients can be grouped based on their RDs, but the patterns of these groups may be the results of diseases progression and the accumulations of RDs. A to tal of 5 classes were identified, including one class with the absence of any RDs; two class with higher frequencies in one or two RDs; one class with almost every RD at low frequencies; and one class with almost every RD at high frequencies. We
68 observed a large heterogeneity of RDs in RA patients. Although the RDs did not have a clear clustered structure, a cumulative structure may present, the clusters identified in this analysis can be considered as the sequences of disease progression in RA patients. Fu ture studies are needed to confirm our findings, and associate these clusters with other aspects of RA. The key findings from this chapter can be summarized as: (1) h eterogeneity in the RA population is observed as we identified multiple groups of patients ; (2) a cumulative pattern in the probabilities and prevalence of other rheumatic diseases in different patient classes were observed, suggesting that these classes may be the result of disease progression and the accumulation of these rheumatic diseases . Limitations and Strengths There are several limitations of this work, some of them have been discussed in previous chapters, but here the limitations will be discussed in a more systematic fashion. First, the data used in this work was from one state, alt hough Florida has the third largest population in the United States, external validation in other dataset of our findings is needed . Second, for the social ecological domain we analyzed in this work, we only i ncluded a limited number of social ecological i ndicators and made b space . The usage of a single year to link social ecological indicators may introduce m isclassification for variables such as air pollution given their large temporal varia tions . In addition, the socio ecological variables included in our analysis were obtained from external sources at zip code level and were incorporated with patient level data using zip code, the measurements of these socio ecological variables may not be accurate for each individual, future studies with individually measured variables could improve the results from this work. Second, we only analyzed ten years of the Florida HCUP data , which m ay have excluded patients only seen earlier in the study timefra me and/or patients who do not regularly seek
69 medical care . Third, the use of ICD 9 codes as definition for RA and RD may be problematic. We adapted an existing definition of RA in EHR settings : i f a patient had two separate ICD codes for RA from two distin ct . Such classification had 90% sensitivity and about 60% positive predictive value (PPV) , which means we may have p otential misclassification in our study population. Future studie s that define RA based on serological testing results or additional gold standard are needed . Fourth, there are some s hortcomings of using ICD 9 ontology. The ICD 9 ontology was d esigned for billing purpose, so clinical interpretations are needed , and t he translational relevance of using exclusively ICD 9 is limited as the codes need to be translated into other ontologies (such as ICD 10). Strategies for s emantic integration to link the ICD codes to other information domains is needed. In this work, we util ized the hierarchy structure of the ICD 9 ontology when an identified 3 digit code contains a broad range of conditions, we used the 5 digit codes under it to identify more meaningful conditions. For example, ICD codes 338 (pain, not classified elsewhere) were identified as an important predictor in our analysis, future studies media can be useful. In addition, s ome of the ICD codes do not always correspond wi th specific condition , which further l imits the interpretability and clinical relevance of our findings . For example, 9 code: 719) is not actually a condition, but a code that explicitly avoids specifying a co ndition . Another example is i n the hierarchical clustering, we found that joint disorders (ICD 9 code: 719) and soft tissue disorders (ICD 9 code: 729) may be linked, but we do not actually know which joint and soft tissue disorders they are . These concern . This particular limitations of ICD 9 makes it difficult to
70 assess whether the signal we picked in our analysis is due to the coding system n addition, the HCUP data only contains diagnosis and procedures for each notes) that may be useful for this work were not included, as the result, the prediction models we built had limited predictivity and may only be used in conjunction with other clinical diagnostic tools . For example, the severity, location, and duration of pain are important for evaluating the clinical management of RA, but there is no corresponding information available in the database. However, there are many strengths in this work that worth mentioning. First, the sample size in this analysis is very large. There are over 20,000 RA patients and over 100,000 RD patients included in the analysis, the statistical power of the analysis is high. Second, we used several s tatistically robust methods and measurements during our analysis. These tools provided us the opportunities of examining large data sets while mainta ining the study validity. In addition, we in this work, which allows us to start from a general overarching hypothesis and test a large volume of factors . Such approach is ve ry useful for diseases with c omplex or unclear etiology (such as RA) and could be use d to g enerate hypothesis for further testing using traditional epidemiological study designs. Finally, u nder a pure predictive approach, the use of ICD 9 codes are useful as the codes can be pulled au tomatically from an EHR, so that the models and findings from this work can be directly applied in external EHR databases . Future Directions and Concluding T houghts In this dissertation work, using a multi domain approach, a level analysis of RA was conduc ted by s ystematical evaluations characteristics related with RA . We identified some novel predictors and clinical consequences of RA, and we examined the patterns of RDs and how they are clustered in RA patients. This multi domain level analysis can be app lied in other diseases to
71 explore w hat the patients look like before the diagnosis , what are the conditions commonly appears with the disease, and what would happen after the diagnosis . Through this dissertation work, we p rovided an algorithm that has the potential of improving the differential diagnosis and early prediction of RA , it c ould be used in conjunction with other diagnostic information . Stronger contribution of RA from social ecological domains compared with other domains , therefore, f uture studi es could examine independent socio ecological risk factors and predictors of RA . We i dentified several clinical conditions as potential clinical consequences of RA and d emonstrated that RA is associated with a variety of physical burdens . These findings s u ggest directions for further investigation into the mechanisms of RA and targeted management of RA patients . Finally, we o bserved heterogeneity of rheumatic diseases in RA patients , although t he rheumatic diseases did not have a clear clustered structure, a cumulative structure may be present . The clusters/classes identified in this analysis can be considered as the sequence of disease progression in RA patients . F uture studies are needed to confirm our findings, and associate these clusters with other aspe cts of RA . In addition to applying the entire conceptual framework of this dissertation work into other diseases or health outcomes, research on RA could also be extended from our findings. First, the m ethods and concepts used in Chapter 3 and 4 can be ap plied to other information domain in this data set . For example, p rocedures associated with RA can be assessed using the methods in Chapter 3 to identify which procedures may be the consequences of RA. Future analysis could also c luster on other variables to further examine the heterogeneity in RA population. Second, because the HCUP does not contain other domains that may be interesting , future studies with information from other domains are needed. For example, medication domain would allow us to evaluate the effects of RA treatment on health consequences , it can also be used to confirm RA
72 case . Other domains, such as laboratory tests, biomarkers, genetic profiles, and symptoms may provide important information for understanding the etiology of RA and to d iscover RA endotypes. One can examine the patterns of the genetics, blood testing, and presentations of symptoms (e.g., locations of pain) to identify the associations between different patterns. Future studies could also examine whether and how these clus ters are associated with the clinical outcomes of RA such as comorbidities, treatment responses, etc. S tatistical and machine learning models are important tools for epidemiologists and clinicians as they seek to better understand pathogenesis of complex d iseases and to design personalized prevention and treatment therapies.
73 LIST OF REFERENCES 1. Majithia V, Geraci SA. Rheumatoid Arthritis: Diagnosis and Management. doi:10.1016/j.amjmed.2007.04.005. 2. A letaha D, Neogi T, Silman AJ, et al. 2010 Rheumatoid arthritis classification criteria: An American College of Rheumatology/European League Against Rheumatism collaborative initiative. Arthritis Rheum . 2010. doi:10.1166/jctn.2016.5902. 3. Lee DM, Schur P H. Clinical utility of the anti CCP assay in patients with rheumatic diseases. Ann Rheum Dis . 2003;62(9):870 874. doi:10.1136/ard.62.9.870. 4. Karlson EW, Van Schaardenburg D, Van der Helm van Mil AH. Strategies to predict rheumatoid arthritis developmen t in at risk populations. Rheumatol (United Kingdom) . 2016;55(1):6 15. doi:10.1093/rheumatology/keu287. 5. Karsdal MA, Bay Jensen A C, Henriksen K, et al. Rheumatoid Arthritis: A Case for Personalized Health Care? Arthritis Care Res (Hoboken) . 2014;66(9) :1273 1280. doi:10.1002/acr.22289. 6. Xu B, Lin J. Characteristics and risk factors of rheumatoid arthritis in the United States: an NHANES analysis. PeerJ . 2017;5:e4035. doi:10.7717/peerj.4035. 7. Vieira VM, Hart JE, Webster TF, et al. Association bet ween residences in U.S. northern Environ Health Perspect . 2010;118(7):957 961. doi:10.1289/ehp.0901861. 8. Oliver JE, Silman AJ. Risk factors for the development of rheum atoid arthritis. Scand J Rheumatol . 2006. doi:10.1080/03009740600718080. 9. Silman A, Pearson J. Epidemiology and genetics of rheumatoid arthritis. Arthritis Res . 2002. doi:10.1186/ar578. 10. McBurney CA, Vina ER. Racial and ethnic disparities in rheumat oid arthritis. Curr Rheumatol Rep . 2012;14(5):463 471. doi:10.1007/s11926 012 0276 0. 11. Greenberg JD, Spruill TM, Shan Y, et al. Racial and ethnic disparities in disease activity in patients with rheumatoid arthritis. Am J Med . 2013;126(12):1089 1098. d oi:10.1016/j.amjmed.2013.09.002. 12. Bax M, van Heemst J, J Huizinga TW, M Toes RE. Genetics of rheumatoid arthritis: what have we learned? 2011. doi:10.1007/s00251 011 0528 6.
74 13. Bergman GJD, Hochberg MC, Boers M, Wintfeld N, Kielhorn A, Jansen JP. Ind irect Comparison of Tocilizumab and Other Biologic Agents in Patients with Rheumatoid Arthritis and Inadequate Response to Disease Modifying Antirheumatic Drugs. YSARH . 39:425 441. doi:10.1016/j.semarthrit.2009.12.002. 14. Vos I, Van Mol C, Trouw LA, et a l. Anti citrullinated protein antibodies in the diagnosis of rheumatoid arthritis (RA): diagnostic performance of automated anti CCP 2 and anti CCP 3 antibodies assays. Clin Rheumatol . 2017;36(7):1487 1492. doi:10.1007/s10067 017 3684 8. 15. Tillmann T, K rishnadas R, Cavanagh J, Petrides K. Possible rheumatoid arthritis subtypes in terms of rheumatoid factor, depression, diagnostic delay and emotional expression: an exploratory case control study. Arthritis Res Ther . 2013. doi:10.1186/ar4204. 16. Gossec L , Smolen JS, Gaujoux Viala C, et al. European league against rheumatism recommendations for the management of psoriatic arthritis with pharmacological therapies. Ann Rheum Dis . 2012. doi:10.1136/annrheumdis 2011 200350. 17. Kazani S, Israel E. Update in a sthma 2010. Am J Respir Crit Care Med . 2011. doi:10.1164/rccm.201103 0557UP. 18. Anderson GP. Endotyping asthma: new insights into key pathogenic mechanisms in a complex, heterogeneous disease. www.thelancet.com . https://ac.els cdn.com/S014067360861452X/1 s2.0 S014067360861452X main.pdf?_tid=57b0fca6 ca37 11e7 931a 00000aacb360&acdnat=1510772651_8f687c9757ed1ab307805b631c18b615. Accessed November 15, 2017. 19. Saag KG, Gim GT, Patkar NM, et al. American College of Rheumatology 2008 recommendations for the use of nonbiologic and biologic disease modifying antirheumatic drugs in rheumatoid arthritis. Arthritis Care Res . 2008. doi:10.1002/art.23721. 20. Chan ESL, Fernandez P, Cronstein BN. Methotrexate in rheumatoid arthritis. Expert Rev Clin Immunol . 2007. doi:10.1586/1744666X.3.1.27. 21. Kruppa J, Ziegler A, KÃ¶nig IR. Risk estimation and risk prediction using machine learning methods. Hum Genet . 2012. doi:10.1007/s00439 012 1194 y. 22. Leung MKK, Delong A, Alipanahi B, Frey BJ. Machine learning in genomic medicine: A review of computational problems and data sets. Proc IEEE . 2016. doi:10.1109/JPROC.2015.2494198. 23. Goodfellow IJ, Erhan D, Luc Carrier P, et al. Challenges in representation learning: A report on three machine learning contests. Neural Netw orks . 2015. doi:10.1016/j.neunet.2014.09.005.
75 24. Podgorelec V, Zorman M. Decision Tree Learning. In: Encyclopedia of Complexity and Systems Science . ; 2015. doi:10.1007/978 3 642 27737 5_117 2. 25. Kotsiantis SB. Decision trees: A recent overview. Artif Intell Rev . 2013. doi:10.1007/s10462 011 9272 4. 26. Lecun Y, Bengio Y, Hinton G. Deep learning. Nature . 2015. doi:10.1038/nature14539. 27. Schmidhuber J. Deep Learning in neural networks: An overview. Neural Networks . 2015. doi:10.1016/j.neunet.2014.09 .003. 28. Fraccaro P, Nicolo M, Bonetto M, et al. Combining macula clinical signs and patient characteristics for age related macular degeneration diagnosis: A machine learning approach Retina. BMC Ophthalmol . 2015. doi:10.1186/1471 2415 15 10. 29. HCUP US Home Page. https://www.hcup us.ahrq.gov/. Accessed January 2, 2019. 30. Prevention C for DC and, Statistics NC for H. ICD ICD 9 CM International Classification of Diseases, Ninth Revision, Clinical Modification. Classif Dis Funct Disabil . 2013. 31. American Community Survey (ACS). https://www.census.gov/programs surveys/acs/. Accessed January 3, 2019. 32. Air Data: Air Quality Data Collected at Outdoor Monitors Across the US. https://www.epa.gov/outdoor air quality data. Accessed January 3, 2019. 33. Hunter TM, Boytsov NN, Zhang X, Schroeder K, Michaud K, Araujo AB. Prevalence of rheumatoid arthritis in the United States adult population in healthcare claims databases, 2004 2014. Rheumatol Int . 2017;37(9):1551 1557. doi:10.1007/s00296 017 3726 1. 34. Smolen JS, Aletaha D, Mcinnes IB. Rheumatoid arthritis. www.thelancet.com . 2016. doi:10.1016/S0140 6736(16)30173 8. 35. Deane KD, Demoruelle MK, Kelmenson LB, Kuhn KA, Norris JM, Holers VM. Genetic and environmental risk factors for rheumatoid arthri tis. doi:10.1016/j.berh.2017.08.003. 36. Dickens C, Mcgowan L, Clark Carter D, Creed F. Depression in RA A Systematic Review of the Literature with Meta Analysis Dickens et Al .; 2002. doi:10.1097/00006842 200201000 00008. 37. Yinshi Yue YY. Microbial I nfection and Rheumatoid Arthritis. J Clin Cell Immunol . 2013;04(06). doi:10.4172/2155 9899.1000174.
76 38. Costenbader KH. Geographic Variation in Rheumatoid Arthritis Incidence Among Women in the United States. Arch Intern Med . 2008;168(15):1664. doi:10.100 1/archinte.168.15.1664. 39. Services. USDoHH. Hcup Facts and Figures: Statistics on Hospital Based Care in the United States , 2008. Qual AfHRa, Rockville, MD . 20011:1 62. doi:http://www.hcup us.ahrq.gov/reports.jsp. 40. Quan H, Sundararajan V, Halfon P, et al. Coding Algorithms for Defining Comorbidities in ICD 9 CM and ICD 10 Administrative Data. Med Care . 2005;43(11):1130 1139. doi:10.1097/01.mlr.0000182534.19832.83. 41. Deyo RA, Cherkin DC, Ciol MA. Adapting a clinical comorbidity index for use with ICD 9 CM administrative databases. J Clin Epidemiol . 1992;45(6):613 619. doi:10.1016/0895 4356(92)90133 8. 42. Quinlan JR. C4.5: Programs for Machine Learning . Vol 1.; 1992. doi:10.1016/S0019 9958(62)90649 6. 43. Li P. Robust logitboost and adaptive base class (abc) logitboost. arXiv Prepr arXiv12033491 . 2012. 44. Breiman L. Random forests. Mach Learn . 2001;45(1):5 32. doi:10.1023/A:1010933404324. 45. Piatetsky Shapiro G. . Discovery, analysis and presentation of strong rules. Knowl Discov Databases . 199 1. doi:10.1080/0748800910070110. 46. Schisterman EF, Perkins NJ, Liu A, Bondell H. Optimal cut point and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology . 2005;16(1):73 81. doi:10.1097/01.ede.0000147512.8 1966.ba. 47. Demoruelle MK, Deane KD, Holers VM. When and where does inflammation begin in rheumatoid arthritis? Curr Opin Rheumatol . 2014;26(1):64 71. doi:10.1097/BOR.0000000000000017. 48. Tracy A, Buckley CD, Raza K. Pre symptomatic autoimmunity in rhe umatoid arthritis: when does the disease start? doi:10.1007/s00281 017 0620 6. 49. Vreugdenhil G, Wognum AW, Van Eijk HG, Swaak AJG. Anaemia in rheumatoid arthritis: The role of iron, vitamin B12, and folic acid deficiency, and erythropoietin responsivene ss. Ann Rheum Dis . 1990;49(2):93 98. doi:10.1136/ard.49.2.93. 50. Simpson NRW, Kersley GD, Brooks DH. Effect of Blood Transfusion on Rheumatoid Arthritis. Ann Rheum Dis . 1949;8(4):277 284. doi:10.1136/ard.8.4.277.
77 51. Roubenoff R, Dellaripa P, Nadeau MR, et al. Abnormal homocysteine metabolism in rheumatoid arthritis. Arthritis Rheum . 1997. doi:10.1002/art.1780400418. 52. Roschmann RA, Rothenberg RJ. Pulmonary fibrosis in rheumatoid arthritis: A review of clinical features and therapy. Semin Arthritis Rh eum . 1987. doi:10.1016/0049 0172(87)90020 5. 53. Hart JE, Laden F, Puett RC, Costenbader KH, Karlson EW. Exposure to traffic pollution and increased risk of rheumatoid arthritis. Environ Health Perspect . 2009;117(7):1065 1069. doi:10.1289/ehp.0800503. 54. Chen H H, Chao W C, Liao T L, Lin C H, Chen D Y. Risk of autoimmune rheumatic diseases in patients with palindromic rheumatism: A nationwide, population based, cohort study. 2018. doi:10.1371/journal.pone.0201340. 55. Hart JE, Laden F, Puett RC, Costenb ader KH, Karlson EW. Exposure to traffic pollution and increased risk of rheumatoid arthritis. Environ Health Perspect . 2009;117(7):1065 1069. doi:10.1289/ehp.0800503. 56. Essouma M, Noubiap JJN. Is air pollution a risk factor for rheumatoid arthritis? J Inflamm (United Kingdom) . 2015. doi:10.1186/s12950 015 0092 1. 57. Sigaux J, Biton J, AndrÃ© E, Semerano L, Boissier MC. Air pollution as a determinant of rheumatoid arthritis. Joint Bone Spine . 2018. 58. Hart JE, KÃ¤llberg H, Laden F, et al. Ambient air p ollution exposures and risk of Arthritis Care Res . 2013;65(7):1190 1196. doi:10.1002/acr.21975. 59. Liao KP, Cai T, Gainer V, et al. Electronic medical records for discovery research in rheumatoid arthriti s NIH Public Access. Arthritis Care Res . 2010;62(8):1120 1127. doi:10.1002/acr.20184. 60. Chung CP, Rohan P, Krishnaswami S, Mcpheeters ML. A systematic review of validated methods for identifying patients with rheumatoid arthritis using administrative or claims data. Vaccine . 2013;31:K41 K61. doi:10.1016/j.vaccine.2013.03.075. 61. Crane MM, Juneja M, Allen J, et al. Epidemiology and Treatment of New Onset and Established Rheumatoid Arthritis in an Insured US Population. Arthritis Care Res . 2015. doi:10.1 002/acr.22646. 62. Wang SL, Chang CH, Hu LY, Tsai SJ, Yang AC, You ZH. Risk of developing depressive disorders following rheumatoid arthritis: A nationwide population based study. PLoS One . 2014. doi:10.1371/journal.pone.0107791.
78 63. Pollard L, Choy EH, Scott DL. The consequences of rheumatoid arthritis: Quality of life measures in the individual patient. Clin Exp Rheumatol . 2005. 64. Scott DL, Pugner K, Kaarela K, et al. The links between joint damage and disability in rheumatoid arthritis. Rheumatology . 2000. doi:10.1093/rheumatology/39.2.122. 65. Garip Y, Eser F, Aktekin LA, Bodur H. Fatigue in rheumatoid arthritis: Association with severity of pain, disease activity and functional status. Acta Reumatol Port . 2011. 66. Bombardier C, Barbieri M, Parth an A, et al. The relationship between joint damage and functional disability in rheumatoid arthritis: A systematic review. Ann Rheum Dis . 2012. doi:10.1136/annrheumdis 2011 200343. 67. England BR, Sayles H, Mikuls TR, Johnson DS, Michaud K. Validation of the rheumatic disease comorbidity index. Arthritis Care Res . 2015. doi:10.1002/acr.22456. 68. Akaike H. A New Look at the Statistical Model Identification. IEEE Trans Automat Contr . 1974. doi:10.1109/TAC.1974.1100705. 69. Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology . 1982. doi:10.1148/radiology.143.1.7063747. 70. GarcÃa GÃ³mez C. Inflammation, lipid metabolism and cardiovascular risk in rheumatoi d arthritis: A qualitative relationship? World J Orthop . 2014. doi:10.5312/wjo.v5.i3.304. 71. Kerekes G, Nurmohamed MT, GonzÃ¡lez Gay MA, et al. Rheumatoid arthritis and metabolic syndrome. Nat Rev Rheumatol . 2014. doi:10.1038/nrrheum.2014.121. 72. McGrat h CM, Young SP. Lipid and Metabolic Changes in Rheumatoid Arthritis. Curr Rheumatol Rep . 2015. doi:10.1007/s11926 015 0534 z. 73. Seifter JL, Chang H Y. Disorders of Acid Base Balance: New Perspectives. Kidney Dis . 2016. doi:10.1159/000453028. 74. L, Housa D, VernerovÃ¡ Z, et al. Resistin in rheumatoid arthritis synovial tissue, synovial fluid and serum. Ann Rheum Dis . 2007. doi:10.1136/ard.2006.054734. 75. Raza K, Falciani F, Curnow SJ, et al. Early rheumatoid arthritis is characterized by a disti nct and transient synovial fluid cytokine profile of T cell and stromal cell origin. Arthritis Res Ther . 2005. doi:10.1186/ar1733. 76. Bowman SJ. Hematological manifestations of rheumatoid arthritis. Scand J Rheumatol . 2002. doi:10.1080/030097402760375124 .
79 77. Wilson A, Yu HT, Goodnough LT, Nissenson AR. Prevalence and outcomes of anemia in rheumatoid arthritis: A systematic review of the literature. In: American Journal of Medicine . ; 2004. doi:10.1016/j.amjmed.2003.12.012. 78. Masson C. Rheumatoid anem ia. Jt Bone Spine . 2011. doi:10.1016/j.jbspin.2010.05.017. 79. Panoulas VF, Metsios GS, Pace A V., et al. Hypertension in rheumatoid arthritis. Rheumatology . 2008. doi:10.1093/rheumatology/ken159. 80. Otero M, Logo R, Gomez R, et al. Changes in plasma le vels of fat derived hormones adiponectin, leptin, resistin and visfatin in patients with rheumatoid arthritis. Ann Rheum Dis . 2006. doi:10.1136/ard.2005.046540. 81. Antin Ozerkis D, Evans J, Rubinowitz A, Homer RJ, Matthay RA. Pulmonary Manifestations of Rheumatoid Arthritis. Clin Chest Med . 2010. doi:10.1016/j.ccm.2010.04.003. 82. Shaw M, Collins BF, Ho LA, Raghu G. Rheumatoid arthritis associated lung disease. Eur Respir Rev . 2015. doi:10.1183/09059180.00008014. 83. Walsh DA, McWilliams DF. Pain in rhe umatoid arthritis. Curr Pain Headache Rep . 2012. doi:10.1007/s11916 012 0303 x. 84. Dougados M. Comorbidities in rheumatoid arthritis. Curr Opin Rheumatol . 2016. doi:10.1097/BOR.0000000000000267. 85. Wei WQ, Teixeira PL, Mo H, Cronin RM, Warner JL, Denny JC. Combining billing codes, clinical notes, and medications from electronic health records provides superior phenotyping performance. J Am Med Informatics Assoc . 2016. doi:10.1093/jamia/ocv130. 86. Van der Pouw Kraan TCTM, Van Gaalen FA, Kasperkovitz P V., et al. Rheumatoid arthritis is a heterogeneous disease: Evidence for differences in the activation of the STAT 1 pathway between rheumatoid tissues. Arthritis Rheum . 2003. doi:10.1002/art.11096. 87. Derksen VFAM, Ajeganova S, Trouw LA, et al. Rheumato id arthritis phenotype at presentation differs depending on the number of autoantibodies present. Ann Rheum Dis . 2017. doi:10.1136/annrheumdis 2016 209794. 88. Hartigan JA, Wong MA. A k means clustering algorithm. Appl Stat . 1979. doi:10.2307/2346830. 89. Moon TK. The expectation maximization algorithm. IEEE Signal Process Mag . 1996. doi:10.1109/79.543975.
80 90. Infante L. Hierarchical clustering. In: Revista Mexicana de Astronomia y Astrofisica: Serie de Conferencias . ; 2002. doi:10.1007/978 0 85729 287 2 _7. 91. Nurmagambetov T, Kuwahara R, Garbe P. The economic burden of asthma in the United States, 2008 2013. Ann Am Thorac Soc . 2018;15(3):348 356. doi:10.1513/AnnalsATS.201703 259OC.
81 BIOGRAPHICAL SKETCH Zhaoyi Chen received his Doctor of Philosophy fr om the D epartment of E pidemiology at the University of Florida in May 201 9 . Prior to that, h e received his Bachelor of Science in p harmacy from China Pharmaceutical University in 201 0 and Master of Science in e pidemiology from Tulane University in 2015 . Zh aoyi had always been actively involved in research throughout his student career , he had several research assistant appointments as undergraduate student , and interned at the Louisiana Public Health Institute during his master study. After the completion o f his , Zhaoyi joined the PhD program in the Department of Epidemiology at the University of Florida, where he was involved in several research projects in different interdisciplinary areas . Zhaoyi predicti on of disease occurrence and outcomes using data science techniques. Apart from big data analysis, he is also interested in pain related diseases and associated comorbidities.