Machine Learning Model: Finding The Best Matched Doctors

Categories: Data

At Grand Rounds, we help our Members find providers who deliver efficient and high-quality healthcare. The best providers base their care on the strongest and most up-to-date evidence about a patient’s specific condition. Unfortunately, distinguishing high and low-quality providers  can be difficult. As a result, millions of patients suffer from pain and unnecessary costs due to misdiagnosis and mistreatment. 

Doctors who focus their clinical practice on a particular condition or group of conditions, even within a specialty, can get more relevant clinical experience. Thus, they’re able to keep abreast of new developments in that area, allowing them to deliver better care. This is especially true for challenging conditions. To identify these conditions, we work with our staff physicians to find areas where subspecialty training doesn’t fully explain how doctors focus on particular areas of expertise. We picked out cases where seeing a doctor who focuses specifically on treating a particular patient’s condition could make a considerable difference. For example, there is substantial differentiation in the clinical focus of neurologists after they leave training, even though they may not have completed a subspecialty fellowship. If a Member with multiple sclerosis goes to a neurologist who has chosen to make this condition their focus, then that Member is more likely to receive up-to-date and evidence-based care than if they go to a neurologist who splits their attention between treating headaches, seizures, dementia, etc.

Goal

The goal of this project is to incorporate providers’ clinical expertise in a condition into our provider matching framework. It’s desired that whenever a user searches for specialists in a condition, providers’ expertise in that condition will be used to adjust their ranking, so that a focused specialist is more likely to be surfaced. The following figure illustrates how this project would ideally impact provider matching when headache specialists are queried (Note that this is an oversimplification as this is only one of multiple factors in our matching framework).

Figure 1. Envisioned impact: list of providers surfaced through Grand Rounds’ provider match engine before and after blending providers’ clinical expertise into providers’ ranking.

Challenges

At the beginning of development, we faced two challenges. First, how can we accurately measure a provider’s expertise in a condition? The conventional approach is to simply assess the volume of patients with that condition a provider has seen. However, we can’t accurately calculate a provider’s volume of patients because our data sources are rarely provider-complete. In addition, the provider who has seen the most patients for a particular condition doesn’t necessarily have the highest degree of expertise. So what should we measure when we look for a true expert? 

To counter this challenge, we iteratively consulted our staff physicians and determined that the right measure is the extent that a provider focuses on treating the condition. Experts in the conditions we select typically receive extended training and pursue a particularly focused career path, allowing them to be more experienced than those with less focus on a specific area. 

The second challenge is labeling clinical expertise: there’s rarely a group of providers who can be categorized as either exclusively focused on one condition (perfect expertise) or not. Rather, most providers fall on a spectrum from being quite generalist to quite specialist. In other words, clinical expertise is on a continuous scale instead of a binary one, so a classification model is less suitable here. 

To solve this problem, we chose unsupervised learning, which doesn’t require labeling and is more adaptive to rapid expansion of methodology to other conditions. Validation of model output can be hard without labels. However, we leverage our medical team’s domain knowledge to assess the model output.  

Data Collection

For each condition of interest, we collected its subtype diagnoses and common treatment procedures as its clinical profile. Providers’ clinical expertise in a condition is measured by how much their clinical experience matches the condition’s clinical profile. We collect the number of each diagnosis and procedure every provider has done in the past few years as their clinical experience document. Although our data is rarely provider-complete, these documents are still extremely informative for providers’ area of focus. This is because we aim to measure whether a provider tends to see patients with the condition of interest more than others, therefore the distribution of diagnoses and procedures in the provider’s clinical document is more important than their volume. 

Conditions’ diagnoses and procedures are also organized as clinical documents. Our metric is the degree to which a provider’s clinical experience document matches a condition’s document. The following figure shows how documents on two sides are stored.

Figure 2. Examples of how providers and conditions clinical documents are stored.

Model Building

The model we chose is Latent Dirichlet Allocation (LDA), a statistical model for topic modeling. It learns the distribution of diagnoses and procedures in the raw clinical documents and transforms these documents into machine learning interpreted topics. A topic, in general, is a collection of terms and their probabilities of showing up in that topic. For a natural language processing task, it resembles what we traditionally refer to as topics, such as entertainment, sports, science, etc. In our context, a topic can be interpreted as a clinical scenario the model uses to summarize the complex medical world. Each topic is a collection of diagnosis and procedure codes and the probability these codes show up in such a topic. For example, in the topic most probable for epilepsy, the following diagnoses are highly likely to show up: infantile spasms, convulsions in newborn, focal epilepsy, generalized nonconvulsive epilepsy, etc.

These topics are then used to featurize a provider’s document and the condition’s document, which can be seen as a dimension reduction. Usually an LDA model with less than 50 topics will suffice to fit tens of thousands of providers’ documents. So, transforming the raw document into ~50 topics helps significantly reduce the number of features. Another motivation is to further complement our data with some hidden information that a statistical model is better at detecting than human brains, like the complicated underlying pattern of diagnoses and treatments among providers with many different specialties. The following figure shows an example of how we translate raw document data into abstract LDA interpreted topics.

Figure 3. Illustration of how provider and conditions clinical documents are translated into LDA topics and associated topic probabilities.

Model Training

We built multiple LDA models, with each model for a group of conditions that are clinically related and a group of providers with a relevant specialty. For example, an LDA model exists for neurological conditions for neurologists. To train an LDA model, we feed providers’ clinical documents into the model. The training process involves repeated five-fold cross-validations for each number of topics and a grid search throughout them. 

To select the optimal model, we calculate multiple metrics during the cross-validation. These include per term log-likelihood, likelihood that a condition’s diagnosis shows up in its most probable topic, and pairwise condition similarity. The best model should be high in the first and second metric and low in the third metric. In other words, the best model should fit well to the development set, identify the right topic for each condition, and distinguish different conditions. The selected model is then used to transform each provider’s document and condition’s document into topic probability vectors. We calculate the cosine similarity between these two vectors as the provider’s expertise in that condition—this ranges from 0 to 1. A higher score indicates a better match between the provider’s experience and the condition, and thus a stronger expertise.

Figure 4. Workflow of modeling training and metric calculation.

Model Validation

The disadvantage of unsupervised learning is that we can’t guide the model training along the desired track. Instead, we validate the model’s output against information about providers gathered from multiple sources. In addition, our in-house medical staff manually check the expertise of providers and compare them against model predictions. 

Currently, the implemented workflow runs from ingesting provider and conditions’ documents, training the LDA models, collecting samples for manual validation, to doing diagnostics reports. It can rapidly expand to new conditions and specialties. We’re also continuously improving the framework as we collect feedback from the validations by our staff physicians for new conditions. With their ongoing support, we’re able to iteratively adjust the model weight to ensure that an adequate number of specialists are surfaced as top matched doctors. 

If solving for messy and complicated data sounds like a fun challenge, join us!

See open positions