Let’s say you’re looking for something in an image — e.g. a tumor in a CT scan. Most likely, you’ll train a deep neural network (or “DNN”) to both flag the scan as containing a tumor, and to highlight the region where the tumor is found. The neural net is able to take in millions of pixels as input, pool them into groups based on proximity, and compare those grouped pixels against learned patterns. With each layer of the net, you recognize higher-level patterns until you finally get to a target class (“Tumor” = True). Neural networks are great for high-dimensionality pattern-recognition where you have a lot of training examples to train a LOT of parameters. So it’s typically the right choice for image recognition.
That doesn’t work so well for presumably relatively low-dimension, low-volume questionnaire data Verily will be using in their just-launched (Santa Clara and San Mateo counties only, at present) COVID-19 risk screening and testing tool.
They’ll most likely use a decision tree.
Decision trees are pretty intuitive. In the simplest case, you just run through the list of known symptoms of this particular disease, and if you have enough of them, then there’s some probability you have the disease and should be screened. Such a decision tree can be manually constructed by subject matter experts, even without any training data.
But sometimes the answer is not so obvious. E.g. You recently returned from Italy, and you are congested and short of breath, but you also have a heart condition and seasonal allergies. Should you be screened, given the scarcity of available test kits?
Manually-defined decision trees have a much harder time managing complexity. So in this case, we should train the decision tree via supervised learning, using past questionnaires from patients who tested positive and patients who tested negative.
The tree “learns” how to branch the decision tree by computing the “information gain” at each branch. Ideally, when we branch the decision tree, we want all the positive diagnoses to go one way and all the negative diagnoses to go the other way. That won’t happen, of course, but we might get closer to that ideal, where two-thirds of the positives go one way and one-third goes the other way. That’s considered information gain. And all we do is keep computing which attribute, on which split, gives us the highest information gain, all the way down the tree.
Typically, we’ll stop at some max-depth, because we’ll overfit the model otherwise. I.e. We’ll get a decision tree that’s too specifically-tuned to the training data and can’t transfer to new data.
This trained decision tree now reflects the best information available to us in the sample data. Sort of. When you randomly selected a portion of the sample data to use for training the decision tree, you left out about 20% of your data for testing. It would be a good idea to resample your training data and re-train the decision tree. Probably several times.
At this point, you have several trees, which will differ in non-trivial ways. And each tree returns a result, which acts as a vote. And the plurality now decides whether you should get screened, and also provides a level of confidence in that decision.
The good news is that training this ensemble of decision trees (or forest, if you will) isn’t that difficult or computationally complex. The real challenge for a tool like this is ensuring the data is accurate, both when training the tool, and when it is in production. Thus, there are 3 primary design challenges for a company like Verily (subsidiary of Alphabet and sibling to Google) in building this tool:
- Gathering and extracting initial diagnostic data from existing health records
- Selecting the features to use in training the decision trees and then to incorporate into the questionnaire
- Structuring the diagnostic tool’s data intake wording and question-sequence to maximize accuracy of self-reported data.
Imagine you’re in a COVID-19 hotspot, and you have a sore throat and fever. You’re already convinced you need to be screened, so you may be inclined to answer every question to maximize the likelihood you’ll get the recommendation for screening. Or perhaps you’re the opposite, and you’ll downplay and minimize symptoms to avoid being screened.
Big Idea: It’s likely Verily will use a decision forest to recommend screenings to users. However, the design challenges of this application, as well as those of any successful AI application, extend well beyond the training of the classification model.
I look forward to watching the roll-out of this tool from my Minnesota bunker constructed entirely from 2-ply, extra soft toilet paper.
About the Author: Eric Nelson
Eric Nelson is a Senior Software Engineer with MS3, Inc., living in Minneapolis, Minnesota with his wife Alisa, 7-year-old daughter Freyda, and 5-year-old son Arthur. Eric received his Bachelor of Science in Electrical Engineering from University of Minnesota and worked in photovoltaics and thin films. In 2015, he founded his own cloud software consulting firm where he trained dozens of student interns in software design and development skills. He is now focused on building large-scale, secure, futureproof production AI applications and software systems for smart brands as a member of the MS3 family.