What can online searches reveal about a potential cancer diagnosis? Scientists at Microsoft are interested in finding out, and as they analyzed the data of millions of people, lead researcher Eric Horvitz had recollections of two friends at the top of his mind.
Horvitz was catching up with his friend Ronald (Ronnie) Nadel on Dec. 21, 2004, when Nadel mentioned some odd symptoms he was experiencing. Horvitz, a doctor himself, told his friend to see a physician about the symptoms. Less than 1 year later, Nadel passed away from pancreatic cancer. The following year, Horvitz lost another friend, Richard Newton, to pancreatic cancer just a few months after he was diagnosed.
“I had first experienced the challenges of diagnosis and treatment of pancreatic cancer as a medical student at Stanford University,” explained Horvitz, a technical fellow and the managing director of the Microsoft Research lab, in a recent interview. “However, the challenges of catching this devastating illness early hit home with two friends.”
Less than a decade later, Horvitz started a study—along with colleagues Ryen White, an information retrieval expert at Microsoft Research, and John Paparrizos, a graduate student at Columbia University and an intern at Microsoft at the time—aimed at identifying pancreatic cancer earlier.
Their study analyzed Bing.com search logs of 9.2 million individuals initially to identify those recently diagnosed with pancreatic cancer. The research team then looked at data from the previous 18 months to find symptom patterns—expressed as searches—and were able to identify 5% to 15% of pancreatic cancer cases.
“[My friend’s death] was a big motivation, I think, to take this disease head-on and see if we could make a dent in it,” Horvitz said. In this interview, Horvitz discusses highlights of his study, what motivates him, and what’s ahead.
Why did you want to conduct this study in pancreatic cancer?
I've done quite a bit of work with my colleague here at Microsoft Research, Ryen White. White has been a powerhouse in multiple topics in information retrieval. I’m interested in artificial intelligence and leveraging large-scale data. My interest and intuitions about challenges in healthcare go back to my experiences doing an MD and PhD.
We're curious about things like web use among people who become ill, how the web works for diagnoses—does it worry people? Does it help them? Once someone is diagnosed with a challenging illness like breast cancer, how well does web search support them in episodes of treatment, recovery, and recurrence, if that happens?
As part of all this work, we have to discriminate between experiential queries and exploratory queries. With experiential queries, there is strong evidence that someone has just been diagnosed with an illness. We looked at these types of queries and saw evidence of users pursuing help with recent diagnoses. One direction I’ve been interested in is detecting and helping people with devastating diseases. The ones that come to mind here are lung cancer and pancreatic cancer. In medical school, I learned that a lung cancer diagnosis when it came to the hospital was almost always too late for surgery—and for pancreatic cancer—that remains the same today.
I was talking with one of my closest friends on the phone—he saw me discussing artificial intelligence with Charlie Rose on his television show—and we were chatting about that. He mentioned as an aside that he had odd symptoms that were bothering him a bit. I asked him a few questions and told him that I didn't want to alarm him, but that he should get himself checked out. Within a few weeks, he was diagnosed with pancreatic cancer and he wasn't even 45 yet at the time.
With an illness like pancreatic cancer, especially if it shows up before it metastasizes, it’s through some nonspecific symptoms—strange back pain, abdominal pain, light-colored stools, and general itchiness, for example. Each symptom taken separately may not alarm someone enough to run to a doctor, and physicians may not react with deep concern. We thought, though, if we had evidence from thousands of patients who were diagnosed, and we could go back in time 18 months from when they were diagnosed, we might be able to see information in the order and accrual of symptoms as reported by people on the web. We thought that we may be able to use subtle clues over time to make inferences.
The answer was yes, the web can show us clues and patterns. I was surprised by how well we could discriminate searches on the web between those who were diagnosed or not, though there is still a false positive rate to deal with. It was a feasibility study and we labeled it as such, but it shows us something about the power, possibilities, and methods in this area.
Your personal connection to pancreatic cancer—how big of a motivation was it to do this study?
It was at the forefront of my mind. What I found stunning was how quickly the disease progressed from diagnosis to death. I remember talking to my friend's surgeon and he said, ‘We opened him up and looked, and he's a very unfortunate young man.’
It was a big motivation, I think, to take this disease head-on and see if we could make a dent in it. I sent a copy of the research article to his wife and his sister, who I've both been in touch with, and I said it was dedicated to him—to my friend who I met in second grade.
In the study, the identities of individuals were kept anonymous. Why was this important to you and how did you do it?
We do a lot of work with anonymized logs of user data. Companies like Google, Microsoft and Apple have access to user data and abide by strict policies to keep it safe.
Our research labs have access to this anonymized data. There's no naming information—just a random identifier assigned to logs.
Our findings led to questions about how this technology might be one day fielded. We have many ideas about that. We could enable an opt-in system, so people could ask to have access to a health and wellness suite of applications. If someone did that, they'd give explicit permission to be monitored and alerted. Another compelling use would be to build classifiers or automatic systems that would then be deployed in the privacy of one's own laptop or smartphone—not sharing anything with anybody—but having the intelligence accrued from studies of large-scale populations of searchers.
What are some of the next steps? What do you hope to achieve going forward?
There are a couple of steps. First, we have to think about ways to deploy the technology. We also have to think about other challenging screening problems and opportunities to do screening when it would really help.
We're interested in lung cancer, in particular. It's not just about finding new ways to identify people earlier and get them to treatment earlier; it's also about finding new kinds of symptoms and demographics and other kinds of observations to extend clinical medicine. Wouldn't it be nice if the results from our studies gave clinical care new avenues for screening that wouldn't even require search engines or search logs? In a way, we would be directing screening policies.
Another direction I'm very excited about is taking this whole paradigm to the next step. Ryen White and I have been talking about working with oncologists to do a patient-centric study. With patient approval in an actual research/review setting, we can get volunteers who have just been diagnosed to fill out a form that would tell us who they are and reveal to us their web search logs. We could then look 18 months back, along with their electronic health record, and link it to how someone has been using the internet to search for information. This is in the works; we've had conversations with clinical colleagues.
What are the cost implications of this work?
It costs money to get people to come to a screening, and it costs money to actually screen individuals. The idea of web searches working as a background observer would lower the cost of screening and raise compliance. The concern, of course, is false positives. You're casting a wider net, and even with a low false positive rate, you'd potentially end up with many people who are being told to get something checked out when it's really nothing.
What were the limitations of the study?
A key limitation is ground truth about patients’ diagnosis. We're excited about the possibility of collaborating with clinical colleagues and working with electronic health records. We have evidence that experiential diagnoses we see in logs are actual diagnoses, but we'd love to have ground truth and to consider details of the timing, background, and other details in the clinical record.
Your own motivation behind wanting to do this research is understandable. As a corporation, what is Microsoft's reason for supporting this research?
Microsoft Research is a leading computer science research and development laboratory. We're charged with looking to the future and pushing the frontiers of computer science. That's our mission and this research is part of that work.
As we develop ideas for research, we always work with the company and product teams to ask where we can go with Microsoft products and services.
For example, we'd love to have Bing search be the place for top-notch, reliable healthcare information. This work can be viewed as part of making our services better for people in the future.
How would you address the worries about hacking and data loss—both with regard to electronic health records, but also search data?
For us, we don't deal with any identity information. In general, though, Microsoft is very serious about information security. Microsoft Azure Cloud Services are HIPAA-compliant.
There's other work going on in another group right now about how to do data analysis and machine learning with actual medical records and encrypted data.
I think we'll see many solutions coming out of our research labs that reduce the chances a patient will lose healthcare data. Data security and healthcare cybersecurity are top-notch challenges for the whole industry.