What do spam email and HIV have in common? They’re examples of the range of problems that longtime Microsoft researcher Dr. David Heckerman has battled during his career — applying his background as an MD and his work in computer science to make advances in some surprising areas.
The spam battle dates back to 1997, when Heckerman received his first piece of junk email and decided to do something about it, setting him and his colleagues on a multi-year battle with spammers that resulted in sophisticated protections still used in Microsoft products today.
Some of the same machine learning approaches have also put Heckerman in the middle of the fight against HIV, helping to create a vaccine that promises to teach the immune system to attack the virus with far greater precision and efficiency. The work has been under way for years, but he and collaborators are moving closer to key tests of the vaccine.
Next up: advances in genomics, helping scientists make use of massive amounts of data.
Meet our new Geek of the Week, and continue reading for excerpts from our recent conversation with Heckerman on the Microsoft campus in Redmond.
What have you been working on recently?
The last several years, you could describe it as machine learning meets health/biology. We like to find problems where it’s high impact, important for society, and where there’s no solution yet. We like to come in, work with scientists and get a feel for what they’re doing, see where they’re lacking the tools to do what they want to do, and fill those gaps.
One of the areas you’ve been working on is human genomics. What’s your focus there?
There’s been an exponential growth in data in genomics because the cost and the time it takes to sequence the human genome, if you look at a curve, it’s dropping much, much faster than Moore’s Law did. Moore’s Law is impressive. This is even faster. A dozen years ago, we sequenced the first human genome for many billions of dollars, and it took years and years to do it. Now you can sequence a genome for $5,000 and it take 24 hours or 48 hours. Beyond that, you don’t have to sequence the whole genome to learn a lot.
So there’s this massive amount of data that’s being collected, and one of the goals of using this data is personalized medicine. Using your genome to warn you that you might get a disease or let you know, don’t worry, it’s very unlikely you’ll get this disease, or saying, well, you’re the type of person that, if you were to exercise, you actually could lower your cholesterol, as opposed to, forget it, it’s not going to happen. To let you know whether a drug is going to work well for you, or have horrible side effects.
What can you do with all that data, and what are the challenges?
The general mechanism that’s being used to find these links between your genome and interesting things about you, it’s called genome-wide association studies. You take these millions of markers and you correlate them one or more than one at a time with some trait, like whether you’re going to get this disease or not. … But the signals are very weak. Any one signal, any one marker influences your trait very weakly. So what do you do? With machine learning, you just get more data. If you want to find weak signals, you need lots of data. Imagine collecting a set of 500,000 people to do a study like this, which is already under way right now. There’s already people that have data sets with 100,000 people in it. But people are shooting for really big data sets.
What happens is, the data gets messy from a machine learning standpoint. You get these people that are closely related to each other. You have different ethnicities. And that leads to getting the wrong answer. But you run these standard statistical algorithms, and they say yes, there’s a link between this genomic marker and whether or not you’re going to get this disease, and it turns out to be completely wrong, because the data is dirty.
Fortunately, animal breeders and plant breeders had this problem a long time ago, before humans got into the business of doing genomics. And some clever mathematicians a long time ago figured out how to solve this problem. They invented what’s called a mixed model. And it’s great, it works, it solves the problem I just told you about, but there’s one catch — it’s super computationally expensive. So if you have N people in your data set, the amount of runtime it takes is N-cubed. So 500,000-cubed, that’s going to take a long time. You wouldn’t even try. There’s also a memory problem, you have to have N-squared memory to do this. It’s a big mess.
Well, we came along and figured out how to do it in linear time.
What does that mean?
Algebraic tricks, basically. Some clever machine learning things that were known by the machine learning community that we brought over into the genomics problem, plus some clever algebraic tricks, allowed us to make this problem fast, and doable, so we can now do these (genome-wide association studies) with 500,000 people. We’re gearing up to do that.
What are the next big problems that you want to solve?
I think sequencing weird organisms is a nice one in genomics.
Weird organisms? Like what?
An example is sugar cane. Sugar cane is a very good fuel. It’s much, much better than corn, if only you could grow it. Wouldn’t it be nice if you could allow it to grow in regions that are not just small regions in Brazil, for example, or Hawaii. And wouldn’t it be great if you could have it fight off viruses more effectively and get more energy yield out of it. How people do this now is they take random variants — this version of sugar cane and this version of sugar cane, and put them together to see what happens. That’s a long cycle. It’s many years cycle to see what works. So wouldn’t it be good to use the genome to be able to help you decide which ones to put together to make much faster progress.
How do you decide which problems to focus on?
If it’s important to society, I would look at it. These just come along. I talked to someone about sugar cane, they said, ‘We should do this, we don’t know the genome.’ So if something comes along that’s important, I’m willing to listen and see if we can help.
It seems like humans, armed with computer science, are making progress against these giant threats that we’ve been grappling with for centuries. Who’s going to win?
Oh, we’re going to win. All of these problems in science are derived from the laws of physics, which are fixed —they don’t get to change. And we’re clever and we can sit there and figure out what’s going on, and be clever about solving the problem. And Mother Nature can’t change. We’re human, and we’re very flexible. I think that gives us the advantage, fundamentally.
OK, let me ask a few of our traditional Geek of the Week questions …
Favorite app? It would have to be Excel. I use it every day. I couldn’t do my job without Excel.
Best game ever? Tennis. I love playing tennis, whenever I can squeeze it in.
Transporter, time machine, or cloak of invisibility? Oh, definitely not the cloak of invisibility. Definitely time machine. You could change everything.
[Photos of Heckerman courtesy Microsoft.]