Big data (and volunteers) help UW scientists solve protein puzzles

Protein diagrams — The molecular diagram at left is a representation of a protein molecule known as DMT superfamily transporter YddG, generated by Rosetta@Home software. The diagram at right is a representation of the molecule as determined by experiments. (Sergei Ovchinnikov et al. / UW via AAAS / Science)

Molecular biologists have enlisted cutting-edge trends in genomics and big data to get a grip on one of the grand challenges of biotech: figuring out how protein molecules fold.

But they couldn’t have done it without the help of tens of thousands of volunteers.

The fruits of all that crowdsourced computer labor went public today in the journal Science. Researchers from the University of Washington and other institutions say they’ve solved more than 600 protein-folding mysteries – which represents a fair proportion of the estimated 5,200 protein families whose molecular structure was unknown.

Still more solutions are in the works, and solving those puzzles could lead to new types of medicines and synthetic molecular machinery.

“Proteins are like little machines,” senior study author David Baker, director of the UW’s Institute for Protein Design, explained in an interview. Their molecular structure determines their function, just as the shape of a key determines which door it unlocks.

A protein with the right shape could open the way for restoring cellular functions, or block the spread of diseases ranging from Alzheimer’s to cancer.

Protein families are made up of large groups of individual proteins that serve specific functions in different organisms. One well-known example of a protein family is hemoglobin, which transports oxygen through the blood. Mice, whales and humans all use hemoglobin, but the protein’s structure varies from species to species.

In the past, figuring out the molecular shape for the different protein families was primarily done through experimental methods such as nuclear magnetic resonance imaging, or NMR. But Baker and his colleagues have been pioneering methods to simulate protein mechanics with software.

One of their tools is a distributed-computing platform called Rosetta@Home, which draws upon computer resources offered up by 1.2 million users. The screen-saver software crunches piles of data about chemical interactions to figure out the likeliest shape that a protein will take. More data leads to greater accuracy.

The data typically comes from analysis of genomic sequences from known organisms. But UW researcher Sergey Ovchinnikov took a different tack: He used a bigger batch of metagenome data, derived from microbial DNA harvested from humble sources such as pond water.

The researchers didn’t know exactly what kind of organism the protein sequences came from, but that didn’t matter. All that mattered was that they now had a database of 2 billion sequences to work from.

That database gave a boost to the Rosetta@Home puzzle-solving effort. First, the research team verified their results by checking them against the known structures of 27 large protein families. Then they generated models for 614 protein families with unknown structures.

Since those models were generated, the structures for five protein families were determined experimentally – and turned out to be similar to the structures that were predicted.

“If you had asked me 20 years ago how we’d solve the protein-folding problem, I never would have guessed it would be using sequence information on pond water and volunteers from around the world,” Baker told GeekWire.

Baker paid special tribute to the contributions of Rosetta@Home and Charity Engine volunteers, who are acknowledged in the Science paper. “Their contributions were absolutely essential, not only to this paper but to a large fraction of work they’ve done over the years,” he said.

There’s still plenty of work to do on the protein-folding front: Because the metagenome DNA came from microbes, the 614 protein families tend to deal with relatively basic functions. But researchers are working to expand their database of protein sequences and focus on more complex cell functions.

One of the big benefits of the Rosetta@Home system is the fact that surplus computing power is provided free of charge. Baker estimated that the cost of decoding the structure for a single protein family amounts to mere hundreds of dollars, compared to a cost of tens of thousands of dollars for more traditional methods.

“Our cost for computing a structure is really very small,” Baker said, “because it’s volunteer computing.”

In addition to Ovchinnikov and Baker, the authors of “Protein Structure Determination Using Metagenome Sequence Data” include Hahnbeom Park, Neha Varghese, Po-Ssu Huang, Georgios Pavlopoulos, David Kim, Hetunandan Kamissety and Nikos Kyrpides.

To volunteer for Rosetta@Home, sign up at http://boinc.bakerlab.org.

Big data (and volunteers) help scientists solve hundreds of protein puzzles

Most Popular on GeekWire

Job Listings on GeekWork

Related Stories

Most Popular on GeekWire

Job Listings on GeekWork