MS MARCO may sound like the name of a giant ocean liner (and, in fact, the MS Marco Polo does have a few voyages to its name).
But this story is not about cruising the open seas, it’s about a new massive dataset that Microsoft researchers are releasing dubbed MS MARCO, which stands for Microsoft MAchine Reading COmprehension.
Microsoft announced plans today to release this massive database of 100,000 questions and answers — perhaps unlocking new ways for artificial intelligence researchers to develop methods for machines to “read and answer questions as well as humans do.”
Based on anonymized real-world data, Microsoft said it is hoping researchers will propel the field of machine reading, similar to advances occurring in image and speech recognition. The database is being made available for free, and it can be accessed here.
“To move toward artificial general intelligence, we need to take a step toward being able to read a document and understand it as well as a person,” said Rangan Majumder, a program manager with Microsoft’s Bing search engine division who is leading the effort. “This is a step in that direction.”
Some of the questions and answers include: “In what type of circulation does the oxygenated blood flow between the heart and the cells of the body?” (answer: systemic) and “Will I qualify for OSAP (Ontario Student Assistance Program) if I’m new in Canada?” (answer: You must be a Canadian citizen, a permanent resident or a Protected person).
Real-world reading comprehension and question-answering “is an extremely challenging undertaking involving the amalgamation of multiple difficult tasks such as reading, processing, comprehending, inferencing/reasoning, and finally summarizing the answer,” notes a paper on MARCO released last month by Microsoft AI and Research.
Right now, Majumder said, systems that can answer sophisticated questions are still in their infancy.
Search engines like Bing and virtual assistants like Cortana and Siri can answer basic questions, like “What day does Hanukkah start?” or “What’s 2,000 times 43?” But in many cases, instead of answering directly, the search engines and virtual assistants will point the questioner to a set of search-engine results. Users can still get the information they need, but they must comb through the results and find the answer on a web page.
“Given much of the world’s knowledge is found in a written format, if we can get machines to be able to read and understand documents as well as humans, we can unlock all of these kinds of scenarios,” Majumder said.
To make automated question-and-answer systems better, researchers need a strong source of what is called training data. The hope is that these datasets, like MARCO, can be used to teach artificial intelligence systems to recognize questions and formulate answers and, eventually, to create systems that can come up with their own answers based on unique questions they haven’t seen before.