Four years after the late Seattle billionaire Paul Allen challenged researchers to come up with an artificial intelligence program smart enough to pass an eighth-grade science test, that feat has been declared accomplished — by the hometown team.
The Allen Institute for Artificial Intelligence, or AI2, announced today that its Aristo software scored better than 90% on a multiple-choice test geared for eighth graders, and better than 80% on a test for high school seniors.
There are caveats, of course: The exam, which was based on New York Regents aptitude tests, excluded questions that depended on interpreting pictures or diagrams. Those questions would have required visual interpretation skills that aren’t yet programmed into Aristo. Questions requiring a direct answer (that is, essay questions) were also left out. And for what it’s worth, Aristo would have been useless outside the areas of science in which it was trained.
Nevertheless, the exercise illustrated how far AI has come just since 2016, when all of the programs competing in the $80,000 Allen AI Science Challenge flunked.
“This is a breakthrough because it’s a remarkable result on standardized test questions which require a degree of natural language understanding, reasoning, and even common sense,” AI2 CEO Oren Etzioni told GeekWire in an email. “This is very different from both standard research benchmarks and board games like Go. Even a year ago, no one would have anticipated such rapid progress on 8th and 12th grade science questions!”
AI2 science quiz: Match wits with Aristo, circa 2016
The work builds on a series of language-interpreting, question-answering AI agents including AI2’s ELMo program and the BERT program developed at Google’s research facility in Seattle. Aristo takes advantage of eight types of problem-solving agents — ranging from an agent that merely looks up answers in a database, to an agent that checks lists of associated concepts (known as tuples), to an agent that performs qualitative reasoning.
Each problem-solver produces a score for the preferred multiple-choice answer, and Aristo weights the different scores to select the most likely choice. The program optimizes its performance through rounds of training and calibration.
For example, one question asks: “How are the particles in a block of iron affected when the block is melted? (A) The particles gain mass. (B) The particles contain less energy. (C) The particles move more rapidly. (D) The particles increase in volume.”
To answer the question, Aristo retrieves the knowledge that particles move faster as the heat of a particle increase, associates the term “melted” with “heat,” associates the term “faster” with “more rapidly,” and scores C as the correct choice.
Combining different problem-solving approaches cleared the way for Aristo to raise its test score from roughly 60% in 2016 to 91.6% on the eighth-grade test. The program scored nearly as well, 83.5%, on the 12th-grade exam.
In a research paper about the project, Etzioni and other AI2 researchers — including Peter Clark, senior manager for Project Aristo — say the program’s passing grade “is only a step on the long road toward a machine that has a deep understanding of science and achieves Paul Allen’s original dream of a Digital Aristotle.”
The researchers are aiming to extend Aristo’s skills to encompass diagram-based questions and essay questions. Eventually, the technology should advance the state of the art when it comes to providing natural-language answers to questions that would tax the brains of grown-ups as well as eighth-graders.
That’s likely to lead to digital assistants that are smarter than the current iterations of Amazon’s Alexa, Microsoft’s Cortana and Apple’s Siri — as well as a whole new wave of AI applications and startups.
In separate emails, Etzioni and Clark both paid tribute to Paul Allen, who passed away last October at the age of 65. And they both said he’d want more.
“Paul would’ve been very pleased, but wouldn’t let us rest on our laurels,” Etzioni told GeekWire. “He would ask: What’s your next major step towards understanding of language?”
Clark agreed: “I would imagine him saying “Congratulations! What’s next?” ”
Update for 1:25 p.m. PT Sept. 4: I sent Clark some follow-up questions via email, and here are a few answers that expand upon the significance of the research. The Q&A has been edited for brevity and clarity (especially for the Q’s):
GeekWire: How is this approach different from IBM’s Watson? If Aristo were to compete against Watson, who would win?
Clark: “The two systems were designed for very different kinds of questions. Watson was focused on encyclopedia-style ‘factoid’ questions where the answer was explicitly written down somewhere in text, typically many times. In contrast, Aristo answers science questions where the answer is not always written down somewhere, and may involve reasoning about a scenario, e.g.:
- “Otto pushed a toy car across a floor. The car traveled fast across the wood, but it slowed to a stop on the carpet. Which best explains what happened when the car reached the carpet? (A) Friction increased (B) Friction decreased…”
- “City administrators can encourage energy conservation by (1) lowering parking fees (2) building larger parking lots (3) decreasing the cost of gasoline (4) lowering the cost of bus and subway fares.”
“Out of the box, Watson would likely struggle with science questions, and Aristo would struggle with the cryptic way that ‘Jeopardy’ questions were phrased. They’d each fail each other’s test.
“Under the hood they are quite different too. In particular, Watson didn’t use deep learning (it was created before the deep learning technology) while Aristo makes heavy use of deep learning. Watson had many modules that tried different ways of looking for the answer. Aristo has a few (eight) modules that try a variety of methods of answering questions, including lookup, several reasoning methods and language modeling.”
Q: Please pass along the usual caveats. For example, questions with pictures were not used because they’d require computer vision. Any other caveats?
A: “Aristo isn’t able to handle questions with diagrams very well except in a few special cases. For instance, Aristo can answer questions about food chains, but it can’t answer those that require reading a map or studying a bar chart. It also has difficulty dealing with hypothetical situations. For example, Aristo struggles with the following question: “If you pull the leaves off a plant, what would the result be?” A good answer would be that the plant would no longer be able to make its own food. But Aristo struggles with this question because it requires the system to create an imaginary world and imagine what would happen in that world. Finally, our benchmark is a multiple choice test, another limitation.”
Q: Could you say a bit about the potential applications? Do you see a “question-answering” program like Watson, or do you see more novel applications?
A: “Aristo’s long-term goal is not just about passing science tests, it’s about creating a system that has a deeper understanding of science, with many potential applications. There are three areas in particular that seem promising. The first is in the area of education and personalized education, where Aristo could help a child understand science by providing personal tutoring. The second is in helping scientists. I can imagine Aristo offering background information on scientific concepts and prior work to a scientist in a laboratory. Finally, longer term, Aristo might help in scientific discovery itself, connecting the dots where people haven’t been able to in the past, in areas such as medicine or engineering. Aristo currently has a long way to go to reach these goals, of course, but performing so well on the Regents Science exam is a tremendous step forward.”
Authors of the AI2 team’s paper, “From ‘F’ to ‘A’ on the N.Y. Regents Science Exam: An Overview of the Aristo Project,” include Clark and Etzioni as well as Tushar Khot, Bhavana Dalvi Mishra, Kyle Richardson, Ashish Sabharwal, Carissa Schoenick, Oyvind Tafjord, Niket Tandon, Sumithra Bhakthavatsalam, Dirk Groeneveld and Michal Guerquin.