Rohit Prasad, senior vice president and head scientist for the Amazon Alexa team. (GeekWire File Photo)

When Amazon’s first Alexa-enabled smart speaker debuted in 2014, it was something of a novelty: A voice-activated natural language processing interface that could perform a number of simple tasks.

Fast forward to today, and the internet-connected platform has rapidly expanded and become its own electronic ecosystem. With tens of thousands of Alexa-enabled devices available and hundreds of millions of units sold, Alexa has become nearly ubiquitous as a virtual assistant.

But while Alexa is now incorporated in everything from televisions to microwaves to earphones, Amazon’s vision of ambient computing is still very much in its infancy. Though tremendous advances have been made in natural language processing and other areas of artificial intelligence in order to work for a potential market of billions of users, there’s still much room for improvement.

Looking ahead, Amazon wants to eventually make these devices capable of understanding and supporting users nearly as well as a human assistant does. But in order to do this, significant advances need to be made in several areas, including contextual decision making and reasoning.

To take a deeper dive into the potential of Alexa and ambient computing in general, I asked Senior Vice President and Head Scientist for Alexa Rohit Prasad about the platform’s future and what Amazon’s goals are for the increasingly intelligent virtual assistant platform.

Richard Yonck: Alexa is sometimes referred to as “ambient computing.” What are a few examples or use cases for ambient AI?

Rohit Prasad: Ambient computing is technology that’s there when you need it, and fades into the background when you don’t. It anticipates your needs and makes life easier by always being available without being intrusive. For example, with Alexa you can use Routines to automate your home, like turning on your lights at sunset, or you can use Alexa Guard to have Alexa proactively notify you if it detects sounds like glass breaking or a smoke alarm.

Yonck: During your recent CogX presentation, you mentioned Alexa “getting into reasoning and autonomy on your behalf.” What are some near-future examples of this compared to where we are right now?

Prasad: Today, we have features like Hunches, with Alexa suggesting actions to take in response to anomalous sensor data, from alerting you that the garage door is open when you’re going to bed, to convenient reordering when your printer ink is low. More recently, owners of a Ring Video Doorbell Pro can choose to have Alexa act on their behalf, greeting visitors and offering to take a message or providing directions for package deliveries.

Overall, we’ve progressed to more contextual decision making and taken initial strides in reasoning and autonomy via self-learning, or Alexa’s ability to improve and expand its capabilities without human intervention. Last year, we took another step with a new Alexa capability that can infer a customer’s latent goal. Let’s say a customer asks for the weather at the beach, Alexa might use the request, in combination with other contextual information, to infer that the customer may be interested in a trip to the beach.

The new Echo Show 10. (Amazon Photo)

Yonck: Edge computing is a means of performing a portion of compute near the device rather than in the cloud. Do you think enough of Alexa’s processing can eventually be done at the edge to sufficiently reduce latency, support federated learning and address privacy concerns?

Prasad: From the moment we introduced Echo and Alexa in 2014, our approach has combined processing in the cloud, on device, and at the edge. The relationship is symbiotic. Where the computing occurs will depend on several factors including connectivity, latency, and customer privacy.

As an example, we understood that customers would want basic capabilities to work even if they happen to lose network connectivity. As a result, in 2018 we launched a hybrid mode where smart home intents, including controlling lights and switches, would continue to work even when connectivity was lost. This also applies to taking Alexa on the go, including in the car where connectivity can be intermittent.

In recent years, we’ve pursued various techniques to make neural networks efficient enough to run on-device and minimize the memory and compute footprint without losing accuracy. Now, with neural accelerators like our AZ1 Neural Edge processor, we are pioneering new experiences for customers, like natural turn-taking, a feature we’ll bring to customers this year that uses on-device algorithms to fuse acoustic and visual cues to infer whether participants in a conversation are interacting with each other or with Alexa.

Yonck: You’ve described several features we need in our social bots and task bots in your AI Pillars for the Future. Can you share projected timelines for any of these, even if they’re broad ones?

Prasad: Open-domain, multi-turn conversations remain an unsolved problem. However, I’m pleased to see students in academia advancing conversational AI through the Alexa Prize competition tracks. Participating teams have improved the state of the art by developing improved natural language understanding and dialogue policies leading to more engaging conversations. Some have even worked on recognizing humor and generating humorous responses or selecting contextually relevant jokes.

These are hard AI problems that will take time to solve. While I believe we’re 5-to-10 years out in achieving the goals of these challenges, one area I’m particularly excited about in conversational AI is where the Alexa team recently received a best-paper award: infusing commonsense knowledge graphs explicitly and implicitly into large pre-trained language models to give machines greater intelligence. Such work will make Alexa more intuitive and intelligent for our customers.

(Amazon Photo)

Yonck: For open domain conversations, you mentioned combining transformer-based neural response generators with knowledge selection to generate more engaging responses. Very briefly, how is the knowledge selection performed?

Prasad: We’re pushing the boundaries with open domain conversations, including as part of the Alexa Prize SocialBot Challenge where we continually invent for the participating university teams. One such innovation is a neural-transformer-based language generator (i.e., neural response generator or NRG). We have extended NRG to generate even better responses by integrating a dialogue policy and fusing world knowledge. The policy determines the optimal form of the response — for example, where appropriate, the next turn from the AI should acknowledge the previous turn and then ask a question. For integrating knowledge, we are indexing publicly available knowledge on the web, and retrieving sentences that are the most relevant to the dialogue context. NRG’s goal is to produce optimal responses that conform to the policy decision and includes knowledge.

Yonck: For naturalness, you ideally want to have a large contextual basis for conversations. Learning, storing and having access to a huge amount of personal information and preferences in order to provide each user uniquely personalized responses. This feels very compute and storage intensive. Where is Amazon’s hardware now relative to where it will need to be to eventually achieve this?

Prasad: This is where processing on the edge comes into play. To provide the best customer experience, certain processing — such as computer vision for figuring out who in the room is addressing the device — has to be done locally. This is an active area of research and invention, and our teams are working diligently to make machine learning — both inference and model updates — more efficient on the device. In particular, I am excited about large pre-trained deep learning-based models that can be efficiently distilled for efficient processing on the edge.

Yonck: What do you think is the greatest challenge in achieving fully developed ambient AI as you’ve described?

Prasad: The greatest challenge for achieving our vision is moving from reactive responses to proactive assistance, where Alexa is able to detect anomalies and alert you (e.g., a hunch that you left the garage door open) or anticipate your needs to complete your latent goals. While AIs can be pre-programmed for such proactive assistance, doing so will not scale given the myriad of use cases.

Therefore, we need to move towards more general intelligence, which is the ability for an AI to: 1) perform multiple tasks without requiring significant task-specific intelligence, 2) self-adapt to variability within a set of known tasks, and 3) learn completely new tasks.

In Alexa’s context it means for it to be more self-learning without requiring human supervision; more self-service by making it easier to integrate Alexa into new devices, dramatically reducing the onus on developers to build conversational experiences, and even enabling customers to customize Alexa and directly teach new concepts and personal preferences; and more self-aware of the ambient state to proactively anticipate customer needs and seamlessly assist them.

Like what you're reading? Subscribe to GeekWire's free newsletters to catch every headline

Job Listings on GeekWork

Find more jobs on GeekWork. Employers, post a job here.