This Microsoft bot draws whatever you tell it to, pixel by pixel

Bird drawing by bot — This sequence of images shows how a drawing bot gradually creates an image of a bird with a yellow body, black wings and a short beak. (Tao Xu et al. via Microsoft)

Want to order up a drawing? Say you want a picture of a bird with a yellow crown and black rings around its eyes. Or a still life showing swirls of pasta with broccoli, carrots and onions.

Now there’s a bot for that.

Researchers have enlisted artificial intelligence tools, including computer vision and natural language processing, to program a “drawing bot” that can create a picture from the ground up, based merely on a descriptive caption.

“If you go to Bing and you search for a bird, you get a bird picture. But here, the pictures are created by the computer, pixel by pixel, from scratch,” Microsoft researcher Xiaodong He said in a report on the project. “These birds may not exist in the real world — they are just an aspect of our computer’s imagination of birds.”

He has been exploring the creative frontiers of artificial intelligence for several years, as a principal researcher and research manager at the Deep Learning Technology Center at Microsoft’s research lab in Redmond, Wash.

He and his colleagues started out with CaptionBot, a program that can write photo captions automatically, based on its analysis of an image. Then they developed Seeing AI, a software tool that can tell blind and visually impaired people what’s happening in a visual scene.

“Now we want to use the text to generate the image, so it is a cycle,” said Qiuyuan Huang, a postdoctoral researcher in He’s group at Microsoft.

Generating an image from a text caption is tougher than generating the text based on the image, for the simple reason that the artist has to fill in the details not described in the caption.

“That means you need your machine learning algorithms running your artificial intelligence to imagine some missing parts of the images,” said Pengchuan Zhang, an associate researcher on He’s team.

The team relied on what’s known as a generative adversarial network, or GAN. One computer model generates the image, based on learned linkages between descriptive terms and pictures. A parallel model checks how authentic the image looks. The back-and-forth between the generator model and the discriminator model fine-tunes the look of the image.

One extra twist is required: If the GAN system tries to create the image from a descriptive caption in one step, elements in the image tend to get muddied together. He’s team tweaked the system to break a caption down into separate terms and match those words to specific regions of the image.

The approach, known as Attentional GAN or AttnGAN, parallels how human artists focus their attention on different parts of a drawing as they create it. “Attention is a human concept; we use math to make attention computational,” He said.

As it’s being trained, the system absorbs the common sense that goes into creating a work of art. Because most pictures show birds sitting on twigs, the bot will add a twig to its bird picture unless it’s told not to.

Crazy pictures — The drawing bot had a hard time translating descriptions of out-of-the-ordinary scenarios into realistic-looking images. (Tao Xu et al.)

But if the description gets too crazy, so will the picture. For example, when the bot is told to draw a picture of a “red double-decker bus floating on a lake,” the result is a blurry, drippy image of something that looks like a cross between a boat and a bus, floating on a lake surrounded by mountains.

“The image suggests the bot had an internal struggle between knowing that boats float on lakes and the text specification of bus,” John Roach writes in today’s report on Microsoft’s AI Blog.

The system, described in a research paper posted on the ArXiv preprint server, dramatically outperformed previous GAN models for text-to-image generation.

Microsoft’s He said more advanced versions of the drawing bot could serve as digital assistants for artists, interior designers, photo editors and even filmmakers. Theoretically, text-to-image generators could transform the description of a place in a screenplay into a full-frame animated scene — which is something director James Cameron might want to keep in mind for his far-off “Avatar” sequels.

In addition to Xiaodong He, Pengchuan Zhang and Qiuyuan Huang at Microsoft, the authors of “AttnGAN: Fine-Grained Text to Image Generation With Attentional Generative Adversarial Networks” include former Microsoft interns Tao Xu from Lehigh University and Zhe Gan from Duke University; plus Han Zhang from Rutgers University and Xiaolei Huang from Lehigh University.

The art of artificial intelligence: Microsoft bot draws what you describe, pixel by pixel

Most Popular on GeekWire

Job Listings on GeekWork

Related Stories

Most Popular on GeekWire

Job Listings on GeekWork