OpenAI’s latest strange yet fascinating creation is DALL-E, which by way of hasty summary might be called “GPT-3 for images.” It creates illustrations, photos, renders or whatever method you prefer, of anything you can intelligibly describe, from “a cat wearing a bow tie” to “a daikon radish in a tutu walking a dog.” But don’t write stock photography and illustration’s obituaries just yet.
As usual, OpenAI’s description of its invention is quite readable and not overly technical. But it bears a bit of contextualizing.
What researchers created with GPT-3 was an AI that, given a prompt, would attempt to generate a plausible version of what it describes. So if you say “a story about a child who finds a witch in the woods,” it will try to write one — and if you hit the button again, it will write it again, differently. And again, and again, and again.
Some of these attempts will be better than others; indeed, some will be barely coherent while others may be nearly indistinguishable from something written by a human. But it doesn’t output garbage or serious grammatical errors, which makes it suitable for a variety of tasks, as startups and researchers are exploring right now.
DALL-E (a combination of Dali and WALL-E) takes this concept one further. Turning text into images has been done for years by AI agents, with varying but steadily increasing success. In this case the agent uses the language understanding and context provided by GPT-3 and its underlying structure to create a plausible image that matches a prompt.
As OpenAI puts it:
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks. Image GPT showed that the same type of neural network can also be used to generate images with high fidelity. We extend these findings to show that manipulating visual concepts through language is now within reach.
What they mean is that an image generator of this type can be manipulated naturally, simply by telling it what to do. Sure, you could dig into its guts and find the token that represents color, and decode its pathways so you can activate and change them, the way you might stimulate the neurons of a real brain. But you wouldn’t do that when asking your staff illustrator to make something blue rather than green. You just say, “a blue car” instead of “a green car” and they get it.
So it is with DALL-E, which understands these prompts and rarely fails in any serious way, although it must be said that even when looking at the best of a hundred or a thousand attempts, many images it generates are more than a little… off. Of which later.
In the OpenAI post, the researchers give copious interactive examples of how the system can be told to do minor variations of the same idea, and the result is plausible and often quite good. The truth is these systems can be very fragile, as they admit DALL-E is in some ways, and saying “a green leather purse shaped like a pentagon” may produce what’s expected but “a blue suede purse shaped like a pentagon” might produce nightmare fuel. Why? It’s hard to say, given the black-box nature of these systems.
But DALL-E is remarkably robust to such changes, and reliably produces pretty much whatever you ask for. A torus of guacamole, a sphere of zebra; a large blue block sitting on a small red block; a front view of a happy capybara, an isometric view of a sad capybara; and so on and so forth. You can play with all the examples at the post.
It also exhibited some unintended but useful behaviors, using intuitive logic to understand requests like asking it to make multiple sketches of the same (non-existent) cat, with the original on top and the sketch on the bottom. No special coding here: “We did not anticipate that this capability would emerge, and made no modifications to the neural network or training procedure to encourage it.” This is fine.