Algorithm draws descriptions: an avocado armchair as the future of AI


With GPT-3, OpenAI has shown that a single deep learning model can be trained in such a way that it can complete or even create texts in a realistic way – simply by giving the system a gigantic mass of text as start data. It then became clear that the same approach also works when texts are replaced by pixels: an AI could be trained to complete half-finished images. GPT-3 mimics how humans use speech; Image GPT-3 predicts what we will see.

OpenAI has now brought these two ideas together and created two new models called GIVE HER and CLIP each combining language and images in a way that helps AI better understand what words mean and what they refer to. “We live in a visual world,” says Ilya Sutskever, Chief Scientist at OpenAI. “In the long run, you will have models that understand both text and images. AI will be able to understand language better because technology will realize what words and sentences mean.”

For all the charm that GPT-3 exudes, what comes out of the system can still sound quite unrealistic, as if it doesn’t know what it’s actually talking about. No wonder: it doesn’t either. Now, by combining text with images, researchers at OpenAI and elsewhere are trying to give language models a better understanding of the everyday concepts people use to make sense of things.

DALL · E and CLIP approach the problem from different directions. At first glance, CLIP (short for “Contrastive Language-Image-Pre-training”) is just another image recognition system.

However, there is more to it here: The system has learned not to recognize images based on appropriately named (tagged) examples from a data set curated by humans (as most existing models do), but based on images and their subtitles from the Internet. It learns from a description of what can be seen in a picture and not from a single term like “cat” or “banana”.

CLIP is trained to predict the correct description for a random selection of 32,768 images. To achieve this, CLIP learns a wide variety of objects with the associated terms and words that describe them. It can then identify objects whose images are not part of the training set.

(Image: OpenAI)

Most image recognition systems are trained to identify certain types of objects – such as faces from surveillance videos or buildings in satellite images. Like GPT-3, CLIP can now generalize across tasks, without any additional training.

In addition, it is less likely than other state-of-the-art image recognition models that the system will be misled by conflicting images. Images that were only slightly changed would typically have confused algorithms, even if a human might not have noticed a difference.

DALL · E (probably a play on words from the film title “WALL · E” and Dali), however, does not recognize any pictures, it paints them. The model is a reduced version of GPT-3 and was also trained with text-image pairs obtained from the Internet. With a short description in natural language – such as “picture of a water pig sitting in the field at sunrise” or “cross-sectional view of a walnut” – DALL · E generates a lot of photos that should correspond to this: dozens of water pigs in all sizes and shapes an orange or yellow background – and rows of walnuts (though not all of them in cross-section).

More from Technology Review

More from Technology Review

The results are fascinating, but still a lucky bag. The description “Fogged glass window with the image of a blue strawberry” produces many accurate results, but also some with blue windows and red strawberries. Others do not contain anything that reminds of a window or a strawberry. In the results recently published by OpenAI, however, the raisins were not manually picked, but they were hierarchized by CLIP.

The model selected 32 DALL · E images for each of the descriptions it believed matched the title. “Text-to-image is a challenge to research that has been around for a long time,” says Mark Riedl, who works on natural language processing (NLP) computational creativity at the Georgia Institute of Technology in Atlanta. “But this is a pretty impressive set of examples.”