OpenAI today debuted two multimodal AI systems that combine computer vision and NLP, like DALL-E, a system that generates images from text. For example, the photo above for this story was generated from the text prompt “an illustration of a baby daikon radish in a tutu walking a dog.” DALL-E uses a 12-billion parameter version of GPT-3, and like GPT-3 is a Transformer language model. The name is meant to hearken to the artist Salvador Dali and the robot WALL-E.
Tests shared by OpenAI today appear to demonstrate DALL-E has the ability to manipulate and rearrange objects in generated imagery but also create things that just don’t exist like a cube with the texture of a porcupine or cube of clouds. Based on text prompt, images generated by DALL-E can appear as if they were taken in the real world, while others can depict works of art.
“We recognize that work involving generative models has the potential for significant, broad societal impacts. In the future, we plan to analyze how models like DALL·E relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology,” OpenAI said in a blog post about DALL-E today.
OpenAI also introduced CLIP today, a multimodal model trained on 400 million pairs of images and text collected from the internet. CLIP uses zero-shot learning capabilities akin to GPT-2 and GPT-3 language models.
“We find that CLIP, similar to the GPT family, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others. We measure this by benchmarking the zero-shot transfer performance of CLIP on over 30 existing datasets and find it can be competitive with prior task-specific supervised models,” a paper about the model by 12 OpenAI coauthors reads.
OpenAI chief scientist Ilya Sutskever was coauthor of a paper detailing CLIP, and seems to have alluded to the coming release of CLIP when he told deeplearning.ai recently that multimodal models would be a major machine learning trend in 2021. Google AI chief Jeff Dean made a similar prediction for 2020 in an interview with VentureBeat.
The release of DALL-E follows the release of a number of generative models with the power to mimic or distort reality or predict how people paint landscape and still life art. Some, like StyleGAN, have demonstrated a propensity to racial bias.
OpenAI researchers working on CLIP and DALL-E called for additional research into the potential societal impact of both systems. GPT-3 displayed significant anti-Muslim bias and negative sentiment scores for Black people so the same shortcomings could be embedded in DALL-E. A bias test included in the CLIP paper found that the model was most likely to miscategorize people under 20 as criminals or non-human, people classified as men were more likely to be labeled as criminals then people classified as women, and some label data contained in the dataset are heavily gendered.
How OpenAI made DALL-E and additional details will be shared in an upcoming paper. Large language models that use data scraped from the internet have been criticized by researchers who say the AI industry needs to undergo a culture change.
VentureBeat’s mission is to be a digital townsquare for technical decision makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you,
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform
- networking features, and more.