Creativity is something that humanity has prided itself in, with artists and designers revered for their works from the likes of Renaissance, to modern-day open-world video games. From the artistic choices of photography to the imagination of creating new worlds. 5 years ago, no one would believe you if you told them that computer programs can be just as creative; much less deliver stunning original artwork from just a couple of words.
DALLE 2 can generate completely new and never before seen images. It can modify existing images and create variations while maintaining distinguishing features. Its results are remarkable:
DALLE 2 began as a research paper and project to display and push the limits of deep learning to help people understand how AI systems see and understand our world.
In April 2022, DALLE 2 paper was published and was made available to limited users for testing and press, gaining a lot of traction. 6 months later on September 28, 2022, DALLE 2 was made public for everyone to use and try.
How Does DALLE 2 Work
DALLE 2, developed by OpenAI, is a text-to-image AI system built on Contrastive Language-Image Pre-Training or CLIP and diffusion. The machine learning platform uses Hierarchical Text-Conditional Image Generation, which is an intersection of deep learning through natural language processing (NLP) coupled with computer vision (CV).
- The Text Embedding step
- The caption is transformed into a CLIP text embedding using a neural network trained on 400 million (image, text) pairs.
- The Prior step:
- A Transformer (encoder-decoder) with Attention is used to generate an image embedding from the text embedding with Diffusion
- The Image Embedding Step
- Prior Image is transformed into a CLIP image embedding using the aforementioned neural network
- The Decoder step:
- A diffusion model is used again to finally transform the image embedding into an image
- The image is then fed into two CNNs (convolutional neural networks) that upscale the image from 64x64 to 256x256, then finally to 1024x1024
CLIP / Contrastive Language-Image Pre-Training
NLP x CV models rely solely on involving extremely dense datasets and extensive labeling to be segmented into designated categories. We have AI models that can achieve this but limits the models by a finite number of categories…
CLIP changes this by not just identifying categories of given images but identifying potential captions for the image. This enables a more precise description of various scenes and is the exact opposite of what we want to do with DALLE 2. By embedding (or representing mathematically) both images and captions, CLIP will then find the best-fitting captions for an image.
Diffusion Model
DALL·E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that within. Think of denoising a very noisy image to fit a certain parameter.
To connect this back to displaying not just garbled pixels, the diffusion model’s condition is reliant on the CLIP model. The diffusion will rearrange pixels until the CLIP model recognizes images that can accurately fit the caption. Of course, the technical details of the math and implementation are extremely complex. Essentially DALL-E 2 is the optimization and refinement of CLIP and Diffusion.
Conclusions
Training these models with millions of parameters and training data, DALLE 2 took a lot of time and resources to be made into the fascinating model today. There are other models similar to DALLE 2 like Midjourney and Stable Diffusions but neither holds the popularity and presence of DALLE 2 which started it all.
DALLE 2 is inspiring more and more developers to delve into the world of Deep Learning and AI bringing these complex algorithms into tangible and usable means. Many data science algorithms and deep learning AI models lie behind the scenes, hard to spot, but drive our everyday lives. DALLE 2 brings all of that to the forefront of the internet, inspiring minds to pursue the complex nature, ingenuity, and creativity an AI can display.
If you’re looking to advance your data science development needs, you’re going to need the right tools. SabrePC offers ample customizability with the most high-performing components for your next Deep Learning and AI workstation or server!