Is it possible to see the sources that AI pulls data from? Thanks to a reverse-engineering project called Datasette, you can now search 12 million training images used in the popular AI model, Stable Diffusion.
If you’re reading this, there’s a good chance that you’ve already taken a dive into the world of AI using MidJourney, Night Cafe, Wonder, or any of the other popular interfaces that have been released during 2022.
But now, many artists and creators of started questioning the legality of AI and where it pulls content from for training.
Where Does AI Collect Data From?
There are a few different sources that AI systems use to gather data.
One common source of data for AI systems is human-generated data. This can include things like text written by people, images and videos taken by people, or even audio recordings of people speaking. This type of data is often collected through online platforms or apps, as well as through sensors or cameras that are placed in the physical world.
Other AI systems are designed to generate information from scratch, which is often referred to as machine-generated data. Popular AI art generators like MidJourney or Night Cafe use algorithms that can create patterns, shapes, and colors in novel ways. These algorithms can be designed to mimic specific art styles, or to create completely original works.
AI can “learn” from itself and with the help and feedback from human users, AI Art generators have become quite sophisticated in a short period of time. But it’s important to understand the difference between the “software” and the “source” for AI.
For example, MidJourney and Night Cafe use the Stable Diffusion model in their art generation. Stable Diffusion is a deep learning, text-to-image model that can be applied to tasks like inpainting, outpainting, and generating image-to-image translations guided by a text prompt.
In essence, Stable Diffusion is the source of the AI — whereas MidJourney or Night Cafe are interfaces used to access it.
A common misconception about AI art generators is that when a user inputs a prompt into an AI art generator, the software is “pulling data from the internet” to generate the image.
This isn’t exactly true, because the image dataset has already been trained with millions of images prior to the user inputting the prompt. The text that a user inputs is encoded, sent through the dataset’s algorithm, and finally creates an randomized image based on the information that the user has provided.
In the case of Night Cafe, here’s a simplified diagram of how it works.
- User inputs a text-based description.
- The text-based description becomes a “prompt” in the Night Cafe web interface.
- After the user hits “enter”, the prompt is then directed to the AI dataset — Stable Diffusion.
- Stable Diffusion runs the prompt through its massive neural network (i.e, dataset) to create a graphic result, which is sent back to the Night Cafe interface and presented to the user.
The dataset does not technically “go online” to create an image, but rather accesses it’s already-existing neural network of images and descriptive text information. They need to be trained by updated datasets, then released to the public. From there, software developers can tap into the new AI models.
That’s why different versions of AI Art Generators are released.
AI isn’t tapping into the “live” internet. It uses data that has been scraped from the internet at a specific time.
So in the case of Night Cafe, it taps into a specific “version” of the Stable Diffusion AI Model.
But where does Stable Diffusion obtain its images?
According to Andy Baio, a former CTO of Kickstarter:
“Stable Diffusion was trained off three massive datasets collected by LAION, a nonprofit whose compute time was largely funded by Stable Diffusion’s owner, Stability AI.
All of LAION’s image datasets are built off of Common Crawl, a nonprofit that scrapes billions of webpages monthly and releases them as massive datasets. LAION collected all HTML image tags that had alt-text attributes, classified the resulting 5 billion image-pairs based on their language, and then filtered the results into separate datasets using their resolution, a predicted likelihood of having a watermark, and their predicted “aesthetic” score (i.e. subjective visual quality).”
He has an extremely informative website that covers this subject in depth, and you can read more about it here.
How to See Original Stable Diffusion Training Images
With Datasette, you can explore the database of over 12 million images that were used to train Stable Diffusion. Each image states the source that it was scraped from, the alt text (descriptive metadata that can be applied to digital images), the size of the image (width and height), and some additional information used to sort the search results.
For example, I plugged in “Gensler” (the world’s leading architecture firm) and the following results were provided by Datasette:
With these results, you can easily track the images back to their original source that they were scraped from.
So with both, graphic and text information, Stable Diffusion can create a baseline of graphics to train from. When a user inputs a prompt through software like Night Cafe, Stable Diffusion attempts to parse the user-input text with the metadata of the images that it has pulled from the internet.
If you want to see a reverse-engineered, partial version of Stable Diffusion’s image database, check out this incredible search engine developed by Andy Baio and Simon Willison:
So now that we know where Stable Diffusion is getting its training information, how does copyright factor in the equation?
Well, that’s where there’s a lot of gray area… So be sure to check out this post, where I unpack all of the good and bad things about AI in architecture.