How to Read an AI Image

Image of two humans kissing. Generated by OpenAI’s DALLE2.

Every AI generated image is an infographic about the dataset. AI images are data patterns inscribed into pictures, and they tell us stories about that dataset and the human decisions behind it.

Media studies has acknowledged that “culture depends on its participants interpreting meaningfully what is around them […] in broadly similar ways” (Hall 1997: 2). Roland Barthes suggests that images make sense because they gather up and reproduce myths, a “collective representation” of ideas  which turns “the social, the cultural, the ideological, and the historical into the natural” (Barthes 1977: 165). These myths can included our assumptions about the world — how we make sense of things. They are encoded into images when people make them, and then decoded when people look at them (cf. Hall 1992: 117). For the most part, these assumptions have operated on the presumption that humans, not machines, were the ones encoding these meanings into images.

When we make images, we consciously or unconsciously bring our own references. When we look at images, we make sense of them through our own references, which typically overlap. Those references orient us toward or away from certain understandings of the image. Unlike a human photographer, an AI system has no conscious or unconscious mind. Instead, these AI systems — specifically, diffusion-based models, produce images from datasets. These datasets are produced and assembled by humans. The images in these datasets reflect the collective myths and unstated assumptions of the humans who create the images in the dataset, and the decisions by those who collect these images into datasets.

In other words, these biases aren’t encoded into the unconscious minds of machine. They’re inscribed into datasets because people made them. Machine learning models identify patterns in these datasets among vast numbers of images: DALL·E 2, for instance, was trained on 250 million text and image pairings (cf. Ramesh et al. 2021: 4). These models have only become larger: Stable DIffusion was trained on billions of images.

These datasets, like the images they contain, were created within specific cultural, political, social, and economic contexts. Machines are programmed in those contexts too. In the end, image synthesis tools pick up on the biases these humans introduce to these datasets, and in how they choose what to include and exclude, or allow vs forbid. This process means that the conscious and unconscious assumptions of human data-gatherers can be reflected in the images produced when models train on that data.

How we build a dataset is much like the ways a photographer takes a picture. We have a subject we'd like to investigate. We find samples of the world that represent that question. We snap the photo when those samples come into specific alignments.

We collect data this way, too: We have questions. We identify where the useful data for answering that question might be. We seek out ways to capture that data. Once captured, we allow machines to contemplate the result, or we work through it ourselves.

The scale of data that makes AI different, but this selection and curation of data can reflect similar sets of decisions. Some datasets grab images from all over the web without any human intervention, like a photographer snapping their shutter at random. Even then, the images gathered up reflect human biases and orientations. They post an image and label it, and that’s what the machine uses.

Can we read human myths through machine generated images — can we treat them as cultural, social, economic, and political artifacts?

To begin answering these questions, I've created a loose methodology. It's based on my training in media analysis at the London School of Economics, drawing from writings by Roland Barthes and Judith Williamson.

Typically we apply media analysis to film, photographs, or advertisements made by people. AI images are not films or photographs. They're infographics for a dataset. But they lack a key, and information isn't distributed along an easy to read X and Y axis.

How might we read these unlabeled maps? Can a methodology meant to understand how human, unconscious assumptions and ideologies circulate through images help us understand, interpret, and critique the inhuman outputs of synthesized images?

The images in this example were created using DALL-E 2. The process applies to this and many other image synthesis tools, specifically tools based on diffusion models. That includes Midjourney, Stable Diffusion, DALL-E 2 & 3, and almost every popular app for image synthesis in 2024.


If you prefer video to reading, the video below is a presentation of this methodology delivered to students at Aarhus University in November 2023, which roughly follows this text! Otherwise, the text continues below.


How Diffusion Works

Training the Model

To understand the relationship between datasets and the images they produce, it’s helpful to know how that data becomes an image. That’s a process called Diffusion.

Diffusion models are trained by watching images decay. Every image in the data has its information removed over a sequence of steps. This introduces noise, and the model is designed to trace the dispersal of this noise (or diffusion of noise, hence the name) across the image. The noise follows a Gaussian distribution pattern, and as the images break down, noise encroaches more slowly into areas where similar pixels are clustered. In human terms, this is like raindrops scattering an ink drawing across a page. Based on what remains of the image, the trajectory of droplets and motion of the ink, we may be able to infer where the droplet landed and what the image represented before the splash.

A Diffusion model is designed to sample the images, with their small differences in clusters of noise, and compare them. In doing this, the model makes a map of how the noise came in: like learning how the ink smeared. It calculates the change between one image and the next, like a trail of breadcrumbs that lead back to the previous image. It will measure what changed between the clear image and the slightly noisier image. If we examine images in the process, we will see clusters of pixels around denser concentrations of the image. In Fig.1, we see an image of flowers compared to the “residue” left behind as it is broken down.

Fig.1: You can see how the densest clusters of pixels, the petals, stay even as other information is stripped away (resulting in noise). This suggests that the shapes and colors — patterns in the image — are strongly correlated to the idea of the caption, for example, “flower.”

For example, flower petals, with their bright colors, stay visible after multiple generations of noise have been introduced. Gaussian noise follows a loose pattern, but one that tends to preserve information around a central space (like the petals). This digital residue of the image is enough to suggest a possible starting point for generating a similar image.

The machine “learns” (actually, applies calculations and memorizes the outcome of those calculations) to describe this distribution of noise — then, calculates a way to reverse it. In other words, it learns how to “clean up” the noise in the image and “restore” it to its starting point.

Once complete, information about the way this image breaks apart enters into a larger algorithmic abstraction, which is categorized by any associated captions or text in the dataset. The category flowers, for example, contains information about the breakdown of millions of images with the caption flowers. As a result, the model can work its way backward from noise, and if given this prompt, flowers, it can arrive at some generalized representation of a flower common to all of those patterns of clustering noise. That is to say: it can produce a perfect stereotype of a flower, a representation of any central tendencies found within the patterns of decay.

Producing an Image

That’s how the system “learns” images (again, it is really just applying calculations and storing values). So, how does this become a new image?

Every image produced by diffusion models like DALL-E 2, Stable Diffusion or Midjourney begins as a random image of Gaussian noise. Gaussian noise is a kind of static that follows certain rules. When we prompt a Diffusion model to create an image, it takes this static and seeks out any patterns of noise that vaguely match the patterns of noise connected to the words in your prompt. After a series of steps, it may arrive at a picture that matches its mathematical model of that prompt fairly closely.

The prompt is interpreted as a caption, and the algorithm works to “find” the image in random noise based on that caption. Consider the way we look for constellations in the nighttime sky: If I tell you a constellation is up there, you might find it – even if it isn’t there! Diffusion models are designed to find constellations among ever-changing random stars.

When the model encounters a new, randomized frame of static, it applies those stereotypes in reverse, seeking these central tendencies anew, guided by the prompt. It will follow the path drawn from the digital residue of these flower images. Each image has broken down in its own way, but shares patterns of breakdown: clusters of noise around the densest concentrations of pixels, representing the strongest signal within the original images. It aims to recreate an image based on averages.

As the model works backward from noise, our prompts constrain the possible pathways that the model is allowed to take. Prompted with flowers, the model cannot use what it has learned about the breakdown of cat photographs. We might constrain it further: Flowers in the nighttime sky. This introduces new sets of constraints: Flowers, but also night, and sky. All of these words are the result of datasets of image-caption pairs taken from the world wide web.

These images, labeled by internet users, are averaged, and these averages are assigned to categories in the model. The result is a boiling down of core visual patterns into abstractions like “flower” or “kissing.” The words trigger a set of instructions for how to recreate them. But these instructions are always paired with random noise — so they have to make new approximations of flowers, rather than recreating any one flower it has seen.

Of course, the captions people write online are often a very biased and limited view of how things look or what they are. Especially when we talk about people, these captions can contain racist, misogynistic ideas, or ideas of things limited to a particular culture. Often, these things are shaped by arbitrary, often malicious, and almost always thoughtless boundaries about what defines these categories. If you want to learn more about some of the problematic content inside of these datasets, you can read more from Dr. Abeba Birhane or myself.

Reading an AI Image

Now that we have an idea of what a diffusion model is and how it works, let's start trying to “read” an image. Here’s how I do it — we’ll break these down step by step.

  1. Produce images until you find one image of particular interest.

  2. Describe the image simply, making note of interesting and uninteresting features.

  3. Create a new set of samples, drawing from the same prompt or dataset.

  4. Conduct a content analysis of these sample images to identify strengths and weaknesses.

  5. Connect these patterns to corresponding strengths and weaknesses in the underlying dataset.

  6. Re-examine the original image of interest.

We’re going to walk through the process of analyzing an image.

Step One: Produce An image

The first step is simple: have an image in mind that you want to understand. Here’s an image from OpenAI's DALL-E 2, a diffusion-based generative image model. DALL-E 2 creates images on demand from a prompt, offering four interpretations. Some are bland. But as Roland Barthes said, "What's noted is notable."

So I noted this one.

Image of two humans kissing. Generated by OpenAI’s DALL-E 2.

Here is an AI image created of two humans kissing. It’s obviously weird, triggering the uncanny valley effect. But what else is going on? How might we “read” this image?

Step Two: Describe the Image

Start literally. What do you see?

We see a heterosexual white couple. A reluctant male is being kissed by a woman. In this case, the man’s lips are protruding, which is rare compared to our sample. The man is also weakly represented: his eyes and ears have notable distortions.

What does it all mean? To find out, we need to start with a series of concrete questions for AI images:

  1. Where did the dataset come from?

  2. What is in the dataset and what isn't?

  3. How was the dataset collected?

This information, combined with more established forms of critical image analysis, can give us ways to “read” the images.

While this example studies a photograph, similar patterns can be found in a variety of images produced through Diffusion systems. For example, if oil paintings frequently depict houses, trees, or a particular style of dress, it may be read as a strong feature that would be matched to a strong correspondence with aspects of the dataset. You may discover that producing oil paintings in the style of 18th century European artists does not generate images of black women. This would be a weak signal from the data, suggesting that the referenced datasets of 18th century European portraiture did not contain many portraits of black women. (Note that these are hypotheticals and have not been specifically verified).

Step Three: Create a Sample Set

At first, it’s challenging to find insights into a dataset through a single image. A uniqueness of generative photography as a medium is scale: millions of images can be produced in a day, with streaks of variations and anomalies. None of these reflect a single author’s choices. They blend thousands, even millions, of choices.

By examining many images produced by a single model, from the same prompt, we can begin to “make sense” of the underlying properties of the data they draw from. The more images we have, the more we can start to understand the central tendencies in the dataset. This central tendency emerges from large amounts of data as the model creates an image from noise.

Remember, if you ask for a cat, it does not create a cat. Rather, it has to find a cat in a frame of pure digital static. Every step of this process is checked against the abstract concept of a “cat” it assembled in the training process. The result is that the model steers noise sharply into the central tendency connected to your prompt — in other words, it will steer towards the most average cat that it can find in that noise.

This is helpful to remember, because that’s what we want to read: the image as an infographic is all about showing us one possible version of this average cat, or average kiss. Generate enough of these images, though, and the patterns start to show us the boundaries of this average. That’s why we say it reflects central tendencies: it’s not one, perfect or ideal “average.” It’s tracing random noise back to something that looks like this statistical average of flowers or kissing pairs. This average is revealed over time, across multiple instances of that prompt.

AI images are different from photography and film because they can be generated endlessly. But even when you generate just a few, patterns will emerge. These patterns are where the underlying data reveals itself. Create lots of images, and these patterns become clearer. We can think of AI imagery as a series of film stills: a sequence of images, oriented toward telling the same story. The story is the dataset. To see that story, we have to see a good number of images and figure out what they have in common.

If you’ve created the image yourself, you’ll want to create a few variations using the same prompt or model. Nine is an arbitrary number. I’ve picked it because nine images can be placed side by side and compared in a grid. In practice, you may want to aim for 18 or 27. For some, I’ve generated 90-120. An article in Rest of World, published in 2023, applied this methodology and generated 3,000 images for each prompt to see how models create stereotypes of all kinds of people.

If you’ve found the image in the wild, you can try to recreate it by describing it with as much detail as possible into your prompt window. However, this technique, for now, assumes you can control for the prompt and that you have made the image. Once you understand how this works, though, you can start applying your “image reading skills” to all kinds of synthetic images.

Here are nine more images created from the exact same prompt. If you want to generate your own, you can type “studio photography of humans kissing” into an image synthesis tool of your choice and grab your own samples. These samples were created for illustration purposes, so they use additional modifiers.

Nine images created from the same prompt as our source image. If you want to generate your own, you can type Photograph of humans kissing into DALL·E 2 and grab samples for comparison yourself.

It’s tempting to try to be “objective” and download images at random. Right now, this is a mistake. The image you are interested in caught your eye for a reason. Our first priority is to understand that reason. So, draw out other images that are notable, however vague and messy this notability may be. They don’t have to look like your source per se, they just have to catch your eye. The trick is in finding out why they caught your attention.

From there, we’ll start to create our real hypothesis — after that, we will apply that hypothesis to random images.

Step Four: Conduct a Content Analysis

Now we can study your new set of images for patterns and similarities.

Again: what do you see? Describe it.

Are there particularly strong correlations between any of the images? Look for certain compositions and arrangements, color schemes, lighting effects, figures or poses, or other expressive elements, that are strong across all, or some meaningful subsection, of the sample pool.

These indicate certain biases in the source data. When patterns are present, we can call these “strong.” What are the patterns? What strengths are present across all of them?

In the example, the images render skin textures quite well. They seem professionally lit, with studio backgrounds. They are all close ups focused on the couple. Women tend to have protruding lips, while men tend to have their mouths closed.

What are their weaknesses? Weaknesses are a comparison of those patterns to what else might be possible. In this question, three important things are apparent to me.

First, all of the couples are heteronormative, ie, men and women. Second, there is only one multiracial couple. Third, what’s missing is any form of convincing interpersonal contact.

The “strong” pattern across the kissing itself is that they are all surrounded by hesitancy, as if an invisible barrier exists between the two “partners” in the image. The lips of the figures are inconsistent and never perfect. It’s as if the machine has never studied photographs of people kissing.

Now we can begin asking some critical questions:

  1. What data would need to be present to explain these strengths?

  2. What data would need to be absent to explain these weaknesses? 

Weaknesses in your images are usually a result of:

  • sparse training data,

  • training biased toward other outcomes, or

  • system interventions (such as content moderation or censorship).

Strengths are usually the result of prevalence in your training data — the more there is in the data, the more often it will be emphasized in the image.

In short: you can “see” what’s in the data. You can’t “see” what isn’t in the data. So when something is weird, or unconvincing, or impossible to produce, that can give us insight into the underlying model.

Sidebar: Strong vs Weak?

Think of “strong” vs “weak” as a way of measuring the strength of the signal in your data. Kind of like checking your cell phone service. If the signal is strong, you get five bars. If it’s weak, you get one.

If something is represented 100,000 times in a dataset of 1 million, it’s going to produce a pretty strong signal. If it’s only represent 100 times, it’s going to produce a weaker signal.

Here’s another example. Years ago, I was studying the FFHQ dataset used to generate images of human faces for something called StyleGAN.

I noted that the faces of black women were consistently more distorted than the faces of other races and genders. I asked the same question: What data was present to make white faces so strong? What data was absent to make black women’s faces so weak?

Here you can begin to formulate a hypothesis. In the case of black women’s faces being distorted, I could hypothesize that black women were underrepresented in the dataset. When I opened up the dataset, I found that to be the case.

Black women were represent in less than three percent of the images, while white women were represented in nearly a quarter of them. The result: white women’s faces were more detailed. Black women’s faces were less clear, and often had distortions.

In the case of kissing, something else is missing. One hypothesis would be that OpenAI didn’t have images of anyone at all kissing. That would explain the awkwardness of the poses. The other possibility is that LGBTQ people kissing are absent from the dataset.

But is that plausible? To test that theory, or whatever you find in your own samples, we would move to step five: look at, or think through, the possible data sources.

Step Five: Look at the Data (If You Can)

You can often find the original training data in white papers associated with the model you are using. Sometimes, you can also use tools to look at the images in the training data for your particular prompt. This can give you another sense of whether you are interpreting the image-data relationship correctly. 

We know that OpenAI trained DALLE2 on hundreds of millions of images with associated captions. While the data used in DALLE2 isn’t publicly available, you can often peek at the underlying training dataset for other models to see what references they are using to produce their images.

For the sake of this exercise, we’ll look through the LAION dataset, which is used for rival diffusion engines like Stable Diffusion and Midjourney. (LAION is currently offline, for reasons I wrote about here).

Another method is to find training datasets and download portions of them, but this has become exponentially harder as datasets become exponentially larger.

Another option, though far less accurate, is to run a Google image search for your prompt, to see the kinds of images it brings up on the World Wide Web. From a Google search, however, you can only make inferences. If you want to make claims, you’d have to look at the training data — which is why open models are helpful for researchers, while closed or proprietary models pose real challenges.

So, let’s start thinking about these images. What are some patterns? (I’ve posted the same image here again for ease of analysis).

Kissing is Weird

When we look at the images that LAION uses for “photograph of humans kissing,” one other thing becomes apparent: pictures of humans kissing are honestly kind of weird to begin with. The training data consists of stock photographs, where actors are sitting together and asked to kiss. This would explain some of the weirdness in the AI images of people kissing: it’s not genuine emotion on display here.

Racial Homogeneity

First, we can consider that the couples appear to be racially homogenous pairs. This seems like it could be a reflection of the training data, and a bias in society at large in presenting media images of same-race couples. We could go deeper into this using the same methods I describe below, but this bias — while important — is well documented in media studies, so it makes sense this bias would appear here.

Heteronormativity

ou might note that none of them depict gay men or women kissing. I’ll explore this, chiefly, because it doesn’t correlate to the bias of the dataset, and makes for an interesting exception to our rule!

It might be tempting to say that the prevalence of heterosexual couples in stock photography contributes to the absence of LGBTQ subjects in the images. To test that, you could type “kissing” into the training data search engine.

Surprisingly, the result is for that dataset is almost exclusively pictures of women kissing women — a reflection of the kind of material people post online, and where that data was pulled from. As scholars such as Dr. Safiya Noble have documented, much of the web is centered around pornography — and any dataset built by pulling arbitrary images from random websites will pull in a lot of pornographic, or explicit, content.

In other words, there is plenty of training data for same-sex couples kissing, at least for women. So why isn’t this present in these images? DALL-E 2 clearly wasn’t being biased by data. If anything, the bias in the data runs the other way — if the training data is overwhelming images of women kissing, you would expect to see women kissing more often in your images.

So what’s happening here? Is this idea of the image as an infographic falling apart? To answer that, we have to look at what I call “system level interventions.”

Sidenote! System Level Interventions

An intervention is a system-level design choice, such as a content filter, which prevents the generation of certain images. Here we do have data for DALL·E 2 that can inform this conclusion. We know that pornographic images were removed from OpenAI’s dataset so as to ensure nobody made explicit content. Other models, because they were scraped from the internet, contain vast amounts of explicit and violent material (see Birhane 2021). OpenAI has made some attempts to mitigate this (in contrast to some open source models).

Interestingly, removing pornographic content had some interesting — and depressing — results.

From OpenAI’s model card:

We conducted an internal audit of our filtering of sexual content to see if it concentrated or exacerbated any particular biases in the training data. We found that our initial approach to filtering of sexual content reduced the quantity of generated images of women in general, and we made adjustments to our filtering approach as a result.

Could this filtering decision explain the “barrier effect” between kissing faces in our sample images? We can begin to raise questions about the boundaries that OpenAI drew around the notion of “explicit” and “sexual content.” Because it used an automated filter to remove these images, did it remove all images of women kissing?

So we have another question: where were boundaries set between explicit/forbidden and “safe”/allowed in OpenAI’s decision-making? What cultural or social values were reflected in those boundaries?

We can begin to test some of our questions. OpenAI will give you a content warning if you attempt to create images depicting pornographic, violent, or hateful imagery. The following was true, as of 2023:

  • If you request an image of two men kissing, it creates an image of two men kissing.

  • If you request an image of two women kissing, you are given a flag for requesting explicit content.

So, we have very clear evidence that women kissing women is deemed problematic by OpenAI. This is an example of how cultural values become inscribed into AI imagery.

  • First, through the dataset and what is collected and trained.

  • Then, through interventions in what is allowed by users.

After publishing my findings, I noticed that OpenAI began forcibly injecting diversifying keywords into prompts: if I asked for “photographs of humans kissing,” for example, it might add “gay,” “black,” “Asian,” etc, in order to diversify the output associated with the keyword “human.” The above are some examples.

Recently (as of 2023) newer versions of these models have started inserting diversifying words into prompts. DALL-E 3, for example, uses OpenAI’s text generation tool (GPT4) to expand your prompts in ways that inject diversity into your requests. Even DALL-E 2, before being replaced by DALL-E 3, began to diversify images by inserting diversifying keywords without the user’s knowledge — a practice called Shadow Prompting, and originally documented in DALL-E 2 by Fabian Offert and Thao Phan.

Today, generating an image of “humans kissing” creates diverse couples as a result — including multiracial and same-sex couples. But this is a result of OpenAI’s deliberate interference with the training data — not a result of the training data itself. Other models still reflect these biases, and testing models for these biases remains an important but overlooked aspect of testing them. It’s a bit of a band-aid, as opposed to a real solution to statistical biases.

Another intervention can be the failures and limits of the model itself. Lips kissing may reflect a well-known flaw in rendering human anatomy — see my 2020 experiments with hands. There’s no way to constrain the properties of fingers, so they become tree roots, branching in multiple directions, multiple fingers per hand with no set length. Lips may do something similar. As models improve, it’s likely that such features may become more natural.

Step Six: Re-Examine the Original Image

We now have a hypothesis for understanding our original image. We may decide that the content filter excludes women kissing from the training data as a form of explicit content. We deduce this because women kissing is flagged as explicit content on the output side, suggesting an ideological, cultural, or social bias against gay women. This bias is evidenced in at least one content moderation decision (banning their generation) and may be present in decisions about what is and is not included in the training data.

The strangeness of the pose in the initial image, and of others showing couples kissing, may also be a result of content restrictions in the training data that reflect OpenAI’s bias toward, and selection for, G-rated content. How was “G-rated” defined, however, and how was the data parsed from one category to another? Human, not machinic, editorial processes were likely involved. Including more “explicit” images in the training model likely wouldn’t solve this problem – and could even create new ones. Pornographic content would create additional distortions. But in a move to exclude explicit content, the system has also filtered out women kissing women, resulting in a series of images that recreate dominant social expectations of relationships and kisses as “normal” between men and women.

Does this mean we should train AI on pornography? Probably not. But it means that the way AI companies collect data seems to sweep up a lot of problematic content, and suggests that gathering data without any human oversight creates real biases in the way these models represent people.

Returning to the target image, we may ask: How do we make sense of it after all this? It’s normal to have lost a lot of interest in that image once you’ve made sense of it! The idea is to use what interests you as a guide to poking into and exploring how these systems work. Once you do, you’re likely to surface something that explains the image — and therefore, makes it less interesting to you.

But by going through the process, you’ve engaged with a lot of meaningful questions that you can now answer.

  • What was encoded into the image through data and decisions?

  • How can we make sense of the information encoded into this image by the data that produced it?

With a few theories in mind, you could run the experiment again: this time, rather than selecting images for the patterns they shared with the notable image, use any images generated from the prompt, asking:

  • Are the same patterns replicated across these images?

  • How many of these images support the theory?

  • How many images challenge or complicate the theory?

Looking at the broader range of generated images, we can see if our observations apply consistently — or consistently enough — to make a confident assertion. Crucially, the presence of images that don’t support your theory does not undermine claims about what is weak or strong in the dataset. Chance is a big part of it. You may get 2000 weak images, and 10 strong ones. That’s why it’s important to remember:

A low number of strong images is still a weak signal.

Images reveal weaknesses in data. Every image is a statistical product: odds are weighted toward certain outcomes. Any consistent failure offers insight into gaps, strengths, and weaknesses of those weights. They may occasionally – or predominantly – be rendered well. What matters to us is what the failures suggest about the underlying data. Likewise, conducting new searches across time can be a useful means of tracking evolutions, acknowledgments, and calibrations for recognized biases. As stated earlier, my sampling of AI images from DALL·E 2 showed swings in bias from predominantly white, heterosexually coded images toward greater representations of genders and skin tones. Sometimes, this can even change day to day.

Conclusions

This approach can be applied to understanding all kinds of AI images. Our case study suggests that examples of cultural, social, and economic values are embedded into the dataset. This approach, combined with more established forms of critical image analysis, can give us ways to read the images as infographics. The method is meant to generate insights and questions for further inquiry, rather than producing statistical claims, though you could design research for quantifying the resulting claims or hypotheses.

Ideally, these insights and techniques move us away from the magic spell of spectacle that these images are so often granted. It is intended to provide a deeper literacy into where these images are drawn from. Identifying the widespread use of stock photography, and what that means about the system’s limited understanding of human relationships, emotional and physical connections, is another pathway for critical analysis and interpretations.

The method is meant to move us further from the illusion of neutral and unbiased technologies, which are still prevalent when we talk about these tools. We often see AI systems deployed as if they are free of human biases – the Edmonton police (Canada) recently issued a wanted poster including an AI-generated image of suspect based on his DNA (XIANG 2022).

That’s pure mystification. They are bias engines. Every image should be read as a map of those biases, and they are made more legible using this approach. For artists and the general public creating synthetic images, it also points to a strategy for revealing these problems. One constraint of this approach is that models can change at any given time. It is obvious that OpenAI could recalibrate their model to include images of women kissing tomorrow.

However, when models calibrate for bias on the user end it does not erase the presence of that bias. Models form abstractions of categories based on the corpus of the images they analyze. Weaknesses and strengths in the dataset shape the images that come out of these models. Removing access to those images, on the user’s end, does not remove their contribution to shaping other images. The results of early, uncalibrated outcomes are still useful in analyzing contemporary and future outputs. Generating samples over time also presents opportunities for another methodology, tracking the evolution (or lack thereof) for a system’s stereotypes in response to social changes. Media studies benefits from the observation and analysis of how models adapt or update their underlying training images or system interventions.

Likewise, this approach has limits. One critique is that we simply look at training data that is not accessible. As these models move away from research contexts and toward technology companies seeking to make a profit from them, proprietary models are likely to be more protected, akin to trade secrets. The largest open access model, LAION 5B, was recently taken down.

We are left making informed inferences about proprietary datasets by referencing datasets of a comparable size and time frame, such as Google image search. Even when we can find the underlying data, researchers may use this method only as a starting point for analysis. It raises the question of where to begin even when there are billions of images in a dataset. The method marks only a starting point for examining the underlying training structures at the site where audiences encounter the products of that dataset, which is the AI-produced image.

Now, we may need to learn to read AI images without access to the training data, as a way of making inferences about the biases that training data might contain. If so, the skill of AI image literacy is more important that ever.


This article was originally published on substack in 2022, and has been updated in January 2024. For more writing on the subject, follow cybernetic forests.


Additional Resources

For more on this approach to reading AI images, check out the following:

Sources

Barthes, Roland: Image, Music, Text. London [Fontana Press] 1977

Birhane, Abeba; Vinay Uday Prabhu; Emmanuael Kahembwe: Multimodal Datasets: Misogyny, Pornography, and Malignant Stereotypes. In: arXiv:2110.01963. 05.10.2021. https://arxiv.org/abs/2110.01963 [accessed February 16, 2023]

Chandler, Daniel; Rod Munday: A Dictionary of Media and Communication. Oxford [Oxford University Press] 2011

Hall, Stuart: Encoding/Decoding. In: Culture, Media, Language: Working Papers in Cultural Studies, 1972–1979. London [Routledge] 1992, pp. 117-127

Hall, Stuart: The Work of Representation. In: Representation: Cultural Representations and Signifying Practices. London [Sage] 1997, pp. 15-74

Harris, Robert: Information Graphics: A Comprehensive Illustrated Reference. New York [Oxford University Press] 1999

OpenAI: DALL·E 2 Preview - Risks and Limitations. In: GitHub. 19.07.2022. https://github.com/openai/dalle-2-preview/blob/main/system-card.md [accessed February 16, 2023]

OFFERT, FABIAN; TAO PHAN: A Sign That Spells: DALL-E 2, Invisual Images and the Racial Politics of Feature Space. Arxiv [CS.CY], 2022, http://arxiv.org/abs/2211.06323

Ramesh, Aditya; Mikhail Pavlov; Gabriel Goh; Scott Gray; Chelsea Voss; Alec Radford; Mark Chen; Ilya Sutskever: Zero-Shot Text-to-Image Generation. In: arXiv:2102.12092. 24.02.2021. https://arxiv.org/abs/2102.12092 [accessed February 16, 2023]

Rose, Gillian: Visual Methodologies: An Introduction to Researching with Visual Materials. London [Sage] 2001

Salvaggio, Eryk: How to Read an AI Image: The Datafication of a Kiss. In: Cybernetic Forests. 02.10.2022. https://cyberneticforests.substack.com/p/how-to-read-an-ai-image [accessed February 16, 2023]

XIANG, CHLOE: Police are using DNA to Generate Suspects They’ve Never Seen. Vice Media, 11.10.2022. https://www.vice.com/en/article/pkgma8/police-are-using-dna-to-generate-3d-images-of-suspects-theyve-never-seen [Accessed February 18, 2023]