AI explainers

How AI image generation works

Updated 2026-05-105 min read

If you understand how AI images are made, you understand where they break. This is a non-technical explanation of diffusion models — the engine behind Midjourney, Stable Diffusion, DALL·E, and most current image generators — and what their architecture implies for detection.

From noise to image

A diffusion model learns by destroying training images with random noise, step by step, until the original is unrecognizable. It then learns to reverse the process — to denoise. Once trained, the model can start from pure random noise and 'denoise' its way to an image that matches a text prompt.

Why this leaves artifacts

The model has no concept of physical objects. It has learned statistical correlations: what 'cat fur' tends to look like next to 'eye'. When asked to render structures it has seen in fewer angles — hands, ears, the back of teeth — it falls back to averaged patterns. Result: the famous extra-finger problem.

It also has no concept of light sources. Lighting consistency is whatever the training data imposed by accident. Reflections and shadows are computed independently of any 3D scene, so they often disagree across the image.

What detection exploits

Detectors take advantage of the same statistical structure. AI images have characteristic frequency-domain patterns — the residue of the denoising process — that real photos don't share. They also have texture continuity quirks: real fabric and skin follow physics; generated versions follow learned averages.

Where this is heading

Newer models (controlnet, multi-stage diffusion, transformer-based image models) are reducing the obvious tells. Hand artifacts have shrunk dramatically since 2023. Detection is in a moving cat-and-mouse game: each generation of model produces fewer easy signals, and each generation of detector retrains on the new patterns.

Try the tool