Lesson 1.1Module 1 · AI & ML Foundations12 min read

How AI models actually work

Open the bonnet just far enough. You don't need the maths, but you do need the mental picture of what's happening when you hit Generate.

It starts every image from pure static — and walks the noise back toward your courtyard.

Watch a diffusion model render a Chettinad courtyard and it looks like a magician pulling a building from thin air. It isn't. Under the hood, the model literally begins with a square of random visual noise — TV static — and removes a little of that noise at a time, nudging each step toward something that matches your words. Thirty or so steps later, a courtyard. The language model that drafts your spec is doing the same trick in another medium: predicting one plausible word, then the next, then the next. Once you can picture those two loops, every AI tool in this course stops being magic and starts being machinery you can drive.

The idea

Two loops: predict the next word, denoise the next pixel

Step 01 — The neural network

A net of numbers that learned which patterns go together

Underneath an LLM or a diffusion model sits a neural network: millions or billions of little numbers called weights, arranged in layers. Nothing in there is a rule a programmer typed. During training, the network was shown enormous amounts of data and the weights were nudged, again and again, until the patterns it had absorbed could reproduce that data well. Think of it less as a brain and more as an unfathomably detailed map of which things tend to go with which — the way 'laterite' goes with 'Kerala', the way a sloping roof goes with heavy rain.

You never touch the weights. You touch the input — your prompt — and the network runs it through all those layers to produce an output. The intelligence is statistical: it has seen the pattern of 'courtyard house' tens of thousands of times and learned, in numbers, what usually appears in one. That is power and limitation in the same sentence.

Two engines, one idea. The LLM predicts the next word; the diffusion model removes noise toward the next image. Both walk step by step from a starting point toward the most plausible result.

A model has no rules inside it — only patterns it absorbed. That's why it can't quote your bye-law but can nail the _vibe_ of a Goan villa.

Step 02 — Tokens, embeddings, latent space

How words and images become numbers the model can think in

Three bits of vocabulary unlock the rest of the course. A token is a chunk of text — roughly a word or part of a word. An LLM doesn't read 'reinforced'; it reads it as a token (or two) and works in tokens, not letters. That's why it sometimes miscounts characters: it never saw them.

An embedding is the clever part. Every token, and every image patch, gets turned into a long list of numbers — coordinates in a vast mathematical space. In that space, things that mean similar things sit close together: 'terracotta' lands near 'brick' and 'warm', far from 'glass curtain wall'. The model doesn't know what terracotta is; it knows where it sits relative to everything else.

Latent space is just the name for that compressed inner space where the model does its work — a map of meaning. When a diffusion model 'imagines', it's moving through latent space; when an LLM picks the next word, it's reading coordinates there. You'll hear 'latent space' a lot. It only means: the model's internal map, where similar ideas are neighbours.

Latent space is the model's map of meaning. Words and images become coordinates; similar ideas sit close. 'Terracotta' lands near 'brick' and 'warm', far from a glass curtain wall.

Step 03 — Why the same prompt gives a different building

There is no single answer inside — only a cloud of plausible ones

Type the identical prompt twice into Midjourney or FLUX and you get two different houses. New users find this maddening; it's the most important behaviour to understand. The model holds no single 'correct' courtyard. It holds a distribution — a fog of plausible courtyards — and each generation starts from a different random seed, so it lands on a different point in that fog.

LLMs do the same, more subtly. At each step the model has a ranked list of plausible next words, and a setting called temperature decides how adventurously it picks from that list. Low temperature, safe and repeatable; higher temperature, more varied and creative. That randomness isn't a bug — it's exactly what lets AI diverge for you, throwing twenty directions when you wanted twenty. The spine of this course in one line: diverge with the machine, converge with your judgement. The model gives you a plausible cloud; you pick the point that's actually true to the site, the client and the code.

Read it your way

For the architect

Knowing it works in tokens and latent space, not facts, tells you why it can't be your code consultant. There is no clause stored in there to look up — only the _shape_ of sentences that mention clauses. The same machinery explains its genius at concept: latent space lets it blend 'Brutalist' and 'tropical' into something coherent because those ideas are points it can interpolate between. Use that blending power early; never mistake it for retrieval of a fact you can sign off.

For the interior designer

Embeddings are why a single well-chosen word swings a whole mood board. 'Wabi-sabi', 'Chettinad', 'biophilic' each land in a rich neighbourhood of latent space and pull the entire image with them. That's a styling superpower. But the same fog of plausibility is why the model can't hold an _exact_ sofa across two renders without help — each generation re-samples the cloud. Lock things you must keep (we cover img2img and editing later); let the model roam where you want options.

For the student & solo studio

You don't need to learn the maths, and nobody will quiz you on transformers. But this one picture — predict-the-next-word and denoise-the-next-pixel, both moving through a map of meaning — is worth memorising. It's the difference between fighting the tools and steering them. When a render comes out wrong, you'll know to re-roll the seed rather than retype; when an LLM rambles, you'll know to lower the temperature or tighten the prompt. That's professional control, free.

The engines doing the predicting and denoising (as of 2026)

tools date fast · verify

Stable Diffusion (open-source)

Diffusion model you can see inside

Because it runs locally and open, it's the clearest place to watch the denoising loop and even step through it. It powers ControlNet and most architecture render tools under the hood — but raw, it needs real setup and a decent GPU.

FLUX.1.1 Pro / FLUX.2 (Black Forest Labs)

Diffusion model, fast and photoreal

Generates a sharp photoreal image in roughly four to five seconds, commercially licensed. Excellent realism. As of 2026 it's the engine behind a lot of 'design AI', including Studio Matrx's wall-only recolour (FLUX Kontext). Still samples a fresh cloud each run, so expect variation.

Claude (Anthropic)

Large language model, next-token prediction

A clean example of the LLM loop at scale, with a 200k-token context window so it holds long specs and briefs coherently. Studio Matrx's own platform is built on it. Like every LLM it predicts the plausible word, not the verified fact, so it will sound certain about a code it invented.

Common misconception

“The model is searching a giant database of real buildings and pictures and showing me the closest match.”

It isn't searching anything. There is no library of stored images inside a diffusion model — the entire training set has been compressed into weights, into patterns. Each render is freshly generated from noise, not retrieved. That's why it can produce a courtyard that has never existed, and equally why it can't fetch the one real product photo you actually need. Generation, not lookup, is the whole game.

Hands-on workshop

Free: any chat AI (Claude / ChatGPT / Gemini) + any image AI (Midjourney, FLUX, or a free generator).

Workshop — watch the two loops with your own eyes

Fifteen minutes to make 'next-token prediction' and 'denoising' real, using the same villa brief in both an image AI and a chat AI. Do it before you read on; the picture sticks better than the paragraph.

Free: any chat AI (Claude / ChatGPT / Gemini) + any image AI (Midjourney, FLUX, or a free generator).

Copy & adapt

IMAGE AI -- run this exact prompt FOUR times:
"sunlit Chettinad courtyard house, lime-plaster walls,
teak columns, oxide red floor, photorealistic"

CHAT AI -- paste this and ask it to continue ONE word
at a time, pausing so you can see the choices:
"In a hot-dry Indian climate, the most important
passive cooling strategy for a courtyard house is"

1Run the image prompt four times. Lay the four results side by side. You did not change a word, yet you got four different houses. Name what differs (massing, light, plant choices) and what stays constant (the courtyard, the oxide floor). That constant-vs-varied split is the plausibility cloud, visible.
2Re-run it once more but, if your tool allows, fix the seed to the previous value. Notice the image now repeats. That single setting is your reproducibility control — remember it exists.
3Switch to the chat AI and run the sentence-completion prompt. Watch how it commits to one word, then builds the next on top of it. That is next-token prediction, live.
4Re-run the same sentence two or three times. The completions diverge — same loop as the images, different medium. If your tool exposes temperature, drop it low and watch the answers converge.
5In your notes, write one line linking the two: 'Both started from randomness and walked toward the plausible — pixels for the image, words for the text.'
6Finally, pick the single best courtyard of your four and write why you chose it. That sentence is you converging — the job the machine cannot do.

You’ll walk away with
A four-up sheet of the same prompt giving four buildings, plus a one-line note that fuses denoising and next-token prediction into a single mental model you can apply to every tool that follows.

Try it

Two quick probes, if you have five more minutes.

01Ask a chat AI to 'count the letters in CANTILEVER' a few times. Watch it fumble — proof it works in tokens, not characters.
02Give an image AI two clashing style words ('Brutalist tropical bungalow') and see it interpolate a coherent blend. That's latent space doing its thing.

The idea to carry forward

An LLM predicts the next plausible word; a diffusion model denoises toward the most plausible image; both think in tokens and embeddings inside a latent space, a map where similar ideas are neighbours. There's no stored fact and no single answer inside — only a cloud of plausibles, freshly generated. So re-roll to diverge, then converge with your own judgement.

In one breath

Neural nets are patterns, not rules. Tokens and embeddings turn words and images into coordinates in latent space. LLMs predict the next token; diffusion denoises pixels. Randomness (seed, temperature) is why the same prompt gives a different building — that's the divergence engine, not a fault.

Make it real

Questions

Why does AI give me a different image every time I use the same prompt?

Because the model holds no single correct answer — it holds a distribution of plausible ones, and each generation starts from a different random seed, landing on a different point in that cloud. It's a feature, not a bug: that variation is exactly what lets AI throw you many directions. Fix the seed if you want the same image back.

What is 'latent space' in simple terms?

It's the model's internal map of meaning. Every word and image patch becomes a list of numbers — coordinates — and things that mean similar things sit close together: 'terracotta' near 'brick' and 'warm'. When the model 'imagines', it moves through this space. You don't need the maths; just know that similar ideas are neighbours, which is why one word can swing a whole render.

Do I need to understand neural networks to use AI as a designer?

No. You need the mental picture, not the mathematics. Knowing that an LLM predicts the next token and a diffusion model denoises toward an image — and that both work from patterns, not stored facts — is enough to direct the tools with confidence and to know why they hallucinate. You'll never write code or touch a weight.

Now you know HOW the model produces the plausible. The next question is darker: plausible according to WHOSE world — because everything it knows came from training data with a heavy accent.

Where AI fits in a real practice Next: Training, data & bias