Unified Intelligence Systems: Notes from Amit Jain (Luma AI) at Stanford CS153

Q: Why couldn’t Luma keep scaling 3D capture?

One transformer for text, images, video, and audio. Why Luma pivoted from 3D to video to unified, and what the agent shape around it looks like.

Q: How is a unified model different from “just a bigger multimodal model”?

One transformer for text, images, video, and audio. Why Luma pivoted from 3D to video to unified, and what the agent shape around it looks like.

Q: What does end-to-end work need that 5-second clip generation didn’t?

One transformer for text, images, video, and audio. Why Luma pivoted from 3D to video to unified, and what the agent shape around it looks like.

Q: Is diffusion really on the way out, or is this Luma’s bet alone?

One transformer for text, images, video, and audio. Why Luma pivoted from 3D to video to unified, and what the agent shape around it looks like.

Q: unified 모델은 “그냥 더 큰 멀티모달 모델”과 뭐가 다른가?

One transformer for text, images, video, and audio. Why Luma pivoted from 3D to video to unified, and what the agent shape around it looks like.

Overview

Luma AI is a San Francisco frontier lab building generative video and multimodal models. Best known for Dream Machine (text-to-video), now shipping the Uni1 unified model behind their agent product. Customers include Coca-Cola (~ $3B/yr in content production moving to Luma), Netflix, Amazon Prime Studios, and the Publicis ad group. Raised ~$ 1.5B total, $1B in the last 12 months.
Stanford CS153 Frontier Systems is a class on how AI factories actually get built. Amit Jain (Luma founder) gave the week-3 guest lecture on what he calls unified intelligence systems.
Thesis: stop bolting separate specialist models together with thin bridges. Put text, image, video, and audio in one transformer backbone the way LLMs already handle text.
Talk covers Luma’s two pivots (3D → video → unified), the agent shape on top of the model, and what is still missing before image and video models become as broadly useful as ChatGPT.

Why did Luma start with 3D and end up doing video?

Original thesis: 3D point clouds carry more information than flat photos. If you encode them so gradient descent works, you can train models the way other labs train LLMs.
Reality check: a single capture app cannot match the volume of photos and videos already on the public internet. Algorithm quality stops mattering once your modality is data-starved.

Wherever there is scale in data, that is the only thing that will work. You design algorithms around where the data is, not the other way around.

2023 pivot: NVIDIA H100 lands, Luma jumps to video. Video is photos plus a time axis, which is also how the human brain learns 3D.
Dream Machine launched March 2024. Six million users in its first month.
2025 ceiling: video shows scenes well but cannot model causality, instruction-following, or sequence reasoning. The things LLMs do well were missing.

Why does stitching specialist towers together hit a wall?

Most 2025 multimodal models look like this. Amit’s read on Google nano-banana fits the shape:

Big language tower writes a long text prompt describing what the image should look like.
Big image tower generates pixels conditioned on that text.
700–800M parameter encoder bridges the two towers.

The bridge is the bottleneck:

Dense spatial information (labels, schematics, precise diagrams) does not survive the trip through a 700M parameter pipe.
Amit tried generating the CS153 factory slide with nano-banana. Got something that looked like a diagram but was not one.
LLMs do not have this problem. Understanding and generation happen in the same model with no thin bridge in between.

Isn’t this just “multimodal”?

“Multimodal” is too broad. Almost anything that accepts more than one input type qualifies. Unified is a specific subset.

Type	Understanding	Generation	Example
VLM	Images	Text only	GPT-4V, early Gemini
Diffusion + text encoder	Text only	Images only	Stable Diffusion, Flux
Fused (big towers + thin bridge)	Both	Both	nano-banana, Sora
Unified	Both	Both, in one backbone	Luma Uni1

Distinction is “one backbone.” Fused models look unified from the outside but are two separate towers with a small encoder between them inside.
Unified has modality-specific encoders and decoders only at the input and output. Reasoning happens over all tokens in one shared space.
Analogy: multimodal = a person with an interpreter next to them. Unified = a person who thinks in both languages with the same mind.
Why it matters: dense spatial info survives, multi-turn memory works, cross-modality reasoning (“look at this image, continue as video”) becomes possible.
Every unified model is multimodal, but most multimodal models are not unified.

What is a unified model in one sentence?

Encode every modality (text, image, video, audio) into a shared representation, then have one transformer reason over all of it.

Brain analogy: visual cortex and auditory cortex are separate, but reasoning, judgment, and planning converge in the neocortex. Unified models are shaped the same way.
Text encodes best as discrete tokens. Images and audio encode best as continuous vectors. Video sits in between.
One transformer backbone reasons over all tokens in one space. Modality-specific decoders sit at the output.
Transformers do not care what kind of token they process. Failures show up at the encoder/decoder boundary, not in the backbone.
Luma spent about a year on failed scaling attempts before landing on this. They now believe it scales to hundreds of billions of parameters.

What does the agent shape around the model look like?

The product wraps three layers around the unified model:

Skills (top): domain knowledge as context. An internal expert wrote a 50-page document on what makes a slide well-designed. That document is loaded as a skill at task time. No fine-tuning. Same base model serves advertising, energy, robotics, and studio work.
Tool harness (middle): general-purpose ability to call APIs, run code, deploy artifacts, drive a Linux box.
Unified model (bottom): reads the input, picks which skill applies, decides which tools to call, generates the output, iterates over multiple turns.
Runtime is a REPL loop (read, eval, print, loop). Computers had this shape for 70 years; the eval slot is now a unified model.

What do customers want in 2026?

Not a 5-second clip. The whole job:

Studios want full shots. Luma agents are reportedly producing nearly all visual work for an upcoming Prime Video Moses series (~$4.5M per episode, Ben Kingsley starring).
Advertisers want full campaigns generated end to end.
Robotics companies want full action sequences plus the model judging whether each action was correct.
Energy customer ingested grid diagrams and grid code. Their unified model now draws better schematics than coding-only LLMs because it can actually see the layout.
Common thread: success measured by ability to finish the task, not by single-shot output quality.

What’s the gap before image and video models match ChatGPT?

Amit’s one-word answer: intelligence.

Working with a model that is “not intelligent yet” feels like working with a person you do not trust:

Forgets what you said. Today’s video models have weak memory.
Takes words too literally. No causal model.
Handles small tasks but falls apart on bigger ones. Cannot compose end to end.
Historical analogue: RLHF for LLMs. Turned chat from a research demo into a product by enabling multi-turn iteration.
Image and video models have not had their RLHF moment yet. Unified architecture is Luma’s bet on what triggers it.

Side note: GANs and diffusion

GANs survive in distillation and real-time systems but lost the mainstream because researchers do not enjoy working with them. “What researchers enjoy is what gets built” is the law of physics for AI direction.
Diffusion is on the way out at Luma too. Moving to hybrid auto-regressive + diffusion because pure diffusion has scaling pathologies that do not go away with more compute.
This is a single frontier lab’s bet, not industry consensus. Read it that way.

FAQ

Why couldn’t Luma keep scaling 3D capture?

A single capture app cannot match the data volume already on the public internet. Algorithm quality stops mattering once the modality is data-starved. Direct consequence of “design around where the data is.”

What is the actual bottleneck in nano-banana style models?

The 700–800M parameter encoder bridging the language tower and the image tower is too narrow. Cannot carry enough semantic detail for things like cleanly labeled diagrams. Output looks like the right thing but is not precisely what was asked for.

How is a unified model different from “just a bigger multimodal model”?

Structural change, not size change. Understanding and generation collapse into one backbone instead of living in separate towers. Modalities only differ at the encoder/decoder boundary; everything in the middle is shared.

Why bet on one mega-model instead of many specialists with a judge?

A federated system inherits the bottleneck of its weakest specialist. A unified model can route capacity to whatever the task needs. Amit’s framing: intelligence looks like a brain, not a pipeline of databases.

What is the skills layer and why is it outside the model?

Skills are domain knowledge (like a 50-page slide design document) injected at task time, not baked into weights. Updating a skill is cheap; retraining a model is not. Same base model serves multiple verticals through different skills.

What does end-to-end work need that 5-second clip generation didn’t?

Multi-turn memory, causal and physical understanding, long-sequence coherence, and the ability to judge its own output against the original instruction. Single-shot generation cannot deliver any of these.

Is diffusion really on the way out, or is this Luma’s bet alone?

Closer to Luma’s bet alone. Pure diffusion has scaling habits Luma considers hard to fix, but other labs have not publicly reached the same conclusion. Read it as one frontier lab’s direction, not the industry’s verdict.

개요

Luma AI는 SF 본사의 frontier lab으로 생성 영상/멀티모달 모델을 만듦. 대표작 Dream Machine(텍스트→영상)으로 알려졌고, 지금은 Uni1 unified 모델을 백엔드로 하는 agent 제품을 출시 중. 고객은 Coca-Cola(연 약 30억 달러 규모의 콘텐츠 제작을 Luma로 이전), Netflix, Amazon Prime Studios, 광고 그룹 Publicis 등. 누적 약 15억 달러 조달, 최근 12개월에 10억 달러.
Stanford CS153 Frontier Systems는 AI factory가 실제로 어떻게 만들어지는지를 다루는 강의. Amit Jain (Luma 창업자)이 3주차 게스트로 unified intelligence systems를 발표함.
핵심 주장: 전문가 모델 여러 개를 작은 다리로 잇지 말고, LLM이 텍스트를 다루듯 텍스트/이미지/영상/오디오를 한 transformer 백본 안에 넣자.
강의는 Luma의 두 번 피벗(3D → 영상 → unified), 모델 위에 쌓는 agent 구조, 영상/이미지 모델이 ChatGPT만큼 유용해지려면 뭐가 더 필요한지를 다룸.

Luma는 왜 3D로 시작해서 영상으로 끝났나?

원래 가설: 3D 점구름이 평면 사진보다 정보가 많고, gradient descent가 작동하게 인코딩만 하면 LLM 학습시키듯 학습 가능.
현실 점검: 캡처 앱 하나로는 인터넷에 매일 올라오는 사진/영상 분량을 못 따라감. 모달리티가 데이터 부족이면 알고리즘 품질이 무의미해짐.

데이터의 규모가 있는 곳, 그게 유일하게 작동하는 곳이다. 알고리즘을 데이터가 있는 쪽에 맞춰 설계해야지, 그 반대가 아니다.

2023년 NVIDIA H100 출시 시점에 영상으로 갈아탐. 영상은 시간 축이 추가된 사진이고, 그 시간 축이 사람 뇌가 3D를 학습하는 방식이기도 함.
Dream Machine은 2024년 3월 출시 한 달 만에 사용자 600만을 모음.
2025년 초 벽: 영상은 장면을 보여주는 건 잘하지만 인과, 지시 따라하기, 사건 순서 추론 같은 LLM 강점이 빠짐.

전문가 모델 여러 개를 이어붙이면 왜 한계에 부딪히나?

2025년 대부분의 멀티모달 모델은 이렇게 생김. Amit이 추정하는 Google nano-banana도 같은 모양:

큰 언어 타워가 이미지가 어떻게 생겨야 하는지 긴 텍스트 프롬프트 작성.
큰 이미지 타워가 그 텍스트를 받아서 픽셀 생성.
7~8억 파라미터 인코더가 두 타워를 연결하는 다리 역할.

이 다리가 병목임:

dense spatial 정보(라벨, schematic, 정확한 다이어그램)가 7억 파라미터 통로를 못 통과함.
Amit이 nano-banana로 CS153 factory 슬라이드를 만들어봤음. 다이어그램처럼 생긴 게 나왔지 다이어그램은 아니었음.
LLM은 이런 문제 없음. 이해와 생성이 같은 모델 안에서 일어나고 사이에 좁은 다리가 없음.

”그냥 multimodal”이랑 뭐가 다른가?

“Multimodal”은 너무 넓은 용어임. 입력으로 두 종류 이상 받기만 해도 들어감. Unified는 그 안의 좁은 한 종류:

종류	이해	생성	예시
VLM	이미지	텍스트만	GPT-4V, 초기 Gemini
Diffusion + 텍스트 인코더	텍스트만	이미지만	Stable Diffusion, Flux
Fused (큰 타워 + 작은 다리)	둘 다	둘 다	nano-banana, Sora
Unified	둘 다	둘 다, 같은 백본에서	Luma Uni1

핵심은 “같은 백본”. Fused 모델은 겉보기엔 unified랑 같은데 안을 까보면 큰 타워 두 개 + 작은 다리 구조.
Unified는 모달리티별 인코더/디코더가 입출력단에만 따로. 추론은 모든 토큰이 같은 표현 공간 안 한 transformer에서 일어남.
비유: multimodal = 옆에 통역사 있는 사람. unified = 양쪽 언어를 같은 머리로 생각하는 사람.
실제로 왜 중요한가: dense spatial 정보가 살아남고, multi-turn 메모리가 작동하고, 모달리티 간 추론(“이 이미지 보고 다음 장면 영상으로 만들기”)이 가능해짐.
모든 unified는 multimodal이지만 모든 multimodal이 unified는 아님.

Unified 모델을 한 문장으로 정의하면?

모든 모달리티(텍스트, 이미지, 영상, 오디오)를 같은 표현 공간에 인코딩하고, 그 위에서 하나의 transformer가 추론한다.

뇌 비유: 시각 피질과 청각 피질은 따로 있지만 추론, 판단, 계획은 신피질 한 군데에 모임. Unified 모델도 같은 모양.
텍스트는 이산 토큰이 효율적. 이미지/오디오는 연속 벡터가 효율적. 영상은 그 사이.
하나의 transformer 백본이 모든 토큰을 같은 공간에서 추론. 모달리티별 디코더가 출력단에 따로.
Transformer는 어떤 토큰이든 신경 안 씀. 망가지는 건 항상 인코더/디코더 경계지 백본이 아님.
Luma는 이 구조에 도달하기까지 약 1년간 실패한 스케일링 시도를 거침. 지금은 수백억 파라미터까지 확장된다고 봄.

모델 주변의 agent 구조는?

제품은 unified 모델 위에 세 층을 쌓음:

Skills (위): 도메인 지식을 컨텍스트로 주는 층. 사내 전문가가 “좋은 슬라이드란”에 대해 50쪽짜리 문서를 작성. 그 문서를 task 시점에 skill로 등록. fine-tuning 안 함. 같은 base 모델이 광고, 에너지, 로봇, 스튜디오 일을 skill만 갈아끼우면서 처리.
Tool harness (가운데): API 호출, 코드 실행, 아티팩트 배포, Linux 박스 운영 같은 일반적인 도구 사용 능력.
Unified 모델 (아래): 입력 읽고, 어떤 skill을 적용할지 정하고, 어떤 tool을 부를지 결정하고, 출력. 필요하면 여러 turn을 돌면서 반복.
런타임은 REPL 루프(read, eval, print, loop). 컴퓨터가 70년 전부터 갖고 있던 모양에 eval 자리에 unified 모델이 들어간 것뿐임.

2026년 고객은 뭘 원하나?

5초 클립 하나가 아님. 작업 통째로:

스튜디오는 전체 샷을 원함. Luma agent로 Prime Video의 모세 시리즈(회당 약 4.5M 달러, Ben Kingsley 주연)의 영상 작업을 거의 전부 만든다고 함.
광고주는 캠페인 한 벌을 통째로 원함.
로봇 회사는 행동 시퀀스 + 그 행동이 맞는지 모델이 자가 판정하기를 원함.
에너지 회사는 전력망 도면과 코드를 ingest 시킴. 그 회사 unified 모델이 코딩 전용 LLM보다 schematic을 더 잘 그림. 레이아웃을 실제로 볼 수 있기 때문.
공통점: 단발 출력 품질이 아니라 작업을 끝까지 완수하는 능력으로 측정됨.

영상/이미지 모델이 ChatGPT만큼 유용해지려면 뭐가 부족한가?

Amit의 한 단어 답: intelligence.

“이 모델 아직 안 똑똑하다”는 신뢰가 안 가는 사람과 일하는 느낌과 비슷함:

방금 한 말을 까먹음. 영상 모델은 메모리가 약함.
단어를 너무 문자 그대로 받아들임. 인과를 모델링 못함.
작은 일은 하지만 큰 일에선 무너짐. End-to-end로 못 엮음.
역사적 평행은 LLM의 RLHF. Chat을 연구 데모에서 제품으로 바꾼 건 multi-turn 반복을 가능하게 했기 때문.
영상/이미지 모델은 아직 RLHF 순간을 안 맞음. Unified 아키텍처가 그 순간을 트리거할 거라는 게 Luma의 베팅임.

사이드 노트: GAN과 diffusion

GAN은 distillation이나 실시간 시스템에선 살아있지만 주류에선 밀려남. 연구자들이 다루기 싫어해서임. “연구자가 즐기는 것이 곧 만들어지는 것”이 AI 방향에서는 사실상 물리 법칙임.
Diffusion도 적어도 Luma에선 빠지고 있음. 하이브리드 auto-regressive + diffusion으로 옮기는 중. 순수 diffusion이 컴퓨트를 더 줘도 안 빠지는 스케일링 병폐가 있다는 이유.
한 frontier lab의 방향성이지 업계 합의는 아님. 그렇게 받아들이는 게 적절.

FAQ

Luma는 왜 3D 캡처 스케일링을 계속 못 했나?

캡처 앱 하나로는 인터넷에 이미 올라와 있는 데이터 양을 못 따라감. 모달리티가 데이터 부족이면 알고리즘 품질이 무의미해짐. “데이터가 있는 곳에 맞춰 설계한다”의 직접 결과임.

nano-banana식 모델의 실제 병목은 뭔가?

언어 타워와 이미지 타워를 잇는 7~8억 파라미터 인코더가 너무 좁음. 깔끔하게 라벨 붙은 다이어그램 같은 데 필요한 의미 정보가 통로를 못 통과함. 출력이 비슷하게 나오지 정확히 부탁한 게 안 나옴.

unified 모델은 “그냥 더 큰 멀티모달 모델”과 뭐가 다른가?

크기 변화가 아니라 구조 변화임. 이해와 생성이 별도 타워로 나뉘어 있지 않고 같은 백본에 합쳐짐. 모달리티는 인코더/디코더 경계에서만 다르고 가운데는 전부 공유.

전문가 여러 개에 judge 붙이는 대신 왜 mega-model 하나에 베팅하나?

Federated 시스템은 가장 약한 specialist의 병목을 그대로 물려받음. Unified 모델은 task에 맞춰 capacity를 자유롭게 라우팅 가능. Amit의 표현으로는 “지능은 데이터베이스 파이프라인이 아니라 뇌처럼 생겼다.”

Skills 층은 뭐고 왜 모델 바깥에 있나?

Skills는 도메인 지식(예: 50쪽 슬라이드 디자인 문서)을 weight에 굽지 않고 task 시점에 컨텍스트로 주입한 것. Skill 갱신은 싸지만 모델 retrain은 비쌈. 같은 base 모델이 skill만 바꿔서 여러 분야를 처리.

end-to-end 작업이 5초 클립과 뭐가 다른가?

multi-turn 메모리, 인과/물리 이해, 긴 시퀀스 일관성, 자기 출력을 원래 지시와 비교해 판정하는 능력이 필요함. 단발 생성으로는 어느 것도 안 나옴.

diffusion이 정말 빠지는 중인가, 아니면 Luma 단독 베팅인가?

Luma 단독에 가까움. Pure diffusion이 안 빠지는 스케일링 습관이 있다는 게 Luma의 진단인데, 다른 lab들이 같은 결론에 공개적으로 도달한 건 아님. 한 frontier lab의 방향으로 받아들이는 게 적절.

Unified Intelligence Systems: Notes from Amit Jain (Luma AI) at Stanford CS153 Unified Intelligence Systems: Luma AI Amit Jain Stanford CS153 강의 정리

Overview

Why did Luma start with 3D and end up doing video?

Why does stitching specialist towers together hit a wall?

Isn’t this just “multimodal”?

What is a unified model in one sentence?

What does the agent shape around the model look like?

What do customers want in 2026?

What’s the gap before image and video models match ChatGPT?

Side note: GANs and diffusion

FAQ

Why couldn’t Luma keep scaling 3D capture?

What is the actual bottleneck in nano-banana style models?

How is a unified model different from “just a bigger multimodal model”?

Why bet on one mega-model instead of many specialists with a judge?

What is the skills layer and why is it outside the model?

What does end-to-end work need that 5-second clip generation didn’t?

Is diffusion really on the way out, or is this Luma’s bet alone?

개요

Luma는 왜 3D로 시작해서 영상으로 끝났나?

전문가 모델 여러 개를 이어붙이면 왜 한계에 부딪히나?

”그냥 multimodal”이랑 뭐가 다른가?

Unified 모델을 한 문장으로 정의하면?

모델 주변의 agent 구조는?

2026년 고객은 뭘 원하나?

영상/이미지 모델이 ChatGPT만큼 유용해지려면 뭐가 부족한가?

사이드 노트: GAN과 diffusion

FAQ

Luma는 왜 3D 캡처 스케일링을 계속 못 했나?

nano-banana식 모델의 실제 병목은 뭔가?

unified 모델은 “그냥 더 큰 멀티모달 모델”과 뭐가 다른가?

전문가 여러 개에 judge 붙이는 대신 왜 mega-model 하나에 베팅하나?

Skills 층은 뭐고 왜 모델 바깥에 있나?

end-to-end 작업이 5초 클립과 뭐가 다른가?

diffusion이 정말 빠지는 중인가, 아니면 Luma 단독 베팅인가?

References 참고 자료