1 year of building a CUA service

Where this started

I started building my first CUA app in the second week of April 2025. As of June 2026, it has been a little over a year. I was not doing only this the whole time. I was also doing SWE work and other things at the same time. Still, enough time has passed that I can say this clearly: the first CUA concept I had probably failed, at least in its original form.

The original idea was simple. I wanted an agent that could remotely control my computer for me. I would give it a task, it would look at the screen, plan what to do, click around, type, and finish the work. The dream was obvious: if an agent could do real labor on my computer, that would be incredibly valuable.

In the first version, almost all decision making went through the LLM. Planning, choosing actions, and execution all depended on the model. At the time I was experimenting with GPT-4o and o3. Both were too slow for this kind of product, and the computer use capability was nowhere near good enough.

The bigger problem was not just speed. The model had to actually act on the screen. It had to find coordinates, click the right button, follow a workflow, recover when the UI changed, and not get lost. In practice, that broke constantly. A VLM can understand a screenshot in a broad sense, but a product needs much more than broad understanding. It needs boring precision.

What technically worked

The first important change was splitting planning from execution. I still used the LLM for deciding what should happen next, but I stopped asking it to do every low level action directly. Instead, local code handled execution.

For the local side, I used classic vision algorithms. Things like AKAZE, template matching, pre learned image vectors, and cosine similarity were more useful than I expected. If the system already knew what a button looked like, it could find that button again very quickly. It did not need another full model round trip just to click something it had already seen.

That solved two problems.

First, latency got much better. Every action no longer needed to go out to a remote model and come back. The local runtime could do the small execution steps quickly.

Second, accuracy got better. Pattern matching is not intelligent in the way an LLM is, but for repeated UI actions, that was exactly the point. Once the system learned a target, it could recognize it more reliably than a general VLM guessing where to click from scratch every time.

So technically, I did make progress. The system became more practical. It had a Tauri UI, local execution, a runtime for CUA models, and pieces of a training loop. But a working technical direction is not the same thing as a business.

The part that failed

The problem was simple: I could not find customers.

I started the project because it was fun. The thought was, “what if an agent could do work for me?” That is still the right question. But the actual harness and the models were not good enough to perform the complex tasks humans do every day. The product could demonstrate something interesting, but it was not reliable enough to become a daily tool for most people.

B2B RPA was one possible path. I had worked at a bank before, so I understood why businesses might care about automation. But selling there requires trust, networking, and a very specific wedge. I did not have that distribution.

B2C was also hard. I tried making the agent feel more like a character. I played with a VTuber style agent, a more consumer friendly surface, and a more playful direction. But without a real use case, even I started to feel unsure what I was building. A character wrapper does not fix the absence of a job people need done.

The bigger mistake was distribution. I was too focused on the product and too loose about getting it in front of people. I was building alone. I was doing development, DevOps, model testing, product thinking, and my regular SWE work at the same time. That is a real constraint, but it is not an excuse. I should have posted more on Reddit and X. I should have talked to more users. I should have tested demand earlier.

So if I say it plainly, I spent about a year building software that nobody wanted. That sounds brutal, but it is probably the accurate version. It is close to the Paul Graham warning: making something nobody wants.

What is still valuable

The strange part is that I do not think the work was worthless.

What I built is not just a failed app. It is also an end to end pipeline for collecting and training CUA data. I have a runtime that can run CUA models inside a Tauri based UI. I have an always on feature that can collect data while I work. That data can be normalized, labeled, exported to a server, trained on RunPod through SFT, pulled back into the Tauri app, evaluated, and eventually used for RL tasks.

That is a lot more specific than “I made an AI app.” It is infrastructure for creating better computer use agents.

The original product failed because the end user use case was not clear enough and the models were not ready enough. But the pipeline itself still feels unique. The market may not be “everyone gets a remote worker on their PC today.” The market may be “labs building CUA models need high quality interaction data, especially in domains their existing data does not cover.”

That changes the question. Instead of asking, “can I sell a consumer CUA app right now?” the better question may be, “can I produce data that helps CUA models become better?”

The pivot

So the current direction is infrastructure and data.

More specifically, I want to use the pipeline I built to create high quality data for labs that are training CUA models. I am especially interested in Korean domain data. Most frontier computer use work will naturally be biased toward English language software, English web flows, and US style workflows. But real computer use is local. Korea has its own websites, UI patterns, login flows, forms, public services, commerce flows, financial workflows, and workplace software habits.

If CUA is going to work globally, models need that kind of data. Not just screenshots, but trajectories. Action sequences. Before and after states. Failure cases. Labels. Verifier results. Environment metadata. The boring stuff that actually teaches a model how work happens on a computer.

That is why I have not fully given up. The first product failed. The first market thesis was wrong, or at least too early. But I still believe in the CUA vision.

CUA is one of the last phases of AI agents becoming useful in the physical workflow of knowledge work. Chat is thinking. Code agents are building. Computer use agents are execution. If an agent can operate the same software humans operate, then it can start replacing real labor, not just producing text about labor.

My mistake was trying to sell the future too directly before the harness, model, and market were ready. The pivot is to sell the missing ingredient instead: data and infrastructure that make the future less impossible.

어디서 시작했나

처음 CUA 앱을 만들기 시작한 건 2025년 4월 둘째 주였음. 지금이 2026년 6월이니까 1년 조금 넘게 지난 셈임. 그동안 이 일만 한 건 아님. 다른 일도 병행했고, SWE 일도 하면서 같이 했음. 그래도 이제는 꽤 시간이 지났으니 이렇게 말할 수 있을 것 같음. 내가 처음 생각했던 CUA 컨셉은 현재로서는 실패했다고 보는 게 맞음.

처음 아이디어는 단순했음. 내 컴퓨터를 원격으로 조작해주는 에이전트를 만들고 싶었음. 내가 일을 주면 에이전트가 화면을 보고, 뭘 할지 계획하고, 클릭하고, 타이핑하고, 일을 끝내는 것임. 꿈 자체는 명확했음. 나 대신 에이전트가 컴퓨터에서 실제 노동을 해준다면 엄청난 가치가 있을 거라고 생각했음.

첫 버전에서는 거의 모든 decision making을 LLM에 맡겼음. planning, action 선택, execution까지 전부 모델이 하도록 했음. 당시에는 GPT-4o와 o3를 테스트하고 있었음. 둘 다 이런 제품에 쓰기에는 느렸고, computer use capability도 아직 많이 부족했음.

더 큰 문제는 속도만이 아니었음. 모델이 실제로 화면 위에서 행동해야 했음. 좌표를 찾고, 올바른 버튼을 누르고, workflow를 따라가고, UI가 바뀌었을 때 복구해야 했음. 실제로는 이 부분이 계속 깨졌음. VLM은 스크린샷을 넓은 의미에서 이해할 수는 있지만, 제품에는 그보다 훨씬 더 지루한 정밀도가 필요함.

기술적으로 작동했던 것

처음으로 의미 있었던 변화는 planning과 execution을 분리한 것이었음. 여전히 다음에 뭘 해야 하는지는 LLM이 정하게 했지만, 낮은 수준의 실행을 전부 모델에게 맡기지는 않았음. 대신 로컬 코드가 execution을 담당하게 했음.

로컬 쪽에서는 classic vision algorithm들을 많이 썼음. AKAZE, template matching, 미리 학습시켜둔 이미지 벡터, cosine similarity 같은 것들이 생각보다 유용했음. 시스템이 버튼이 어떻게 생겼는지 이미 알고 있다면, 그 버튼을 다시 찾는 건 매우 빠르게 할 수 있었음. 이미 본 걸 클릭하기 위해 매번 모델 round trip을 할 필요가 없었음.

이렇게 하니 두 가지 문제가 해결됐음.

첫째, latency가 좋아졌음. 모든 action이 원격 모델에 갔다가 돌아올 필요가 없어졌음. 로컬 runtime이 작은 실행 단계들을 빠르게 처리할 수 있었음.

둘째, accuracy가 좋아졌음. pattern matching은 LLM처럼 똑똑한 건 아니지만, 반복되는 UI action에서는 오히려 그 점이 장점이었음. 한 번 target을 배운 뒤에는 매번 general VLM이 새로 좌표를 추측하는 것보다 더 안정적으로 인식할 수 있었음.

그러니까 기술적으로는 진전이 있었음. 시스템은 더 실용적으로 변했음. Tauri UI도 있었고, 로컬 execution도 있었고, CUA model을 실행할 runtime도 있었고, training loop의 조각들도 있었음. 하지만 기술적으로 방향이 잡힌 것과 사업이 되는 것은 다른 문제였음.

실패한 부분

문제는 단순했음. 고객을 찾을 수 없었음.

나는 이 프로젝트를 재미있어서 시작했음. “만약 에이전트가 나 대신 일을 해줄 수 있다면?”이라는 생각이 출발점이었음. 이 질문 자체는 아직도 맞다고 봄. 하지만 실제로 인간이 매일 수행하는 복잡한 task를 하기에는 harness와 model 둘 다 부족했음. 제품은 흥미로운 데모를 보여줄 수 있었지만, 대부분의 사람이 매일 쓸 도구가 될 만큼 안정적이지 않았음.

B2B RPA 쪽은 가능한 길 중 하나였음. 원래 은행에서 일했기 때문에 기업들이 왜 자동화에 관심을 가질지 어느 정도 이해하고 있었음. 하지만 그쪽으로 팔려면 신뢰, 네트워킹, 아주 구체적인 wedge가 필요함. 나는 그 distribution이 없었음.

B2C도 어려웠음. agent를 character처럼 보이게 하려고도 했음. VTuber 스타일 agent를 입혀보고, 더 consumer friendly한 surface를 만들고, 조금 더 playful한 방향도 시도했음. 하지만 실제 use case가 없으니 나조차도 이게 뭘 하는 건지 애매하게 느껴졌음. character wrapper는 사람들이 진짜로 필요로 하는 일이 없다는 문제를 해결하지 못함.

더 큰 실수는 distribution이었음. 나는 product에 너무 집중했고, 사람들에게 보여주는 일에는 너무 소홀했음. 혼자 개발하고 있었음. development, DevOps, model testing, product thinking, 원래 SWE 일까지 동시에 하고 있었음. 이건 실제 제약이긴 했지만, 변명은 아님. Reddit과 X에 더 많이 올렸어야 했음. 더 많은 사람과 이야기했어야 했음. 수요를 더 빨리 검증했어야 했음.

그래서 솔직히 말하면, 나는 아무도 원하지 않는 소프트웨어를 1년 정도 혼자 만든 셈임. 꽤 아픈 말이지만 아마 이게 정확한 표현일 것임. Paul Graham이 말한 “making something nobody wants”에 가까웠음.

그래도 남은 가치

이상한 점은, 그렇다고 이 작업이 전부 무가치했다고 생각하지는 않는다는 것임.

내가 만든 것은 단순히 실패한 앱만은 아님. CUA data를 수집하고 학습시키기 위한 end to end pipeline이기도 함. Tauri 기반 UI 안에서 CUA model을 실행할 수 있는 runtime이 있음. 내가 일하고 있을 때 always on feature로 data를 모을 수 있음. 그 data를 normalize하고, label하고, 서버로 export하고, RunPod에서 SFT하고, 다시 Tauri app으로 불러와서 eval하고, 나중에는 RL task까지 돌릴 수 있는 구조가 있음.

이건 그냥 “AI 앱 하나 만들었다”와는 다름. 더 나은 computer use agent를 만들기 위한 infrastructure에 가까움.

초기 제품은 end user use case가 충분히 명확하지 않았고, model도 충분히 준비되지 않았기 때문에 실패했음. 하지만 pipeline 자체는 아직 유니크하다고 생각함. 시장은 “오늘 당장 모든 사람이 자기 PC에 원격 worker를 붙인다”가 아닐 수도 있음. 오히려 “CUA model을 만드는 lab들이 기존 data로 커버하지 못하는 domain의 high quality interaction data를 필요로 한다”일 수 있음.

그러면 질문이 바뀜. “지금 consumer CUA app을 팔 수 있는가?”가 아니라, “CUA model이 더 좋아지는 데 필요한 data를 만들 수 있는가?”가 더 좋은 질문일 수 있음.

피봇

그래서 지금 방향은 infrastructure와 data 쪽임.

조금 더 구체적으로는, 내가 만든 pipeline을 활용해서 CUA model을 training하는 lab들을 위한 high quality data를 만들고 싶음. 특히 한국 domain data에 관심이 있음. 대부분의 frontier computer use 작업은 자연스럽게 영어권 software, 영어 web flow, 미국식 workflow에 치우칠 가능성이 높음. 하지만 실제 computer use는 local함. 한국에는 한국만의 website, UI pattern, login flow, form, public service, commerce flow, financial workflow, workplace software 습관이 있음.

CUA가 글로벌하게 작동하려면 그런 data가 필요함. 단순한 screenshot이 아니라 trajectory가 필요함. action sequence, before and after state, failure case, label, verifier result, environment metadata가 필요함. 모델에게 컴퓨터에서 일이 어떻게 일어나는지 가르치는 지루한 것들이 필요함.

그래서 아직 완전히 포기하지 않았음. 첫 제품은 실패했음. 첫 market thesis도 틀렸거나, 적어도 너무 빨랐음. 하지만 나는 여전히 CUA 비전을 믿음.

CUA는 AI agent가 지식 노동의 실제 workflow 안에서 유용해지는 마지막 phase 중 하나라고 생각함. Chat은 생각에 가까움. Code agent는 만드는 것에 가까움. Computer use agent는 실행임. agent가 사람이 쓰는 software를 그대로 조작할 수 있다면, 그때부터는 노동에 대해 말하는 것이 아니라 실제 노동을 대체하기 시작할 수 있음.

내 실수는 harness, model, market이 준비되기 전에 미래를 너무 직접적으로 팔려고 했던 것임. 이제 피봇은 그 미래에 필요한 재료를 파는 쪽임. data와 infrastructure를 만들어서, 그 미래가 조금 덜 불가능하게 보이도록 만드는 것임.

1 year of building a CUA service1년간 CUA 서비스를 만들며