Scaling Agents for Computer Use: Behavior Judge

Overview

Simular Research wrote this arXiv paper, “Scaling Agents for Computer Use,” about computer-use agents (CUAs), AI systems that operate real desktop, Windows, or Android interfaces by clicking, typing, reading screenshots, and sometimes running code.
The main problem is variance. A CUA may solve the same long task once and fail the next time because one small mistake can poison the rest of the trajectory.
The paper’s bet is simple: instead of trusting one attempt, run several attempts in parallel and pick the best full behavior. This is called wide scaling.
Their method, Behavior Judge (BJudge), converts each raw rollout into a compact behavior narrative, then asks a judge model to compare the candidates.
On OSWorld, BJudge reaches 72.6% success at 100 steps, beating the previous best result of 63.4% and slightly passing the reported human level of 72.36%.

Why does one rollout fail so often?

Imagine asking an agent to edit a spreadsheet, save it, export it, and email the result. If it misses one click near the start, every later step may still look confident but the task is already off track.
Computer-use tasks are long-horizon tasks. The agent has to stay correct across dozens or hundreds of UI actions, not just answer one prompt.
The environment is noisy. Pop-ups, latency, UI layout changes, and hidden state can make two runs of the same instruction behave differently.
This makes a single rollout brittle. More thinking inside one rollout can help locally, but it still commits the agent to one path.
Wide scaling uses a different source of strength: several agents or several stochastic runs may fail on different tasks, so one of them may find a working path.

What is wide scaling?

Single-rollout scaling means spending more compute inside one trajectory, often by choosing among candidate actions at each step.
Wide scaling means generating multiple full trajectories from the same starting state, then selecting one final answer or action history.
The key difference is commitment. Step-wise selection can keep improving a bad route, while full-rollout selection can choose a totally different route that happened to work.
The hard part is evaluation. Screenshots and actions over 100 steps are too dense for a judge to compare directly.
BJudge is the paper’s answer to that bottleneck: make each trajectory easier to read before comparing it.

Scaling style	What gets compared	Main risk	Why BJudge cares
Step-wise scaling	Candidate next actions	Agent over-commits to a poor plan	Local choices do not reveal final success
Wide scaling	Candidate full rollouts	Judge may pick the wrong rollout	Needs a readable trajectory representation

How does Behavior Judge read a rollout?

A raw rollout is mostly screenshots. Humans can inspect that, but it is slow, expensive, and full of irrelevant pixels.
BJudge first creates a behavior narrative, a step-by-step list of facts about what changed after each action.
For a click, tap, drag, or move, it marks the pointer location before the action and zooms into the relevant region after the action. This helps the generator check whether the intended UI change really happened.
The narrative keeps the first screenshot, the final screenshot, and the action-effect facts in between. It drops detail that does not matter for task success.
This changes the judge’s job from “understand a movie of the whole desktop” to “compare what each candidate actually accomplished.”

How does BJudge choose the best candidate?

Multiple base policies produce candidate rollouts. The paper uses diversity from both stochastic decoding and different models.
Each candidate is converted into a behavior narrative before selection.
A vision-language model evaluator receives the narratives together and answers a multiple-choice question: which trajectory best satisfies the task?
The paper prefers direct comparison over independent scoring. If each rollout is scored alone, the judge may miss that one candidate succeeded on a requirement that the others skipped.
Their best OSWorld setup uses rollouts from GPT-5 and Claude Opus 4.5, Opus to generate facts, and GPT-5 to choose the final trajectory.

What is Agent S3?

BJudge still needs good candidates. The authors build Agent S3, an improved computer-use agent framework, on top of Agent S2.
Agent S3 removes the manager-worker hierarchy and uses a flatter policy that can replan from the current observation and history.
It can choose between direct GUI actions and a coding agent. For bulk edits, file transforms, or structured parsing, code may be faster and more reliable than clicking through menus.
After a code call finishes, the agent writes a short summary of what changed and how to verify it, then returns to GUI control.
This baseline matters because selection cannot recover if none of the candidate rollouts are good.

What did the experiments show?

On OSWorld, Agent S3 with GPT-5 scores 62.6% at 100 steps. Adding BJudge with 10 GPT-5 rollouts raises it to 69.9%.
The best reported result is 72.6% using a mixture of GPT-5 and Claude Opus 4.5 rollouts.
Behavior narratives outperform simpler representations. With 10 GPT-5 Mini rollouts, screenshots only score 56.0%, trajectory summaries 55.0%, naive captioning 56.8%, and behavior narratives 60.2%.
BJudge also transfers beyond Ubuntu. It improves Agent S3 on WindowsAgentArena by 6.4 points at 100 steps and on AndroidWorld by 3.5 points.
Agent S3 itself is more efficient than Agent S2: higher success, about half as many LLM calls per task, and much lower average task time in the paper’s OSWorld setup.

Result	Baseline	With BJudge	Gain
OSWorld, GPT-5	62.6%	69.9%	+7.3
OSWorld, GPT-5 Mini	49.8%	60.2%	+10.4
WindowsAgentArena, GPT-5	50.2%	56.6%	+6.4
AndroidWorld, GPT-5	68.1%	71.6%	+3.5

When does the method break down?

BJudge assumes you can safely reset the computer state and run several independent attempts. That is natural in benchmarks and virtual machines, but harder on a live personal desktop.
Shared external state can interfere across rollouts. Email inboxes, cloud files, shopping carts, and accounts may not stay independent just because the local VM was reset.
The judge can still be fooled when the behavior narrative is wrong. In the paper’s failure analysis, most remaining human-labeled failures came from narrative hallucinations.
Code-GUI handoff is another weak spot. A coding agent may complete the real work in one step, but the GUI agent may not recognize that and later overwrite it.
More rollouts are not always better under a small total budget. If each worker gets too few steps, none of them can finish the task.

FAQ

Why compare full rollouts instead of actions one step at a time?

Step-wise judging can make a local action better while trapping the agent on a bad route. Full-rollout judging lets the system choose a different attempt that actually finished the task.

Why are behavior narratives better than screenshot summaries?

They describe action effects, not just screen contents. The judge sees what changed after each action, which is closer to the evidence needed for deciding whether the task succeeded.

Does BJudge require stronger base models?

It benefits from them, but the main idea is selection. If several weaker rollouts fail in different ways, BJudge can still improve success by choosing the one that reached the goal.

Why does model diversity help?

Different models tend to solve different subsets of tasks. A mixed rollout pool raises the chance that at least one candidate succeeds before the judge has to choose.

Can this run on my actual laptop safely?

Only with care. The method works best in isolated VMs or containers where each rollout starts from the same snapshot and cannot corrupt shared accounts or files.

What is the biggest practical bottleneck?

Reliable evaluation is still the bottleneck. Running more agents is easy; knowing which long behavior actually satisfied the user’s instruction is the hard part.

개요

Simular Research의 arXiv 논문 “Scaling Agents for Computer Use” 정리임. computer-use agent(CUA)는 실제 데스크톱, Windows, Android 화면을 보고 클릭, 타이핑, 코드 실행까지 하면서 일을 처리하는 AI 시스템임.
핵심 문제는 variance임. 같은 긴 작업도 한 번은 성공하고 다음 번엔 실패할 수 있음. 초반 클릭 하나가 틀리면 뒤의 행동 전체가 망가질 수 있기 때문임.
논문의 베팅은 단순함. 하나의 시도만 믿지 말고 여러 시도를 병렬로 실행한 뒤, 가장 좋은 전체 행동을 고르자는 것임. 이걸 wide scaling이라고 부름.
논문의 방법인 **Behavior Judge (BJudge)**는 raw rollout을 짧은 behavior narrative로 바꾼 다음, judge 모델이 후보들을 비교하게 함.
OSWorld에서 BJudge는 100 step 기준 72.6% 성공률을 냄. 이전 최고 기록 63.4%를 넘었고, 논문에 보고된 human level 72.36%도 아주 조금 넘음.

왜 rollout 하나는 자주 실패하나?

agent에게 spreadsheet를 수정하고 저장하고 export한 뒤 이메일로 보내라고 시킨다고 해보자. 초반에 버튼 하나를 잘못 누르면 뒤에서는 자신 있게 움직여도 이미 작업은 틀어진 상태임.
computer-use task는 long-horizon task임. 프롬프트 하나에 답하는 게 아니라 수십에서 수백 개 UI action을 계속 맞게 이어가야 함.
환경도 지저분함. 팝업, 지연, UI 배치 변화, 보이지 않는 상태 때문에 같은 instruction이어도 실행마다 결과가 달라질 수 있음.
그래서 single rollout은 약함. 한 rollout 안에서 더 오래 생각하는 건 국소적으로 도움 되지만, 결국 한 경로에 commit하는 구조임.
wide scaling은 다른 힘을 씀. 여러 agent나 여러 stochastic run은 서로 다른 곳에서 실패하기 때문에, 그중 하나는 맞는 경로를 찾을 수 있음.

wide scaling은 무엇인가?

Single-rollout scaling은 하나의 trajectory 안에 compute를 더 쓰는 방식임. 보통 각 step에서 candidate action을 여러 개 만들고 그중 하나를 고름.
Wide scaling은 같은 시작 상태에서 여러 full trajectory를 만든 뒤, 마지막에 하나의 결과나 action history를 고르는 방식임.
차이는 commitment임. step-wise selection은 나쁜 경로 위에서 다음 action만 계속 좋게 만들 수 있지만, full-rollout selection은 아예 다른 성공 경로를 고를 수 있음.
어려운 부분은 평가임. 100 step짜리 screenshot과 action 기록은 judge가 그대로 비교하기엔 너무 빽빽함.
BJudge는 이 병목을 풀기 위한 방법임. 각 trajectory를 비교하기 쉬운 형태로 바꾼 뒤 judge에게 줌.

Scaling 방식	비교 대상	주요 위험	BJudge가 신경 쓰는 이유
Step-wise scaling	다음 action 후보	agent가 나쁜 plan에 과하게 묶임	국소 선택만으로 최종 성공을 알기 어려움
Wide scaling	전체 rollout 후보	judge가 잘못된 rollout을 고를 수 있음	읽기 쉬운 trajectory 표현이 필요함

Behavior Judge는 rollout을 어떻게 읽나?

raw rollout은 대부분 screenshot임. 사람이 보면 이해할 수는 있지만 느리고 비싸고, 관련 없는 pixel도 너무 많음.
BJudge는 먼저 behavior narrative를 만듦. 각 action 뒤에 무엇이 바뀌었는지를 step-by-step fact로 적은 리스트임.
click, tap, drag, move 같은 action에서는 action 전 pointer 위치를 표시하고, action 뒤 관련 영역을 확대함. 이걸로 generator가 의도한 UI 변화가 실제로 일어났는지 확인할 수 있음.
narrative는 첫 screenshot, 마지막 screenshot, 중간의 action-effect fact를 남김. task success와 무관한 세부사항은 버림.
judge의 일이 “데스크톱 전체 영상을 이해하기”에서 “각 후보가 실제로 무엇을 달성했는지 비교하기”로 바뀜.

BJudge는 최고의 후보를 어떻게 고르나?

여러 base policy가 candidate rollout을 만듦. 논문은 stochastic decoding과 서로 다른 모델을 모두 써서 다양성을 얻음.
각 candidate는 선택 전에 behavior narrative로 변환됨.
vision-language model evaluator가 여러 narrative를 함께 받고, 어떤 trajectory가 task를 가장 잘 만족하는지 multiple-choice로 고름.
논문은 independent scoring보다 direct comparison을 선호함. rollout을 하나씩 따로 채점하면, 한 후보만 만족한 requirement를 judge가 놓칠 수 있기 때문임.
OSWorld 최고 설정은 GPT-5와 Claude Opus 4.5 rollout을 섞고, Opus가 fact를 만들고, GPT-5가 최종 trajectory를 고르는 방식임.

Agent S3는 무엇인가?

BJudge는 좋은 후보가 있어야 작동함. 저자들은 Agent S2 위에 개선된 computer-use agent framework인 Agent S3를 만듦.
Agent S3는 manager-worker hierarchy를 없애고, 현재 observation과 history에서 다시 계획할 수 있는 더 평평한 policy를 사용함.
direct GUI action과 coding agent 중 하나를 고를 수 있음. 대량 수정, 파일 변환, structured parsing은 메뉴를 클릭하는 것보다 코드가 더 빠르고 안정적일 수 있음.
code call이 끝나면 agent는 무엇이 바뀌었고 어떻게 검증할 수 있는지 짧게 요약한 뒤 GUI control로 돌아감.
이 baseline이 중요한 이유는 selection이 마법이 아니기 때문임. candidate rollout 중 아무것도 좋지 않으면 BJudge도 건질 게 없음.

실험 결과는 무엇을 보여주나?

OSWorld에서 GPT-5 기반 Agent S3는 100 step 기준 62.6%를 기록함. 여기에 GPT-5 rollout 10개를 쓰는 BJudge를 더하면 69.9%로 올라감.
최고 결과는 GPT-5와 Claude Opus 4.5 rollout을 섞은 72.6%임.
behavior narrative는 단순한 표현보다 좋았음. GPT-5 Mini rollout 10개 기준 screenshot only는 56.0%, trajectory summary는 55.0%, naive captioning은 56.8%, behavior narrative는 60.2%였음.
BJudge는 Ubuntu 밖에서도 transfer됨. WindowsAgentArena에서 Agent S3를 6.4 point 올렸고, AndroidWorld에서는 3.5 point 올림.
Agent S3 자체도 Agent S2보다 효율적임. 논문 OSWorld 설정에서 성공률은 더 높고, task당 LLM call은 약 절반이며, 평균 task time도 훨씬 낮았음.

결과	Baseline	BJudge 적용	증가
OSWorld, GPT-5	62.6%	69.9%	+7.3
OSWorld, GPT-5 Mini	49.8%	60.2%	+10.4
WindowsAgentArena, GPT-5	50.2%	56.6%	+6.4
AndroidWorld, GPT-5	68.1%	71.6%	+3.5

언제 이 방법이 깨지나?

BJudge는 computer state를 안전하게 reset하고 여러 독립 시도를 실행할 수 있다고 가정함. benchmark나 VM에서는 자연스럽지만, 실제 개인 desktop에서는 더 어려움.
공유 external state가 rollout 사이에 간섭할 수 있음. 이메일 inbox, cloud file, shopping cart, account 상태는 local VM을 reset해도 독립적이지 않을 수 있음.
behavior narrative가 틀리면 judge도 속을 수 있음. 논문의 failure analysis에서 남은 human-labeled failure 대부분은 narrative hallucination에서 나왔음.
code-GUI handoff도 약점임. coding agent가 실제 작업을 한 step에 끝냈는데 GUI agent가 그 사실을 못 알아보고 나중에 덮어쓸 수 있음.
전체 budget이 작을 때 rollout을 무조건 늘리는 것도 답이 아님. worker마다 step이 너무 적으면 아무도 task를 끝내지 못함.

FAQ

왜 action을 step마다 고르지 않고 full rollout을 비교하나?

Step-wise judging은 국소 action을 좋게 만들 수 있지만 나쁜 경로에 agent를 묶어둘 수 있음. Full-rollout judging은 실제로 task를 끝낸 다른 시도를 고를 수 있게 함.

behavior narrative가 screenshot summary보다 왜 나은가?

화면 내용만 요약하지 않고 action 뒤에 무엇이 바뀌었는지를 설명함. judge는 task 성공 여부를 판단하는 데 더 가까운 증거를 보게 됨.

BJudge는 더 강한 base model이 있어야만 작동하나?

강한 model일수록 도움 되지만 핵심은 selection임. 약한 rollout들이 서로 다른 방식으로 실패하면, 그중 목표에 가장 가까운 시도를 골라 성공률을 올릴 수 있음.

model diversity는 왜 도움이 되나?

서로 다른 model은 서로 다른 task subset을 잘 푸는 경향이 있음. mixed rollout pool은 judge가 고르기 전에 성공 후보가 하나라도 생길 확률을 높임.

이걸 실제 내 laptop에서 안전하게 실행할 수 있나?

주의가 필요함. 이 방법은 각 rollout이 같은 snapshot에서 시작하고 공유 계정이나 파일을 망가뜨릴 수 없는 VM, container 같은 격리 환경에서 가장 잘 맞음.

가장 큰 실용 병목은 무엇인가?

여전히 reliable evaluation이 병목임. agent를 더 많이 실행하는 건 쉽지만, 긴 행동 기록 중 무엇이 사용자 instruction을 실제로 만족했는지 아는 게 어려움.

Scaling Agents for Computer Use: Behavior Judge 컴퓨터 사용 에이전트 스케일링: Behavior Judge