Salesforce AI Released GTA1: A Test-Time Scaled GUI Agent That Outperforms OpenAI’s CUA


Salesforce AI Research has introduced GTA1, a new graphical user interface (GUI) agent that redefines the state-of-the-art in agentic human-computer interaction. Designed to autonomously operate in real operating system environments such as Linux, GTA1 addresses two critical bottlenecks in GUI agent development: ambiguous task planning and inaccurate grounding of actions. With a 45.2% task success rate on the OSWorld benchmark, GTA1 surpasses OpenAI’s CUA (Computer-Using Agent), establishing a new record among open-source models.

Core Challenges in GUI Agents

GUI agents typically translate high-level user instructions into action sequences—clicks, keystrokes, or UI interactions—while observing UI updates after each action to plan subsequent steps. However, two issues persist:

  1. Planning Ambiguity: Multiple valid action sequences can fulfill a task, leading to execution paths with varying efficiency and reliability.
  2. Grounding Precision: Translating abstract action proposals into accurate, coordinate-level GUI interactions is especially challenging in high-resolution, dynamic interfaces.

GTA1 introduces novel mechanisms to resolve both.

Smarter Planning via Test-Time Scaling

Traditional planners commit to a single action proposal at each decision point, limiting robustness. GTA1’s test-time scaling introduces a simple yet effective solution: concurrently sample multiple candidate actions at each step, and employ a multimodal judge model—typically a large language model—to evaluate and select the most appropriate one.

This technique avoids premature commitment to suboptimal plans and enables the agent to better explore execution paths without requiring future rollout, which is infeasible in GUI environments due to irreversible actions. Importantly, this method can work with any planner and scales well with increasing task complexity and action space size.

Reinforcement Learning for Grounding Accuracy

For GUI grounding, most prior models rely on supervised fine-tuning to predict the center of target UI elements, which limits generalization. GTA1 adopts a reinforcement learning (RL) framework based on Group Relative Policy Optimization (GRPO). Rather than relying on intermediate reasoning (“thinking”) or predicting bounding boxes, the model learns directly from click-based rewards: it is rewarded only when the predicted coordinate falls within the correct UI element.

Through this reward structure, GTA1 achieves state-of-the-art accuracy without the complexity or overhead of chain-of-thought style supervision. Notably, an ablation study shows that removing auxiliary signals such as “thinking” or IoU-based box rewards actually improves grounding performance—particularly in static environments.

Performance Across Benchmarks

GTA1 sets a new standard in several evaluations:

  • OSWorld (Task Success Rate): GTA1-7B reaches 45.2%, outperforming OpenAI CUA (42.9%) and Claude 3.7 (28.0%).
  • ScreenSpot-Pro (Grounding Accuracy): GTA1-7B scores 50.1%, ahead of models like UGround-72B (34.5%).
  • ScreenSpot-V2 (Cross-platform Grounding): GTA1-72B hits 94.8%, nearly matching the top proprietary models.
  • OSWorld-G (Linux GUI Grounding): GTA1-7B reaches 67.7%, outperforming all prior open-source approaches.

These results validate the effectiveness of both the planning and grounding innovations introduced in GTA1.

Additional Design Highlights

  • Data Cleaning: Misaligned annotations from datasets like Aria-UI and OS-Atlas are filtered using OmniParser to improve training signal fidelity.
  • Model Scaling: The approach scales well across models from 7B to 72B parameters, with GTA1-7B offering the best trade-off between performance and compute.
  • Judge Reusability: The multimodal judge used in test-time scaling can be the same LLM used for planning, reducing overhead.

Conclusion

GTA1 demonstrates that robust and accurate GUI agents can be built using a modular two-stage framework enhanced by test-time planning diversity and precise RL-based grounding. By forgoing unnecessary complexity—such as chain-of-thought reasoning in static tasks—Salesforce AI has introduced a lean, effective agent architecture that pushes the frontier in open-ended digital interaction.


Check out the Paper, Codes, 7B Model32B Model and 72B Model. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter, Youtube and Spotify and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.



Source link

  • Related Posts

    Perplexity Introduces Comet—An AI-First Alternative to Traditional Browsers

    Perplexity, a company already recognized for redefining how users interact with information through AI-powered search, has announced the launch of Comet, an ambitious AI-native web browser. Designed with an AI-first…

    Master the Art of Prompt Engineering

    In today’s AI-driven world, prompt engineering isn’t just a buzzword—it’s an essential skill. This blend of art and science goes beyond simple queries, enabling you to transform vague ideas into…

    Leave a Reply

    Your email address will not be published. Required fields are marked *