ExchangeDEX+

Buy Crypto Markets Spot Futures500X Earn Events

Gold Bar & BTC Giveaway2000g

While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.

The Screen Is the API

2025/12/10 15:36

"Why not just use llms.txt to understand the page?"

My friend was watching an AI agent work through a complex enterprise workflow. Clicking through menus, filling forms, handling the kind of nested configuration screens that were the definition of scope creep.

It was a reasonable question. Everyone is excited about llms.txt right now. A simple text file that tells AI systems what your website contains. Finally, the thinking goes, we have a standardized way for machines, or LLMs, to understand the web.

But my friend was confusing two very different problems. Reading is not doing.

The web did not become useful when machines learned to read it. It became useful when machines learned to act on it. And right now, the reading part is limited and we must shift focus to the doing.

Reading Isn’t the Hard Part

Let me be clear about what llms.txt actually does. It is a curated map for LLM inference. A structured way for language models to understand what exists on a website and where to find it.

This is useful for bringing information to an LLM. But it is not a control mechanism. It does not let AI systems actually do anything. The gap between reading and acting is where the real work begins.

The Action Space

When people talk about AI automation, they usually mean APIs. Expose endpoints, let the AI call them, and you have automation. Simple.

Except it is not simple at all.

APIs expose only what developers choose to expose. They represent a curated subset of functionality that someone decided was worth the engineering effort to formalize. And in enterprise software, that subset is usually tiny compared to what users actually need to do.

Then came MCP, the Model Context Protocol. MCP tries to solve the connector problem. Instead of every AI system needing custom integrations with every application, you build one MCP connector and any MCP-compatible AI can use it.

This is an improvement. It solves the M×N problem where M AI systems need to integrate with N applications. But it assumes someone builds the connector in the first place.

Building these connectors is still hard. It requires understanding both the application and the MCP protocol. Most enterprise software will never get proper MCP support because the economics, I believe, are hard to justify. \n \n Attempts to automate API to MCP conversion have become popular, but they mostly produce brittle, low-level tools. As Han Lee and others point out, REST APIs are designed around nouns (resources with GET/PUT/POST/DELETE), while MCP works best when tools are verbs (deleteRow, createTask). Auto-wrapping one into the other hides that mismatch instead of solving it.

The M×N×P Problem

There is a deeper issue that neither APIs nor MCP can address. Call it the P variable: interface diversity.

P represents the number of unique ways the same software can be configured. And in enterprise software, P grows to enormous scale.

Consider SAP. A single SAP S/4HANA server contains tens of thousands of customizing tables. Every implementation is different. Every organization has its own approval chains, its own business rules, its own custom ABAP developments.

Here is a concrete example. Take something supposedly simple: a purchase order approval workflow. In a real SAP implementation, this involves parallel approval processes with all-or-nothing requirements. Custom rules like auto-approve if a contract covers the full purchase order amount. Multi-level approval chains where limits are maintained in custom tables. Dynamic role assignment based on cost center responsibility.

None of this is standard.

The approval chain requires both the Department Manager and Finance Department to approve simultaneously. Either rejection kills the whole workflow.

Then come the rules. If the purchase order references a contract and the totals match, auto-approve. Otherwise, check approval limits in a custom table. If the first approver lacks sufficient authority, cascade to the next level.

And the approvers themselves? Assigned dynamically. Sometimes it is the Manager of Workflow Initiator. Sometimes the Cost Center Responsible. Sometimes specific users pulled from yet another custom table.

This is one workflow in one module.

It requires domain-knowledge-specific consultants to implement because the out-of-the-box logic is too simple for how real organizations actually work.

This is the M×N×P problem. Even if you solved M×N with perfect connector protocols (like the MCP), you would still face the reality that every enterprise implementation is effectively a unique interface.

Computer-Use as the Universal Layer

There is one interface that is universal: the screen.

Computer-use agents operate at the pixel level. They see what humans see. They click where humans click. They navigate the same menus and fill out the same forms.

This sounds crude compared to elegant API calls. But it has one massive advantage: it works with everything. No connector required. No API exposure decisions. No MCP protocol adoption. If a human can do it, a computer-use agent can learn to do it.

The question is whether computer-use works well enough for production use. And here the research is early but encouraging.

The Demonstration Effect

The SCUBA benchmark tests AI agents on real Salesforce CRM workflows. In zero-shot settings, meaning no task-specific training, open-source models achieved less than 5% success rates. Even strong models that perform well on generic desktop benchmarks failed catastrophically when confronted with actual enterprise software.

But with demonstrations, meaning examples of humans completing the workflows, success rates jumped to 50%. Simultaneously, time and costs (of the agents) dropped by 13% and 16% respectively.

General capability is not enough. You need specific training on specific workflows.

Data Efficiency

In my experience, collecting computer-use trajectories is painful. Domain experts rarely understand what actually challenges a model. The infrastructure stacks on top of brittle web environments. Building those environments is pure tedium. When every example costs this much, data efficiency stops being nice-to-have.

Which is why the PC Agent-E research matters. Trained on just 312 trajectories, the model achieved a 141% improvement over the base model.

312 examples. Not millions. Not even thousands. A few hundred carefully chosen demonstrations of the exact workflows.

The model outperformed Claude 3.7 Sonnet with extended thinking on the WindowsAgentArena benchmark. And it generalized well to different operating systems, suggesting the learned behaviors were not brittle.

The economics of enterprise AI automation are simple: you do not need massive datasets. You need the right datasets from the right workflows.

The Honest Trade-Off

Now for the uncomfortable part. Generalization is necessary but not sufficient for high-stakes operations.

The same research that shows promising results also reveals gaps. Some agents that perform well on generic benchmarks like OSWorld achieve less than 5% success on specialized enterprise environments. Despite advances, today's RL systems struggle to generalize beyond narrow training contexts.

The sim-to-real gap persists. An agent that performs flawlessly in simulation may fail in production due to unmodeled variables.

For high-volume, repetitive workflows like expense approvals, CRM updates, and standard procurement, trained computer-use agents are approaching production readiness. The error rate is acceptable because any single mistake is recoverable.

For one-off, high-stakes operations like schema migrations, financial reconciliations, and compliance configurations, the calculus is different. A database configuration error can cost millions. A compliance failure can trigger regulatory action.

The honest answer is that computer-use can handle navigation and execution for these tasks, but humans must remain at verification checkpoints. The agent does the clicking. The human confirms the consequences.

This is not a failure of the technology. It is appropriate risk management. And it still represents an enormous productivity gain. Navigating to the right screen, filling in the right fields, and preparing the right configurations is most of the work. Human verification at critical decision points is the remaining essential piece. At least for now.

Down The Middle: Agents and Humans

The path forward is not pure automation or pure human control. They are hybrid workflows where computer-use agents handle the interface complexity while humans handle the judgment calls. Human-in-the-loop is already the norm for production AI agents.

This requires new infrastructure. You need training pipelines for enterprise-specific demonstrations. You need simulation environments that match production configurations. You need checkpoint mechanisms that pause for human review at appropriate moments. Companies like Applied Compute, Theta, Osmosis, and Scale AI are starting to build this infrastructure.

But the hard technical problem, making computers reliably operate arbitrary interfaces, is being solved. The remaining problems are organizational and economic. Those problems have a tendency to get solved when the benefits are large enough.

The best agents still fail on most real enterprise tasks. But a few years ago they could barely hit a single submit button. The screen is the only universal interface. That's where the work should go.

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.