While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.While llms.txt helps AI read the web and APIs help them connect, neither solves the infinite customization found in the economically important tasks in enterprise software. The real solution lies in computer-use agents that operate at the pixel level, learning from human demonstrations to navigate screens directly. This approach bypasses brittle connectors, allowing AI to handle complex workflows while humans remain in the loop for critical verification.

The Screen Is the API

2025/12/10 15:36

"Why not just use llms.txt to understand the page?"

My friend was watching an AI agent work through a complex enterprise workflow. Clicking through menus, filling forms, handling the kind of nested configuration screens that were the definition of scope creep.

It was a reasonable question. Everyone is excited about llms.txt right now. A simple text file that tells AI systems what your website contains. Finally, the thinking goes, we have a standardized way for machines, or LLMs, to understand the web.

But my friend was confusing two very different problems. Reading is not doing.

The web did not become useful when machines learned to read it. It became useful when machines learned to act on it. And right now, the reading part is limited and we must shift focus to the doing.

Reading Isn’t the Hard Part

Let me be clear about what llms.txt actually does. It is a curated map for LLM inference. A structured way for language models to understand what exists on a website and where to find it. 

This is useful for bringing information to an LLM. But it is not a control mechanism. It does not let AI systems actually do anything. The gap between reading and acting is where the real work begins.

The Action Space

When people talk about AI automation, they usually mean APIs. Expose endpoints, let the AI call them, and you have automation. Simple.

Except it is not simple at all.

APIs expose only what developers choose to expose. They represent a curated subset of functionality that someone decided was worth the engineering effort to formalize. And in enterprise software, that subset is usually tiny compared to what users actually need to do.

Then came MCP, the Model Context Protocol. MCP tries to solve the connector problem. Instead of every AI system needing custom integrations with every application, you build one MCP connector and any MCP-compatible AI can use it.

This is an improvement. It solves the M×N problem where M AI systems need to integrate with N applications. But it assumes someone builds the connector in the first place.

Building these connectors is still hard. It requires understanding both the application and the MCP protocol. Most enterprise software will never get proper MCP support because the economics, I believe, are hard to justify. \n \n Attempts to automate API to MCP conversion have become popular, but they mostly produce brittle, low-level tools. As Han Lee and others point out, REST APIs are designed around nouns (resources with GET/PUT/POST/DELETE), while MCP works best when tools are verbs (deleteRow, createTask). Auto-wrapping one into the other hides that mismatch instead of solving it.

The M×N×P Problem

There is a deeper issue that neither APIs nor MCP can address. Call it the P variable: interface diversity.

P represents the number of unique ways the same software can be configured. And in enterprise software, P grows to enormous scale.

Consider SAP. A single SAP S/4HANA server contains tens of thousands of customizing tables. Every implementation is different. Every organization has its own approval chains, its own business rules, its own custom ABAP developments.

Here is a concrete example. Take something supposedly simple: a purchase order approval workflow. In a real SAP implementation, this involves parallel approval processes with all-or-nothing requirements. Custom rules like auto-approve if a contract covers the full purchase order amount. Multi-level approval chains where limits are maintained in custom tables. Dynamic role assignment based on cost center responsibility.

None of this is standard.

The approval chain requires both the Department Manager and Finance Department to approve simultaneously. Either rejection kills the whole workflow. 

Then come the rules. If the purchase order references a contract and the totals match, auto-approve. Otherwise, check approval limits in a custom table. If the first approver lacks sufficient authority, cascade to the next level.

And the approvers themselves? Assigned dynamically. Sometimes it is the Manager of Workflow Initiator. Sometimes the Cost Center Responsible. Sometimes specific users pulled from yet another custom table.

This is one workflow in one module. 

It requires domain-knowledge-specific consultants to implement because the out-of-the-box logic is too simple for how real organizations actually work.

This is the M×N×P problem. Even if you solved M×N with perfect connector protocols (like the MCP), you would still face the reality that every enterprise implementation is effectively a unique interface.

Computer-Use as the Universal Layer

There is one interface that is universal: the screen.

Computer-use agents operate at the pixel level. They see what humans see. They click where humans click. They navigate the same menus and fill out the same forms.

This sounds crude compared to elegant API calls. But it has one massive advantage: it works with everything. No connector required. No API exposure decisions. No MCP protocol adoption. If a human can do it, a computer-use agent can learn to do it.

The question is whether computer-use works well enough for production use. And here the research is early but encouraging.

The Demonstration Effect

The SCUBA benchmark tests AI agents on real Salesforce CRM workflows. In zero-shot settings, meaning no task-specific training, open-source models achieved less than 5% success rates. Even strong models that perform well on generic desktop benchmarks failed catastrophically when confronted with actual enterprise software.

But with demonstrations, meaning examples of humans completing the workflows, success rates jumped to 50%. Simultaneously, time and costs (of the agents) dropped by 13% and 16% respectively.

General capability is not enough. You need specific training on specific workflows.

Data Efficiency

In my experience, collecting computer-use trajectories is painful. Domain experts rarely understand what actually challenges a model. The infrastructure stacks on top of brittle web environments. Building those environments is pure tedium. When every example costs this much, data efficiency stops being nice-to-have.

Which is why the PC Agent-E research matters. Trained on just 312 trajectories, the model achieved a 141% improvement over the base model.

312 examples. Not millions. Not even thousands. A few hundred carefully chosen demonstrations of the exact workflows.

The model outperformed Claude 3.7 Sonnet with extended thinking on the WindowsAgentArena benchmark. And it generalized well to different operating systems, suggesting the learned behaviors were not brittle.

The economics of enterprise AI automation are simple: you do not need massive datasets. You need the right datasets from the right workflows.

The Honest Trade-Off

Now for the uncomfortable part. Generalization is necessary but not sufficient for high-stakes operations.

The same research that shows promising results also reveals gaps. Some agents that perform well on generic benchmarks like OSWorld achieve less than 5% success on specialized enterprise environments. Despite advances, today's RL systems struggle to generalize beyond narrow training contexts.

The sim-to-real gap persists. An agent that performs flawlessly in simulation may fail in production due to unmodeled variables. 

For high-volume, repetitive workflows like expense approvals, CRM updates, and standard procurement, trained computer-use agents are approaching production readiness. The error rate is acceptable because any single mistake is recoverable.

For one-off, high-stakes operations like schema migrations, financial reconciliations, and compliance configurations, the calculus is different. A database configuration error can cost millions. A compliance failure can trigger regulatory action.

The honest answer is that computer-use can handle navigation and execution for these tasks, but humans must remain at verification checkpoints. The agent does the clicking. The human confirms the consequences.

This is not a failure of the technology. It is appropriate risk management. And it still represents an enormous productivity gain. Navigating to the right screen, filling in the right fields, and preparing the right configurations is most of the work. Human verification at critical decision points is the remaining essential piece. At least for now.

Down The Middle: Agents and Humans

The path forward is not pure automation or pure human control. They are hybrid workflows where computer-use agents handle the interface complexity while humans handle the judgment calls. Human-in-the-loop is already the norm for production AI agents.

This requires new infrastructure. You need training pipelines for enterprise-specific demonstrations. You need simulation environments that match production configurations. You need checkpoint mechanisms that pause for human review at appropriate moments. Companies like Applied Compute, Theta, Osmosis, and Scale AI are starting to build this infrastructure.

But the hard technical problem, making computers reliably operate arbitrary interfaces, is being solved. The remaining problems are organizational and economic. Those problems have a tendency to get solved when the benefits are large enough.

The best agents still fail on most real enterprise tasks. But a few years ago they could barely hit a single submit button. The screen is the only universal interface. That's where the work should go.

\n

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

CEO Sandeep Nailwal Shared Highlights About RWA on Polygon

The post CEO Sandeep Nailwal Shared Highlights About RWA on Polygon appeared on BitcoinEthereumNews.com. Polygon CEO Sandeep Nailwal highlighted Polygon’s lead in global bonds, Spiko US T-Bill, and Spiko Euro T-Bill. Polygon published an X post to share that its roadmap to GigaGas was still scaling. Sentiments around POL price were last seen to be bearish. Polygon CEO Sandeep Nailwal shared key pointers from the Dune and RWA.xyz report. These pertain to highlights about RWA on Polygon. Simultaneously, Polygon underlined its roadmap towards GigaGas. Sentiments around POL price were last seen fumbling under bearish emotions. Polygon CEO Sandeep Nailwal on Polygon RWA CEO Sandeep Nailwal highlighted three key points from the Dune and RWA.xyz report. The Chief Executive of Polygon maintained that Polygon PoS was hosting RWA TVL worth $1.13 billion across 269 assets plus 2,900 holders. Nailwal confirmed from the report that RWA was happening on Polygon. The Dune and https://t.co/W6WSFlHoQF report on RWA is out and it shows that RWA is happening on Polygon. Here are a few highlights: – Leading in Global Bonds: Polygon holds 62% share of tokenized global bonds (driven by Spiko’s euro MMF and Cashlink euro issues) – Spiko U.S.… — Sandeep | CEO, Polygon Foundation (※,※) (@sandeepnailwal) September 17, 2025 The X post published by Polygon CEO Sandeep Nailwal underlined that the ecosystem was leading in global bonds by holding a 62% share of tokenized global bonds. He further highlighted that Polygon was leading with Spiko US T-Bill at approximately 29% share of TVL along with Ethereum, adding that the ecosystem had more than 50% share in the number of holders. Finally, Sandeep highlighted from the report that there was a strong adoption for Spiko Euro T-Bill with 38% share of TVL. He added that 68% of returns were on Polygon across all the chains. Polygon Roadmap to GigaGas In a different update from Polygon, the community…
Share
BitcoinEthereumNews2025/09/18 01:10
U.S. Seizes Oil Tanker Off Venezuela Coast

U.S. Seizes Oil Tanker Off Venezuela Coast

The post U.S. Seizes Oil Tanker Off Venezuela Coast appeared on BitcoinEthereumNews.com. Topline The U.S. seized an oil tanker off the coast of Venezuela, President Donald Trump said Wednesday, the latest military incursion near Venezuela as the Trump administration pressures Venezuelan President Nicolas Maduro to resign. A Venezuelan navy patrol boat escorts Panamanian flagged crude oil tanker Yoselin near the El Palito refinery in Puerto Cabello, Venezuela on November 11, 2025. (Photo by JUAN CARLOS HERNANDEZ/AFP via Getty Images) AFP via Getty Images Key Facts Trump confirmed the news reported earlier in the day by Reuters, telling business leaders at the White House the tanker was “the largest one ever seized.” Details of the seizure led by the U.S. Coast Guard—including the name of the tanker, its country of origin and where it took place—are unclear, according to Reuters. The price of oil futures rose 56 cents, to $58.93 per barrel, after the seizure was made public. The seizure comes amid an increase in U.S. military presence off the coast of Venezuela and a series of attacks on alleged drug-carrying vessels in the Caribbean. Big Number 303 billion barrels. That’s the total amount of oil preserves Venezuela has, according to the Oil & Gas Journal, amounting to 17% of the world’s oil supply. Read More Source: https://www.forbes.com/sites/saradorn/2025/12/10/us-seizes-oil-tanker-near-venezuela-as-tensions-rise/
Share
BitcoinEthereumNews2025/12/11 05:10