AI evaluation is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. I propose the following framework: decompose the pipeline into small steps, design a measurable and reproducible evaluation approach, assess the interactions between steps and adjust accordingly.AI evaluation is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. I propose the following framework: decompose the pipeline into small steps, design a measurable and reproducible evaluation approach, assess the interactions between steps and adjust accordingly.

Evaluating AI Is Harder Than Building It

\ For the past few months the mentions of AI evaluation by leaders in the industry have became more and more frequent, with greatest minds tackling the challenges of ensuring AI safety, reliability and alignment. It got me thinking about the topic, and in this post I'll share my view on it.

The Problem

Creating a robust evaluation system is a tricky engineering challenge. With so many diverse tasks that we're trying to solve with AI, it will become increasingly complex to get it right. In pre-agentic era, most of the problems were narrow and specific. An example is making sure that the user gets better recommended posts, measured by engagement time, likes and so on. More engagement - better performance.

But as AI advanced and unlocked new experiences and scenarios, things became much more difficult. Even without agentic systems we started facing the challenge of getting the measurement right, especially with things like conversational AI. In contrast to the previous example, the exact thing to measure here is practically unknown. Here we have criteria like customer satisfaction rate (for customer support applications), "vibe" for creativity tasks, various benchmarks like SWE for pure coding ability and so on. The problem is that these criteria are actually proxy values for our evaluation approaches. This prevents us from achieving the same quality of measurement as we had with simpler tasks.

Today’s Main Concerns

As we accelerate in the agentic era, existing eval issues compound. Imagine you have a multi-step process that you're designing the agent system for. For each of these steps you have to create a proper quality control system to prevent points of failure or bottlenecks. Then, given that you're working with a pipeline, you must ensure that the chain of small steps that depend on each other completes flawlessly. What if one of the steps is an automated conversation with the user? This one is tricky to evaluate by itself, but when an abstract task like this becomes a part of your business pipeline, it will affect the entire thing.

A Proposed Solution

This might seem concerning, and it really is. In my opinion, we can still get it right if we apply systematic thinking to such problems. I propose the following framework:

  1. ==Decompose the pipeline into small steps==
  2. ==Design a measurable and reproducible evaluation approach==
  3. ==Assess the interactions between steps and adjust accordingly==

When we decompose the pipeline, we should try to match the step complexity with the current intellectual capacity of agentic tools that we currently have available. A good eval design will ensure that the results of each step are reliable and robust. And if we get the interplay of these steps in check, we can harden the overall pipeline integrity. When there are many moving parts, it’s important to get this step right, especially at scale.

Conclusion

Of course, the complexity doesn't end there. There's a huge amount of diverse problems that need careful and thoughtful approach, individual to the specific domain.

An example that excites me personally is how we apply non-invasive BCI technology to previously unimaginable things. From properly interpreting abstract data like brain states, to correctly measuring the effectiveness of incremental changes as we make progress in this field, this will require much more advanced approaches to evaluation than we have now.

So far things look promising, and with many great minds dedicating their time to designing better systems alongside the primary AI research I’m sure we’ll get a safe and aligned technology. Let me know what you think!

Market Opportunity
null Logo
null Price(null)
--
----
USD
null (null) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

U.Today Crypto Review: Ethereum (ETH) Loses 30-Day Progress, Shiba Inu’s (SHIB) End of Bears; Bitcoin’s (BTC) Last Recovery Chance

U.Today Crypto Review: Ethereum (ETH) Loses 30-Day Progress, Shiba Inu’s (SHIB) End of Bears; Bitcoin’s (BTC) Last Recovery Chance

The post U.Today Crypto Review: Ethereum (ETH) Loses 30-Day Progress, Shiba Inu’s (SHIB) End of Bears; Bitcoin’s (BTC) Last Recovery Chance appeared on BitcoinEthereumNews
Share
BitcoinEthereumNews2026/01/22 10:51
Headwind Helps Best Wallet Token

Headwind Helps Best Wallet Token

The post Headwind Helps Best Wallet Token appeared on BitcoinEthereumNews.com. Google has announced the launch of a new open-source protocol called Agent Payments Protocol (AP2) in partnership with Coinbase, the Ethereum Foundation, and 60 other organizations. This allows AI agents to make payments on behalf of users using various methods such as real-time bank transfers, credit and debit cards, and, most importantly, stablecoins. Let’s explore in detail what this could mean for the broader cryptocurrency markets, and also highlight a presale crypto (Best Wallet Token) that could explode as a result of this development. Google’s Push for Stablecoins Agent Payments Protocol (AP2) uses digital contracts known as ‘Intent Mandates’ and ‘Verifiable Credentials’ to ensure that AI agents undertake only those payments authorized by the user. Mandates, by the way, are cryptographically signed, tamper-proof digital contracts that act as verifiable proof of a user’s instruction. For example, let’s say you instruct an AI agent to never spend more than $200 in a single transaction. This instruction is written into an Intent Mandate, which serves as a digital contract. Now, whenever the AI agent tries to make a payment, it must present this mandate as proof of authorization, which will then be verified via the AP2 protocol. Alongside this, Google has also launched the A2A x402 extension to accelerate support for the Web3 ecosystem. This production-ready solution enables agent-based crypto payments and will help reshape the growth of cryptocurrency integration within the AP2 protocol. Google’s inclusion of stablecoins in AP2 is a massive vote of confidence in dollar-pegged cryptocurrencies and a huge step toward making them a mainstream payment option. This widens stablecoin usage beyond trading and speculation, positioning them at the center of the consumption economy. The recent enactment of the GENIUS Act in the U.S. gives stablecoins more structure and legal support. Imagine paying for things like data crawls, per-task…
Share
BitcoinEthereumNews2025/09/18 01:27
GBP trades firmly against US Dollar

GBP trades firmly against US Dollar

The post GBP trades firmly against US Dollar appeared on BitcoinEthereumNews.com. Pound Sterling trades firmly against US Dollar ahead of Fed’s policy outcome The Pound Sterling (GBP) clings to Tuesday’s gains near 1.3640 against the US Dollar (USD) during the European trading session on Wednesday. The GBP/USD pair holds onto gains as the US Dollar remains on the back foot amid firm expectations that the Federal Reserve (Fed) will cut interest rates in the monetary policy announcement at 18:00 GMT. At the time of writing, the US Dollar Index (DXY), which tracks the Greenback’s value against six major currencies, holds onto losses near a fresh two-month low of 96.60 posted on Tuesday. Read more… UK inflation unchanged at 3.8%, Pound shrugs The British pound is unchanged on Wednesday, trading at 1.3645 in the European session. Today’s inflation report was a dour reminder that UK inflation remains entrenched. CPI for August was unchanged at 3.8% y/y, matching the consensus and its highest level since January 2024. Airfares decreased but this was offset by food and petrol prices. Monthly, CPI rose 0.3%, up from 0.1% in July and matching the consensus. Core CPI, which excludes volatile items such as food and energy, eased to 3.6% from 3.8%. Monthly, core CPI ticked up to 0.3% from 0.2%. The inflation report comes just a day before the Bank of England announces its rate decision. Inflation is almost double the BoE’s target of 2% and today’s release likely means that the BoE will not reduce rates before 2026. Read more… Source: https://www.fxstreet.com/news/pound-sterling-price-news-and-forecast-gbp-trades-firmly-against-us-dollar-ahead-of-feds-policy-outcome-202509171209
Share
BitcoinEthereumNews2025/09/18 01:50