Discover how Bright Data optimize its Web Archive to handle petabytes of data in AWS. Learn how a $100,000 billing mistake revealed the trade-off between write speed, read speed, and cloud costs—and how we fixed it with a cost-effective Rearrange Pipeline. Spoiler: We are hiring!Discover how Bright Data optimize its Web Archive to handle petabytes of data in AWS. Learn how a $100,000 billing mistake revealed the trade-off between write speed, read speed, and cloud costs—and how we fixed it with a cost-effective Rearrange Pipeline. Spoiler: We are hiring!

Building a Petabyte-Scale Web Archive

2025/12/09 21:07

In an engineer’s ideal world, architecture is always beautiful. In the real world of high-scale systems, you have to make compromises. One of the fundamental problems an engineer must think about at the start is the vicious trade-off between Write Speed and Read Speed.

Usually, you sacrifice one for the other. But in our case, working with petabytes of data in AWS, this compromise didn’t hit our speed–it hit the wallet.

We built a system that wrote data perfectly, but every time it read from the archive, it burned through the budget in the most painful way imaginable. After all, reading petabytes from AWS costs money for data transfer, request counts, and storage class retrievals… A lot of money!

This is the story of how we optimized it to make it more efficient and cost-effective!

Part 0: How We Ended Up Spending $100,000 in AWS Fees!

True story: a few months back, one of our solution architects wanted to pull a sample export from a rare, low-traffic website to demonstrate the product to a potential client. Due to a bug in the API, the safety limit on file count wasn’t applied.

Because the data for this “rare” site was scattered across millions of archives alongside high-traffic sites, the system tried to restore nearly half of our entire historical storage to find those few pages.

That honest mistake ended up costing us nearly $100,000 in AWS fees!

Now, I fixed the API bug immediately (and added strict limits), but the architectural vulnerability remained. It was a ticking time bomb…

Let me tell you the story of the Bright Data Web Archive architecture: how I drove the system into the trap of “cheap” storage and how I climbed out using a Rearrange Pipeline.

Part 1: The “Write-First” Legacy

When I started working on the Web Archive, the system was already ingesting a massive data stream: millions of requests per minute, tens of terabytes per day. The foundational architecture was built with a primary goal: capture everything without data loss.

It relied on the most durable strategy for high-throughput systems: Append-only Log.

  1. Data (HTML, JSON) is buffered.
  2. Once the buffer hits ~300 MB, it is “sealed” into a TAR archive.
  3. The archive flies off to S3.
  4. After 3 days, files move to S3 Glacier Deep Archive.

For the ingestion phase, this design was flawless. Storing data in Deep Archive costs pennies, and the write throughput is virtually unlimited.

The Problem: That Pricing Nuance

The architecture worked perfectly for writing… until clients came asking for historical data. That’s when I faced a fundamental contradiction:

  • The System Writes by Time: An archive from 12:00 PM contains a mix of cnn.comgoogle.com, and shop.xyz.
  • The System Reads by Domain: The client asks: “Give me all pages from cnn.com for the last year.”

Here lies the mistake that inspired this article. Like many engineers, I’m used to thinking about latency, IOPS, and throughput. But I overlooked the AWS Glacier billing model.

I thought: “Well, retrieving a few thousand archives is slow (48 hours), but it’s not that expensive.”

The Reality: AWS charges not just for the API call, but for the volume of data restored ($ per GB retrieved).

The “Golden Byte” Effect

Imagine a client requests 1,000 pages from a single domain. Because the writing logic was chronological, these pages can be spread across 1,000 different TAR archives.

To give the client these 50 MB of useful data, a disaster occurs:

  1. The system has to trigger a Restore for 1,000 archives.
  2. It lifts 300 GB of data out of the “freezer” (1,000 archives × 300 MB).
  3. AWS bills us for restoring 300 GB.
  4. I extract the 50 MB required and throw away the other 299.95 GB 🤯.

We were paying to restore terabytes of trash just to extract grains of gold. It was a classic Data Locality problem that turned into a financial black hole.

Part 2: Fixing the Mistake: The Rearrange Pipeline

I couldn’t quickly change the ingestion method–the incoming stream is too parallel and massive to sort “on the fly” (though I am working on that), and I needed a solution that worked for already archived data, too.

So, I designed the Rearrange Pipeline, a background process that “defragments” the archive.

This is an asynchronous ETL (Extract, Transform, Load) process, with several critical core components:

  1. Selection: It makes no sense to sort data that clients aren’t asking for. Thus, I direct all new data into the pipeline, as well as data that clients have specifically asked to restore. We overpay for the retrieval the first time, but it never happens a second time.

    \

  2. Shuffling (Grouping): Multiple workers download unsorted files in parallel and organize buffers by domain. Since the system is asynchronous, I don’t worry about the incoming stream overloading memory. The workers handle the load at their own pace.

    \

  3. Rewriting: I write the sorted files back to S3 under a new prefix (to distinguish sorted files from raw ones).

  • Before: 2024/05/05/random_id_ts.tar → [cnn, google, zara, cnn]
  • After: 2024/05/05/cnn/random_id_ts.tar → [cnn, cnn, cnn...]
  1. Metadata Swap: In Snowflake, the metadata table is append-only. Doing MERGE INTO or UPDATE is prohibitively expensive.
  • The Solution: I found it was far more efficient to take all records for a specific day, write them to a separate table using a JOIN, delete the original day’s records, and insert the entire day back with the modified records. I managed to process 300+ days and 160+ billion UPDATE operations in just a few hours on a 4X-Large Snowflake warehouse.

The Result

This change radically altered the product’s economics:

  • Pinpoint Accuracy: Now, when a client asks for cnn.com, the system restores only the data where cnn.com lives.
  • Efficiency: Depending on the granularity of the request (entire domain vs. specific URLs via regex), I achieved a 10% to 80% reduction in “garbage data” retrieval (which is directly proportional to the cost).
  • New Capabilities: Beyond just saving money on dumps, this unlocked entirely new business use cases. Because retrieving historical data is no longer agonizingly expensive, we can now afford to extract massive datasets for training AI models, conducting long-term market research, and building knowledge bases for agentic AI systems to reason over (think specialized search engines). What was previously a financial suicide mission is now a standard operation.

We Are Hiring

Bright Data is scaling the Web Archive even further. If you enjoy:

  • High‑throughput distributed systems,
  • Data engineering at massive scale,
  • Building reliable pipelines under real‑world load,
  • Pushing Node.js to its absolute limits,
  • Solving problems that don’t appear in textbooks…

Then I’d love to talk.

We’re hiring strong Node.js engineers to help build the next generation of the Web Archive. Having data engineering and ETL experience is highly advantageous. Feel free to send your CV to [email protected].

More updates coming as I continue scaling the archive—and as I keep finding new and creative ways to break it!

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns

Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns

The post Upbit to Raise Cold Wallet Ratio to 99% Amid Liquidity Concerns appeared on BitcoinEthereumNews.com. South Korea’s largest cryptocurrency exchange, Upbit, announced plans to increase its cold wallet storage ratio to 99%, following a major security breach last month. The announcement comes as part of a comprehensive security overhaul following hackers’ theft of approximately 44.5 billion won ($31 million) in Solana-based assets on November 27. Upbit Strengthens Security After Second November 27 Breach According to operator Dunamu, Upbit currently maintains 98.33% of customer digital assets in cold storage as of late October, with only 1.67% held in hot wallets. The exchange stated it has completed a full wallet infrastructure overhaul and aims to reduce hot wallet holdings to below 1% in the coming months. Dunamu emphasized that customer asset protection remains Upbit’s top priority, with all breach-related losses covered by the company’s reserves. Sponsored Sponsored The breach marked Upbit’s second major hack on the same date six years ago. In 2019, North Korean hacking groups Lazarus and Andariel stole 342,000 ETH from the exchange’s hot wallet. This time, attackers drained 24 different Solana network tokens in just 54 minutes during the early morning hours. Under South Korea’s Virtual Asset User Protection Act, exchanges must store at least 80% of customer assets in cold wallets. Upbit significantly exceeds this threshold and maintains the lowest hot wallet ratio among domestic exchanges. Data released by lawmaker Huh Young showed that other Korean exchanges were operating with cold wallet ratios of 82% to 90% as of June. Upbit Outpaces Global Industry Standards Upbit’s security metrics compare favorably with those of major global exchanges. Coinbase stores approximately 98% of customer funds in cold storage, while Kraken maintains 95-97% of its funds offline. OKX, Gate.io, and MEXC each keep around 95% of their funds in cold wallets. Binance and Bybit have not disclosed specific ratios but emphasize that the majority of…
Share
BitcoinEthereumNews2025/12/10 13:37
Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns?

Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns?

The post Tidal Trust Files For ‘Bitcoin AfterDark ETF’, Could Off-Hours Trading Boost Returns? appeared on BitcoinEthereumNews.com. Tidal Trust has filed for the first Bitcoin AfterDark ETF with the U.S. SEC. The product looks to capture overnight price movements of the token. What Is the Bitcoin AfterDark ETF? Tidal Trust has filed with the SEC for its proposed Bitcoin AfterDark ETF product. It is an ETF that would hold the coin only during non-trading hours in the United States. This filing also seeks permission for two other BTC-linked products managed with Nicholas Wealth Management. Source: SEC According to the registration documents, the ETF would buy Bitcoin at the close of U.S. markets and then sell the position the following morning upon the reopening of trading. In other words, it will effectively hold BTC only over the night “The fund trades those instruments during U.S. overnight hours and closes them out shortly after the U.S. market opens each trading day,” the filing said. During the day, the fund’s assets switch to U.S. Treasuries, money-market funds, and similar cash instruments. That means even when the fund has 100% notional exposure to Bitcoin overnight, a substantial portion of its capital may still sit in Treasuries during the day. Eric Balchunas, senior ETF analyst cited earlier research and said, “most of Bitcoin’s gains historically occur outside U.S. market hours.” If those patterns persist, the Bitcoin AfterDark ETF token will outperform more traditional spot BTC products, he said. Source: X Balchunas added that the effect may be partly driven by positioning in existing Bitcoin ETFs and related derivatives activity. The SEC has of late taken an increasingly more accommodating approach toward crypto-related ETFs. This September, for instance, REX Shares launched the first Ethereum Staking ETF. It represented direct ETH exposure and paid out on-chain staking rewards.  Also on Tuesday, BlackRock filed an application for an iShares Staked Ethereum ETF. The filing states…
Share
BitcoinEthereumNews2025/12/10 13:00
Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners

Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners

The post Tempo Testnet Goes Live with Stablecoin Tools and Expanded Partners appeared on BitcoinEthereumNews.com. The Tempo testnet, developed by Stripe and Paradigm, is now live, enabling developers to run nodes, sync the chain, and test stablecoin features for payments. This open-source platform emphasizes scale, reliability, and integration, paving the way for instant settlements on a dedicated layer-1 blockchain. Tempo testnet launches with six core features, including stablecoin-native gas and fast finality, optimized for financial applications. Developers can create stablecoins directly in browsers using the TIP-20 standard, enhancing accessibility for testing. The project has secured $500 million in funding at a $5 billion valuation, with partners like Mastercard and Klarna driving adoption; Klarna launched a USD-pegged stablecoin last month. Discover the Tempo testnet launch by Stripe and Paradigm: test stablecoins, run nodes, and explore payment innovations on this layer-1 blockchain. Join developers in shaping the future of crypto payments today. What is the Tempo Testnet? Tempo testnet represents a pivotal milestone in the development of a specialized layer-1 blockchain for payments, created through a collaboration between Stripe and Paradigm. This public testnet allows participants to run nodes, synchronize the chain, and experiment with essential features tailored for stablecoin operations and financial transactions. By focusing on instant settlements and low fees, it addresses key limitations in traditional blockchains for real-world payment use cases. Source: Patrick Collison The Tempo testnet builds on the project’s foundation, which was first announced four months ago, with an emphasis on developer-friendly tools. It supports a range of functionalities that prioritize reliability and scalability, making it an ideal environment for testing before the mainnet rollout. As per the official announcement from Tempo, this phase will involve ongoing enhancements, including new infrastructure partnerships and stress tests under simulated payment volumes. One of the standout aspects of the Tempo testnet is its open-source nature, inviting broad community involvement. This approach not only accelerates development…
Share
BitcoinEthereumNews2025/12/10 13:01