This study evaluates a transformer-based framework for detecting anomalies in large-scale system logs. Experiments were conducted on four public datasets—HDFS, BGL, Spirit, and Thunderbird—using adaptive log-sequence generation to handle varying sequence lengths and data rates. The model architecture includes two transformer encoder layers with multi-head attention and was optimized using AdamW and OneCycleLR. Implemented in PyTorch and trained on an HPC system, the setup demonstrates an efficient and scalable approach for benchmarking log anomaly detection methods.This study evaluates a transformer-based framework for detecting anomalies in large-scale system logs. Experiments were conducted on four public datasets—HDFS, BGL, Spirit, and Thunderbird—using adaptive log-sequence generation to handle varying sequence lengths and data rates. The model architecture includes two transformer encoder layers with multi-head attention and was optimized using AdamW and OneCycleLR. Implemented in PyTorch and trained on an HPC system, the setup demonstrates an efficient and scalable approach for benchmarking log anomaly detection methods.

How Transformer Models Detect Anomalies in System Logs

2025/11/04 01:52

Abstract

1 Introduction

2 Background and Related Work

2.1 Different Formulations of the Log-based Anomaly Detection Task

2.2 Supervised v.s. Unsupervised

2.3 Information within Log Data

2.4 Fix-Window Grouping

2.5 Related Works

3 A Configurable Transformer-based Anomaly Detection Approach

3.1 Problem Formulation

3.2 Log Parsing and Log Embedding

3.3 Positional & Temporal Encoding

3.4 Model Structure

3.5 Supervised Binary Classification

4 Experimental Setup

4.1 Datasets

4.2 Evaluation Metrics

4.3 Generating Log Sequences of Varying Lengths

4.4 Implementation Details and Experimental Environment

5 Experimental Results

5.1 RQ1: How does our proposed anomaly detection model perform compared to the baselines?

5.2 RQ2: How much does the sequential and temporal information within log sequences affect anomaly detection?

5.3 RQ3: How much do the different types of information individually contribute to anomaly detection?

6 Discussion

7 Threats to validity

8 Conclusions and References

\

4 Experimental Setup

4.1 Datasets We evaluate our proposed approach and conduct experiments with four commonlyused public datasets: HDFS [8], Blue Gene/L (BGL), Spirit, and Thunderbird [32]. These datasets are commonly used in existing studies [1, 5, 12]. The HDFS dataset [8] is derived from the Amazon EC2 platform. The dataset comprises over 11 million log events, each linked to a block ID. This block ID allows us to partition the log data into sessions. The annotations are block-wise: each session is labeled as either normal or abnormal. In total, there are 575,061 log sessions, with 16,838 (2.9%) identified as anomalies. The BGL, Spirit, and Thunderbird datasets are recorded from supercomputer systems, from which they are named. Different from the HDFS dataset, all these datasets have log item-wise annotation. However, there is no block ID or other identifier to group the log items into sequences. The BGL dataset is recorded with a time span of 215 days, containing 4,747,963 log items, where 348,460 (7.3%) are labeled as anomalies. As the Spirit and Thunderbird datasets each contain more than 200 million log items, which is too large to process, we use subsets of 5 million and 10 million log items, respectively, as per the practices of previous works [7, 11, 15]. We split the datasets into an 80% training set and a 20% test set. For the HDFS dataset, we randomly shuffle the sessions to perform dataset splitting. For the remaining datasets, we divide them in accordance with the chronological order of logs. The summarised properties of datasets utilized in the evaluation and experiment of our study are presented in Table 2.

\

4.3 Generating Log Sequences of Varying Lengths

Except for the HDFS dataset, which has a block ID to group the logs into sequences, other datasets employed by our study have no identifier to group or split the whole log sequence into sub-sequences. In practice, the logs produced by systems and applications do not adhere to a fixed rate of generation. Using fixed-window or fixed-time grouping with a sliding window fails to adequately accommodate the variability in log generation and thus may lead to inaccurate detection of anomalies in real scenarios. Moreover, according to previous studies [1, 7, 15], the best grouping setting varies depending on the dataset, and these settings can significantly influence the performance of the anomaly detection model, making it challenging to compare the effectiveness of different anomaly detection methods. Therefore, we use a method to generate log sequences with varying lengths and utilize these sequences to train the model within our anomaly detection framework. In the process of log sequence generation, we determined specific parameters, including minimum and maximum sequence lengths, as well as a designated step size. The step size is used to control the interval of the first log events in log sequences. The length of each log sequence is randomly generated in the range of the minimum and the maximum length. We assume the log sequence of the minimum length can offer a minimum context for a possible anomaly. The step size controls the overlaps of sequences. The maximum length affects the number of parameters in the model, and step size decides the amount of samples in the dataset. They should be aligned with the data distribution and computational resources available. In the experiments conducted in this study, we set the minimum length as 128, the maximum length as 512, and the step size as 64 for the datasets without a grouping identifier.

\ 4.4 Implementation Details and Experimental Environment

In our experiments, the proposed transformer-based anomaly detection model has two layers of the transformer encoder. The number of attention heads is 12, and the dimension of the feedforward network layer within each transformer block is set to 2048. We use AdamW with an initial learning rate of 5e-4 as the optimization algorithm and employ the OneCycleLR learning rate scheduler to enable a better convergence. We selected these hyperparameters following standard practices while also considering computational efficiency. Our implementation is based on Python 3.11 and PyTorch 2.2.1. All the experiments are run on a high-performance computing (HPC) system. We use a computational node equipped with an Intel Gold 6148 Skylake @ 2.4 GHz CPU, 16GB RAM and an NVIDIA V100 GPU to run our experiments.

:::info Authors:

  1. Xingfang Wu
  2. Heng Li
  3. Foutse Khomh

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

The Future of Secure Messaging: Why Decentralization Matters

The Future of Secure Messaging: Why Decentralization Matters

The post The Future of Secure Messaging: Why Decentralization Matters appeared on BitcoinEthereumNews.com. From encrypted chats to decentralized messaging Encrypted messengers are having a second wave. Apps like WhatsApp, iMessage and Signal made end-to-end encryption (E2EE) a default expectation. But most still hinge on phone numbers, centralized servers and a lot of metadata, such as who you talk to, when, from which IP and on which device. That is what Vitalik Buterin is aiming at in his recent X post and donation. He argues the next steps for secure messaging are permissionless account creation with no phone numbers or Know Your Customer (KYC) and much stronger metadata privacy. In that context he highlighted Session and SimpleX and sent 128 Ether (ETH) to each to keep pushing in that direction. Session is a good case study because it tries to combine E2E encryption with decentralization. There is no central message server, traffic is routed through onion paths, and user IDs are keys instead of phone numbers. Did you know? Forty-three percent of people who use public WiFi report experiencing a data breach, with man-in-the-middle attacks and packet sniffing against unencrypted traffic among the most common causes. How Session stores your messages Session is built around public key identities. When you sign up, the app generates a keypair locally and derives a Session ID from it with no phone number or email required. Messages travel through a network of service nodes using onion routing so that no single node can see both the sender and the recipient. (You can see your message’s node path in the settings.) For asynchronous delivery when you are offline, messages are stored in small groups of nodes called “swarms.” Each Session ID is mapped to a specific swarm, and your messages are stored there encrypted until your client fetches them. Historically, messages had a default time-to-live of about two weeks…
Share
BitcoinEthereumNews2025/12/08 14:40
Grayscale Files Sui Trust as 21Shares Launches First SUI ETF Amid Rising Demand

Grayscale Files Sui Trust as 21Shares Launches First SUI ETF Amid Rising Demand

The post Grayscale Files Sui Trust as 21Shares Launches First SUI ETF Amid Rising Demand appeared on BitcoinEthereumNews.com. The Grayscale Sui Trust filing and 21Shares’ launch of the first SUI ETF highlight surging interest in regulated Sui investments. These products offer investors direct exposure to the SUI token through spot-style structures, simplifying access to the Sui blockchain’s growth without direct custody needs, amid expanding altcoin ETF options. Grayscale’s spot Sui Trust seeks to track SUI price performance for long-term holders. 21Shares’ SUI ETF provides leveraged exposure, targeting traders with 2x daily returns. Early trading data shows over 4,700 shares exchanged, with volumes exceeding $24 per unit in the debut session. Explore Grayscale Sui Trust filing and 21Shares SUI ETF launch: Key developments in regulated Sui investments for 2025. Stay informed on altcoin ETF trends. What is the Grayscale Sui Trust? The Grayscale Sui Trust is a proposed spot-style investment product filed via S-1 registration with the U.S. Securities and Exchange Commission, aimed at providing investors with direct exposure to the SUI token’s price movements. This trust mirrors the performance of SUI, the native cryptocurrency of the Sui blockchain, minus applicable fees, offering a regulated avenue for long-term participation in the network’s ecosystem. By holding SUI assets on behalf of investors, it eliminates the need for individuals to manage token storage or transactions directly. ⚡ LATEST: GRAYSCALE FILES S-1 FOR $SUI TRUSTThe “Grayscale Sui Trust,” is a spot-style ETF designed to provide direct exposure to the $SUI token. Grayscale’s goal is to mirror SUI’s market performance, minus fees, giving long-term investors a regulated, hassle-free way to… pic.twitter.com/mPQMINLrYC — CryptosRus (@CryptosR_Us) December 6, 2025 How does the 21Shares SUI ETF differ from traditional funds? The 21Shares SUI ETF, launched under the ticker TXXS, introduces a leveraged approach with 2x daily exposure to SUI’s price fluctuations, utilizing derivatives for amplified returns rather than direct spot holdings. This structure appeals to short-term…
Share
BitcoinEthereumNews2025/12/08 14:20