This article explores various formulations and methodologies for log-based anomaly detection, including binary classification, prediction, masked log modeling, and clustering. It contrasts supervised and unsupervised approaches, highlighting trade-offs between labeled accuracy and real-world practicality. The paper reviews how contextual, sequential, temporal, and semantic information from log data influences detection accuracy and discusses empirical studies comparing traditional versus deep-learning methods. Ultimately, the research proposes a Transformer-based anomaly detection model capable of capturing richer log features, offering a more holistic understanding of how AI identifies system anomalies across diverse datasets.This article explores various formulations and methodologies for log-based anomaly detection, including binary classification, prediction, masked log modeling, and clustering. It contrasts supervised and unsupervised approaches, highlighting trade-offs between labeled accuracy and real-world practicality. The paper reviews how contextual, sequential, temporal, and semantic information from log data influences detection accuracy and discusses empirical studies comparing traditional versus deep-learning methods. Ultimately, the research proposes a Transformer-based anomaly detection model capable of capturing richer log features, offering a more holistic understanding of how AI identifies system anomalies across diverse datasets.

An Overview of Log-Based Anomaly Detection Techniques

2025/11/04 01:52

Abstract

1 Introduction

2 Background and Related Work

2.1 Different Formulations of the Log-based Anomaly Detection Task

2.2 Supervised v.s. Unsupervised

2.3 Information within Log Data

2.4 Fix-Window Grouping

2.5 Related Works

3 A Configurable Transformer-based Anomaly Detection Approach

3.1 Problem Formulation

3.2 Log Parsing and Log Embedding

3.3 Positional & Temporal Encoding

3.4 Model Structure

3.5 Supervised Binary Classification

4 Experimental Setup

4.1 Datasets

4.2 Evaluation Metrics

4.3 Generating Log Sequences of Varying Lengths

4.4 Implementation Details and Experimental Environment

5 Experimental Results

5.1 RQ1: How does our proposed anomaly detection model perform compared to the baselines?

5.2 RQ2: How much does the sequential and temporal information within log sequences affect anomaly detection?

5.3 RQ3: How much do the different types of information individually contribute to anomaly detection?

6 Discussion

7 Threats to validity

8 Conclusions and References

\

2 Background and Related Work

2.1 Different Formulations of the Log-based Anomaly Detection Task

Previous works formulate the log-based anomaly detection task differently. Generally, the common formulations can be classified into the following categories.

Binary Classification The most common way to formulate the log-based anomaly detection task is to transform it into a binary classification task where machine learning models are used to classify logs or log sequences into anomalies and normal samples [1]. Both supervised [18–20] and unsupervised [8] classifiers can be used under this formulation. In unsupervised schemes, a threshold is usually employed to determine whether it is an anomaly based on the degree of pattern violation.

Future Event Prediction There are also some approaches that formulate the anomaly detection task as a prediction task [10]. Usually, sequential models are trained to predict the potential future events given the past few logs within a fixed window frame. In the predicting phase, the models are expected to generate a prediction with Top-N probable candidates for a future event. If the real event is not among the predicted candidates, the unexpected log is considered an anomaly which violates the normal pattern of log sequences.

\ Masked Log Prediction The log-based anomaly detection task can also be formulated as a masked log prediction task [21], where models trained with normal log sequence data are expected to predict the randomly masked log events in a log sequence. Similar to future event prediction, a log sequence is considered normal if the actual log events that appeared in log sequences are among the predicted candidates.

\ Others

Some works formulate the anomaly detection task as a clustering task, where feature vectors of normal and abnormal log sequences are expected to fall into different clusters [22]. The prediction of the label for the log sequence is determined based on the distance between the sequence to be processed and the centroids of the clusters. Moreover, there are previous approaches that utilize invariant mining [9] to tackle the task. They identify anomalies by discerning pattern violations of feature vectors of log sequences.

\ 2.2 Supervised v.s. Unsupervised

Another dimension of the formulations of the anomaly detection tasks is based on the training mechanisms. Supervised anomaly detection methods demand labeled logs as training data to learn to discern abnormal samples from normal ones, while unsupervised methods learn from the normal pattern from normal log data and do not require labels in the model training process. Unsupervised methods offer greater practicality as we do not usually have access to well-annotated log data. However, supervised methods usually achieve superior and more stable performance according to previous empirical studies.

\ 2.3 Information within Log Data

Generally, log data that is formed by sequences of log events contains various types of information. Within a log sequence, the occurrences of logs from different templates serve as a context and are a distinctive feature for log sequences. Similar to the Bag-of-Words model, numerical presentation based on the frequency of the template occurrences can represent log sequences and be used in anomaly detection. Various works [1] utilize the MCV to represent this information. Moreover, the sequential information within the log items provides richer information about the occurrences of logs and probably reflects the execution sequence of applications and services. DeepLog [10] uses a LSTM model to encode the sequential information. Furthermore, the temporal information from the log data provides even richer details about the occurrence of logs. The time intervals between log events may offer valuable insights into anomaly detection and other log analysis tasks about the system status, workload, and potential blocks. Du et al. [10] tried to utilize this information in a Parameter Value Anomaly Detection model for anomaly detection.

Besides, textual or semantic information provided by log messages has garnered significant attention in recent studies [5, 11, 12]. Given the inherent nature of log data, log messages written by developers articulate crucial information in natural language regarding the system’s operations, errors, and events, making them valuable for troubleshooting and system analysis. Various natural language processing techniques are employed to extract textual features and generate embeddings for log messages. From basic numerical statistics such as TF-IDF to word embedding techniques like Word2Vec, and advancing to advanced contextual embedding methods like BERT, these advancements are geared towards more accurately capturing the semantic information contained within log messages. Their objective is to distinguish between unrelated logs and connect similar ones, thereby supplying more informative and distinguishable features for subsequent downstream models.

\ In addition, the parameters carried by the log messages offer more diverse information about the systems. However, as most parameters are system-specific and lack a consistent format or range, deciding on the best way to model the information from different parameters is a formidable challenge. In most previous works, the parameters that are usually numbers and tokens are removed in pre-processing stages. In DeepLog [10], a parameter value anomaly detection model for each log key (i.e., log template) is used to detect anomalies associated with parameter values as an auxiliary measure to the log key anomaly detection model. In a more recent study [12], a parameter encoding module is employed to produce character-level encodings for parameters. Following this, each output is assigned a learnable scalar, which functions as a bias term within the self-attention mechanism. Moreover, log data generated by various systems and applications often contains system-specific information that may require domain-specific knowledge and tailored approaches to optimize the performance of downstream tasks.

\ 2.4 Fix-Window Grouping

Available public datasets for log-based anomaly detection have either sequence-level or event-level annotations. For the datasets that do not have a grouping identifier, fix-length or fix-time grouping is often employed in the pre-processing process to form log sequences that can be processed by log representation techniques and anomaly detection models. Various grouping settings have been used in previous studies for public datasets [1]. The different grouping settings generate different amounts of samples and varying contextual windows of log data, making direct comparisons of their performance impossible. Moreover, the logs are not generated with fixed rates or fixed lengths. Using fixed-window grouped log sequences for training and testing samples does not align with the actual scenarios.

2.5 Related Works

Recent empirical studies on log-based anomaly detection aim to deepen the understanding of the existing log-based anomaly detection models and the public datasets for evaluation. They focus on several issues. Le et al. [15] conducted an in-depth analysis of recent deep-learning anomaly detection models over several aspects of model evaluation. Their findings suggest that different settings of stages in anomaly detection would greatly impact the evaluation process. Therefore, using diverse datasets and analyzing logical relationships between logs are important for assessing log-based anomaly detection approaches.

\ Wu et al. [7] conducted an empirical study on vectorization (i.e., representation) techniques for log-based anomaly detection. They evaluated the effectiveness of some existing classical and semantic-based techniques with different anomaly detection models. Their experimental results suggest that the classical ways of transforming textual logs into feature vectors can achieve competitive results with more complex semantic embeddings. A more recent work [23] compared classical and deep-learning approaches of log-based anomaly detection methods. Their experimental results also suggest that simple models can outperform complex log vectorization methods. The deep learning approaches fail to surpass the simpler techniques. Their work highlights the need to critically analyze the datasets used in evaluation. Moreover, Landauer et al. [16] critically reviewed the common log datasets used to evaluate anomaly detection techniques. Their analysis of the log datasets suggests that most anomalies are not directly associated with sequential information within the log sequence. Sophisticated detection methods are unnecessary for attaining excellent detection performance. Their findings also highlight the creation of new datasets that incorporate sequential anomalies for evaluating anomaly detection approaches.

\ In our work, we proposed a Transformer-based anomaly detection model capable of capturing sequential and temporal information within the log sequence, in addition to event occurrence and semantic information. Due to the flexibility of the proposed model, we can easily utilize various combinations of log features as input for our evaluations. Through a series of carefully designed experiments, we scrutinized the four common public datasets and deepened our understanding of the roles of different types of information in identifying anomalies within the log sequence. Our findings are generally in accordance with the previous empirical studies. However, our analysis offers a more comprehensive and detailed understanding of the anomaly detection task and the studied public datasets.

:::info Authors:

  1. Xingfang Wu
  2. Heng Li
  3. Foutse Khomh

:::

:::info This paper is available on arxiv under CC by 4.0 Deed (Attribution 4.0 International) license.

:::

\

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

Best Crypto to Buy as Saylor & Crypto Execs Meet in US Treasury Council

Best Crypto to Buy as Saylor & Crypto Execs Meet in US Treasury Council

The post Best Crypto to Buy as Saylor & Crypto Execs Meet in US Treasury Council appeared on BitcoinEthereumNews.com. Michael Saylor and a group of crypto executives met in Washington, D.C. yesterday to push for the Strategic Bitcoin Reserve Bill (the BITCOIN Act), which would see the U.S. acquire up to 1M $BTC over five years. With Bitcoin being positioned yet again as a cornerstone of national monetary policy, many investors are turning their eyes to projects that lean into this narrative – altcoins, meme coins, and presales that could ride on the same wave. Read on for three of the best crypto projects that seem especially well‐suited to benefit from this macro shift:  Bitcoin Hyper, Best Wallet Token, and Remittix. These projects stand out for having a strong use case and high adoption potential, especially given the push for a U.S. Bitcoin reserve.   Why the Bitcoin Reserve Bill Matters for Crypto Markets The strategic Bitcoin Reserve Bill could mark a turning point for the U.S. approach to digital assets. The proposal would see America build a long-term Bitcoin reserve by acquiring up to one million $BTC over five years. To make this happen, lawmakers are exploring creative funding methods such as revaluing old gold certificates. The plan also leans on confiscated Bitcoin already held by the government, worth an estimated $15–20B. This isn’t just a headline for policy wonks. It signals that Bitcoin is moving from the margins into the core of financial strategy. Industry figures like Michael Saylor, Senator Cynthia Lummis, and Marathon Digital’s Fred Thiel are all backing the bill. They see Bitcoin not just as an investment, but as a hedge against systemic risks. For the wider crypto market, this opens the door for projects tied to Bitcoin and the infrastructure that supports it. 1. Bitcoin Hyper ($HYPER) – Turning Bitcoin Into More Than Just Digital Gold The U.S. may soon treat Bitcoin as…
Share
BitcoinEthereumNews2025/09/18 00:27
The Future of Secure Messaging: Why Decentralization Matters

The Future of Secure Messaging: Why Decentralization Matters

The post The Future of Secure Messaging: Why Decentralization Matters appeared on BitcoinEthereumNews.com. From encrypted chats to decentralized messaging Encrypted messengers are having a second wave. Apps like WhatsApp, iMessage and Signal made end-to-end encryption (E2EE) a default expectation. But most still hinge on phone numbers, centralized servers and a lot of metadata, such as who you talk to, when, from which IP and on which device. That is what Vitalik Buterin is aiming at in his recent X post and donation. He argues the next steps for secure messaging are permissionless account creation with no phone numbers or Know Your Customer (KYC) and much stronger metadata privacy. In that context he highlighted Session and SimpleX and sent 128 Ether (ETH) to each to keep pushing in that direction. Session is a good case study because it tries to combine E2E encryption with decentralization. There is no central message server, traffic is routed through onion paths, and user IDs are keys instead of phone numbers. Did you know? Forty-three percent of people who use public WiFi report experiencing a data breach, with man-in-the-middle attacks and packet sniffing against unencrypted traffic among the most common causes. How Session stores your messages Session is built around public key identities. When you sign up, the app generates a keypair locally and derives a Session ID from it with no phone number or email required. Messages travel through a network of service nodes using onion routing so that no single node can see both the sender and the recipient. (You can see your message’s node path in the settings.) For asynchronous delivery when you are offline, messages are stored in small groups of nodes called “swarms.” Each Session ID is mapped to a specific swarm, and your messages are stored there encrypted until your client fetches them. Historically, messages had a default time-to-live of about two weeks…
Share
BitcoinEthereumNews2025/12/08 14:40