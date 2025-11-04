Abstract

We discuss our lessons learned according to the experimental results.

Semantic information contributes to anomaly detection

The findings of this study confirm the efficacy of utilizing semantic information within log messages for log-based anomaly detection. Recent studies show classical machine learning models and simple log representation (vectorization) techniques can outperform complex DL counterparts [7, 23]. In these simple approaches, log events within log data are substituted with event IDs or tokens, and semantic information is lost. However, according to our experimental results, the semantic information is valuable for subsequent models to distinguish anomalies, while the event occurrence information is also prominent.

We call for future contributions of new, high-quality datasets that can be combined with our flexible approach to evaluate the influence of different components in logs for anomaly detection. ***The results of our study confirm the findings of recent works [16, 23]. Most anomalies may not be associated with sequential information within log sequences. The occurrence of certain log templates and the semantics within log templates contribute to the anomalies. This finding highlights the importance of employing new datasets to validate the recent designs of DL models (e.g., LSTM [10], Transformer [11]). Moreover, our flexible approach can be used off-the-shelf with the new datasets to evaluate the influences of different components and contribute to high-quality anomaly detection that leverages the full capacity of logs.

The publicly available log datasets that are well-annotated for anomaly detection are limited, which greatly hinders the evaluation and development of anomaly detection approaches that have practical impacts. Except for the HDFS dataset, whose anomaly annotations are session-based, the existing public datasets contain annotations for each log entry within log data, which implies the anomalies are only associated with certain specific log events or associated parameters within the events. Under this setting, the causality or sequential information that may imply anomalous behaviors is ignored.

7 Threats to validity

We have identified the following threats to the validity of our findings:

Construct Validity

In our proposed anomaly detection method, we adopt the Drain parser to parse the log data. Although the Drain parser performs well and can generate relatively accurate parsing results, parsing errors still exist. The parsing error may influence the generation of log event embedding (i.e., logs from the same log event may have different embeddings) and thus influence the performance of the anomaly detection model. To mitigate this threat, we pass some extra regular expressions for each dataset to the parser. These regular expressions can help the parser filter some known dynamic areas in log messages and thus achieve more accurate results.

\ Internal Validity There are various hyperparameters involved in our proposed anomaly detection model and experiment settings: 1) In the process of generating samples for both training and test sets, we define minimum and maximum lengths, along with step sizes, to generate log sequences of varying lengths. We do not have prior knowledge about the range of sequence length in which anomalies may reside. However, we set these parameters according to the common practices of previous studies, which adopt fixlength grouping. 2) The Transformer-based anomaly detection model entails numerous hyperparameters, such as the number of transformer layers, attention heads, and the size of the fully-connected layer. As the number of combinations is huge, we were not able to do a grid search. However, we referred to the settings of similar models and experimented with different combinations of hyperparameters, selecting the bestperforming combination accordingly.

\ External Validity

In this study, we conducted experiments on four public log datasets for anomaly detection. Some findings and conclusions obtained from our experimental results are constrained to the studied datasets. However, the studied datasets are the most used ones to evaluate the log-based anomaly detection models. They have become the standard of the evaluation. As the annotation of the log datasets demands a lot of human effort, there are only a few publicly available datasets for log-based anomaly detection tasks. The studied datasets are representative, thus enabling the findings to illuminate prevalent challenges within the realm of anomaly detection.

Reliability

The reliability of our findings may be influenced by the reproducibility of results, as variations in dataset preprocessing, hyperparameter tuning, and log parsing configurations across different implementations could lead to discrepancies. To mitigate this threat, we adhered to well-used preprocessing processes and hyperparameter settings, which are detailed in the paper. However, even minor differences in experimental setups or parser configurations may yield divergent outcomes, potentially impacting the consistency of the model’s performance across independent studies.

8 Conclusions and References

The existing log-based anomaly detection approaches have used different types of information within log data. However, it remains unclear how these different types of information contribute to the identification of anomalies. In this study, we first propose a Transformer-based anomaly detection model, with which we conduct experiments with different input feature combinations to understand the role of different information in detecting anomalies within log sequences. The experimental results demonstrate that our proposed approach achieves competitive and more stable performance compared to simple machine learning models when handling log sequences of varying lengths. With the proposed model and the studied datasets, we find that sequential and temporal information do not contribute to the overall performance of anomaly detection when the event occurrence information is present. The event occurrence information is the most prominent feature for identifying anomalies, while the inclusion of semantic information from log templates is helpful for anomaly detection models. Our results and findings generally confirm that of the recent empirical studies and indicate the deficiency of using the existing public datasets to evaluate anomaly detection methods, especially the deep learning models. Our work highlights the need to utilize new datasets that contain different types of anomalies and align more closely with real-world systems to evaluate anomaly detection models. Our flexible approach can be readily applied with the new datasets to evaluate the influences of different components and enhance anomaly detection by leveraging the full capacity of log information.

:::info Supplementary information: The source code of the proposed method is publicly available in our supplementary material package 1.

:::

Acknowledgements

We would like to gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-2021-03900) and the Fonds de recherche du Qu´ebec – Nature et technologies (FRQNT, 326866) for their funding support for this work.

