Toto is a decoder-only transformer built for multivariate time series forecasting. It adapts innovations from large language models—like RMSNorm, SwiGLU, and rotary embeddings—while introducing a novel “Proportional Factorized Space-Time Attention” mechanism. This design balances time- and space-wise attention to handle complex, high-cardinality data efficiently. Combined with a robust probabilistic prediction head using Student-T mixture models, Toto delivers flexible, scalable, and uncertainty-aware forecasts suitable for real-world applications.Toto is a decoder-only transformer built for multivariate time series forecasting. It adapts innovations from large language models—like RMSNorm, SwiGLU, and rotary embeddings—while introducing a novel “Proportional Factorized Space-Time Attention” mechanism. This design balances time- and space-wise attention to handle complex, high-cardinality data efficiently. Combined with a robust probabilistic prediction head using Student-T mixture models, Toto delivers flexible, scalable, and uncertainty-aware forecasts suitable for real-world applications.

How Toto Reimagines Multi-Head Attention for Multivariate Forecasting

  1. Background
  2. Problem statement
  3. Model architecture
  4. Training data
  5. Results
  6. Conclusions
  7. Impact statement
  8. Future directions
  9. Contributions
  10. Acknowledgements and References

Appendix

3 Model architecture

Toto is a decoder-only forecasting model. This model employs many of the latest techniques from the literature, and introduces a novel method for adapting multi-head attention to multivariate time series data (Fig. 1).

\ 3.1 Transformer design

\ Transformer models for time series forecasting have variously used encoder-decoder [12, 13, 21], encoderonly [14, 15, 17], and decoder-only architectures [19, 23]. For Toto, we employ a decoder-only architecture. Decoder architectures have been shown to scale well [25, 26], and allow for arbitrary prediction horizons. The causal next-patch prediction task also simplifies the pre-training process.

\ We use techniques from some of the latest large language model (LLM) architectures, including prenormalization [27], RMSNorm [28], and SwiGLU feed-forward layers [29].

\ 3.2 Input embedding

\ Time series transformers in the literature have used various approaches for creating input embeddings. We use non-overlapping patch projections (Fig. 3), first introduced for Vision Transformers [30, 31] and popularized in the time series context by PatchTST [14]. Toto was trained using a fixed patch size of 32.

\

\ 3.3 Attention mechanism

\ Observability metrics are often high-cardinality, multivariate time series. Therefore, an ideal model will natively handle multivariate forecasting. It should be able to analyze relationships both in the time dimension (what we refer to as “time-wise” interactions) and in the channel dimension (what we refer to as “space-wise” interactions, following the convention in the Datadog platform of describing different groups or tag sets of a metric as the “space” dimension).

\ In order to model both space and time-wise interactions, we need to adapt the traditional multi-head attention architecture [11] from one to two dimensions. Several approaches have been proposed in the literature to do this, including:

\ • Assuming channel independence, and computing attention only in the time dimension [14]. This is efficient, but throws away all information about space-wise interactions.

\ • Computing attention only in the space dimension, and using a feed-forward network in the time dimension [17, 18].

\ • Concatenating variates along the time dimension and computing full cross-attention between every space/time location [15]. This can capture every possible space and time interaction, but it is computationally costly.

\ • Computing “factorized attention,” where each transformer block contains a separate space and time attention computation [16, 32, 33]. This allows both space and time mixing, and is more efficient than full cross-attention. However, it doubles the effective depth of the network.

\ In order to design our attention mechanism, we follow the intuition that for many time series, the time relationships are more important or predictive than the space relationships. As evidence, we observe that even models that completely ignore space-wise relationships (such as PatchTST [14] and TimesFM [19]) can still achieve competitive performance on multivariate datasets. However, other studies (e.g. Moirai [15]) have shown through ablations that there is some clear benefit to including space-wise relationships.

\ We therefore propose a novel variant of factorized attention, which we call “Proportional Factorized Space-Time Attention.” We use a mixture of alternating space-wise and time-wise attention blocks. As a configurable hyperparameter, we can change the ratio of time-wise to space-wise blocks, thus allowing us to devote more or less compute budget to each type of attention. For our base model, we selected a configuration with one space-wise attention block for every two time-wise blocks.

\ In the time-wise attention blocks, we use causal masking and rotary positional embeddings [34] with XPOS [35] in order to autoregressively model timedependent features. In the space-wise blocks, by contrast, we use full bidirectional attention in order to preserve permutation invariance of the covariates, with a block-diagonal ID mask to ensure that only related variates attend to each other. This masking allows us to pack multiple independent multivariate time series into the same batch, in order to improve training efficiency and reduce the amount of padding.

\ 3.4 Probabilistic prediction head

\ In order to be useful for forecasting applications, a model should produce probabilistic predictions. A common practice in time series models is to use an output layer where the model regresses the parameters of a probability distribution. This allows for prediction intervals to be computed using Monte Carlo sampling [7].

\ Common choices for an output layer are Normal [7] and Student-T [23, 36], which can improve robustness to outliers. Moirai [15] allows for more flexible residual distributions by proposing a novel mixture model incorporating a weighted combination of Gaussian, Student-T, Log-Normal, and Negative-Binomial outputs.

\ However, real-world time series can often have complex distributions that are challenging to fit, with outliers, heavy tails, extreme skew, and multimodality. In order to accommodate these scenarios, we introduce an even more flexible output likelihood. To do this we employ a method based on Gaussian mixture models (GMMs), which can approximate any density function ([37]). To avoid training instability in the presence of outliers, we use a Student-T mixture model (SMM), a robust generalization of GMMs [38] that has previously shown promise for modeling heavy-tailed financial time series [39, 40]. The model predicts k Student-T distributions (where k is a hyperparameter) for each time step, as well as a learned weighting.

\ Figure 4. Example metric query in the Datadog platform. The metric name (1) determines which metric is being queried. The filter clause (2) limits which contexts are queried, in this case restricting the query to the prod environment. The space aggregation (3) indicates that the average metric value should be returned for each unique combination of the group-by keys. The time aggregation (4) indicates that metric values should be aggregated to the average for each 60-second interval. The query results will be a multivariate time series with 1-minute time steps, and with separate individual variates for each unique service, datacenter tuple.

\ When we perform inference, we draw samples from the mixture distribution at each timestamp, then feed each sample back into the decoder for the next prediction. This allows us to produce prediction intervals at any quantile, limited only by the number of samples; for more precise tails, we can choose to spend more computation on sampling (Fig. 2).

\ 3.5 Input/output scaling

\ As in other time series models, we perform instance normalization on input data before passing it through the patch embedding, in order to make the model generalize better to inputs of different scales [41]. We scale the inputs to have zero mean and unit standard deviation. The output predictions are then rescaled back to the original units.

\ 3.6 Training objective

\ As a decoder-only model, Toto is pre-trained on the next-patch prediction task. We minimize the negative log-likelihood of the next predicted patch with respect to the distribution output of the model. We train the model using the AdamW optimizer [42].

\ 3.7 Hyperparameters

\ The hyperparameters used for Toto are detailed in Table A.1, with 103 million total parameters.

\

:::info Authors:

(1) Ben Cohen ([email protected]);

(2) Emaad Khwaja ([email protected]);

(3) Kan Wang ([email protected]);

(4) Charles Masson ([email protected]);

(5) Elise Rame ([email protected]);

(6) Youssef Doubli ([email protected]);

(7) Othmane Abou-Amal ([email protected]).

:::


:::info This paper is available on arxiv under CC BY 4.0 license.

:::

\

Market Opportunity
Multichain Logo
Multichain Price(MULTI)
$0.03854
$0.03854$0.03854
+3.63%
USD
Multichain (MULTI) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact [email protected] for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

BlackRock boosts AI and US equity exposure in $185 billion models

BlackRock boosts AI and US equity exposure in $185 billion models

The post BlackRock boosts AI and US equity exposure in $185 billion models appeared on BitcoinEthereumNews.com. BlackRock is steering $185 billion worth of model portfolios deeper into US stocks and artificial intelligence. The decision came this week as the asset manager adjusted its entire model suite, increasing its equity allocation and dumping exposure to international developed markets. The firm now sits 2% overweight on stocks, after money moved between several of its biggest exchange-traded funds. This wasn’t a slow shuffle. Billions flowed across multiple ETFs on Tuesday as BlackRock executed the realignment. The iShares S&P 100 ETF (OEF) alone brought in $3.4 billion, the largest single-day haul in its history. The iShares Core S&P 500 ETF (IVV) collected $2.3 billion, while the iShares US Equity Factor Rotation Active ETF (DYNF) added nearly $2 billion. The rebalancing triggered swift inflows and outflows that realigned investor exposure on the back of performance data and macroeconomic outlooks. BlackRock raises equities on strong US earnings The model updates come as BlackRock backs the rally in American stocks, fueled by strong earnings and optimism around rate cuts. In an investment letter obtained by Bloomberg, the firm said US companies have delivered 11% earnings growth since the third quarter of 2024. Meanwhile, earnings across other developed markets barely touched 2%. That gap helped push the decision to drop international holdings in favor of American ones. Michael Gates, lead portfolio manager for BlackRock’s Target Allocation ETF model portfolio suite, said the US market is the only one showing consistency in sales growth, profit delivery, and revisions in analyst forecasts. “The US equity market continues to stand alone in terms of earnings delivery, sales growth and sustainable trends in analyst estimates and revisions,” Michael wrote. He added that non-US developed markets lagged far behind, especially when it came to sales. This week’s changes reflect that position. The move was made ahead of the Federal…
Share
BitcoinEthereumNews2025/09/18 01:44
Alameda Research recovers 500 BTC, still holds over $1B in assets

Alameda Research recovers 500 BTC, still holds over $1B in assets

The post Alameda Research recovers 500 BTC, still holds over $1B in assets appeared on BitcoinEthereumNews.com. Alameda Research is sitting on over $1B in crypto assets, even after the latest repayment to creditors. The fund’s wallets received another 500 BTC valued at over $58M.  Alameda Research, the defunct quant and hedge firm linked to FTX, received another 500 BTC in one of its main wallets. Following the latest inflow, and with additional SOL unlocks, Alameda Research once again sits on over $1B in assets.  The BTC inflow came from an intermediary wallet, labeled ‘WBTC merchant deposit’, from Alameda’s involvement with the WBTC ecosystem. The 500 BTC were moved through a series of intermediary wallets, showing activity in the past few weeks.  The funds were tracked to deposits from QCP Capital, which started moving into Alameda’s wallets three weeks ago. The wallets also moved through Alameda’s WBTC Merchant addresses. During its activity period, Alameda Research had status as an official WBTC merchant, meaning it could accept BTC and mint WBTC tokens. The WBTC was still issued by BitGo, while Alameda was not the custodian.  The current tranche of 500 BTC returning to Alameda’s wallet may come from its own funds, unwrapped from the tokenized form. In any case, Alameda is now the full custodian of the 500 BTC.  The small transaction recalls previous episodes when Alameda withdrew assets from FTX in the days before its bankruptcy. WBTC was one of the main inflows, as Alameda used its status as WBTC merchant to unwrap the assets and switch to BTC. Due to the rising BTC market price, the recent inflow was even larger than the withdrawals at the time of the FTX bankruptcy.  Alameda inflows arrive just before the next FTX distribution The transfer into Alameda’s wallets has not been moved to another address, and may not become a part of the current FTX distribution at this stage. …
Share
BitcoinEthereumNews2025/09/30 18:39
White House Forms Crypto Team to Drive Regulation

White House Forms Crypto Team to Drive Regulation

The White House developed a "dream team" for U.S. cryptocurrency regulations. Continue Reading:White House Forms Crypto Team to Drive Regulation The post White
Share
Coinstats2025/12/23 04:10