Why does Chronos-Bolt achieve significantly better results and performance compared to Chronos-T5 #231
-
Why does Chronos-Bolt achieve significantly better results and performance compared to Chronos-T5? What are the main contributing factors? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Regarding performance: while both model families rely on the T5 architecture under the hood, chronos-bolt models embed observations in the context in non-overlapping windows of multiple observations. This is the usual "patch-based" embedding that is used by other models, most notably PatchTST. This happens here in the code, and uses a patch length of 16 in the models we released: this effectively "compresses" the context length by a factor 16 in the embedding space, enabling much of the speedup. On the decoding side, instead of autoregressive generation the chronos-bolt models perform direct multi-step prediction for 9 quantiles, which is also faster. Regarding accuracy, it is hard to say which specific aspect of the model leads to most of the improvement. My insight is as follows: because of their architecture, chronos-bolt models are trained for quantile regression, using quantile loss. This is directly the task at which they are evaluated for the WQL experiments. On the other hand, chronos-t5 models output 20 samples by default (not so many, if you think about it), which already adds sampling noise. Quantile regression may even be more token-efficient at training time, in the sense that the model does not need to learn to output probabilities for 4096 classes like in the chronos-t5 models, but a much closer task to the downstream evaluation. |
Beta Was this translation helpful? Give feedback.
Regarding performance: while both model families rely on the T5 architecture under the hood, chronos-bolt models embed observations in the context in non-overlapping windows of multiple observations. This is the usual "patch-based" embedding that is used by other models, most notably PatchTST. This happens here in the code, and uses a patch length of 16 in the models we released: this effectively "compresses" the context length by a factor 16 in the embedding space, enabling much of the speedup. On the decoding side, instead of autoregressive generation the chronos-bolt models perform direct multi-step prediction for 9 quantiles, which is also faster.
Regarding accuracy, it is hard to say…