More Consistent, More Dynamic!

TRAIN SHORT, INFERENCE LONG: Training-free Horizon Extension for Autoregressive Video Generation

Jia Li1,2,*, Xiaomeng Fu3,*, Xurui Peng2, Weifeng Chen2, Youwei Zheng2, Tianyu Zhao2, Jiexi Wang2, Fangmin Chen2, Xing Wang2,†, Hayden Kwok-Hay So1,†

1The University of Hong Kong · 2ByteDance · 3Institute of Information Engineering, Chinese Academy of Sciences

*Equal contribution · Corresponding authors

Abstract

Autoregressive video diffusion models have emerged as a scalable paradigm for long video generation. However, they often suffer from severe extrapolation failure, where rapid error accumulation leads to significant temporal degradation when extending beyond training horizons. In this paper, we propose FLEX(Frequency-aware Length EXtension) , a training-free inference-time framework that bridges the gap between short-term training and long-term inference.

On VBench-Long, FLEX outperforms state-of-the-art methods at 6x extrapolation (30s) and remains competitive with long-video fine-tuned baselines at 12x scale (60s). As a plug-and-play augmentation, FLEX extends the generation horizon of existing pipelines (e.g., LongLive) to consistent, dynamic synthesis at minutes level.

Features

1Frequency-aware 3D RoPE Modulation

Adaptively interpolates under-trained low-frequency components while extrapolating high-frequency components to preserve multi-scale temporal discriminability.

2Antiphase Noise Sampling

A structured noise initialization strategy that injects high-frequency dynamic priors to encourage rich temproal variations.

3Inference-only Attention Sink

Adds global structural anchors at inference time for long-range stability, without modifying training or finetuning model weights.

Visual Results

30s Videos (6x Horizon Extension)

Scenes & Objects

CausVid
Self Forcing
LongLive
Rolling Forcing
Self Forcing+FLEX
Prompt 165: [replace with your prompt]
Prompt 165: [replace with your prompt]
Prompt 165: [replace with your prompt]
Prompt 165: [replace with your prompt]
Prompt 165: [replace with your prompt]
Prompt 052: [replace with your prompt]
Prompt 052: [replace with your prompt]
Prompt 052: [replace with your prompt]
Prompt 052: [replace with your prompt]
Prompt 052: [replace with your prompt]
Prompt 075: [replace with your prompt]
Prompt 075: [replace with your prompt]
Prompt 075: [replace with your prompt]
Prompt 075: [replace with your prompt]
Prompt 075: [replace with your prompt]

Sports

CausVid
Self Forcing
LongLive
Rolling Forcing
Self Forcing+FLEX
Prompt 033: [replace with your prompt]
Prompt 033: [replace with your prompt]
Prompt 033: [replace with your prompt]
Prompt 033: [replace with your prompt]
Prompt 033: [replace with your prompt]
Prompt 164: [replace with your prompt]
Prompt 164: [replace with your prompt]
Prompt 164: [replace with your prompt]
Prompt 164: [replace with your prompt]
Prompt 164: [replace with your prompt]
Prompt 203: [replace with your prompt]
Prompt 203: [replace with your prompt]
Prompt 203: [replace with your prompt]
Prompt 203: [replace with your prompt]
Prompt 203: [replace with your prompt]

Portraits

CausVid
Self Forcing
LongLive
Rolling Forcing
Self Forcing+FLEX
Prompt 317: [replace with your prompt]
Prompt 317: [replace with your prompt]
Prompt 317: [replace with your prompt]
Prompt 317: [replace with your prompt]
Prompt 317: [replace with your prompt]
Prompt 000: [replace with your prompt]
Prompt 000: [replace with your prompt]
Prompt 000: [replace with your prompt]
Prompt 000: [replace with your prompt]
Prompt 000: [replace with your prompt]
Prompt 010: [replace with your prompt]
Prompt 010: [replace with your prompt]
Prompt 010: [replace with your prompt]
Prompt 010: [replace with your prompt]
Prompt 010: [replace with your prompt]

Animation

CausVid
Self Forcing
LongLive
Rolling Forcing
Self Forcing+FLEX
Prompt 865: [replace with your prompt]
Prompt 865: [replace with your prompt]
Prompt 865: [replace with your prompt]
Prompt 865: [replace with your prompt]
Prompt 865: [replace with your prompt]
Prompt 185: [replace with your prompt]
Prompt 185: [replace with your prompt]
Prompt 185: [replace with your prompt]
Prompt 185: [replace with your prompt]
Prompt 185: [replace with your prompt]
Prompt 950: [replace with your prompt]
Prompt 950: [replace with your prompt]
Prompt 950: [replace with your prompt]
Prompt 950: [replace with your prompt]
Prompt 950: [replace with your prompt]

60s Videos (12x Horizon Extension)

Self Forcing
LongLive
Rolling Forcing
Self Forcing+FLEX
Prompt 016: [replace with your prompt]
Prompt 016: [replace with your prompt]
Prompt 016: [replace with your prompt]
Prompt 016: [replace with your prompt]
Prompt 231: [replace with your prompt]
Prompt 231: [replace with your prompt]
Prompt 231: [replace with your prompt]
Prompt 231: [replace with your prompt]
Prompt 296: [replace with your prompt]
Prompt 296: [replace with your prompt]
Prompt 296: [replace with your prompt]
Prompt 296: [replace with your prompt]
Prompt 294: [replace with your prompt]
Prompt 294: [replace with your prompt]
Prompt 294: [replace with your prompt]
Prompt 294: [replace with your prompt]

Ultra-long Video Generation

FLEX is compatible with existing long-video inference pipelines for further generation horizon extension. For instance, it supports more stable multi-minute video generation with LongLive.

Appearance Consistency

Scene Dynamics

LongLive

LongLive+FLEX

LongLive

LongLive+FLEX

Prompt 070: [replace with your prompt]
Prompt 070: [replace with your prompt]
Prompt 087: [replace with your prompt]
Prompt 087: [replace with your prompt]

Citation

@misc{li2026trainshortinferencelong,
      title={Train Short, Inference Long: Training-free Horizon Extension for Autoregressive Video Generation}, 
      author={Jia Li and Xiaomeng Fu and Xurui Peng and Weifeng Chen and Youwei Zheng and Tianyu Zhao and Jiexi Wang and Fangmin Chen and Xing Wang and Hayden Kwok-Hay So},
      year={2026},
      eprint={2602.14027},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.14027}, 
}