Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

Introduction

Hybrid Forcing consistently achieves state-of-the-art performance, and thanks to the hybrid attention design the our model achieves real-time video generation at 29.5 FPS (832x480) on a single NVIDIA H100 GPU without quantization or model compression.

Demo video

Implementation Details

Overview

Framework of our training and inference pipeline. (a) An auxiliary regularization loss is introduced to mitigate training mode collapse. (b) The output of our hybrid attention is computed as the sum of sparse local window attention and linear history attention. (c) The linear state is updated prior to evicting outdated frames from the KV cache to preserve long-term information. (d) The temporal RoPE index is constrained within the maximum temporal index of pretrained model.

Attention design

Illustration of our hybrid attention paradigm for SVG. (a) The standard SWA approach only caches the most recent frames, leading to significant error accumulation in long-form generation. (b) While sinking the first frame can reduce early drift, it sacrifices motion diversity and fails to capture long-horizon dependencies effectively. (c) Our hybrid attention combines long-term history retention with efficient local modeling, leveraging low-cost history linear modeling and sparse SWA to enhance performance.

Qualitative Results

Qualitative results demonstrating the capabilities of Hybrid forcing.

5s Short duration inference cases 1 (832x480) A bicycle gliding smoothly through a snowy field at twilight. The bicycle is a classic model with a wooden frame and leather handlebars, painted a deep shade of green. The rider, a young woman with long, wavy blonde hair tied back in a ponytail, is wearing a warm winter coat, boots, and gloves. She pedals effortlessly, her face lit by the soft glow of the setting sun. The snow-covered field stretches out behind her, with tall, frosted grasses and occasional patches of bare ground. In the distance, a few trees stand still, their branches heavy with snow. The sky is a mix of pale blues and purples, with a few stars beginning to twinkle. A gentle breeze rustles the snow, creating a peaceful, serene atmosphere. The scene is captured in a smooth, cinematic style, with a medium shot following the bicycle as it moves forward.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE

5s Short duration inference cases 2 (832x480) A motorcycle racing through a winding road, making a sharp turn. The rider, a young man with short dark hair and intense focus, leans into the curve, his body slightly hunched over the handlebars. His helmet is securely fastened, and he grips the handlebars tightly. The motorcycle has a sleek, black exterior with red accents, and its engine roars as it speeds around the bend. The background shows blurred trees and scenery passing by quickly. The road is wet and slightly muddy, adding to the sense of speed and action. The camera follows the motorcycle from behind, capturing the dynamic motion and the thrill of the ride. High-speed, dynamic shot with a smooth panning effect.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE

5s Short duration inference cases 3 (832x480) A storm trooper from Star Wars cleaning the beach in a methodical manner. The storm trooper is wearing the classic white armor with a helmet that partially obscures their face, giving them a stern expression. They are using a large, industrial vacuum cleaner, moving slowly across the sand to pick up debris. The beach is partially visible, with waves gently lapping at the shore in the background. The sky is overcast, with dark clouds suggesting an approaching storm. The shot is medium close-up, focusing on the storm trooper's movements and the vacuum cleaner's operation. Natural sweeping motions and the trooper's deliberate steps add to the scene's authenticity. The background shows a few other storm troopers in the distance, walking towards the camera.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE

30s Long duration inference cases 1 (832x480) Animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. The art style is 3D and realistic, with a focus on lighting and texture. The mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. Its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. The use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE

30s Long duration inference cases 2 (832x480) A turtle swimming gracefully in the vast ocean, surrounded by clear blue water and gentle waves. The turtle has a smooth, dark green shell with a few light yellow stripes, and its head is slightly turned to the side as it swims. Its flippers move rhythmically, propelling it forward. The background features a vibrant ocean landscape with colorful coral reefs, schools of fish, and floating seaweed. The water is crystal clear, allowing the viewer to see the underwater world vividly. The turtle appears content and serene, gliding through the water with ease. The shot is a medium shot from a slightly overhead angle, capturing the turtle in mid-swim.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE

30s Long duration inference cases 3 (832x480) Two playful-looking pandas, one with a slightly puzzled expression and the other with a curious gaze, are sitting across from each other on a bamboo mat in a serene bamboo forest. They are discussing an academic paper, their ears perked up as they exchange ideas. One panda holds a rolled-up paper in its front paws, while the other leans forward, listening intently. The background is lush and green, with sunlight filtering through the dense foliage, creating dappled patterns on the ground. The atmosphere is peaceful and intellectual. The pandas move their heads slightly to show interest and occasionally tap their paws together in agreement. The scene is captured in a medium shot, with a soft focus on the pandas' faces and a blurred, natural forest background. The style is realistic and detailed, with a warm color palette.
|

Hybrid Forcing

CasVid

Self-Forcing

Rolling Forcing

LongLive

MemFlow

∞-RoPE