Apple introduced a new generative video model called STARFlow-V on December 6, 2025, built on “normalizing flows” rather than diffusion. The goal, according to a report by the-decoder.com’s Jonathan Kemper, is faster, more stable long clips, though outputs are currently capped at 640 × 480 pixels at 16 frames per second and demo videos run up to 30 seconds.
STARFlow-V normalizing flows: What Apple just unveiled
Kemper frames STARFlow-V as a technical fork away from the current trend: diffusion-based tools like Sora, Veo, and Runway. In his words,
“Apple’s new STARFlow-V model takes a different technical approach to video generation than popular tools like Sora or Veo, relying on ‘Normalizing Flows’ instead of the widely used diffusion models.”
He writes that Apple is emphasizing stability on longer clips and faster generation:
“This allows for faster, more stable production of longer video clips, though current outputs are limited to 640 × 480 pixels at 16 frames per second.”
Out of the box, Kemper says the system supports three modes without swapping architectures:
- Text-to-video
- Image-to-video (treats the input image as a starting frame)
- Video-to-video editing (add or remove objects)
To extend beyond its training-length clips, Kemper describes a “sliding window” that rolls the sequence forward while carrying context from the last frames so the next segment aligns:
“For clips exceeding the training length, the model employs a sliding window technique: it generates a section, retains context from the final frames, and continues seamlessly.”
Kemper also notes a caveat from Apple’s demos: clips up to 30 seconds “show limited variance over time.”
How it stacks up to Sora, Veo, and Runway
Kemper says STARFlow-V isn’t topping the diffusion heavyweights overall, at least in the benchmarks he cites:
“In benchmarks, STARFlow-V trails top diffusion models in overall score but clearly outperforms other autoregressive models, especially in maintaining video quality and stability over longer sequences.”
That sets up a neat split. Diffusion models like OpenAI’s Sora, Google’s Veo, and Runway’s systems still lead on aggregate metrics and high-fidelity showpieces, according to Kemper’s . The flow-based model, by , leans into steadier motion over longer horizons and faster sampling. The trade-off is visible: 640 × 480 at 16 fps is a hard ceiling right now, which narrows use cases even if the motion holds together better over time.
Kemper does credit Apple with pursuing a different lane:
“With STARFlow-V, Apple has introduced a video generation model that diverges technically from competitors like Sora, Veo, and Runway.”
Inside the tech: flows, windows, and trade-offs
Kemper’s summary of the method is straightforward. Diffusion cleans up noise over many iterative steps. Normalizing flows learn a direct, invertible mapping between noise and data. He writes:
“While diffusion models generate clean video by gradually removing noise from images in multiple steps, normalizing flows learn a direct mathematical transformation between random noise and complex video data.”
That matters for training and sampling. In Kemper’s rundown:
“This allows for training in a single pass rather than through many small iterations.”
“Once trained, the model generates video directly from random values, eliminating the need for iterative calculations.”
“Apple argues this makes training more efficient and reduces the errors often seen in step-by-step generation.”
The approach extends to long clips by pairing flows with that sliding-window generator. Generate a chunk, keep the last frames’ context, and roll forward to the next chunk. Kemper cites Apple’s claim that this dual-architecture setup helps tamp down error accumulation over time, a persistent headache for video models:
“Generating long sequences remains a major hurdle for video AI, as frame-by-frame generation often leads to accumulating errors.”
Viewed through Kemper’s lens, the choices line up cleanly with the results: STARFlow-V emphasizes stable motion and speed over headline resolution. Benchmarks reflect that tilt—behind top diffusion models overall, but ahead of other autoregressive systems when the sequence gets long.
Apple’s flow push: from images to video
Kemper points out that this didn’t come out of nowhere:
“Apple has explored this method since at least last year, publishing a paper on image generation via normalizing flows over the summer.”
Now those ideas carry over to moving pictures:
“Now applied to video, Apple claims STARFlow-V is the first of its kind to rival diffusion models in visual quality and speed, albeit at a relatively low resolution of 640 × 480 pixels at 16 frames per second.”
Put together, Kemper’s reporting paints STARFlow-V as a test of whether normalizing flows can hang with diffusion in video. The current ceiling—640 × 480, 16 fps, demo clips up to 30 seconds—keeps expectations grounded. The upside Apple is chasing, per Kemper’s account, is fewer generation steps, steadier long-form motion, and a single model that covers text-to-video, image-to-video, and basic edits without architectural swaps.
Specs and limits noted in the report
- Output resolution: 640 × 480 pixels
- Frame rate: 16 fps
- Demo length: up to 30 seconds
- Tasks: text-to-video, image-to-video (starting frame), video-to-video editing (add/remove objects)
- Generation method: normalizing flows with a sliding window for long clips
- Benchmark summary: trails top diffusion models overall; outperforms other autoregressive models on longer-sequence stability
All details above come from Jonathan Kemper’s report at the-decoder.com, which characterizes Apple’s move as a deliberate bet on flows for video. If Apple pushes this line further, the next thing to watch is whether those stability gains at length can survive a jump in resolution and frame rate. For now, this is Apple’s STARFlow-V: latest developments in video AI, with flows, windows, and a clear emphasis on long-clip steadiness over showpiece pixels. More details at STARFlow-V normalizing flows. More details at flow-based generative video. More details at image-to-video conversion.