[HN Gopher] Packing Input Frame Context in Next-Frame Prediction...
___________________________________________________________________
Packing Input Frame Context in Next-Frame Prediction Models for
Video Generation
Author : GaggiX
Score : 215 points
Date : 2025-04-19 13:17 UTC (9 hours ago)
(HTM) web link (lllyasviel.github.io)
(TXT) w3m dump (lllyasviel.github.io)
| ZeroCool2u wrote:
| Wow, the examples are fairly impressive and the resources used to
| create them are practically trivial. Seems like inference can be
| run on previous generation consumer hardware. I'd like to see
| throughput stats for inference on a 5090 too at some point.
| Jaxkr wrote:
| This guy is a genius; for those who don't know he also brought us
| ControlNet.
|
| This is the first decent video generation model that runs on
| consumer hardware. Big deal and I expect ControlNet pose support
| soon too.
| msp26 wrote:
| I haven't bothered with video gen because I'm too impatient but
| isn't Wan pretty good too on regular hardware?
| dewarrn1 wrote:
| LTX-Video isn't quite the same quality as Wan, but the new
| distilled 0.9.6 version is pretty good and screamingly fast.
|
| https://github.com/Lightricks/LTX-Video
| vunderba wrote:
| Wan 2.1 is solid but you start to get pretty bad continuity /
| drift issues when genning more than 81 frames (approx 5
| seconds of video) whereas FramePack lets you generate 1+
| minute.
| dragonwriter wrote:
| Wan 2.1 (and Hunyuan and LTXV, in descending ordee of overall
| video quality but each has unique strengths) work well--but
| slow, except LTXV--for short (single digit seconds at their
| usual frame rates -- 16 for WAN, 24 for LXTV, I forget for
| Hunyuan) videos on consumer hardware. But this blows them
| entirely out of the water on the length it can handle, so if
| it does so with coherence and quality across general prompts
| (especially if it is competitive with WAN and Hunyuan on
| trainability for concepts it may not handle normally) it is
| potentially a radical game changer.
| dragonwriter wrote:
| For completeness, I should note I'm talking about the 14B
| i2v and t2v WAN 2.1 models; there are others in the family,
| notably a set of 1.3B models that are presumably much
| faster, but I haven't worked with them as much
| artninja1988 wrote:
| He also brought us IC-Light! I wonder why he's still
| contributing to open source... Surely all the big companies
| have made him huge offers. He's so talented
| dragonwriter wrote:
| I think he is working on his Ph.D. at Stanford. I assume
| whatever offers he has haven't been attractive enough to
| abandon that, whether he'll still be doing open work or get
| sucked into the bowels of some proprietary corporate behemoth
| afterwards remains to be seen, but I suspect he won't have
| trouble monetizing his skills either way.
| IshKebab wrote:
| Funny how it _really_ wants people to dance. Even the guy sitting
| down for an interview just starts dancing sitting down.
| Jaxkr wrote:
| Massive open TikTok training set lots of video researchers use
| jonas21 wrote:
| Presumably they're dancing because it's in the prompt. You
| could change the prompt to have them do something else (but
| that would be less fun!)
| IshKebab wrote:
| I'm no expert but are you sure there is a prompt?
| dragonwriter wrote:
| Yes, while the page here does not directly mention the
| prompts, the linked paper does, and the linked code repo
| shows that prompts are used as well.
| bravura wrote:
| It's a peculiar and fascinating observation you make.
|
| With static images, we always look for eyes.
|
| With video, we always look for dancing.
| fregocap wrote:
| looks like the only motion it can do...is to dance
| jsolson wrote:
| It can dance if it wants to...
|
| It can leave LLMs behind...
|
| 'Cause LLMs don't dance, and if they don't dance, well, they're
| no friends of mine.
| rhdunn wrote:
| That's a certified bop! ;) You should get elybeatmaker to do
| a remix!
|
| Edit: I didn't realize that this was actually a reference to
| Men Without Hats - The Safety Dance. I was referencing a
| different parody/allusion to that song!
| MyOutfitIsVague wrote:
| The AI Safety dance?
| dragonwriter wrote:
| There is plenty of non-dance motion (only one or two where its
| non-dance _foot_ motion, but feet aren 't the only things that
| move.)
| enlyth wrote:
| It takes a text prompt along with the image input, dancing is
| presumably what they've used for the examples
| WithinReason wrote:
| Could you do this spatially as well? E.g. generate the image top-
| down instead of all at once
| modeless wrote:
| Could this be used for video interpolation instead of
| extrapolation?
| yorwba wrote:
| Their "inverted anti-drifting" basically amounts to first
| extrapolating a lot and then interpolating backwards.
| ilaksh wrote:
| Amazing. If you have more RAM or something, can it go faster? Can
| you get even more speed on an H100 or H200?
___________________________________________________________________
(page generated 2025-04-19 23:00 UTC)