[HN Gopher] Packing Input Frame Context in Next-Frame Prediction...
       ___________________________________________________________________
        
       Packing Input Frame Context in Next-Frame Prediction Models for
       Video Generation
        
       Author : GaggiX
       Score  : 215 points
       Date   : 2025-04-19 13:17 UTC (9 hours ago)
        
 (HTM) web link (lllyasviel.github.io)
 (TXT) w3m dump (lllyasviel.github.io)
        
       | ZeroCool2u wrote:
       | Wow, the examples are fairly impressive and the resources used to
       | create them are practically trivial. Seems like inference can be
       | run on previous generation consumer hardware. I'd like to see
       | throughput stats for inference on a 5090 too at some point.
        
       | Jaxkr wrote:
       | This guy is a genius; for those who don't know he also brought us
       | ControlNet.
       | 
       | This is the first decent video generation model that runs on
       | consumer hardware. Big deal and I expect ControlNet pose support
       | soon too.
        
         | msp26 wrote:
         | I haven't bothered with video gen because I'm too impatient but
         | isn't Wan pretty good too on regular hardware?
        
           | dewarrn1 wrote:
           | LTX-Video isn't quite the same quality as Wan, but the new
           | distilled 0.9.6 version is pretty good and screamingly fast.
           | 
           | https://github.com/Lightricks/LTX-Video
        
           | vunderba wrote:
           | Wan 2.1 is solid but you start to get pretty bad continuity /
           | drift issues when genning more than 81 frames (approx 5
           | seconds of video) whereas FramePack lets you generate 1+
           | minute.
        
           | dragonwriter wrote:
           | Wan 2.1 (and Hunyuan and LTXV, in descending ordee of overall
           | video quality but each has unique strengths) work well--but
           | slow, except LTXV--for short (single digit seconds at their
           | usual frame rates -- 16 for WAN, 24 for LXTV, I forget for
           | Hunyuan) videos on consumer hardware. But this blows them
           | entirely out of the water on the length it can handle, so if
           | it does so with coherence and quality across general prompts
           | (especially if it is competitive with WAN and Hunyuan on
           | trainability for concepts it may not handle normally) it is
           | potentially a radical game changer.
        
             | dragonwriter wrote:
             | For completeness, I should note I'm talking about the 14B
             | i2v and t2v WAN 2.1 models; there are others in the family,
             | notably a set of 1.3B models that are presumably much
             | faster, but I haven't worked with them as much
        
         | artninja1988 wrote:
         | He also brought us IC-Light! I wonder why he's still
         | contributing to open source... Surely all the big companies
         | have made him huge offers. He's so talented
        
           | dragonwriter wrote:
           | I think he is working on his Ph.D. at Stanford. I assume
           | whatever offers he has haven't been attractive enough to
           | abandon that, whether he'll still be doing open work or get
           | sucked into the bowels of some proprietary corporate behemoth
           | afterwards remains to be seen, but I suspect he won't have
           | trouble monetizing his skills either way.
        
       | IshKebab wrote:
       | Funny how it _really_ wants people to dance. Even the guy sitting
       | down for an interview just starts dancing sitting down.
        
         | Jaxkr wrote:
         | Massive open TikTok training set lots of video researchers use
        
         | jonas21 wrote:
         | Presumably they're dancing because it's in the prompt. You
         | could change the prompt to have them do something else (but
         | that would be less fun!)
        
           | IshKebab wrote:
           | I'm no expert but are you sure there is a prompt?
        
             | dragonwriter wrote:
             | Yes, while the page here does not directly mention the
             | prompts, the linked paper does, and the linked code repo
             | shows that prompts are used as well.
        
         | bravura wrote:
         | It's a peculiar and fascinating observation you make.
         | 
         | With static images, we always look for eyes.
         | 
         | With video, we always look for dancing.
        
       | fregocap wrote:
       | looks like the only motion it can do...is to dance
        
         | jsolson wrote:
         | It can dance if it wants to...
         | 
         | It can leave LLMs behind...
         | 
         | 'Cause LLMs don't dance, and if they don't dance, well, they're
         | no friends of mine.
        
           | rhdunn wrote:
           | That's a certified bop! ;) You should get elybeatmaker to do
           | a remix!
           | 
           | Edit: I didn't realize that this was actually a reference to
           | Men Without Hats - The Safety Dance. I was referencing a
           | different parody/allusion to that song!
        
           | MyOutfitIsVague wrote:
           | The AI Safety dance?
        
         | dragonwriter wrote:
         | There is plenty of non-dance motion (only one or two where its
         | non-dance _foot_ motion, but feet aren 't the only things that
         | move.)
        
         | enlyth wrote:
         | It takes a text prompt along with the image input, dancing is
         | presumably what they've used for the examples
        
       | WithinReason wrote:
       | Could you do this spatially as well? E.g. generate the image top-
       | down instead of all at once
        
       | modeless wrote:
       | Could this be used for video interpolation instead of
       | extrapolation?
        
         | yorwba wrote:
         | Their "inverted anti-drifting" basically amounts to first
         | extrapolating a lot and then interpolating backwards.
        
       | ilaksh wrote:
       | Amazing. If you have more RAM or something, can it go faster? Can
       | you get even more speed on an H100 or H200?
        
       ___________________________________________________________________
       (page generated 2025-04-19 23:00 UTC)