[HN Gopher] Text-to-4D Dynamic Scene Generation
       ___________________________________________________________________
        
       Text-to-4D Dynamic Scene Generation
        
       Author : Sebastian_09
       Score  : 108 points
       Date   : 2023-01-27 10:41 UTC (12 hours ago)
        
 (HTM) web link (make-a-video3d.github.io)
 (TXT) w3m dump (make-a-video3d.github.io)
        
       | Sebastian_09 wrote:
       | Link to paper https://arxiv.org/abs/2301.11280, dynamic
       | visualisations only work in Chrome (?)
        
         | jerpint wrote:
         | Can confirm it doesn't work on brave on mobile
        
       | stale2002 wrote:
       | Another paper, with no code released?
       | 
       | What's the point then?
        
         | kamray23 wrote:
         | It's perfectly reasonable to release a publicly accessible
         | paper while keeping the code to yourself, especially if you're
         | Meta or OpenAI and wish to commercialize it at some point.
         | 
         | You can recreate things from papers fine. I've done it for
         | several projects, it's often nicer than just copy-pasting in
         | code and it fixes issues where one side is uisng Montreal's AI
         | toolkit and another is using pytorch and one other is using
         | keras.
         | 
         | Although for a tool like this, they clearly used pre-trained
         | models as a large component, ones with publicly accessible
         | weights as well. So replicating it will probably happen in the
         | coming months if Meta doesn't (understandably) release the code
         | they very clearly plan to use for their own Metaverse product.
        
         | radarsat1 wrote:
         | Code is nice, but a paper should be written sufficiently well
         | that it gets the ideas across such that the solution can be
         | replicated. The _ideas_ are the point, not the implementation.
        
       | smusamashah wrote:
       | These videos look too much like the things and their movement
       | that I see in dreams. They are blurryish but makes sense but
       | actually don't. e.g. the running rabbit, its legs are moving but
       | its not. This is almost exactly how I remember dreams, when I see
       | people moving, I can rarely notice their limbs moving
       | accordingly. When I look at my own hands they might have more
       | than 5 five fingers and very vague and blurry hand lines. When i
       | try to run or walk, or fly its just as weird as these videos.
       | 
       | This reminds of how the first generation of these kind of image
       | generators were said to be 'dreaming'. This also makes me think
       | that do our brains really work like these algorithms (or these
       | algos are mimicking brains very correctly).
        
       | littlestymaar wrote:
       | I've expected NERF + Diffusion models for a while, but it looks
       | like there's still a lot of work needed before it gets practical.
        
         | GaggiX wrote:
         | Performing these optimization processes during inference time
         | has never been very practical for generative tasks, as it
         | requires a lot of time, memory (to store the gradient) and the
         | quality is usually mediocre. I still remember VQGAN+CLIP, the
         | optimization process was to find a latent embedding that would
         | maximize the cosine similarity between the CLIP encoded image
         | and the CLIP encoded prompt, It worked but not very practical.
        
       | dukeofdoom wrote:
       | Getting something that generates multiple angles of the same
       | subject in different typical poses would go a long way. I can get
       | midjourney to kind of do this by asking for "multiple angles",
       | but it's hit or mis.
        
       | ajjenkins wrote:
       | Can someone explain what's 4D about this? Is it 4D because the 3D
       | models are animated (moving)?
        
         | spdustin wrote:
         | 4D: Height, width, depth, and time.
        
         | [deleted]
        
       | radarsat1 wrote:
       | > trained only on Text-Image pairs and unlabeled videos
       | 
       | This is fascinating. It's able to pick up sufficiently on the
       | fundamentals of 3D motion from 2D videos, while only needing
       | static images with descriptions to infer semantics.
        
       | jackling wrote:
       | I really wish these datasets were more openly accessiable. I
       | always want to try replicating these models but it seems that the
       | data is the blocker. Renting the compute needed to create an
       | inferiror model does not seem to be an issue, it's always the
       | data.
        
       | jug wrote:
       | Here we go again. The samples look uncannily similar to the early
       | text-to-image stuff we had.
        
       ___________________________________________________________________
       (page generated 2023-01-27 23:01 UTC)