[HN Gopher] Text2Video-Zero Code and Weights Released by Picsart...
       ___________________________________________________________________
        
       Text2Video-Zero Code and Weights Released by Picsart AI Research
       (12G VRAM)
        
       Author : ftxbro
       Score  : 387 points
       Date   : 2023-03-29 04:15 UTC (18 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | meghan_rain wrote:
       | Wow, scroll to the part where stick figures are provided as
       | guidance https://github.com/Picsart-AI-Research/Text2Video-
       | Zero#text-... and tell me this won't upend the VFX industry!
        
         | DrewADesign wrote:
         | OK... This won't upend the VFX industry. Maybe a subsequent
         | generation that is vastly more precise, but this isn't even
         | ballpark. The stick figure results are vastly less impressive
         | than current methods made with less time and nearly as little
         | effort. Look at Unreal Engine's talk at GDC and then consider
         | that movie studios generally won't use those tools for anything
         | visually important because they just aren't as good as pre-
         | rendered 3D. Even in places like The Mandalorian it was only
         | used for background imagery.
        
           | londons_explore wrote:
           | Someone will see these results, and think "if only it was 10x
           | better, I could be rich by selling to the vfx industry".
           | They'll then start training a 10x bigger model (there is
           | plenty of video training data out there).
           | 
           | And in 3 months, it will be upending the vfx industry...
        
             | DrewADesign wrote:
             | Once again, the hubris of developers when trying to
             | evaluate art just kills me. The most important part of art
             | is deciding exactly what needs to be in the image and how
             | it should look-- not the act of actually creating it. Even
             | using plain language, describing these things in plain
             | English with the precision needed would take way longer
             | than current experts take to make them using existing
             | tools, and the refinement process is full of uncertainty.
             | At most, getting a base layer for people to modify using
             | the professional tools of the trade. Much like development
             | work, the final 10% takes 90% of the time. In dev, you're
             | figuring out how to make it. Usually in VFX, it's figuring
             | out how it should look. That's a fundamentally human
             | process. Whether it accomplishes its goal is evaluated by
             | humans and it needs to satisfy human tastes. No matter how
             | good the tools get at creating images, they just won't be
             | able to speak to human needs and emotions like people will,
             | maybe ever.
        
               | refulgentis wrote:
               | I'm sorry about these replies they're so funny. I'm an
               | engineer and wish more of us understood that these aren't
               | workflow replacements, we'd build much better tools. Been
               | in AI art for 2 years so I also grok that we're not one
               | more model away from reading your mind and always perfect
               | results
        
               | DrewADesign wrote:
               | Yeah. Convincing people that they don't know something
               | they think they know is always a tough task. Convincing
               | _developers_ they don 't know something they think they
               | know just might be impossible. I feel like the kid in the
               | Starfish parable.
        
               | satvikpendem wrote:
               | > _they just won 't be able to speak to human needs and
               | emotions like people will, maybe ever._
               | 
               | Until the AI knows what you want and like by analyzing
               | your browser and viewing history. There is a part in the
               | Three Body Problem trilogy where the aliens create art
               | that's on par if not better than human-made art, and
               | humans are shocked because the aliens do not even
               | resemble any sort of human intelligence at all.
        
               | DrewADesign wrote:
               | Customizing marketing experiences for consumers is a
               | vastly different task than creating art. Creating images
               | is a vastly different process than creating art though
               | creating art might involve it. If you think art is merely
               | customizing imagery things to suit people's preferences,
               | you don't understand art. No matter how hard sci-fi gets,
               | what you're talking about is still very much "fi".
        
               | satvikpendem wrote:
               | > _If you think art is merely customizing imagery things
               | to suit people 's preferences, you don't understand art._
               | 
               | No, art is whatever the artist wants to call art, and
               | it's whatever people want to find meaning in. Sure, you
               | can't equate it to producing images, but the vast
               | majority of people won't really care, if they can see
               | some cool images or videos then that's what they'll want.
               | This is the same argument that was used when Stable
               | Diffusion released, yet the inexorable wheel of progress
               | turns on.
        
         | whitemary wrote:
         | It won't upend the VFX industry.
         | 
         | If a workflow cannot support iterating on every detail in
         | isolation with psychotic levels of control, it won't even be
         | adopted by the VFX industry.
        
           | detrites wrote:
           | This is exactly what it supports, and limitlessly.
           | 
           | An AI model doesn't get tired of receiving direction and
           | doing retakes.
           | 
           | It's a matter of time.
        
           | echelon wrote:
           | Once this tech matures, they won't have any work left to do.
           | 
           | Kids at home will be making Star Wars, not Industrial Lights
           | and Magic.
        
             | CyberDildonics wrote:
             | What do you think the steps will be between 1000 people
             | with 100 million dollars and "kids making star wars".
        
             | 101008 wrote:
             | What? No. Kids already have all their tools in their pocket
             | to make (and distribute) a short film. How many of them are
             | doing it right now?
        
             | [deleted]
        
             | DrewADesign wrote:
             | That's like saying kids at home will be making enterprise
             | software once code generation tools become smooth enough.
             | ChatGPT won't replace novelists, either, no matter how
             | solid its grammar gets.
             | 
             | People who don't mind having output that's an aesthetic
             | amalgam of whatever it has already ingested won't mind
             | using these sorts of tools to generate work. For a movie
             | studio that lives and dies on presenting imagery so
             | precisely and thoughtfully crafted that it leaves a lasting
             | impression for _decades,_ I doubt it will be anything more
             | than a tool in the toolkit to smooth things along for the
             | people who 've made careers figuring out how to do that.
             | 
             | I think there's two reasons this sort of thinking is so
             | prevelant. A) Since developers are only really exposed to
             | the tools-end of other people's professions, they tend to
             | forget that the most important component is the brain
             | that's dreaming up the imagery and ideas to begin with. Art
             | school or learning to be an artist is a lot more about
             | thinking, ideas, analyzing, interpreting, and honing your
             | eye than about using tools... people without those skills
             | who can use amazing generative tools will make smooth,
             | believable looking garbage in no time flat from the comfort
             | of their own living rooms. Great. B) Most people,
             | especially those from primarily STEM backgrounds, don't
             | really understand what art communicates beyond what it
             | physically represents and maybe some intangible vibe or
             | mood. Someone who really knows what they're talking about
             | would probably take longer to accurately describe an
             | existing artistic image than would be reasonable to feed to
             | a prompt. Once again, that will be fine for a lot of
             | people-- even small game studios, for example, that need to
             | generate assets for their 3rd mobile release this month,
             | but it's got miiiiles to go before it's making a movie
             | scene or key game moment that people will remember for
             | decades.
        
               | ttjjtt wrote:
               | There are so many well described points in this post.
               | Thank you.
        
               | DrewADesign wrote:
               | Thank you.
        
               | satvikpendem wrote:
               | > _For a movie studio that lives and dies on presenting
               | imagery so precisely and thoughtfully crafted that it
               | leaves a lasting impression for decades_
               | 
               | I'd hope so, but there are a lot of films churning out
               | the next Michael Bay Transformers-type movie which also
               | get the vast majority of VFX work and make the most
               | revenue that upending the VFX industry might remove a lot
               | of jobs.
        
               | DrewADesign wrote:
               | You again _vastly_ underestimate the amount of work that
               | goes into the intellectual and artistic components of
               | VFX, and vastly underestimate what the artistic
               | components require of someone. Those movies are nearly
               | _entirely_ VFX. 50% of the budget in some cases. That 's
               | not because the tools are expensive or the pieces just
               | take a long time to build-- it's repeatedly iterating and
               | putting things in context to figure out what should be
               | there. No matter how fast those iterations get, that's
               | just not something machines will be able to do
               | themselves. Ever. No machine will be able to say "this is
               | the most emotionally impactful texture and drape for that
               | superhero's cape in this scene." They might be able to
               | give you a thousand different versions of it, but someone
               | still has to look at all of that and respond to it.
               | Replicants from Blade Runner couldn't do that and to say
               | we're a ways off from that is a pretty huge
               | understatement.
        
           | nine_k wrote:
           | It will help add video illustrations to ordinary
           | presentations or talking-head videos. A market enough for
           | some attention from, say, Adobe, and for a bunch of less
           | expensive offerings.
        
             | Zetobal wrote:
             | So, it doesn't upend a multibillion-dollar market because
             | someone who has no idea how the market works made an overly
             | enthusiastic comment on HN? Who would have thought.
        
               | tanseydavid wrote:
               | Give it a couple of generations...
        
               | Zetobal wrote:
               | I won't need to as long as its diffusion based it won't
               | happen. Something else will happen but it will not be
               | diffusion based.
        
               | crakenzak wrote:
               | elaborate.
        
           | [deleted]
        
         | rvz wrote:
         | No it won't. At least not now when we are seeing deformed
         | artefacts in videos like this one: [0]
         | 
         | [0]
         | https://old.reddit.com/r/StableDiffusion/comments/1244h2c/wi...
        
           | SV_BubbleTime wrote:
           | Funnier than almost any network TV I've seen in years.
        
           | ftxbro wrote:
           | some call it deformed artefacts some call it meme magic
        
             | rvz wrote:
             | As long as 'OpenAI.com' is in last place and loses the race
             | to the bottom, I'm fine with it.
             | 
             | Good for anything that is actually open source and doesn't
             | close off research, like Text2Video.
        
         | reasonabl_human wrote:
         | ControlNet is the new development that will really allow us to
         | guide diffusion model outputs more granularly - first I'm
         | seeing it used against video generation
        
         | sockaddr wrote:
         | I think there's a boob on that camel's hip.
        
         | dragonwriter wrote:
         | It will create an entirely new low end to the industry (which
         | may be used in prototyping workflows by the high end), but the
         | high end is going to want both more detailed control of the
         | avatars (I would say models, but that means different things in
         | the colliding domains) and already does skeletal animation with
         | many more control points.
         | 
         | You want to upend the VFX industry, you probably want text-
         | to-3d-model and text-for-script-for-existing-animation-systems,
         | that still supports either of those being finetuned manually
         | with existing tools.
        
           | pjonesdotca wrote:
           | First, tweening. If you can get an AI to do the job of a
           | tweener you've already massively impacted the VFX/Animation
           | pipeline.
        
             | CyberDildonics wrote:
             | There is no "job of a tweener" in computer animation.
             | Geometry transformations are interpolated and always have
             | been.
             | 
             | Where are you getting this idea?
        
       | mightytravels wrote:
       | Wow - this looks awesome! Love the video2video.
        
       | EGreg wrote:
       | I want to output an animated GIF. What is the command?
       | 
       | Do I run pix2pix at the end?
       | 
       | Also can I somehow have more frames and set the GIF speed?
        
       | jonas-w wrote:
       | I see that there is a 12GB VRAM requirement. Can my 6GB GPU do
       | anything to at least provide some performance advantages as
       | instead of running it entirely on the CPU?
        
       | lifeisstillgood wrote:
       | Weirdly Corridor digital had a AI generated video and they
       | suffered from what is slightly happening here - the image of the
       | bear / panda or whatever is a different animal each time (ie it's
       | a panda, just a different one "hallucinated" each frame.
       | 
       | corridor digital handled it by training their model on specific
       | images of specific people - and so they effectively said "video
       | of the panda called phil that we have trained you on images of"
       | 
       | Clearly this is not possible here - so I am missing how they got
       | it close
        
       | bunnywantspluto wrote:
       | Can we run this on a CPU?
        
         | novaRom wrote:
         | Yes, you can run any inference with any model on CPU, but here
         | some examples:
         | 
         | To create a single frame with Stable Diffusion 4-5B parameters,
         | 512x512, 20 iterations takes 5-30 minutes depending on your
         | CPU. On any modern GPU it's only 0.1-20 seconds!
         | 
         | Similarly with LLMs, to produce one token with a transformer of
         | 30-50 layers 7-12B parameters you will wait several CPU minutes
         | while it takes few seconds on a Pascal-generation GPU and tiny
         | fraction of second on Ampere.
        
           | gpderetta wrote:
           | > To create a single frame with Stable Diffusion 4-5B
           | parameters, 512x512, 20 iterations takes 5-30 minutes
           | depending on your CPU
           | 
           | Depends a lot on the cpu. Are you specifically talking about
           | Text2Video or SD in general? IIRC, last time I tried SD on my
           | CPU (10 core 10850k, not exactly cutting edge) it did take
           | less than one minute for more than 20 iters. This was about
           | 4-5 months ago, things might have gotten better.
           | 
           | The GPU (even a vintage 1070) was faster still of course.
        
           | tjoff wrote:
           | That is much better than expected, depending on how easy it
           | is to setup it is very much worthwhile if you don't have
           | access to a decent GPU.
        
           | probablynish wrote:
           | The *.cpp adaptations of popular models appear to have
           | optimized them for the CPU, for example llama.cpp and
           | alpaca.cpp let me generate several tokens in a matter of
           | seconds.
        
       | voxelghost wrote:
       | what a time to be alive!
        
         | RajSinghLA wrote:
         | Dear fellow scholars
        
           | crakenzak wrote:
           | this is two minute papers, with Dr. Karoly Zsolnai-Feher
        
       | evolveyourmind wrote:
       | Seems very limited. I wonder if the same can be achieved with
       | just stable diffusion and neighbor latent walks with very small
       | steps. On the other hand the interpolation techniques with the
       | GigaGAN txt2img produce much higher quality "videos" than this
        
         | valine wrote:
         | Right, stable diffusion and control net depth network could
         | give very good results if you have a source video.
        
       | rasz wrote:
       | at least two of the example dogs have 5 paws
        
       | tehsauce wrote:
       | I haven't seen a single example from this model that demonstrates
       | video with any time of time continuity. It appears every frame is
       | independent to each other.
        
         | mlboss wrote:
         | Pose based controlnet looks like it demonstrates continuity
        
       | nprateem wrote:
       | So long giphy
        
       | [deleted]
        
       | adeon wrote:
       | It kinda works. I tried "kermit doing push ups" and it looks like
       | kermit. And Kermit sorta looks like he is in push up position.
       | Other than that it looks a lot like those early image generation
       | AIs that vaguely look like what you asked but is all mutated.
       | Animation itself is not very good for this prompt.
       | 
       | But hey, still pretty good for the early days. Maybe need to
       | figure out how to prompt engineer it and tune it. Seems like it's
       | very heavily based on existing image models? Wondering how easy
       | it is to adapt to other image models. I think I need to read the
       | paper. https://imgur.com/a/h3ciJNn
        
         | spokeonawheel wrote:
         | I tried "spongebob hugging a cactus with his butt hanging out"
         | and it said "error".
         | 
         | perhaps I dont understand this tool
        
           | adeon wrote:
           | You are welcome: https://imgur.com/a/bjfjhQq
        
             | whoomp12342 wrote:
             | that is completely horrific lol
        
             | HopenHeyHi wrote:
             | Thanks for the nightmare fuel. And complete absence of
             | butts. :/
        
       | dcreater wrote:
       | Can this get the ggerganov treatment so that I can run it on
       | Apple silicon?
        
         | crakenzak wrote:
         | with llama.cpp getting GPT4all inference support the same day
         | it came out, I feel like llama.cpp might soon become a general
         | purpose high performance inference library/toolkit. excited!
        
           | ggerganov wrote:
           | Heh, things seem to be moving in this direction, but I think
           | it's still a very long way to go. But who knows - the amount
           | of contributions to the project keep growing. I guess when we
           | have a solid foundation for LLM inference we can think about
           | supporting SD as well
        
             | reasonabl_human wrote:
             | Thanks for your work in this space, it's incredible!
        
         | Labo333 wrote:
         | SD models are much smaller so you can probably already run it
         | with the github code.
         | 
         | Image models work well at least:
         | https://huggingface.co/docs/diffusers/optimization/mps
         | 
         | It is even possible to run them in the browser:
         | https://stablediffusionweb.com/
        
       | ftxbro wrote:
       | The associated paper is "Text2Video-Zero: Text-to-Image Diffusion
       | Models are Zero-Shot Video Generators" at
       | https://arxiv.org/abs/2303.13439
        
       | bdcravens wrote:
       | I entered "A dachshund doing a backflip"
       | 
       | It was a dachshund. Doing some sort of a dance in the air.
       | Nothing resembling a flip.
       | 
       | https://imgur.com/a/eqHpuo7
        
         | 2c2c2c wrote:
         | looks like the leg motions of a kickflip, but didn't include
         | the board
        
         | fleischhauf wrote:
         | maybe the frame rate is not high enough to capture the flip,
         | could just be a super fast dachshund..
        
         | cubefox wrote:
         | As I mentioned recently, imgur hijacks the back button, at
         | least in my config (mobile Chrome). I recommend not using it
         | anymore.
        
           | Workaccount2 wrote:
           | Someone here showed me this extension yesterday:
           | 
           | https://github.com/libredirect/libredirect
           | 
           | I don't know think it will work on mobile chrome, but for
           | others like me that can't take imgurs bloat anymore, its
           | great.
        
           | tinus_hn wrote:
           | > As I mentioned recently
           | 
           | If only people would just listen to your sage advice!
           | 
           | This kind of sentence makes it come across as if you are on a
           | high horse talking down to the unwashed masses. You might
           | want to avoid this.
        
             | cubefox wrote:
             | Oh sorry this didn't came off as intended.
        
               | tough wrote:
               | I appreciate you telling us imgur is no longer a clean UX
               | site, I've felt the same
               | 
               | They told you to unblock adblock to upload now.
               | 
               | PS: Hello from my high horse
        
               | zamnos wrote:
               | As a phrase it's fine, it just that in the context of
               | "random internet commenter on random HN thread", no one
               | knows you and has no idea what you said recently.
               | 
               | My horse is also really high. We've been smoking weed
               | together.
        
         | leereeves wrote:
         | The examples on GitHub are so much better than the examples
         | here. I wonder if the authors are cherry picking or we just
         | don't know how to get good results.
         | 
         | Why are people downvoting this? The examples on GitHub are
         | clearly doing the action described (unlike the dachsund above)
         | and don't have the wild deformations that the "Spongebob
         | hugging a cactus" or "Will Smith eating spaghetti" animations
         | have.
        
           | gremlinsinc wrote:
           | it only matters to the inpatient among us, give it a few
           | weeks and things will only get better. The pace of ai
           | innovation is frightening but exciting too.
        
           | godelski wrote:
           | This is, unfortunately, standard practice on generative
           | papers. Best is to show cherry picked and random selection.
           | But if you don't cherry pick you don't get published. Even
           | having random samples can put you in a tough position.
           | 
           | To be fair, metrics for vision are limited and many don't
           | know how to correctly inturpret them. You both want to know
           | what's the best images the network can produce a well as the
           | average. Reviewers, unfortunately, just want to reject (no
           | incentive you accept). That said, it can still be a useful
           | tool because you know the best (or near) that the model can
           | produce. This is also why platforms like HuggingFace are so
           | critical.
           | 
           | I do think we should ask more of the research community but
           | just know that this is in fact normal and even standard. Most
           | research judge by things like citations and just reading the
           | works themselves. The more hype in a research area the more
           | noisy the signal for good research. (This is good research
           | imo fwiw)
        
           | throw101010 wrote:
           | In my experience AI generative processes always involve a lot
           | of cherry picking, rarely have I generated 1 image or 1 text
           | and it was perfect or very representative of what can be
           | attained with some refining.
           | 
           | Seems fair to choose good results when you try to demonstrate
           | what the software "can" do. Maybe a mention of the process
           | often being iterative would be the most honest, but I think
           | anyone familiar with these tools assume this by now.
        
           | nullsense wrote:
           | They're probably cherry picking. I know on ModelScope
           | text2video I have to generate 10 clips per prompt to get 2 or
           | 3 usable clips and out of that will come one really good one.
           | But it's fast enough to do that I just sort of accept that as
           | the workflow and generate in batches of 10. I assume it's
           | likely the same for this.
           | 
           | Can't wait to try it though.
        
         | fortyseven wrote:
         | He's tryin' his best. :{
        
           | zamnos wrote:
           | They're Good Dogs Brent
        
       | patientplatypus wrote:
       | [dead]
        
       | detrites wrote:
       | For the adventurous among us, here's that boolean you might want
       | to change:
       | 
       | https://github.com/Picsart-AI-Research/Text2Video-Zero/blob/...
        
         | novaRom wrote:
         | Is there any estimate how much training data someone like
         | MindGeek has?
        
         | stevefan1999 wrote:
         | Come on, somebody please make this an environmental variable
        
         | totetsu wrote:
         | This is exactly the kind of thing the Taoist philosophers were
         | warning us about.
        
         | belter wrote:
         | You are the wind beneath my wings...
        
         | SentinelLdnma wrote:
         | Economists will ponder for decades why LLMs never increased
         | labor productivity.
        
       | fauxpause_ wrote:
       | I feel like they really missed an opportunity to make their
       | example Horse in Motion rather than Horse Galloping on a Street
        
       | mnemotronic wrote:
       | [edit] _Almost_ everything I try except the predefined examples
       | returns "error". "a pencil with wings" returns something nothing
       | like a pencil that does, in fact, have wings.
       | 
       | There will be a learning curve.
        
       | rexreed wrote:
       | Question about models and weights. When an organization says they
       | release the weights, how is that different from when an
       | organization releases a model, say Whisper from OpenAI? In what
       | way are the model releases different in these cases?
        
         | superb-owl wrote:
         | The model is just an untrained architecture. You'd need to
         | spend a lot of money (a) gathering data, and (b) running GPUs
         | to train it.
         | 
         | The weights are the fruition of training. They make the model
         | actually useful.
        
           | rexreed wrote:
           | When you download a trained model for use by Python, I'm
           | assuming the model contains both the architecture (the neural
           | net or even a boosted tree) as well as the weights / tree
           | structure that makes the model actually usable in inference.
           | When organizations release a trained model, I'm assuming that
           | the weights are necessary to make use of that model? If not,
           | then are they not really releasing the model, but just the
           | architecture and training data?
        
             | vdfs wrote:
             | As i understand it:
             | 
             | - Model is the code
             | 
             | - Data is the text/images used for traing
             | 
             | - Weights are the training results
             | 
             | For example Lucene, models will be the java library, data
             | is text data like wikipedia and weights are the lucene
             | index. if you have all the 3 you can start searching right
             | away, if you have model+data you have to generate the index
             | which can take a lot of time, training/indexing take more
             | than searching or using the model. if you have just he
             | model you need to get your own data and run training on it
        
             | freeone3000 wrote:
             | Usually not the training data either!
        
         | leodriesch wrote:
         | OpenAI also released the weights for Whisper. Some model
         | releases, like LLaMa from Meta, just contain the code for
         | training and the data. You can train the weights yourself, but
         | for LLaMa that takes multiple weeks on very expensive hardware.
         | 
         | (LLaMa also released the weights for researchers and they
         | leaked online, but the weights are not open-source.)
         | 
         | When a company releases a model including the weights, you can
         | download their pre-trained weights and run inference with them,
         | without having to train by yourself.
        
       ___________________________________________________________________
       (page generated 2023-03-29 23:02 UTC)