[HN Gopher] Text2Video-Zero Code and Weights Released by Picsart...
___________________________________________________________________
Text2Video-Zero Code and Weights Released by Picsart AI Research
(12G VRAM)
Author : ftxbro
Score : 387 points
Date : 2023-03-29 04:15 UTC (18 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| meghan_rain wrote:
| Wow, scroll to the part where stick figures are provided as
| guidance https://github.com/Picsart-AI-Research/Text2Video-
| Zero#text-... and tell me this won't upend the VFX industry!
| DrewADesign wrote:
| OK... This won't upend the VFX industry. Maybe a subsequent
| generation that is vastly more precise, but this isn't even
| ballpark. The stick figure results are vastly less impressive
| than current methods made with less time and nearly as little
| effort. Look at Unreal Engine's talk at GDC and then consider
| that movie studios generally won't use those tools for anything
| visually important because they just aren't as good as pre-
| rendered 3D. Even in places like The Mandalorian it was only
| used for background imagery.
| londons_explore wrote:
| Someone will see these results, and think "if only it was 10x
| better, I could be rich by selling to the vfx industry".
| They'll then start training a 10x bigger model (there is
| plenty of video training data out there).
|
| And in 3 months, it will be upending the vfx industry...
| DrewADesign wrote:
| Once again, the hubris of developers when trying to
| evaluate art just kills me. The most important part of art
| is deciding exactly what needs to be in the image and how
| it should look-- not the act of actually creating it. Even
| using plain language, describing these things in plain
| English with the precision needed would take way longer
| than current experts take to make them using existing
| tools, and the refinement process is full of uncertainty.
| At most, getting a base layer for people to modify using
| the professional tools of the trade. Much like development
| work, the final 10% takes 90% of the time. In dev, you're
| figuring out how to make it. Usually in VFX, it's figuring
| out how it should look. That's a fundamentally human
| process. Whether it accomplishes its goal is evaluated by
| humans and it needs to satisfy human tastes. No matter how
| good the tools get at creating images, they just won't be
| able to speak to human needs and emotions like people will,
| maybe ever.
| refulgentis wrote:
| I'm sorry about these replies they're so funny. I'm an
| engineer and wish more of us understood that these aren't
| workflow replacements, we'd build much better tools. Been
| in AI art for 2 years so I also grok that we're not one
| more model away from reading your mind and always perfect
| results
| DrewADesign wrote:
| Yeah. Convincing people that they don't know something
| they think they know is always a tough task. Convincing
| _developers_ they don 't know something they think they
| know just might be impossible. I feel like the kid in the
| Starfish parable.
| satvikpendem wrote:
| > _they just won 't be able to speak to human needs and
| emotions like people will, maybe ever._
|
| Until the AI knows what you want and like by analyzing
| your browser and viewing history. There is a part in the
| Three Body Problem trilogy where the aliens create art
| that's on par if not better than human-made art, and
| humans are shocked because the aliens do not even
| resemble any sort of human intelligence at all.
| DrewADesign wrote:
| Customizing marketing experiences for consumers is a
| vastly different task than creating art. Creating images
| is a vastly different process than creating art though
| creating art might involve it. If you think art is merely
| customizing imagery things to suit people's preferences,
| you don't understand art. No matter how hard sci-fi gets,
| what you're talking about is still very much "fi".
| satvikpendem wrote:
| > _If you think art is merely customizing imagery things
| to suit people 's preferences, you don't understand art._
|
| No, art is whatever the artist wants to call art, and
| it's whatever people want to find meaning in. Sure, you
| can't equate it to producing images, but the vast
| majority of people won't really care, if they can see
| some cool images or videos then that's what they'll want.
| This is the same argument that was used when Stable
| Diffusion released, yet the inexorable wheel of progress
| turns on.
| whitemary wrote:
| It won't upend the VFX industry.
|
| If a workflow cannot support iterating on every detail in
| isolation with psychotic levels of control, it won't even be
| adopted by the VFX industry.
| detrites wrote:
| This is exactly what it supports, and limitlessly.
|
| An AI model doesn't get tired of receiving direction and
| doing retakes.
|
| It's a matter of time.
| echelon wrote:
| Once this tech matures, they won't have any work left to do.
|
| Kids at home will be making Star Wars, not Industrial Lights
| and Magic.
| CyberDildonics wrote:
| What do you think the steps will be between 1000 people
| with 100 million dollars and "kids making star wars".
| 101008 wrote:
| What? No. Kids already have all their tools in their pocket
| to make (and distribute) a short film. How many of them are
| doing it right now?
| [deleted]
| DrewADesign wrote:
| That's like saying kids at home will be making enterprise
| software once code generation tools become smooth enough.
| ChatGPT won't replace novelists, either, no matter how
| solid its grammar gets.
|
| People who don't mind having output that's an aesthetic
| amalgam of whatever it has already ingested won't mind
| using these sorts of tools to generate work. For a movie
| studio that lives and dies on presenting imagery so
| precisely and thoughtfully crafted that it leaves a lasting
| impression for _decades,_ I doubt it will be anything more
| than a tool in the toolkit to smooth things along for the
| people who 've made careers figuring out how to do that.
|
| I think there's two reasons this sort of thinking is so
| prevelant. A) Since developers are only really exposed to
| the tools-end of other people's professions, they tend to
| forget that the most important component is the brain
| that's dreaming up the imagery and ideas to begin with. Art
| school or learning to be an artist is a lot more about
| thinking, ideas, analyzing, interpreting, and honing your
| eye than about using tools... people without those skills
| who can use amazing generative tools will make smooth,
| believable looking garbage in no time flat from the comfort
| of their own living rooms. Great. B) Most people,
| especially those from primarily STEM backgrounds, don't
| really understand what art communicates beyond what it
| physically represents and maybe some intangible vibe or
| mood. Someone who really knows what they're talking about
| would probably take longer to accurately describe an
| existing artistic image than would be reasonable to feed to
| a prompt. Once again, that will be fine for a lot of
| people-- even small game studios, for example, that need to
| generate assets for their 3rd mobile release this month,
| but it's got miiiiles to go before it's making a movie
| scene or key game moment that people will remember for
| decades.
| ttjjtt wrote:
| There are so many well described points in this post.
| Thank you.
| DrewADesign wrote:
| Thank you.
| satvikpendem wrote:
| > _For a movie studio that lives and dies on presenting
| imagery so precisely and thoughtfully crafted that it
| leaves a lasting impression for decades_
|
| I'd hope so, but there are a lot of films churning out
| the next Michael Bay Transformers-type movie which also
| get the vast majority of VFX work and make the most
| revenue that upending the VFX industry might remove a lot
| of jobs.
| DrewADesign wrote:
| You again _vastly_ underestimate the amount of work that
| goes into the intellectual and artistic components of
| VFX, and vastly underestimate what the artistic
| components require of someone. Those movies are nearly
| _entirely_ VFX. 50% of the budget in some cases. That 's
| not because the tools are expensive or the pieces just
| take a long time to build-- it's repeatedly iterating and
| putting things in context to figure out what should be
| there. No matter how fast those iterations get, that's
| just not something machines will be able to do
| themselves. Ever. No machine will be able to say "this is
| the most emotionally impactful texture and drape for that
| superhero's cape in this scene." They might be able to
| give you a thousand different versions of it, but someone
| still has to look at all of that and respond to it.
| Replicants from Blade Runner couldn't do that and to say
| we're a ways off from that is a pretty huge
| understatement.
| nine_k wrote:
| It will help add video illustrations to ordinary
| presentations or talking-head videos. A market enough for
| some attention from, say, Adobe, and for a bunch of less
| expensive offerings.
| Zetobal wrote:
| So, it doesn't upend a multibillion-dollar market because
| someone who has no idea how the market works made an overly
| enthusiastic comment on HN? Who would have thought.
| tanseydavid wrote:
| Give it a couple of generations...
| Zetobal wrote:
| I won't need to as long as its diffusion based it won't
| happen. Something else will happen but it will not be
| diffusion based.
| crakenzak wrote:
| elaborate.
| [deleted]
| rvz wrote:
| No it won't. At least not now when we are seeing deformed
| artefacts in videos like this one: [0]
|
| [0]
| https://old.reddit.com/r/StableDiffusion/comments/1244h2c/wi...
| SV_BubbleTime wrote:
| Funnier than almost any network TV I've seen in years.
| ftxbro wrote:
| some call it deformed artefacts some call it meme magic
| rvz wrote:
| As long as 'OpenAI.com' is in last place and loses the race
| to the bottom, I'm fine with it.
|
| Good for anything that is actually open source and doesn't
| close off research, like Text2Video.
| reasonabl_human wrote:
| ControlNet is the new development that will really allow us to
| guide diffusion model outputs more granularly - first I'm
| seeing it used against video generation
| sockaddr wrote:
| I think there's a boob on that camel's hip.
| dragonwriter wrote:
| It will create an entirely new low end to the industry (which
| may be used in prototyping workflows by the high end), but the
| high end is going to want both more detailed control of the
| avatars (I would say models, but that means different things in
| the colliding domains) and already does skeletal animation with
| many more control points.
|
| You want to upend the VFX industry, you probably want text-
| to-3d-model and text-for-script-for-existing-animation-systems,
| that still supports either of those being finetuned manually
| with existing tools.
| pjonesdotca wrote:
| First, tweening. If you can get an AI to do the job of a
| tweener you've already massively impacted the VFX/Animation
| pipeline.
| CyberDildonics wrote:
| There is no "job of a tweener" in computer animation.
| Geometry transformations are interpolated and always have
| been.
|
| Where are you getting this idea?
| mightytravels wrote:
| Wow - this looks awesome! Love the video2video.
| EGreg wrote:
| I want to output an animated GIF. What is the command?
|
| Do I run pix2pix at the end?
|
| Also can I somehow have more frames and set the GIF speed?
| jonas-w wrote:
| I see that there is a 12GB VRAM requirement. Can my 6GB GPU do
| anything to at least provide some performance advantages as
| instead of running it entirely on the CPU?
| lifeisstillgood wrote:
| Weirdly Corridor digital had a AI generated video and they
| suffered from what is slightly happening here - the image of the
| bear / panda or whatever is a different animal each time (ie it's
| a panda, just a different one "hallucinated" each frame.
|
| corridor digital handled it by training their model on specific
| images of specific people - and so they effectively said "video
| of the panda called phil that we have trained you on images of"
|
| Clearly this is not possible here - so I am missing how they got
| it close
| bunnywantspluto wrote:
| Can we run this on a CPU?
| novaRom wrote:
| Yes, you can run any inference with any model on CPU, but here
| some examples:
|
| To create a single frame with Stable Diffusion 4-5B parameters,
| 512x512, 20 iterations takes 5-30 minutes depending on your
| CPU. On any modern GPU it's only 0.1-20 seconds!
|
| Similarly with LLMs, to produce one token with a transformer of
| 30-50 layers 7-12B parameters you will wait several CPU minutes
| while it takes few seconds on a Pascal-generation GPU and tiny
| fraction of second on Ampere.
| gpderetta wrote:
| > To create a single frame with Stable Diffusion 4-5B
| parameters, 512x512, 20 iterations takes 5-30 minutes
| depending on your CPU
|
| Depends a lot on the cpu. Are you specifically talking about
| Text2Video or SD in general? IIRC, last time I tried SD on my
| CPU (10 core 10850k, not exactly cutting edge) it did take
| less than one minute for more than 20 iters. This was about
| 4-5 months ago, things might have gotten better.
|
| The GPU (even a vintage 1070) was faster still of course.
| tjoff wrote:
| That is much better than expected, depending on how easy it
| is to setup it is very much worthwhile if you don't have
| access to a decent GPU.
| probablynish wrote:
| The *.cpp adaptations of popular models appear to have
| optimized them for the CPU, for example llama.cpp and
| alpaca.cpp let me generate several tokens in a matter of
| seconds.
| voxelghost wrote:
| what a time to be alive!
| RajSinghLA wrote:
| Dear fellow scholars
| crakenzak wrote:
| this is two minute papers, with Dr. Karoly Zsolnai-Feher
| evolveyourmind wrote:
| Seems very limited. I wonder if the same can be achieved with
| just stable diffusion and neighbor latent walks with very small
| steps. On the other hand the interpolation techniques with the
| GigaGAN txt2img produce much higher quality "videos" than this
| valine wrote:
| Right, stable diffusion and control net depth network could
| give very good results if you have a source video.
| rasz wrote:
| at least two of the example dogs have 5 paws
| tehsauce wrote:
| I haven't seen a single example from this model that demonstrates
| video with any time of time continuity. It appears every frame is
| independent to each other.
| mlboss wrote:
| Pose based controlnet looks like it demonstrates continuity
| nprateem wrote:
| So long giphy
| [deleted]
| adeon wrote:
| It kinda works. I tried "kermit doing push ups" and it looks like
| kermit. And Kermit sorta looks like he is in push up position.
| Other than that it looks a lot like those early image generation
| AIs that vaguely look like what you asked but is all mutated.
| Animation itself is not very good for this prompt.
|
| But hey, still pretty good for the early days. Maybe need to
| figure out how to prompt engineer it and tune it. Seems like it's
| very heavily based on existing image models? Wondering how easy
| it is to adapt to other image models. I think I need to read the
| paper. https://imgur.com/a/h3ciJNn
| spokeonawheel wrote:
| I tried "spongebob hugging a cactus with his butt hanging out"
| and it said "error".
|
| perhaps I dont understand this tool
| adeon wrote:
| You are welcome: https://imgur.com/a/bjfjhQq
| whoomp12342 wrote:
| that is completely horrific lol
| HopenHeyHi wrote:
| Thanks for the nightmare fuel. And complete absence of
| butts. :/
| dcreater wrote:
| Can this get the ggerganov treatment so that I can run it on
| Apple silicon?
| crakenzak wrote:
| with llama.cpp getting GPT4all inference support the same day
| it came out, I feel like llama.cpp might soon become a general
| purpose high performance inference library/toolkit. excited!
| ggerganov wrote:
| Heh, things seem to be moving in this direction, but I think
| it's still a very long way to go. But who knows - the amount
| of contributions to the project keep growing. I guess when we
| have a solid foundation for LLM inference we can think about
| supporting SD as well
| reasonabl_human wrote:
| Thanks for your work in this space, it's incredible!
| Labo333 wrote:
| SD models are much smaller so you can probably already run it
| with the github code.
|
| Image models work well at least:
| https://huggingface.co/docs/diffusers/optimization/mps
|
| It is even possible to run them in the browser:
| https://stablediffusionweb.com/
| ftxbro wrote:
| The associated paper is "Text2Video-Zero: Text-to-Image Diffusion
| Models are Zero-Shot Video Generators" at
| https://arxiv.org/abs/2303.13439
| bdcravens wrote:
| I entered "A dachshund doing a backflip"
|
| It was a dachshund. Doing some sort of a dance in the air.
| Nothing resembling a flip.
|
| https://imgur.com/a/eqHpuo7
| 2c2c2c wrote:
| looks like the leg motions of a kickflip, but didn't include
| the board
| fleischhauf wrote:
| maybe the frame rate is not high enough to capture the flip,
| could just be a super fast dachshund..
| cubefox wrote:
| As I mentioned recently, imgur hijacks the back button, at
| least in my config (mobile Chrome). I recommend not using it
| anymore.
| Workaccount2 wrote:
| Someone here showed me this extension yesterday:
|
| https://github.com/libredirect/libredirect
|
| I don't know think it will work on mobile chrome, but for
| others like me that can't take imgurs bloat anymore, its
| great.
| tinus_hn wrote:
| > As I mentioned recently
|
| If only people would just listen to your sage advice!
|
| This kind of sentence makes it come across as if you are on a
| high horse talking down to the unwashed masses. You might
| want to avoid this.
| cubefox wrote:
| Oh sorry this didn't came off as intended.
| tough wrote:
| I appreciate you telling us imgur is no longer a clean UX
| site, I've felt the same
|
| They told you to unblock adblock to upload now.
|
| PS: Hello from my high horse
| zamnos wrote:
| As a phrase it's fine, it just that in the context of
| "random internet commenter on random HN thread", no one
| knows you and has no idea what you said recently.
|
| My horse is also really high. We've been smoking weed
| together.
| leereeves wrote:
| The examples on GitHub are so much better than the examples
| here. I wonder if the authors are cherry picking or we just
| don't know how to get good results.
|
| Why are people downvoting this? The examples on GitHub are
| clearly doing the action described (unlike the dachsund above)
| and don't have the wild deformations that the "Spongebob
| hugging a cactus" or "Will Smith eating spaghetti" animations
| have.
| gremlinsinc wrote:
| it only matters to the inpatient among us, give it a few
| weeks and things will only get better. The pace of ai
| innovation is frightening but exciting too.
| godelski wrote:
| This is, unfortunately, standard practice on generative
| papers. Best is to show cherry picked and random selection.
| But if you don't cherry pick you don't get published. Even
| having random samples can put you in a tough position.
|
| To be fair, metrics for vision are limited and many don't
| know how to correctly inturpret them. You both want to know
| what's the best images the network can produce a well as the
| average. Reviewers, unfortunately, just want to reject (no
| incentive you accept). That said, it can still be a useful
| tool because you know the best (or near) that the model can
| produce. This is also why platforms like HuggingFace are so
| critical.
|
| I do think we should ask more of the research community but
| just know that this is in fact normal and even standard. Most
| research judge by things like citations and just reading the
| works themselves. The more hype in a research area the more
| noisy the signal for good research. (This is good research
| imo fwiw)
| throw101010 wrote:
| In my experience AI generative processes always involve a lot
| of cherry picking, rarely have I generated 1 image or 1 text
| and it was perfect or very representative of what can be
| attained with some refining.
|
| Seems fair to choose good results when you try to demonstrate
| what the software "can" do. Maybe a mention of the process
| often being iterative would be the most honest, but I think
| anyone familiar with these tools assume this by now.
| nullsense wrote:
| They're probably cherry picking. I know on ModelScope
| text2video I have to generate 10 clips per prompt to get 2 or
| 3 usable clips and out of that will come one really good one.
| But it's fast enough to do that I just sort of accept that as
| the workflow and generate in batches of 10. I assume it's
| likely the same for this.
|
| Can't wait to try it though.
| fortyseven wrote:
| He's tryin' his best. :{
| zamnos wrote:
| They're Good Dogs Brent
| patientplatypus wrote:
| [dead]
| detrites wrote:
| For the adventurous among us, here's that boolean you might want
| to change:
|
| https://github.com/Picsart-AI-Research/Text2Video-Zero/blob/...
| novaRom wrote:
| Is there any estimate how much training data someone like
| MindGeek has?
| stevefan1999 wrote:
| Come on, somebody please make this an environmental variable
| totetsu wrote:
| This is exactly the kind of thing the Taoist philosophers were
| warning us about.
| belter wrote:
| You are the wind beneath my wings...
| SentinelLdnma wrote:
| Economists will ponder for decades why LLMs never increased
| labor productivity.
| fauxpause_ wrote:
| I feel like they really missed an opportunity to make their
| example Horse in Motion rather than Horse Galloping on a Street
| mnemotronic wrote:
| [edit] _Almost_ everything I try except the predefined examples
| returns "error". "a pencil with wings" returns something nothing
| like a pencil that does, in fact, have wings.
|
| There will be a learning curve.
| rexreed wrote:
| Question about models and weights. When an organization says they
| release the weights, how is that different from when an
| organization releases a model, say Whisper from OpenAI? In what
| way are the model releases different in these cases?
| superb-owl wrote:
| The model is just an untrained architecture. You'd need to
| spend a lot of money (a) gathering data, and (b) running GPUs
| to train it.
|
| The weights are the fruition of training. They make the model
| actually useful.
| rexreed wrote:
| When you download a trained model for use by Python, I'm
| assuming the model contains both the architecture (the neural
| net or even a boosted tree) as well as the weights / tree
| structure that makes the model actually usable in inference.
| When organizations release a trained model, I'm assuming that
| the weights are necessary to make use of that model? If not,
| then are they not really releasing the model, but just the
| architecture and training data?
| vdfs wrote:
| As i understand it:
|
| - Model is the code
|
| - Data is the text/images used for traing
|
| - Weights are the training results
|
| For example Lucene, models will be the java library, data
| is text data like wikipedia and weights are the lucene
| index. if you have all the 3 you can start searching right
| away, if you have model+data you have to generate the index
| which can take a lot of time, training/indexing take more
| than searching or using the model. if you have just he
| model you need to get your own data and run training on it
| freeone3000 wrote:
| Usually not the training data either!
| leodriesch wrote:
| OpenAI also released the weights for Whisper. Some model
| releases, like LLaMa from Meta, just contain the code for
| training and the data. You can train the weights yourself, but
| for LLaMa that takes multiple weeks on very expensive hardware.
|
| (LLaMa also released the weights for researchers and they
| leaked online, but the weights are not open-source.)
|
| When a company releases a model including the weights, you can
| download their pre-trained weights and run inference with them,
| without having to train by yourself.
___________________________________________________________________
(page generated 2023-03-29 23:02 UTC)