[HN Gopher] Stable Video Diffusion
       ___________________________________________________________________
        
       Stable Video Diffusion
        
       Author : roborovskis
       Score  : 675 points
       Date   : 2023-11-21 19:01 UTC (3 hours ago)
        
 (HTM) web link (stability.ai)
 (TXT) w3m dump (stability.ai)
        
       | minimaxir wrote:
       | Model weights (two variations, each 10GB) are available without
       | waitlist/approval: https://huggingface.co/stabilityai/stable-
       | video-diffusion-im...
       | 
       | The LICENSE is a special non-commercial one:
       | https://huggingface.co/stabilityai/stable-video-diffusion-im...
       | 
       | It's unclear how exactly to run it easily: diffusers has video
       | generation support now but need to see if it plugs in seamlessly.
        
         | ronsor wrote:
         | Regular reminder that it is very likely that model weights
         | can't be copyrighted (and thus can't be licensed).
        
         | chankstein38 wrote:
         | It looks like the huggingface page links their github that
         | seems to have python scripts to run these:
         | https://github.com/Stability-AI/generative-models
        
           | minimaxir wrote:
           | Those scripts aren't as easy to use or iterate upon since
           | they are CLI apps instead of a REPL like a Colab/Jupyter
           | Notebook (although these models probably will not run in a
           | normal Colab without shenanigans).
           | 
           | They can be hacked into a Jupyter Notebook but it's really
           | not fun.
        
       | valine wrote:
       | The rate of progress in ML this past year has been breath taking.
       | 
       | I can't wait to see what people do with this once controlnet is
       | properly adapted to video. Generating videos from scratch is
       | cool, but the real utility of this will be the temporal
       | consistency. Getting stable video out of stable diffusion
       | typically involves lots of manual post processing to remove
       | flicker.
        
         | Der_Einzige wrote:
         | Controlnet is adapted to video today, the issues are that it's
         | very slow. Haven't you seen the insane quality of videos on
         | civitai?
        
           | valine wrote:
           | I have seen them, the workflows to create those videos are
           | extremely labor intensive. Control net lets you maintain
           | poses between frames, it doesn't solve the temporal
           | consistency of small details.
        
             | mattnewton wrote:
             | People use animatediff's motion module (or other models
             | that have cross frame attention layers). Consistency is
             | close to being solved.
        
               | valine wrote:
               | Hopefully this new model will be a step beyond what you
               | can do with animatediff
        
               | dragonwriter wrote:
               | Temporal consistency is improving, but "close to being
               | solved" is very optimistic.
        
               | mattnewton wrote:
               | No I think we're actually close. My source is I'm working
               | on this problem and the incredible progress of our tiny 3
               | person team at drip.art (http://api.drip.art) - we can
               | generate a lot of frames that are consistent, and with
               | interpolation between them, smoothly restyle even long
               | videos. Cross-frame attention works for most cases, it
               | just needs to be scaled up.
               | 
               | And that's just for diffusion focused approaches like
               | ours. There are probably other techniques from the token
               | flow or nerf family of approaches close to breakout
               | levels of quality, tons of talented researchers working
               | on that too.
        
           | capableweb wrote:
           | > Haven't you seen the insane quality of videos on civitai?
           | 
           | I have not, so I went to https://civitai.com/ which I guess
           | is what you're talking about? But I cannot find a single
           | video there, just images and models.
        
             | Kevin09210 wrote:
             | https://www.youtube.com/shorts/ZN-NbdFwfNQ
             | 
             | https://www.youtube.com/watch?v=3WWy98ylLT4
             | 
             | The inconsistencies are what's most interesting in these
             | videos in fact
        
         | alberth wrote:
         | What was the big "unlock" that allowed so much progress this
         | past year?
         | 
         | I ask as a noob in this area.
        
           | mlboss wrote:
           | Stable diffusion open source release and llama release
        
             | alberth wrote:
             | But what technically allowed for so much progress?
             | 
             | There's been open source AI/ML for 20+ years.
             | 
             | Nothing comes close to the massive milestones over the past
             | year.
        
               | Chabsff wrote:
               | Public availability of large transformer-based foundation
               | models trained at great expense, which is what OP is
               | referring to, is definitely unprecedented.
        
               | kmeisthax wrote:
               | Attention, transformers, diffusion. Prior image synthesis
               | techniques - i.e. GANs - had problems that made it
               | difficult to scale them up, whereas the current
               | techniques seem to have no limit other than the amount of
               | RAM in your GPU.
        
               | jasonjmcghee wrote:
               | People figuring out how to train and scale newer
               | architectures (like transfomers) effectively, to be
               | wildly larger than ever before.
               | 
               | Take AlexNet - the major "oh shit" moment in image
               | classification.
               | 
               | It had an absolutely mind-blowing number of parameters at
               | a whopping 62 million.
               | 
               | Holy shit, what a large network, right?
               | 
               | Absolutely unprecedented.
               | 
               | Now, for language models, anything under 1B parameters is
               | a toy that barely works.
               | 
               | Stable diffusion has around 1B or so - or the early
               | models did, I'm sure they're larger now.
               | 
               | A whole lot of smart people had to do a bunch of cool
               | stuff to be able to keep networks working at all at that
               | size.
               | 
               | Many, many times over the years, people have tried to
               | make larger networks, which fail to converge (read: learn
               | to do something useful) in all sorts of crazy ways.
               | 
               | At this size, it's also expensive to train these things
               | from scratch, and takes a shit-ton of data, so
               | research/discovery of new things is slow and difficult.
               | 
               | But, we kind of climbed over a cliff, and now things are
               | absolutely taking off in all the fields around this kind
               | of stuff.
               | 
               | Take a look at XTTSv2 for example, a leading open source
               | text-to-speech model. It uses multiple models in its
               | architecture, but one of them is GPT.
               | 
               | There are a few key models that are still being used in a
               | bunch of different modalities like CLIP, U-Net, GPT, etc.
               | or similar variants. When they were released / made
               | available, people jumped on them and started
               | experimenting.
        
               | dragonwriter wrote:
               | > Stable diffusion has around 1B or so - or the early
               | models did, I'm sure they're larger now.
               | 
               | SDXL is 6.6 billion.
        
           | 4death4 wrote:
           | I think these are the main drivers behind the progress:
           | 
           | - Unsupervised learning techniques, e.g. transformers and
           | diffusion models. You need unsupervised techniques in order
           | to utilize enough data. There have been other unsupervised
           | techniques in the past, e.g. GANs, but they don't work as
           | well.
           | 
           | - Massive amounts of training data.
           | 
           | - The belief that training these models will produce
           | something valuable. It costs between hundreds of thousands to
           | millions of dollars to train these models. The people doing
           | the training need to believe they're going to get something
           | interesting out at the end. More and more people and teams
           | are starting to see training a large model as something worth
           | pursuing.
           | 
           | - Better GPUs, which enables training larger models.
           | 
           | - Honestly the fall of crypto probably also contributed,
           | because miners were eating a lot of GPU time.
        
             | mkaic wrote:
             | I don't think transformers or diffusion models are
             | inherently "unsupervised", especially not the way they're
             | used in Stable Diffusion and related models (which are very
             | much trained in a supervised fashion). I agree with the
             | rest of your points though.
        
               | ebalit wrote:
               | Generative methods have usually been considered
               | unsupervised.
               | 
               | You're right that conditional generation start to blur
               | the lines though.
        
           | Cyphase wrote:
           | One factor is that Stable Diffusion and ChatGPT were released
           | within 3 months of each other - August 22, 2022 and November
           | 3, 2022, respectively. That brought a lot of attention and
           | excitement to the field. More excitement, more people, more
           | work being done, more progress.
           | 
           | Of course those two releases didn't fall out of the sky.
        
         | hanniabu wrote:
         | > but the real utility of this will be the temporal consistency
         | 
         | The main utility will me misinformation
        
       | ericpauley wrote:
       | I'm still puzzled as to how these "non-commercial" model licenses
       | are supposed to be enforceable. Software licenses govern the
       | redistribution of the _software_ , not products produced with it.
       | An image isn't GPL'd because it was produced with GIMP.
        
         | cubefox wrote:
         | Nobody claimed otherwise?
        
           | not2b wrote:
           | There are sites that make Stable Diffusion-derived models
           | available, along with GPU resources, and they sell the
           | service of generating images from the models. The company
           | isn't permitting that use, and it seems that they could find
           | violators and shut them down.
        
           | littlethoughts wrote:
           | Fantasy.ai was subject to controversy for attempting to
           | license models.
        
         | Der_Einzige wrote:
         | They're not enforceable.
        
         | yorwba wrote:
         | The license is a contract that allows you to use the software
         | provided you fulfill some conditions. If you do not fulfill the
         | conditions, you have no _right_ to a _copy_ of the software and
         | can be sued. This enforcement mechanism is the same whether the
         | conditions are that you include source code with copies you
         | redistribute, or that you may only use it for evil, or that you
         | must pay a monthly fee. Of course this enforcement mechanism
         | may turn out to be ineffective if it 's hard to discover that
         | you're violating the conditions.
        
           | comex wrote:
           | It also somewhat depends on open legal questions like whether
           | models are copyrightable and, if so, whether model outputs
           | are derivative works of the model. Suppose that models are
           | not copyrightable, due to their not being the product of
           | human creativity (this is debatable). Then the creator can
           | still require people to agree to contractual terms before
           | downloading the model from them, presumably including the
           | usage limitations as well as an agreement not to redistribute
           | the model to anyone else who does not also agree. Agreement
           | can happen explicitly by pressing a button, or potentially
           | implicitly just by downloading the model from them, if the
           | terms are clearly disclosed beforehand. But if someone
           | decides on their own (not induced by you in any way) to
           | violate the contract by uploading it somewhere else, and you
           | passively download it from there, then you may be in the
           | clear.
        
             | ronsor wrote:
             | > Then the creator can still require people to agree to
             | contractual terms before downloading the model from them,
             | presumably including the usage limitations as well as an
             | agreement not to redistribute the model to anyone else who
             | does not also agree.
             | 
             | I don't think it's possible to invent copyright-like
             | rights.
        
         | dist-epoch wrote:
         | Visual Studio Community (and many other products) only allows
         | "non-commercial" usage. Sounds like it limits what you can do
         | with what you produce with it.
         | 
         | At the end of the day, a license is a legal contract. If you
         | agree that an image which you produce with some software will
         | be GPL'ed, it's enforceable.
         | 
         | As an example, see the Creative Commons license, ShareAlike
         | clause:
         | 
         | > If you remix, transform, or build upon the material, you must
         | distribute your contributions under the same license as the
         | original.
        
           | blibble wrote:
           | > At the end of the day, a license is a legal contract. If
           | you agree that an image which you produce with some software
           | will be GPL'ed, it's enforceable.
           | 
           | you can put whatever you want in a contract, doesn't mean
           | it's enforceable
        
           | antonyt wrote:
           | Do you have link for the VS Community terms you're
           | describing? What I've found is directly contradictory: "Any
           | individual developer can use Visual Studio Community to
           | create their own free or paid apps." From
           | https://visualstudio.microsoft.com/vs/community/
        
             | dist-epoch wrote:
             | Enterprise organizations are not allowed to use VS
             | Community for commercial purposes:
             | 
             | > _In enterprise organizations (meaning those with >250 PCs
             | or >$1 Million US Dollars in annual revenue), no use is
             | permitted beyond the open source, academic research, and
             | classroom learning environment scenarios described above._
        
         | kmeisthax wrote:
         | So, there's a few different things interacting here that are a
         | little confusing.
         | 
         | First off, you have copyright law, which grants monopolies on
         | the act of copying to the creators of the original. In order to
         | legally make use of that work you need to either have
         | permission to do so (a license), or you need to own a copy of
         | the work that was made by someone with permission to make and
         | sell copies (a sale). For the purposes of computer software,
         | you will almost always get rights to the software through a
         | license and _not_ a sale. In fact, there is an argument that
         | usage of computer software requires a license and that a sale
         | wouldn 't be enough because you wouldn't have permission to
         | load it into RAM[0].
         | 
         | Licenses are, at least under US law, contracts. These are
         | Turing-complete priestly rites written in a special register of
         | English that legally bind people to do or not do certain
         | things. A license can grant rights, or, confusingly, take them
         | away. For example, you could write a license that takes away
         | your fair use rights[1], and courts will actually respect that.
         | So you can also have a license that says you're only allowed to
         | use software for specific listed purposes but not others.
         | 
         | In copyright you also have the notion of a derivative work.
         | This was invented whole-cloth by the US Supreme Court, who
         | needed a reason to prosecute someone for making a SSSniperWolf-
         | tier abridgement[2] of someone else's George Washington
         | biography. Normal copyright infringement is evidenced by
         | substantial similarity and access: i.e. you saw the original,
         | then you made something that's nearly identical, ergo
         | infringement. The law regarding derivative works goes a step
         | further and counts hypothetical works that an author _might_
         | make - like sequels, translations, remakes, abridgements, and
         | so on - as requiring permission in order to make. Without that
         | permission, you don 't own anything and your work has no right
         | to exist.
         | 
         | The GPL is the anticopyright "judo move", invented by a really
         | ornery computer programmer that was angry about not being able
         | to fix their printer drivers. It disclaims _almost_ the entire
         | copyright monopoly, but it leaves behind one license
         | restriction, called a  "copyleft": any derivative work must be
         | licensed under the GPL. So if you modify the software and
         | distribute it, you have to distribute your changes under GPL
         | terms, thus locking the software in the commons.
         | 
         | Images made with software are not derivative works of the
         | software, nor do they contain a substantially similar copy of
         | the software in them. Ergo, the GPL copyleft does not trip. In
         | fact, _even if it did trip_ , your image is still not a
         | derivative work of the software, so you don't lose ownership
         | over the image because you didn't get permission. This also
         | applies to model licenses on AI software, insamuch as the AI
         | companies don't own their training data[3].
         | 
         | However, there's still something that licenses can take away:
         | your right to use the software. If you use the model for
         | "commercial" purposes - whatever those would be - you'd be in
         | breach of the license. What happens next is also determined by
         | the license. It could be written to take away your
         | noncommercial rights if you breach the license, or it could
         | preserve them. In either case, however, the primary enforcement
         | mechanism would be a court of law, and courts usually award
         | money damages. If particularly justified, they _could_ demand
         | you destroy all copies of the software.
         | 
         | If it went to SCOTUS (unlikely), they might even decide that
         | images made by software are derivative works of the software
         | after all, just to spite you. The Betamax case said that
         | advertising a copying device with potentially infringing
         | scenarios was fine as long as that device could be used in a
         | non-infringing manner, but then the Grokster case said it was
         | "inducement" and overturned it. Static, unchanging rules are
         | ultimately a polite fiction, and the law can change behind your
         | back if the people in power want or need it to. This is why you
         | don't talk about the law in terms of something being legal or
         | illegal, you talk about it in terms of risk.
         | 
         | [0] Yes, this is a real argument that courts have actually
         | made. Or at least the Ninth Circuit.
         | 
         | The actual facts of the case are even more insane - basically a
         | company trying to sue former employees for fixing it's
         | customers computers. Imagine if Apple sued Louis Rossman for
         | pirating macOS every time he turned on a customer laptop. The
         | only reason why they _can 't_ is because Congress actually
         | created a special exemption for computer repair and made it
         | part of the DMCA.
         | 
         | [1] For example, one of the things you agree to when you buy
         | Oracle database software is to give up your right to benchmark
         | the software. I'm serious! The tech industry is evil and needs
         | to burn down to the ground!
         | 
         | [2] They took 300 pages worth of material from 12 books and
         | copied it into a separate, 2 volume work.
         | 
         | [3] Whether or not copyright on the training data images flows
         | through to make generated images a derivative work is a
         | separate legal question in active litigation.
        
           | dragonwriter wrote:
           | > Licenses are, at least under US law, contracts
           | 
           | Not necessarily; gratuitous licenses are not contracts.
           | Licenses which happen to also meet the requirements for
           | contracts (or be embedded in agreements that do) are
           | contracts or components of contracts, but that's not all
           | licenses.
        
         | SXX wrote:
         | It doesn't have to be enforceable. This licensing model works
         | exactly the same as Microsoft Windows licensing or WinRAR
         | licensing. Lots and lots of people have pirated Windows or just
         | buy some cheap keys off Ebay, but no one of them in their sane
         | mind would use anything like that at their company.
         | 
         | The same way you can easily violate any "non-commercial"
         | clauses of models like this one as private person or as some
         | tiny startup, but company that decide to use them for their
         | business will more likely just go and pay.
         | 
         | So it's possible to ignore license, but legal and financial
         | risks are not worth it for businesses.
        
       | helpmenotok wrote:
       | Can this be used for porn?
        
         | theodric wrote:
         | If it can't, someone will massage it until it can. Porn, and
         | probably also stock video to sell to YouTubers.
        
         | citrusui wrote:
         | Very unusual comment.
         | 
         | I do not think so as the chance of constructing a fleshy
         | eldritch horror is quite high.
        
           | tstrimple wrote:
           | > I do not think so as the chance of constructing a fleshy
           | eldritch horror is quite high.
           | 
           | There is a market for everything!
        
           | johndevor wrote:
           | How is that not the first question to ask? Porn has proven to
           | be a fantastic litmus test of fast market penetration when it
           | comes to new technologies.
        
             | citrusui wrote:
             | This is true. I was hoping my educated guess of the outcome
             | would minimize the possibility of anyone attempting this.
             | And yet, here we are - the only losing strategy in the
             | technology sector is to not try at all.
        
             | throwaway743 wrote:
             | No pun intended?
        
             | xanderlewis wrote:
             | Market what?
        
           | crtasm wrote:
           | That didn't stop people using PornPen for images and it
           | wouldn't stop them using something else for video.
        
           | ben_w wrote:
           | A surprisingly large number of people are into fleshy
           | eldritch horrors.
        
         | 1024core wrote:
         | The question reminded me of this classic:
         | https://www.youtube.com/watch?v=YRgNOyCnbqg
        
         | Racing0461 wrote:
         | Nope, all commercial models are severly gated.
        
         | hbn wrote:
         | Depends on whether trains, cars, and/or black cowboys tickle
         | your fancy.
        
           | boppo1 wrote:
           | Whatever this is:
           | 
           | https://i.4cdn.org/g/1700595378919869.png
        
         | artursapek wrote:
         | Porn will be one of the main use cases for this technology.
         | Porn sites pioneered video streaming technologies back in the
         | day, and drove a lot of the innovation there.
        
         | SXX wrote:
         | It's already posted to Unstable Diffusion discord so soon we'll
         | know.
         | 
         | After all fine-tuning wouldn't take that long.
        
       | christkv wrote:
       | Looks like I'm still good for my bet with some friends that
       | before 2028 a team of 5-10 people will create a blockbuster style
       | movie that today costs 100+ million USD on a shoestring budget
       | and we won't be able to tell.
        
         | CamperBob2 wrote:
         | It'll happen, but I think you're early. 2038 for sure, unless
         | something drastic happens to stop it (or is forced to happen.)
        
         | accrual wrote:
         | The first full-length AI generated movie will be an important
         | milestone for sure, and will probably become a "required watch"
         | for future AI history classes. I wonder what the Rotten
         | Tomatoes page will look like.
        
           | jjkaczor wrote:
           | As per the reviews - it will be hard to say, as both positive
           | and negative takes will be uploaded by ChatGPT bots (or it's
           | myriad of descendents).
        
           | qiine wrote:
           | "I wonder what the Rotten Tomatoes page will look like"
           | 
           | Surely it will be written using machine vision and llms !
        
         | throwaway743 wrote:
         | Definitely a big first for benchmarks. After that hyper
         | personalized content/media generated on-demand
        
         | ben_w wrote:
         | I wouldn't bet either way.
         | 
         | Back in the mid 90s to 2010 or so, graphical improvements were
         | hailed as photorealistic only to be improved upon with each
         | subsequent blockbuster game.
         | 
         | I think we're in a similar phase with AI[0]: every new release
         | in $category is better, gets hailed as super fantastic world
         | changing, is improved upon in the subsequent Two Minute Papers
         | video on $category, and the cycle repeats.
         | 
         | [0] all of them: LLMs, image generators, cars, robots, voice
         | recognition and synthesis, scientific research, ...
        
           | Keyframe wrote:
           | Your comment reminded me of this: https://www.reddit.com/r/ga
           | ming/comments/ktyr1/unreal_yes_th...
           | 
           | Many more examples, of course.
        
             | ben_w wrote:
             | Yup, that castle flyby, those reflections. I remember being
             | mesmerised by the sequence as a teenager.
             | 
             | Big quality improvement over Marathon 2 on a mid-90s Mac,
             | which itself was a substantial boost over the Commodore 64
             | and NES I'd been playing on before that.
        
         | marcusverus wrote:
         | I'm pumped for this future, but I'm not sure that I buy your
         | optimistic timeline. If the history of AI has taught us
         | anything, it is that the last 1% of of progress is the hardest
         | half. And given the unforgiving nature of the uncanny valley,
         | the video produced by such a system will be worthless until it
         | is damn-near perfect. That's a tall order!
        
         | deckard1 wrote:
         | I'm imagining more of an AI that takes a standard movie
         | screenplay and a sidecar file, similar to a CSS file for the
         | web and generates the movie. This sidecar file would contain
         | the "director" of the movie, with camera angles, shot length
         | and speed, color grading, etc. Don't like how the new Dune
         | movie looks? Edit the stylesheet and make it your own.
         | Personalized remixed blockbusters.
         | 
         | On a more serious note, I don't think Roger Deakins has
         | anything to worry about right now. Or maybe ever. We've been
         | here before. DAWs opened up an entire world of audio production
         | to people that could afford a laptop and some basic gear. But
         | we certainly do not have a thousand Beatles out there. It still
         | requires talent and effort.
        
           | timeon wrote:
           | > thousand Beatles out there. It still requires talent and
           | effort
           | 
           | As well as marketing.
        
       | btbuildem wrote:
       | In the video towards the bottom of the page, there are two birds
       | (blue jays), but in the background there are two identical
       | buildings (which look a lot like the CN Tower). CN Tower is the
       | main landmark of Toronto, whose baseball team happens to be the
       | Blue Jays. It's located near the main sportsball stadium
       | downtown.
       | 
       | I vaguely understand how text-to-image works, and so it makes
       | sense that the vector space for "blue jays" would be near
       | "toronto" or "cn tower". The improvements in scale and speed
       | (image -> now video) are impressive, but given how incredibly
       | able the image generation models are, they simultaneously feel
       | crippled and limited by their lack of editing / iteration
       | ability.
       | 
       | Has anyone come across a solution where model can iterate (eg,
       | with prompts like "move the bicycle to the left side of the
       | photo")? It feels like we're close.
        
         | appplication wrote:
         | I don't spend a lot of time keeping up with the space, but I
         | could have sworn I've seen a demo that allowed you to iterate
         | in the way you're suggesting. Maybe someone else can link it.
        
           | accrual wrote:
           | It's not exactly like GP described (e.g. move bike to the
           | left) but there is a more advanced SD technique called
           | inpainting [0] that allows you to manually recompose parts of
           | the image, e.g. to fix bad eyes and hands.
           | 
           | [0] https://stable-diffusion-art.com/inpainting_basics/
        
           | ssalka wrote:
           | My guess is you're thinking of InstructPix2Pix[1], with
           | prompts like "make the sky green" or "replace the fruits with
           | cake"
           | 
           | [1] https://github.com/timothybrooks/instruct-pix2pix
        
             | appplication wrote:
             | This is exactly it!
        
           | tjoff wrote:
           | Emu-Edit is the closest I've seen.
           | 
           | https://emu-edit.metademolab.com/
           | 
           | https://ai.meta.com/blog/emu-text-to-video-generation-
           | image-...
        
         | kshacker wrote:
         | Assuming we can post links, you mean this video:
         | https://youtu.be/G7mihAy691g?si=o2KCmR2Uh_97UQ0N
         | 
         | Also, maybe you can't edit post facto, but when you give
         | prompts, would you not be able to say : two blue jays but no CN
         | tower
        
           | FrozenTuna wrote:
           | Yes, its called a negative prompt. Idk if txt2video has it,
           | but both llms and stable-diffusion have it so I'd assume its
           | good to go.
        
             | nottheengineer wrote:
             | Haven't implemented negative prompts yet, but from what I
             | can tell it's as simple as substracting from the prompt in
             | embedding space.
        
         | FrozenTuna wrote:
         | Not _exactly_ what you 're asking for, but AnimateDiff has
         | introduced creating gifs to SD. Still takes quite a bit of
         | tweaking IME.
        
         | xianshou wrote:
         | Emu edit should be exactly what you're looking for:
         | https://ai.meta.com/blog/emu-text-to-video-generation-image-...
        
           | smcleod wrote:
           | It doesn't look like the code for that is available anywhere
           | though?
        
         | TacticalCoder wrote:
         | > Has anyone come across a solution where model can iterate
         | (eg, with prompts like "move the bicycle to the left side of
         | the photo")? It feels like we're close.
         | 
         | I feel like we're close too, but for another reason.
         | 
         | For although I love SD and these video examples are great...
         | It's a flawed method: they never get lighting correctly and
         | there are many incoherent things just about everywhere. Any 3D
         | artist or photographer can immediately spot that.
         | 
         | However I'm willing to bet that we'll soon have something
         | _much_ better: you 'll describe something and you'll get a full
         | 3D scene, with 3D models, source of lights set up, etc.
         | 
         | And the scene shall be sent into Blender and you'll click on a
         | button and have an actual rendering made by Blender, with
         | correct lighting.
         | 
         | Wanna move that bicycle? Move it in the 3D scene exactly where
         | you want.
         | 
         | That is coming.
         | 
         | And for audio it's the same: why generate an audio file when
         | soon models shall be able to generate the various tracks, with
         | all the instruments and whatnots, allowing to create the audio
         | file?
         | 
         | That is coming too.
        
           | p1esk wrote:
           | Are you working on all that?
        
             | cptaj wrote:
             | Probably not. But there does seem to be a clear path to it.
             | 
             | The main issue is going to be having the right dataset. You
             | basically need to record user actions in something like
             | blender (ie: moving a model of a bike to the left of a
             | scene), match it to a text description of the action (ie;
             | "move bike to the left") and match those to before/after
             | snapshots of the resulting file format.
             | 
             | You need a whole metric fuckton of these.
             | 
             | After that, you train your model to produce those 3d scene
             | files instead of image bitmaps.
             | 
             | You can do this for a lot of other tasks. These general
             | purpose models can learn anything that you can usefully
             | represent in data.
             | 
             | I can imagine AGI being, at least in part, a large set of
             | these purpose trained models. Heck, maybe our brains work
             | this way. When we learn to throw a ball, we train a model
             | in a subset of our brain to do just this and then this
             | model is called on by our general consciousness when
             | needed.
             | 
             | Sorry, I'm just rambling here but its very exciting stuff.
        
               | sterlind wrote:
               | The hard part of AGI is the self-training and few
               | examples. Your parents didn't attach strings to your body
               | and puppeteer you through a few hundred thousand games of
               | baseball. And the humans that invented baseball had zero
               | training data to go on.
        
           | atentaten wrote:
           | Whats your reasoning for feeling that we're close?
        
             | cptaj wrote:
             | We do it for text, audio and bitmapped images. A 3D scene
             | file format is no different, you could train a model to
             | output a blender file format instead of a bitmap.
             | 
             | It can learn anything you have data for.
             | 
             | Heck, we do it with geospatial data already, generating
             | segmentation vectors. Why not 3D?
        
               | boppo1 wrote:
               | >3D scene file format is no different
               | 
               | Not in theory, but the level of complexity is way higher
               | and the amount of data available is much smaller.
               | 
               | Compare bitmaps to this: https://fossies.org/linux/blende
               | r/doc/blender_file_format/my...
        
               | kaibee wrote:
               | Also the level of fault tolerance... if your pixels are a
               | bit blurry, chances are no one notices at a high enough
               | resolution. If your json is a bit blurry you have
               | problems.
        
               | jncfhnb wrote:
               | Text, audio, and bitmapped images are data. Numbers and
               | tokens.
               | 
               | A 3D scene is vastly more complex, and the way you
               | consume it is tangential to the rendering of it we use to
               | interpret. It is a collection of arbitrary data
               | structures.
               | 
               | We'll need a new approach for this kind of problem
        
               | dragonwriter wrote:
               | > Text, audio, and bitmapped images are data. Numbers and
               | tokens.
               | 
               | > A 3D scene is vastly more complex
               | 
               | 3D scenes, in fact, are also data, numbers and tokens.
               | (Well, numbers, but so are tokens.)
        
               | dragonwriter wrote:
               | We do it for 3D, too.
               | 
               | https://guytevet.github.io/mdm-page/
        
           | bob1029 wrote:
           | > However I'm willing to bet that we'll soon have something
           | much better: you'll describe something and you'll get a full
           | 3D scene, with 3D models, source of lights set up, etc.
           | 
           | I agree with this philosophy - Teach the AI to work with the
           | same tools the human does. We already have a lot of human
           | experts to refer to. Training material is everywhere.
           | 
           | There isn't a "text-to-video" expert we can query to help us
           | refine the capabilities around SD. It's a one-shot, Jupiter-
           | scale model with incomprehensible inertia. Contrast this with
           | an expert-tuned model (i.e. natural language instructions)
           | that can be nuanced precisely and to the the point of
           | imperceptibility with a single sentence.
           | 
           | The other cool thing about the "use existing tools" path is
           | that if the AI fails part way through, it's actually possible
           | for a human operator to step in and attempt recovery.
        
           | epr wrote:
           | > you'll describe something and you'll get a full 3D scene,
           | with 3D models, source of lights set up, etc.
           | 
           | I'm always confused why I don't hear more about projects
           | going in this direction. Controlnets are great, but there's
           | still quite a lot of hallucination and other tiny mistakes
           | that a skilled human would never make.
        
             | boppo1 wrote:
             | Blender files are dramatically more complex than any image
             | format, which are basically all just 2D arrays of 3-value
             | vectors. The blender filetype uses a weird DNA/RNA struct
             | system that would probably require its own training run.
             | 
             | More on the Blender file format: https://fossies.org/linux/
             | blender/doc/blender_file_format/my...
        
               | mikepurvis wrote:
               | But surely you wouldn't try to emit that format directly,
               | but rather some higher level scene description? Or even
               | just a set of instructions for how to manipulate the UI
               | to create the imagined scene?
        
               | BirdieNZ wrote:
               | I've seen this but producing Python scripts that you run
               | in Blender, e.g.
               | https://www.youtube.com/watch?v=x60zHw_z4NM (but I saw
               | something marginally more impressive, not sure where
               | though!)
        
               | Keyframe wrote:
               | Scene layouts, models and their attributes are a result
               | of user input (ok and sometimes program output). One
               | avenue to take there would be to train on input expecting
               | an output. Like teaching a model to draw instead of
               | generate images.. which in a sense we already did by
               | broadly painting out silhouettes and then rendering
               | details.
        
               | guyomes wrote:
               | Voxel files could be a simpler step for 3D images.
        
             | bozhark wrote:
             | One was on the front page the other day, I'll search for a
             | link
        
             | jowday wrote:
             | There's a lot of issues with it, but perhaps the biggest is
             | that there aren't just troves of easily scrapable and
             | digestible 3D models lying around on the internet to train
             | on top of like we have with text, images, and video.
             | 
             | Almost all of the generative 3D models you see are actually
             | generative image models that essentially (very crude
             | simplification) perform something like photogrammetry to
             | generate a 3D model - 'does this 3D object, rendered from
             | 25 different views, match the text prompt as evaluated by
             | this model trained on text-image pairs'?
             | 
             | This is a shitty way to generate 3D models, and it's why
             | they almost all look kind of malformed.
        
               | sterlind wrote:
               | If reinforcement learning were farther along, you could
               | have it learn to reproduce scenes as 3D models. Each
               | episode's task is to mimic an image, each step is a
               | command mutating the scene (adding a polygon, or rotating
               | the camera, etc.), and the reward signal is image
               | similarity. You can even start by training it with
               | synthetic data: generate small random scenes and make
               | them increasingly sophisticated, then later switch over
               | to trying to mimic images.
               | 
               | You wouldn't need any models to learn from. But my
               | intuition is that RL is still quite weak, and that the
               | model would flounder after learning to mimic background
               | color and placing a few spheres.
        
               | skdotdan wrote:
               | Deepmind tried something similar in 2018
               | https://deepmind.google/discover/blog/learning-to-write-
               | prog...
        
             | dragonwriter wrote:
             | > I'm always confused why I don't hear more about projects
             | going in this direction.
             | 
             | Probably because they aren't as advanced and the demos
             | aren't as impressive to nontechnical audiences who don't
             | understand the implications: there's lots of work on text-
             | to-3d-model generation, and even plugins for some stable
             | diffusion UIs (e.g., MotionDiff for ComyUI.)
        
           | a_bouncing_bean wrote:
           | Thanks! this is exactly what I have been thinking, only
           | you've expressed it much more eloquently than I would be
           | able.
        
           | internet101010 wrote:
           | I am guessing it will be similar to inpainting in normal
           | stable diffusion, which is easy when using the workflow
           | feature InvokeAI ui.
        
           | coldtea wrote:
           | > _For although I love SD and these video examples are
           | great... It 's a flawed method: they never get lighting
           | correctly and there are many incoherent things just about
           | everywhere. Any 3D artist or photographer can immediately
           | spot that._
           | 
           | The question is whether the 99% of the audience would even
           | care...
        
           | sheepscreek wrote:
           | Excellent point.
           | 
           | Perhaps a more computationally expensive but better looking
           | method will be to pull all objects in the scene from a 3D
           | model library, then programmatically set the scene and render
           | it.
        
           | Kuinox wrote:
           | This isn't coming, it's already here.
           | https://github.com/gsgen3d/gsgen Yes, it's just 3D models for
           | now, but it can do whole scenes generations, it's just not
           | great yet at it. The tech is there but just need to improve.
        
         | treesciencebot wrote:
         | Have you seen fal.ai/dynamic where you can perform image to
         | image synthesis (basically editing an existing image with the
         | help of diffusion process) using LCMs to provide a real time
         | UI?
        
         | filterfiber wrote:
         | > Has anyone come across a solution where model can iterate
         | (eg, with prompts like "move the bicycle to the left side of
         | the photo")? It feels like we're close.
         | 
         | Emu can do that.
         | 
         | The bluejay/toronto thing may be addressable later (I suspect
         | via more detailed annotations a la dalle3) - these current
         | video models are highly focused on figuring out temporal
         | coherence
        
         | JoshTriplett wrote:
         | I also wonder if the model takes capitalization into account.
         | Capitalized "Blue Jays" seems more likely to reference the
         | sports team; the birds would be lowercase.
        
         | psunavy03 wrote:
         | > sportsball
         | 
         | This is not the flex you think it is. You don't have to like
         | sports, but snarking on people who do doesn't make you
         | intellectual, it just makes you come across as a douchebag, no
         | different than a sports fan making fun of "D&D nerds" or
         | something.
        
           | chaps wrote:
           | Ah, Mr. Kettle, I see you've met my friend, Mr. Pot!
        
           | Zetaphor wrote:
           | This has become a colloquial term for describing all sports,
           | not the insult you're perceiving it to be.
           | 
           | Rather than projecting your own hangups and calling people
           | names, try instead assuming that they're not trying to offend
           | you personally and are just using common vernacular.
        
         | amoshebb wrote:
         | I wonder what other odd connections are made due to city-name
         | almost certainly being the most common word next to sportsball-
         | name.
         | 
         | Do the parameters think that Jazz musicians are mormon? Padres
         | often surf? Wizards like the Lincoln Memorial?
        
         | ProfessorZoom wrote:
         | that sounds like v0 by vercel, you can iterate just like you
         | asked, to combine that type of iteration with video would be
         | really awesome
        
       | dinvlad wrote:
       | Seems relatively unimpressive tbh - it's not really a video, and
       | we've seen this kind of thing for a few months now
        
         | accrual wrote:
         | It seems like the breakthrough is that the video generating
         | method is now baked into the model and generator. I've seen
         | several fairly impressive AI animations as well, but until now,
         | I assumed they were tediously cobbled together by hacking on
         | the still-image SD models.
        
       | youssefabdelm wrote:
       | Can't wait for these things to not suck
        
         | accrual wrote:
         | It's definitely pretty impressive already. If there could be
         | some kind of "final pass" to remove the slightly glitchy
         | generative artifacts, these look completely passible for simple
         | .gif/.webm header images. Especially if they could be made to
         | loop smoothly ala Snapchat's bounce filter.
        
       | accrual wrote:
       | Fascinating leap forward.
       | 
       | It makes me think of the difference between ancestral and non-
       | ancestral samplers, e.g. Euler vs Euler Ancestral. With Euler,
       | the output is somewhat deterministic and doesn't vary with
       | increasing sampling steps, but with Ancestral, noise is added to
       | each step which creates more variety but is more
       | random/stochastic.
       | 
       | I assume to create video, the sampler needs to lean heavily on
       | the previous frame while injecting some kind of sub-prompt, like
       | rotate <object> to the left by 5 degrees, etc. I like the phrase
       | another commenter used, "temporal consistency".
       | 
       | Edit: Indeed the special sauce is "temporal layers". [0]
       | 
       | > Recently, latent diffusion models trained for 2D image
       | synthesis have been turned into generative video models by
       | inserting temporal layers and finetuning them on small, high-
       | quality video datasets
       | 
       | [0] https://stability.ai/research/stable-video-diffusion-
       | scaling...
        
         | adventured wrote:
         | The hardest problem the Stable Diffusion community has dealt
         | with in terms of quality has been in the video space, largely
         | in relation to the consistency between frames. It's probably
         | the most commonly discussed problem for example on
         | r/stablediffusion. Temporal consistency is the popular term for
         | that.
         | 
         | So this example was posted an hour ago, and it's jumping all
         | over the place frame to frame (somewhat weak temporal
         | consistency). The author appears to have used pretty straight-
         | forward text2img + Animatediff:
         | 
         | https://www.reddit.com/r/StableDiffusion/comments/180no09/on...
         | 
         | Fixing that frame to frame jitter related to animation is
         | probably the most in-demand thing around Stable Diffusion right
         | now.
         | 
         | Animatediff motion painting made a splash the other day:
         | 
         | https://www.reddit.com/r/StableDiffusion/comments/17xnqn7/ro...
         | 
         | It's definitely an exciting time around SD + animation. You can
         | see how close it is to reaching the next level of generation.
        
       | torginus wrote:
       | I admit I'm ignorant about these model's inner workings, but I
       | don't understand why text is the chosen input format for these
       | models.
       | 
       | It was the same for image generation, where one needed to produce
       | text prompts to create the image, and stuff like img2img and
       | Controlnet that allowed things like controlling poses and
       | inpainting, or having multiple prompts with masks controlling
       | which part of the image is influenced by which prompt.
        
         | gorbypark wrote:
         | According to the GitHub repo this is an "image-to-video model".
         | They tease of an upcoming "text to video" interface on the
         | linked landing page, though. My guess is that interface will
         | use a text-to-image model and then feed that into the image-to-
         | video model.
        
         | pizzafeelsright wrote:
         | Imago Deo? The Word is what is spoken when we create.
         | 
         | The input eventually becomes meanings mapped to reality.
        
       | awongh wrote:
       | It makes sense that they had to take out all of the cuts and
       | fades from the training data to improve results.
       | 
       | I'm the background section of the research paper they mention
       | "temporal convolution layers", can anyone explain what that is?
       | What sort of training data is the input to represent temporal
       | states between images that make up a video? Or does that mean
       | something else?
        
         | machinekob wrote:
         | I would assume is something similar to joining multiple
         | frames/attentions? in channel dimension and then moving values
         | inside so convolution will have access to some channels from
         | other video frames.
         | 
         | I was working on similar idea few years ago using this paper as
         | reference and it was working extremely well for consistency
         | also helping with flicker. https://arxiv.org/abs/1811.08383
        
       | spaceman_2020 wrote:
       | A seemingly off topic question, but with enough compute and
       | optimization, could you eventually simulate "reality"?
       | 
       | Like, at this point, what are the technical counters to the
       | assertion that our world is a simulation?
        
         | refulgentis wrote:
         | A little too freshman's first bit off a bong for me. There is,
         | of course, substantial differences between video and reality.
         | 
         | Let's steel-man -- you mean 3D VR. Let's stipulate there's a
         | headset today that renders 3D visually indistinguishable from
         | reality. We're still short the other 4 senses
         | 
         | Much like faith, there's always a way to sort of escape the
         | traps here and say "can you PROVE this is base reality"
         | 
         | The general technical argument against "brain in a vat being
         | stimulated" would be the computation expense of doing such, but
         | you can also write that off with the equivalent of foveated
         | rendering but for all senses / entities
        
         | 2-718-281-828 wrote:
         | > Like, at this point, what are the technical counters to the
         | assertion that our world is a simulation?
         | 
         | How about this theory is neither verifiable nor falsifiable.
        
           | vidarh wrote:
           | The _general concept_ is not falsifiable, but many variations
           | might be, or their inverse might be. E.g. the theory that we
           | are _not_ in a simulation would in general be falsifiable by
           | finding an  "escape" from a simulation and so showing we are
           | in one (but not finding an escape of course tells us
           | nothing).
           | 
           | It's not a very useful endeavour to worry about, but it can
           | be fun to speculate about what might give rise to testable
           | hypotheses and what that might tell us about the world.
        
         | tracerbulletx wrote:
         | The brain does simulate reality in the sense that what you
         | experience isn't direct sensory input, but more like a dream
         | being generated to predict what it thinks is happening based on
         | conflicting and imperfect sensory input.
        
           | accrual wrote:
           | To illustrate your point, an easily accessible example of
           | this is how the second hand on clocks appears to freeze for
           | longer than a second when you quickly glance at it. The brain
           | is predicting/interpolating what it expects to see, creating
           | the illusion of a delay.
           | 
           | https://www.popsci.com/how-time-seems-to-stop/
        
           | danielbln wrote:
           | Example vision: comes in from the optic nerve warped and
           | upside down and as small patches of high resolution captured
           | by the eyes zigzagging across the visual field (saccades),
           | all of which is assembled and integrated into a coherent
           | field of vision by our trusty old grey blob.
        
         | beepbooptheory wrote:
         | Why does it matter? Not trying to dismiss, but truly, what
         | would it mean to you if you could somehow verify the
         | "simulation"?
         | 
         | If it _would_ mean something drastic to you, I would be very
         | curious to hear your preexisting existential beliefs
         | /commitments.
         | 
         | People say this sometimes and its kind of slowly revealed to me
         | that its just a new kind of geocentrism: its not _just_ a
         | simulation people have in mind, but one where earth /humans are
         | centered, and the rest of the universe is just for the benefit
         | of "our" part of the simulation.
         | 
         | Which is a fine theory I guess, but is also just essentially
         | wanting God to exist with extra steps!
        
         | KineticLensman wrote:
         | (disclaimer: worked in the sim industry for 25 years, still
         | active in terms of physics-based rendering).
         | 
         | First off, there are zero technical proofs that we are in a
         | sim, just a number of philosophical arguments.
         | 
         | In practical terms, we cannot yet simulate a single human cell
         | at the molecular level, given the massive number of
         | interactions that occur every microsecond. Simulating our
         | entire universe is not technically possible within the lifetime
         | of our universe, according to our current understanding of
         | computation and physics. You either have to assume that 'the
         | sim' is very narrowly focussed in scope and fidelity, and / or
         | that the outer universe that hosts 'the sim' has laws of
         | physics that are essentially magic from our perspective. In
         | which case the simulation hypothesis is essentially a religious
         | argument, where the creator typed 'let there be light' into his
         | computer. If there isn't such a creator, the sim hypothesis
         | 'merely' suggests that our universe, at its lowest levels,
         | looks somewhat computational, which is an entirely different
         | argument.
        
           | freedomben wrote:
           | I don't think you would need to simulate the entire universe,
           | just enough of it that the consciousness receiving sense data
           | can't encounter any missing info or "glitches" in the
           | metaphorical matrix. Still hard of course, but substantially
           | less compute intensive than every molecule in the universe.
        
             | gcanyon wrote:
             | And if you're in charge of the simulation, you get to
             | decide how many "consciousnesses" there are, constraining
             | them to be within your available compute. Maybe that's ~8
             | billion -- maybe it's 1. Yeah, I'm feeling pretty
             | Boltzmann-ish right now...
        
             | KineticLensman wrote:
             | > but substantially less compute intensive than every
             | molecule in the universe
             | 
             | Very true, but to me this view of the universe and one's
             | existence within it as a sort of second-rate solipsist
             | bodge isn't a satisfyingly profound answer to the question
             | of life the universe and everything.
             | 
             | Although put like that it explains quite a lot.
             | 
             | [Edit] There is also a sense in which the sim-as-a-
             | focussed-mini-universe view is even less falsifiable,
             | because sim proponents address any doubt about the sim by
             | moving the goal posts to accommodate what they claim is
             | actually achievable by the putative creator/hacker on
             | Planet Tharg or similar.
        
             | kaashif wrote:
             | And you don't have to simulate it in real time, maybe 1
             | second here takes years or centuries to simulate outside
             | the simulation. It's not like we'd have any way to tell.
        
               | hackerlight wrote:
               | These are all open questions in philosophy of mind.
               | Nobody knows what causes consciousness/qualia so nobody
               | knows if it's substrate dependent or not and therefore
               | nobody knows if it can be simulated in a computer, or if
               | it can nobody knows what type of computer is required for
               | consciousness to be a property of the resulting
               | simulation.
        
         | SXX wrote:
         | Actually it was already done by sentdex with GAN Theft Auto:
         | 
         | https://youtu.be/udPY5rQVoW0
         | 
         | To an extent...
         | 
         | PS: Video is 2 years old, but still really impressive.
        
       | epiccoleman wrote:
       | This is really, really cool. A few months ago I was playing with
       | some of the "video" generation models on Replicate, and I got
       | some really neat results[1], but it was very clear that the
       | resulting videos were made from prompting each "frame" with the
       | previous one. This looks like it can actually figure out how to
       | make something that has a higher level context to it.
       | 
       | It's crazy to see this level of progress in just a bit over half
       | a year.
       | 
       | [1]: https://epiccoleman.com/posts/2023-03-05-deforum-stable-
       | diff...
        
       | richthekid wrote:
       | This is gonna change everything
        
         | jetsetk wrote:
         | Is it? How so?
        
         | Chabsff wrote:
         | It's really not.
         | 
         | Don't get me wrong, this is insanely cool, but it's still a
         | long way from good enough to be truly disruptive.
        
           | echelon wrote:
           | One year.
           | 
           | All of Hollywood falls.
        
             | Chabsff wrote:
             | No offense, but this is absolutely delusional.
             | 
             | As long as people can "clock" content generated from these
             | models, it will be treated by consumers as low-effort
             | drivel, no matter how much actual artistic effort goes in
             | the exercise. Only once these systems push through the
             | threshold of being indistinguishable from artistry will all
             | hell break loose, and we are still very far from that.
             | 
             | Paint-by-numbers low-effort market-driven stuff will take a
             | hit for sure, but that's only a portion of the market, and
             | frankly not one I'm going to be missing.
        
               | ben_w wrote:
               | Very far, yes, but also in a fast moving field.
               | 
               | CGI in films used to be obvious all the time no matter
               | how good the artists using it, now it's everywhere and
               | only noticeable when that's the point; the gap from Tron
               | to Fellowship of the Ring was 19.5 years.
               | 
               | My guess is the analogy here puts the quality of existing
               | genAI somewhere near the equivalent of early TV CGI,
               | given its use in one of the Marvel title sequences etc.,
               | but it is just an analogy and there's no guarantees of
               | anything either way.
        
               | r3d0c wrote:
               | something unrelated improved overtime so something else
               | unrelated will also improve to whatever goal you've set
               | in your mind
               | 
               | weird logic circles yall keep making to justify your
               | beliefs, i mean the world is very easy like you just
               | described if you completely strip all nuance and
               | complexity
               | 
               | people used to believe at the start of the space race
               | we'd have mars colonies by now because they looked at the
               | rate of technological advancement from 1910 to 1970, from
               | the first flight to landing on the moon; yet that didn't
               | happen because everything doesn't follow the same
               | repeatable patterns
        
               | pessimizer wrote:
               | People also believed that recorded music would destroy
               | the player piano industry and the market for piano rolls.
               | Just because recorded music is cheaper doesn't mean that
               | the audience will be willing to give up the actual sound
               | of a piano being played.
        
               | ben_w wrote:
               | First, lotta artists already upset with genAI and the
               | impact it has.
               | 
               | Second, I _literally_ wrote the same point you seem to
               | think is a gotcha:
               | 
               | > it is just an analogy and there's no guarantees of
               | anything either way
        
             | woeirua wrote:
             | Every time something like this is released someone comments
             | how it's going to blow up legacy studios. The only way you
             | can possibly think that is that: 1-the studios themselves
             | will somehow be prevented from using this tech themselves,
             | and 2-that somehow customers will suddenly become amenable
             | to low grade garbage movies. Hollywood already produces
             | thousands of low grade B or C movies every year that cost
             | fractions of what it costs to make a blockbuster. Those
             | movies make almost nothing at the box office.
             | 
             | If anything, a deluge of cheap AI generated movies is going
             | to lead to a flight to quality. The big studios will be
             | more powerful because they will reap the productivity gains
             | and use traditional techniques to smooth out the rough
             | edges.
        
               | underscoring wrote:
               | > 2-that somehow customers will suddenly become amenable
               | to low grade garbage movies
               | 
               | People have been amenable to low grade garbage movies for
               | a long, long time. See Adam Sandler's back catalog.
        
           | evrenesat wrote:
           | In a few years' time, teenagers will be consuming shows and
           | films made by their peers, not by streaming providers.
           | They'll forgive and perhaps even appreciate the technical
           | imperfections for the sake of uncensored, original content
           | that fits perfectly with their cultural identity.
           | 
           | Actually, when processing power catches up, I'm expecting a
           | movie engine with well-defined characters, scenes, entities,
           | etc., so people will be able to share mostly text-based
           | scenarios to watch on their hardware players.
        
             | Chabsff wrote:
             | Similar to how all the kids today only play itch.io games
             | thanks to Unity and Unreal dramatically lowering the bar of
             | entry into game development.
             | 
             | Oh wait... No.
             | 
             | All it has done is create an environment where indy games
             | are now assumed to be trash unless proven otherwise, making
             | getting traction as a small developer orders of magnitude
             | harder than it has ever been because their efforts are
             | drowning in a sea of mediocrity.
             | 
             | That same thing is already starting to happen on youtube
             | with AI content, and there's no reason for me to expect
             | this going any other way.
        
               | evrenesat wrote:
               | It took ~2 years for my 10 year old daughter to get bored
               | and give up the shitty user made roblox games and start
               | playing on switch, steam or ps4.
        
             | nwienert wrote:
             | They do that now (forget the name there's a popular one my
             | niece uses to make animated comics, others do similar
             | things in Minecraft etc), and have been doing that since
             | forever - nearly 30 years ago my friends and I were
             | scribbling comic panels into our notebooks and sharing them
             | around class.
        
       | nbzso wrote:
       | Model chain:
       | 
       | Instance One : Act as a top tier Hollywood scenarist, use the
       | public available data for emotional sentiment to generate a
       | storyline, apply the well known archetypes from proven
       | blockbusters for character development. Move to instance two.
       | 
       | Instance Two: Act as top tier producer. {insert generated
       | prompt}. Move to instance three.
       | 
       | Instance Three: Generate Meta-humans and load personality traits.
       | Move to instance four.
       | 
       | Instance Four: Act as a top tier director.{insert generated
       | prompt}. Move to instance five.
       | 
       | Instance Five: Act as a top tier editor.{insert generated
       | prompt}. Move to instance six.
       | 
       | Instance Six: Act as a top tier marketing and advertisement
       | agency.{insert generated prompt}. Move to instance seven.
       | 
       | Instance Seven: Act as a top tier accountant, generate an
       | interface to real-time ROI data and give me the results on an
       | optimized timeline into my AI induced dream.
       | 
       | Personal GPT: Buy some stocks, diversify my portfolio, stock up
       | on synthetic meat, bug-coke and Soma. Call my mom and tell her I
       | made it.
        
       | aliljet wrote:
       | I've been following this space very very closely and the killer
       | feature would be to be able to generate these full featured
       | videos for longer than a few seconds with consistently shaped
       | "characters" (e.g., flowers, and grass, and houses, and cars,
       | actors, etc.). Right now, it's not clear to me that this is
       | achieving that objective. This feels like it could be great to
       | create short GIFs, but at what cost?
       | 
       | To be clear, this remains wicked, wicked, wicked exciting.
        
       | speedgoose wrote:
       | Has anyone managed to run the thing? I got the streamlit demo to
       | start after fighting with pytorch, mamba, and pip for half an
       | hour, but the demo runs out of GPU memory after a little while. I
       | have 24GB on GPU on the machine I used, does it need more?
        
         | mkaic wrote:
         | Have heard from others attempting it that it needs 40GB, so
         | basically an A100/A6000/H100 or other large card. Or an Apple
         | Silicon Mac with a bunch of unified memory, I guess.
        
           | mlboss wrote:
           | Give it a week.
        
           | speedgoose wrote:
           | Alright thanks for the information. I will try to justify
           | using one A100 for my "very important" research activities.
        
         | skonteam wrote:
         | Yeah, got a 24GB 4090, try to reduce the number of frames
         | decoded to something like 4 or 8. Although, keep in mind it
         | caps the 24Gb and goes to RAM (with the latest nvidia drivers).
        
           | speedgoose wrote:
           | Oh yes it works, thanks!
        
         | nwoli wrote:
         | Is the checkpoint default fp16 or fp32?
        
       | neaumusic wrote:
       | It's funny that still don't really have video wallpapers on most
       | devices (I'm only aware of Wallpaper Engine on Windows)
        
       | pcj-github wrote:
       | Soon the hollywood strike won't even matter, won't need any of
       | those jobs. Entire west coast economy obliterated.
        
       | jonplackett wrote:
       | Is this available in the stability API any time soon?
        
       | chrononaut wrote:
       | Much like in static images, the subtle unintended imperfections
       | are quite interesting to observe.
       | 
       | For example, the man in the cowboy hat seems he is almost
       | gagging. In the train video the tracks seem to be too wide while
       | the train ice skates across them.
        
       ___________________________________________________________________
       (page generated 2023-11-21 23:00 UTC)