[HN Gopher] Stable Fast 3D: Rapid 3D Asset Generation from Singl...
___________________________________________________________________
Stable Fast 3D: Rapid 3D Asset Generation from Single Images
Author : meetpateltech
Score : 239 points
Date : 2024-08-01 15:16 UTC (7 hours ago)
(HTM) web link (stability.ai)
(TXT) w3m dump (stability.ai)
| talldayo wrote:
| > 0.5 seconds per 3D asset generation on a GPU with 7GB VRAM
|
| Holy cow - I was thinking this might be one of those datacenter-
| only models but here I am proven wrong. 7GB of VRAM suggests this
| could run on a lot of hardware that 3D artists own already.
| msp26 wrote:
| You can interact with the models on their project page:
| https://stable-fast-3d.github.io/
| calini wrote:
| I'm going to 3D print so much dumb stuff with this.
| jsheard wrote:
| They're still hesitant to show the untextured version of the
| models so I would assume it's like previous efforts where most
| of the detail is in the textures, and the model itself, the
| part you would 3D print, isn't so impressive.
| yazzku wrote:
| I was going to comment on the same; these 3d reconstructions
| often generate a mess of a topology, and this post does not
| show any of the mesh triangulations, so I assume they're
| still not good. Arguably, the meshes are bad even for
| rendering.
| dlivingston wrote:
| Presumably, these meshes can be cleaned up using standard
| mesh refinement algorithms, like those found in MeshLab:
| https://www.meshlab.net/#features
| Keyframe wrote:
| Hopefully that's in the (near) future, but as of now
| there still exists 'retopo' in 3D work for a reason. Just
| like roto and similar menial tasks. We're getting there
| with automation though.
| mft_ wrote:
| You can download a .glb file (from the HuggingFace demo page)
| and open it locally (e.g. in MS 3D Viewer). I'm looking at a
| mesh from one of the better examples I tried and it's
| actually pretty good...
| jayd16 wrote:
| You know I do wonder about this. If its just for static
| assets does it really matter? In something like Unreal, the
| textures are going to be virtualized and the geometry is
| going to be turned in to LODed triangle soup anyway.
|
| Has anyone tried to build an Unreal scene with these
| generated meshes?
| jsheard wrote:
| Usually the problem is the model itself is severely lacking
| in detail, sure Nanite could make light work of a poorly
| _optimized_ model but it 's not going to fix the model
| being a vague blob which doesn't hold up to close scrutiny.
| kaibee wrote:
| Generate the accompanying normal map and then just
| tesselate it?
| andybak wrote:
| So don't use them in a context where they require close
| scrutiny?
| fragmede wrote:
| hueforge
| bloopernova wrote:
| Closer and closer to the automatic mapping drones from
| _Prometheus_.
|
| I wonder what the optimum group of technologies is that would
| enable that kind of mapping? Would you pile on LIDAR, RADAR, this
| tech, ultrasound, magnetic sensing, etc etc. Although, you're
| then getting a flying tricorder. Which could enable some cool
| uses even outside the stereotypical search and rescue.
| nycdatasci wrote:
| High-res images from multiple perspectives should be
| sufficient. If you have a consumer drone, this product (no
| affiliation) is extremely impressive:
| https://www.dronedeploy.com/
|
| You basically select an area on a map that you want to model in
| 3d, it flies your drone (take-off, flight path, landing), takes
| pictures, uploads to their servers for processing, generates
| point cloud, etc. Very powerful.
| thetoon wrote:
| What you could do with WebODM is already quite impressive
| alsodumb wrote:
| Are you talking about mapping tunnels with drones? That's
| already done and it doesn't really need any 'AI': it's plain
| old SLAM.
|
| DARPA's subterranean challenge had many teams that did some
| pretty cool stuff in this direction:
| https://spectrum.ieee.org/darpa-subterranean-challenge-26571...
| sorenjan wrote:
| You don't need or want generative AI for mapping, you "just"
| need lidar and drones for slam.
|
| https://www.youtube.com/watch?v=1CWWP9jb4cE
| pzo wrote:
| You already have depth anything v2 that can generate depthmap
| in realtime even on iPhone. Quality is pretty good but probably
| will be even improved in the future. Actually in many ways
| those depthmaps are much better quality than iPhone Lidar or
| Truedepth camera (that cannot handle transparent, metalic,
| reflective surfaces and also they have a big noise).
|
| https://github.com/DepthAnything/Depth-Anything-V2
|
| https://huggingface.co/spaces/pablovela5620/depth-compare
|
| https://huggingface.co/apple/coreml-depth-anything-v2-small
| specproc wrote:
| Be still my miniature-painting heart.
| timr wrote:
| For all of the hype around LLMs, this general area (image
| generation and graphical assets) seems to me to be the big long-
| term winner of current-generation AI. It hits the sweet spot for
| the fundamental limitations of the methods:
|
| * so-called "hallucination" (actually just how generative models
| work) is a feature, not a bug.
|
| * anyone can easily _see_ the unrealistic and biased outputs
| without complex statistical tests.
|
| * human intuition is useful for evaluation, and not fundamentally
| misleading (i.e. the equivalent of _" this text sounds fluent, so
| the generator must be intelligent!"_ hype doesn't really exist
| for imagery. We're capable of treating it as _technology_ and
| evaluating it fairly, because there 's no equivalent human
| capability.)
|
| * even lossy, noisy, collapsed and over-trained methods can be
| valuable for different creative pursuits.
|
| * perfection is not required. You can easily see distorted
| features in output, and iteratively try to improve them.
|
| * consistency is not required (though it will unlock hugely
| valuable applications, like video, should it ever arrive).
|
| * technologies like LoRA allow even unskilled users to train
| character-, style- or concept-specific models with ease.
|
| I've been amazed at how much better image / visual generation
| models have become in the last year, and IMO, the pace of
| improvement has not been slowing as much as text models.
| Moreover, it's becoming increasingly clear that the future isn't
| the wholesale replacement of photographers, cinematographers,
| etc., but rather, a generation of crazy AI-based power tools that
| can do things like add and remove _concepts_ to imagery with a
| few text prompts. It 's insanely useful, and just like Photoshop
| in the 90s, a new generation of power-users is already emerging,
| and doing wild things with the tools.
| CuriouslyC wrote:
| Image models are a great way to understand generate AI. It's
| like surveying a battlefield from the air as opposed to the
| ground.
| ibash wrote:
| > anyone can easily see the unrealistic outputs without complex
| statistical tests.
|
| This is key, we're all pre-wired with fast correctness tests.
|
| Are there other data types that match this?
| batch12 wrote:
| Audio to a lesser degree
| sounds wrote:
| Software (I mean the product, not the code)
|
| Mundane tasks that can be visually inspected at the end
| (cleaning, organizing, maintenance and mechanical work)
| leetharris wrote:
| > For all of the hype around LLMs, this general area (image
| generation and graphical assets) seems to me to be the big
| long-term winner of current-generation AI. It hits the sweet
| spot for the fundamental limitations of the methods:
|
| I am biased (I work at Rev.com and Rev.ai), but I totally agree
| and would add one more thing: transcription. Accurate human
| transcription takes a really, really long time to do right.
| Often a ratio of 3:1-10:1 of transcriptionist time to original
| audio length.
|
| Though ASR is only ~90-95% accurate on many "average" audios,
| it is often 100% accurate on high quality audio.
|
| It's not only a cost savings thing, but there are entire
| industries that are popping up around AI transcription that
| just weren't possible before with human speed and scale.
| timr wrote:
| I agree. I think it's more of a niche use-case than image
| models (and fundamentally harder to evaluate), but
| transcription and summarization is my current front-runner
| for winning use-case of LLMs.
|
| That said, "hallucination" is more of a fundamental problem
| for this area than it is for imagery, which is why I still
| think imagery is the most interesting category.
| llm_trw wrote:
| Is there any models that can do diarization well yet?
|
| I need one for a product and the state of the art, e.g.
| pyannote, is so bad it's better to not use them.
| throw03172019 wrote:
| Deepgram has been pretty good for our product. Fast and
| fairly accurate for English.
| llm_trw wrote:
| Do they have a local model?
|
| I keep getting burned by APIs having stupid restrictions
| that makes use cases impossible that are trivial if you
| can run the thing locally.
| toddmorey wrote:
| Also the other way around: text to speech. We're at the point
| where I can finally listen to computer generated voice for
| extended periods of time without fatigue.
|
| There was a project mentioned here on HN where someone was
| creating audio book versions of content in the public domain
| that would never have been converted through the time and
| expense of human narrators because it wouldn't be
| economically feasible. That's a huge win for accessibility.
| Screen readers are also about to get dramatically better.
| qup wrote:
| > a project mentioned here on HN where someone was creating
| audio book versions of content in the public domain
|
| Maybe this: https://news.ycombinator.com/item?id=40961385
| toddmorey wrote:
| That's the one! Thanks!
| mitthrowaway2 wrote:
| > it's becoming increasingly clear that the future isn't the
| wholesale replacement of photographers, cinematographers, etc.
|
| I'd refrain from making any such statements about the future;*
| the pace of change makes it hard to see the horizon beyond a
| few years, especially relative to the span of a career. It's
| already wholesale-replacing many digital artists and editorial
| illustrators, and while it's still early, there's a clear push
| starting in the cinematography direction. (I fully agree with
| the rest of your comment, and it's strange how much diffusion
| models seem to be overlooked relative to LLMs when people think
| about AI progress these days.)
|
| * (edit: about the future _impact of AI on jobs_ ).
| timr wrote:
| I mean, my whole comment is a prediction of the future, so
| that's water under the bridge. Maybe you're right and this is
| the start of the apocalypse for digital artists, but it feels
| more like photoshop in 1990 to me -- and people were saying
| the same stuff back then.
|
| > It's already wholesale-replacing many digital artists and
| editorial illustrators
|
| I think you're going to need to cite some data on a claim
| like that. Maybe it's replacing the fiverr end of the market?
| It's certainly much harder to justify paying someone to
| generate a (bad) logo or graphic when a diffusion model can
| do the same thing, but there's no way that a model, today,
| can replace a _skilled_ artist. Or said differently: a
| skilled artist, combined with a good AI model, is vastly more
| productive than an unskilled artist with the same model.
| cjbgkagh wrote:
| What happens when the AI takes the low end of the market is
| that the people who catered to the low end now have to try
| to compete more in the mid-to-high end. The mid end facing
| increased competition has to try to move up to the high
| end. So while AI may not be able to compete directly with
| the high end it will erode the negotiating power and thus
| the earning potential of the high end.
| sroussey wrote:
| We have watched this same process repeat a few times over
| the last century with photography.
| timr wrote:
| Or graphic design, or video editing, or audio mastering,
| or...every new tool has come with a bunch of people
| saying things like _" what will happen to the linotype
| operators!?"_
|
| I sort of hate this line of argument, but it also has
| been manifestly true of the past, and rhymes with the
| present.
| llm_trw wrote:
| >For all of the hype around LLMs, this general area (image
| generation and graphical assets) seems to me to be the big
| long-term winner of current-generation AI.
|
| Let me show you the future:
| https://www.youtube.com/watch?v=eVlXZKGuaiE
|
| This is an LLM controlling an embodied VR body in a physics
| simulation.
|
| It is responding to human voice input not only with voice but
| body movements.
|
| Transformers aren't just chatbots, they are general symbolic
| manipulation machines. Anything that can be expressed as a
| series of symbols is a thing they can do.
| latentsea wrote:
| >This is an LLM controlling an embodied VR body in a physics
| simulation.
|
| No it's not. It's VAM that is controlling the character and
| it's literally just using a bog standard LLM as a chatbot and
| feeding the text into a plugin in VAM and VAM itself does the
| animation. Don't get me wrong it's absolutely next level to
| experience chatbots this way, but it's still a chat bot.
| llm_trw wrote:
| The animation, not the movement decisions.
|
| This is as native as calling an industrial robot 'just a
| calculator'.
| kkukshtel wrote:
| > This general area (image generation and graphical assets)
| seems to me to be the big long-term winner of current-
| generation AI
|
| I think it's easy to totally miss that LLMs are just being
| completely and quietly subsumed into a ton of products. They
| have been far more successful, and many image generation models
| use LLMs on the backend to generate "better" prompts for the
| models themselves. LLMs are the bedrock
| derefr wrote:
| I would argue the opposite -- image generation is the clear
| loser. If you've ever tried to do it yourself, grabbing a bunch
| of LoRAs from Civitai to try to convince a model to draw
| something it doesn't initially know how to draw -- it becomes
| clear that there's far too much unavoidable correlation between
| "form" and "representation" / "style" going on in even a SOTA
| diffusion model's hidden layers.
|
| Unlike LLMs, that really seem to translate the text into
| "concepts" at a certain embedding layer, the (current, 2D)
| diffusion models will store (and thus require to be trained on)
| a completely different idea of a thing, if it's viewed from a
| slightly different angle, or is a different size. Diffusion
| models can _interpolate_ but not _extrapolate_ -- they can 't
| see a prompt that says "lion goat dragon monster" and come up
| with the ancient-greek Chimera, unless they've actually been
| _trained on_ a Chimera. You can tell them "asian man, blond
| hair" -- and if their training dataset contains asian men and
| men with blonde hair but never _at the same time_ , then they
| _won 't_ be able to "hallucinate" a blond asian man for you,
| because that won't be an established point in the model's
| latent space.
|
| ---
|
| On a tangent: IMHO the _true_ breakthrough would be a model for
| "text to textured-3D-mesh" -- where it builds the model out of
| parts that it shapes individually and assembles in 3D space not
| out of tris, but _by writing /manipulating tokens representing
| shader code_ (i.e. it creates "procedural art"); and then it
| consistency-checks itself at each step not just against a
| textual embedding, but _also_ against an arbitrary (i.e.
| controlled for each layer at runtime by data) set of 2D
| projections that can be decoded out _to_ textual embeddings.
|
| (I imagine that such a model would need some internal
| "blackboard" of representational memory that it can set up
| arbitrarily-complex "lenses" for between each layer -- i.e. a
| camera with an arbitrary projection matrix, through which is
| read/written a memory matrix. This would allow the model to
| arbitrarily re-project its internal working visual "conception"
| of the model between each step, _in a way controllable by the
| output of each step_. Just like a human would rotate and zoom a
| 3D model while working on it[1]. But (presumably) with all the
| edits needing a particular perspective done in parallel on the
| first layer where that perspective is locked in.)
|
| Until we have something like that, though, all we're really
| getting from current {text,image}-to-{image,video} models is
| the parallel layered inpainting of a _decently, but not
| remarkably_ exhaustive pre-styled patch library, with each
| patch of each layer being applied with an arbitrary Photoshop-
| like "layer effect" (convolution kernel.) Which is the big
| reason that artists get mad at AI for "stealing their work" --
| but also why the results just aren't very flexible. Don't have
| a patch of a person's ear with a big earlobe seen in profile?
| No big-earlobe ear in profile for you. It either becomes a
| small-earlobe ear or the whole image becomes not-in-profile.
| (Which is an improvement from earlier models, where _just the
| ear_ became not-in-profile.)
|
| [1] Or just like our _minds_ are known to rotate and zoom
| objects in our "spatial memory" to snap them into our mental
| visual schemas!
| earthnail wrote:
| I think you're arguing about slightly different things. OP
| said that image generation is useful despite all its
| shortcomings, and that the shortcomings are easy to deal with
| for humans. OP didn't argue that the image generation AIs are
| actually smart. Just that they are useful tech for a variety
| of use cases.
| mrandish wrote:
| > Until we have something like that...
|
| The kind of granular, human-assisted interaction interface
| and workflow you're describing is, IMHO, the high-value path
| for the evolution of AI creative tools for non-text
| applications such as imaging, video and music, etc. Using a
| single or handful of images or clips as a starting place is
| good but as a semi-talented, life-long aspirational creative,
| current AI generation isn't that practically useful to me
| without the ability to interactively guide the AI toward what
| I want in more granular ways.
|
| Ideally, I'd like an interaction model akin to real-time
| collaboration. Due to my semi-talent, I've often done initial
| concepts myself and then worked with more technically
| proficient artists, modelers, musicians and sound designers
| to achieve my desired end result. By far the most valuable
| such collaborations weren't necessarily with the most
| technically proficient implementers, but rather those who had
| the most evolved real-time collaboration skills. The 'soft
| skill' of interpreting my directional inputs and then
| interactively refining or extrapolating them into new options
| or creative combinations proved simply invaluable.
|
| For example, with graphic artists I've developed a strong
| preference for working with those able to start out by
| collaboratively sketching rough ideas on paper in real-time
| before moving to digital implementation. The interaction and
| rapid iteration of tossing evolving ideas back and forth
| tended to yield vastly superior creative results. While I
| don't expect AI-assisted creative tools to reach anywhere
| near the same interaction fluidity as a collaboratively-
| gifted human anytime soon, even minor steps in this direction
| will make such tools far more useful for concepting and
| creative exploration.
| derefr wrote:
| ...but I wasn't describing a "human-assisted interaction
| interface and workflow." I was describing a different way
| for an AI to do things "inside its head" in a feed-forward
| span-of-a-few-seconds inference pass.
| thrance wrote:
| Honestly, I am still to see an AI generated image that makes me
| go "oh wow". It's missing those 10 last percents that always
| seem to elude neural networks.
|
| Also, the very bad press gen AI gets is very much slowing down
| adoption. Particularly among the creative-minded people, who
| would be the most likely users.
| jokethrowaway wrote:
| Hop on civitai
|
| There's plenty of mindblowing images
| quantumwoke wrote:
| Great result. Just had a play around with the demo models and
| they preserve structure really nicely; although the textures are
| still not great. It's kind of a voxelized version of the input
| image
| mft_ wrote:
| I'm really excited for something in this area to really deliver,
| and it's really cool that I can just drag pictures into the demo
| on HuggingFace [0] to try it.
|
| However... mixed success. It's not good with (real) cats yet -
| which was obvs the first thing I tried. It did reasonably well
| with a simple image of an iPhone, and actually pretty
| impressively with a pancake with fruit on top, terribly with a
| rocket, and impressively again with a rack of pool balls.
|
| [0] https://huggingface.co/spaces/stabilityai/stable-fast-3d
| kleiba wrote:
| This is good news for the indie game dev scene, I suppose?
| jayd16 wrote:
| The models aren't really optimized for game dev. Fine for
| machinima, probably.
| ww520 wrote:
| This is a great step forward.
|
| I wonder whether RAG based 3D animation generation can be done
| with this.
|
| 1. Textual description of a story.
|
| 2. Extract/generate keywords from the story using LLM.
|
| 3. Search and look up 2D images by the keywords.
|
| 4. Generate 3D models from the 2D images using Stable Fast 3D.
|
| 5. Extract/generate path description from the story using LLM.
|
| 6. Generate movement/animation/gait using some AI.
|
| ...
|
| 7. Profit??
| nwoli wrote:
| Pre generate a bunch of images via sdxl and convert to 3d and
| then serve nearest mesh after querying
| nwoli wrote:
| Huggingface space to try it
| https://huggingface.co/spaces/stabilityai/stable-fast-3d
| Y_Y wrote:
| It really looks like they've been doing that classic infomercial
| tactic of desaturating the images of the things they're comparing
| against to make theirs seen better.
| woolion wrote:
| This is the third image to 3D AI I've tested, and in all cases
| the examples they give look like 2D renders of 3D models already.
| My tests were with cel-shaded images (cartoony, not with
| realistic lighting) and the model outputs something very flat but
| with very bad topology, which is worse than starting with a low
| poly or extruding the drawing. I suspect it is unable to give
| decent results without accurate shadows from which the normal
| vectors could be recomputed and thus lacks any 'understanding' of
| what the structure would be from the lines and forms.
|
| In any case it would be cool if they specified the set of inputs
| that is expected to give decent results.
| quitit wrote:
| It might not just be your tests.
|
| All of my tests of img2mesh technologies have produced poor
| results, even when using images that are very similar to the
| ones featured in their demo. I've never got fidelity like what
| they've shown.
|
| I'll give this a whirl and see if it performs better.
| quitit wrote:
| Tried it with a collection of images, and in my opinion it
| performs -worse- than earlier releases.
|
| It is however fast.
| woolion wrote:
| All right, I was hesitating to try shading some images to see
| if that improves the quality. It's probably still too early.
| diggan wrote:
| What stuck out to me from this release was this:
|
| > Optional quad or triangle remeshing (adding only 100-200ms to
| processing time)
|
| But it seems to have been optional. Did you try it with that
| turned on? I'd be very interested in those results, as I had
| the same experience as you, the models don't generate good
| enough meshes, so was hoping this one would be a bit better at
| that.
|
| Edit: I just tried it out myself on their Huggingface demo and
| even with the predefined images they have there, the mesh
| output is just not good enough. https://i.imgur.com/e6voLi6.png
| nextworddev wrote:
| For those reading from Stability - just tried it - API seems to
| be down and the notebook doesn't have the example code it claimed
| to have.
| hansc wrote:
| Looks very good on examples, but testing a few Ikea chairs or a
| Donald Duck image gives very wrong results.
|
| You can test here:
| https://huggingface.co/spaces/stabilityai/stable-fast-3d
| ksec wrote:
| Given the Graphics Asset part of AA or AAA Games are the most
| expensive, i wonder if 3D Asset Generation could perhaps
| drastically lower that by 50% or more? At least in terms of same
| output. Because in reality I guess artist will just spend more
| time in other areas.
| causi wrote:
| Man it would be so cool to get AI-assisted photogrammetry.
| Imagine that instead of taking a hundred photos or a slow scan
| and having to labor over a point cloud, you could just take like
| three pictures and then go down a checklist. "Is this circular?
| How long is this straight line? Is this surface flat? What's the
| angle between these two pieces?" and getting a perfect replica or
| even a STEP file out of it. Heaven for 3D printers.
| puppycodes wrote:
| I really can't wait for this technology to improve. Unfortunately
| just from testing this it seems not very useful. It takes more
| work to modify the bad model it approximates from the image
| output than starting with a good foundation from scratch. I would
| rather see something that took a series of steps to reach a
| higher quality end product more slowly instead of expecting
| everything to come from one image. Perhaps i'm missing the use
| case?
| MrTrvp wrote:
| Perhaps it'll require a series of segmentation and transforms
| that improves individual components and then works up towards
| the full 3d model of the image.
| andybak wrote:
| > not very useful
|
| Useful for what? I think use cases will emerge.
|
| A lot of critiques assume you're working in VFX or game
| development. Making image to 3d (and by extension text to image
| to 3d) effortless a whole host of new applications open up -
| which might not be anywhere near so demanding.
| fsloth wrote:
| Not the holy grail yet, but pretty cool!
|
| I see these usable not as main assets, but as something you would
| add as a low effort embellishment to add complexity to the main
| scene. The fact they maintain profile makes them usable for
| situations where mere 2d billboard impostor (i.e the original
| image always oriented towards the camera) would not cut it.
|
| You can totally create a figure image (Midjourney|Bing|Dalle3)
| and drag and drop it to the image input and get a surprising good
| 3d presentation that is not a highly detailed model, but
| something you could very well put to a shelf in a 3d scene as an
| embellishment where the camera never sees the back of it, and the
| model is never at the center of attention.
| abidlabs wrote:
| Official Gradio demo is here:
| https://huggingface.co/spaces/stabilityai/stable-fast-3d
___________________________________________________________________
(page generated 2024-08-01 23:01 UTC)