[HN Gopher] Meta Segment Anything Model 3
       ___________________________________________________________________
        
       Meta Segment Anything Model 3
        
       Author : lukeinator42
       Score  : 158 points
       Date   : 2025-11-19 17:14 UTC (5 hours ago)
        
 (HTM) web link (ai.meta.com)
 (TXT) w3m dump (ai.meta.com)
        
       | fzysingularity wrote:
       | SAM3 is cool - you can already do this more interactively on
       | chat.vlm.run [1], and do much more. It's built on our new Orion
       | [2] model; we've been able to integrate with SAM and several
       | other computer-vision models in a truly composable manner. Video
       | segmentation and tracking is also coming soon!
       | 
       | [1] https://chat.vlm.run
       | 
       | [2] https://vlm.run/orion
        
         | visioninmyblood wrote:
         | Wow this is actually pretty cool, I was able to segment out the
         | people and dog in the same chat.
         | https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...
        
           | fzysingularity wrote:
           | Nice, that's pretty neat.
        
       | yeldarb wrote:
       | We (Roboflow) have had early access to this model for the past
       | few weeks. It's really, really good. This feels like a seminal
       | moment for computer vision. I think there's a real possibility
       | this launch goes down in history as "the GPT Moment" for vision.
       | The two areas I think this model is going to be transformative in
       | the immediate term are for rapid prototyping and distillation.
       | 
       | Two years ago we released autodistill[1], an open source
       | framework that uses large foundation models to create training
       | data for training small realtime models. I'm convinced the idea
       | was right, but too early; there wasn't a big model good enough to
       | be worth distilling from back then. SAM3 is finally that model
       | (and will be available in Autodistill today).
       | 
       | We are also taking a big bet on SAM3 and have built it into
       | Roboflow as an integral part of the entire build and deploy
       | pipeline[2], including a brand new product called Rapid[3], which
       | reimagines the computer vision pipeline in a SAM3 world. It feels
       | really magical to go from an unlabeled video to a fine-tuned
       | realtime segmentation model with minimal human intervention in
       | just a few minutes (and we rushed the release of our new SOTA
       | realtime segmentation model[4] last week because it's the perfect
       | lightweight complement to the large & powerful SAM3).
       | 
       | We also have a playground[5] up where you can play with the model
       | and compare it to other VLMs.
       | 
       | [1] https://github.com/autodistill/autodistill
       | 
       | [2] https://blog.roboflow.com/sam3/
       | 
       | [3] https://rapid.roboflow.com
       | 
       | [4] https://github.com/roboflow/rf-detr
       | 
       | [5] https://playground.roboflow.com
        
         | dangoodmanUT wrote:
         | I was trying to figure out from their examples, but how are you
         | breaking up the different "things" that you can detect in the
         | image? Are you just running it with each prompt individually?
        
           | rocauc wrote:
           | The model supports batch inference, so all prompts are sent
           | to the model, and we parse the results.
        
         | sorenjan wrote:
         | SAM3 is probably a great model to distill from when training
         | smaller segmentation models, but isn't their DINOv2 a better
         | example of a large foundation model to distill from for various
         | computer vision tasks? I've seen it used for as starting point
         | for models doing segmentation and depth estimation. Maybe
         | there's a v3 coming soon?
         | 
         | https://dinov2.metademolab.com/
        
       | xfeeefeee wrote:
       | I can't wait until it is easy to rotoscope / greenscreen / mask
       | this stuff out accessibly for videos. I had tried Runway ML but
       | it was... lacking, and the webui for fixing parts of it had
       | similar issues.
       | 
       | I'm curious how this works for hair and transparent/translucent
       | things. Probably not the best, but does not seem to be mentioned
       | anywhere? Presumably it's just a straight line or vector rather
       | than alpha etc?
        
         | nodja wrote:
         | I'm pretty sure davinci resolve does this already, you can even
         | track it, idk if it's available in the free version.
        
         | rocauc wrote:
         | I tried it on transparent glass mugs, and it does pretty well.
         | At least better than other available models:
         | https://i.imgur.com/OBfx9JY.png
         | 
         | Curious if you find interesting results -
         | https://playground.roboflow.com
        
       | sciencesama wrote:
       | Does the license allow for commercial purposes?
        
         | visioninmyblood wrote:
         | I just check and it seems to commercial permissiable.Companies
         | like vlm.run and roboflow are using for commercial use as show
         | by thier comments below. So i guess it can be used for
         | commercial purposes.
        
           | rocauc wrote:
           | Yes. But also note that redistribution of SAM 3 requires
           | using the same SAM 3 license downstream. So libraries that
           | attempt to, e.g., relicense the model as AGPL are non-
           | compliant.
        
         | rocauc wrote:
         | Yes. It's a custom license with an Acceptable Use Policy
         | preventing military use and export restrictions. The custom
         | license permits commercial use.
        
         | colesantiago wrote:
         | Yes, the license allows you to grift for your "AI startup"
        
       | gs17 wrote:
       | The 3D mesh generator is really cool too:
       | https://ai.meta.com/sam3d/ It's not perfect, but it seems to
       | handle occlusion very well (e.g. a person in a chair can be
       | separated into a person mesh and a chair mesh) and it's very
       | fast.
        
         | Animats wrote:
         | It's very impressive. Do they let you export a 3D mesh, though?
         | I was only able to export a video. Do you have to buy tokens or
         | something to export?
        
           | TheAtomic wrote:
           | I couldn't download it. Model appears to be comparable to
           | Sparc3D, Huyunan, etc but w/o download, who can say? It is
           | much faster though.
        
             | visioninmyblood wrote:
             | you can download it at
             | https://github.com/facebookresearch/sam3. for 3d
             | https://github.com/facebookresearch/sam-3d-objects
             | 
             | I actually found the easiest way was to run it here
             | directly to see if it works for my use case of person
             | deidentification https://chat.vlm.run/chat/63953adb-a89a-4c
             | 85-ae8f-2d501d30a4...
        
           | modeless wrote:
           | The model is open weights, so you can run it yourself.
        
       | dangoodmanUT wrote:
       | This model is incredibly impressive. Text is definitely the right
       | modality, and now the ability to intertwine it with an LLM
       | creates insane unlocks - my mind is already storming with ideas
       | of projects that are now not only possible, but trivial.
        
       | HowardStark wrote:
       | Curious if anyone has done anything meaningful with SAM2 and
       | streaming. SAM3 has built-in streaming support which is _very_
       | exciting.
       | 
       | I've seen versions where people use an in-memory FS to write
       | frames of stream with SAM2. Maybe that is good enough?
        
       | rocauc wrote:
       | A brief history. SAM 1 - Visual prompt to create pixel-perfect
       | masks in an image. No video. No class names. No open vocabulary.
       | SAM 2 - Visual prompting for tracking on images and video. No
       | open vocab. SAM 3 - Open vocab concept segmentation on images and
       | video.
       | 
       | Roboflow has been long on zero / few shot concept segmentation.
       | We've opened up a research preview exploring a SAM 3 native
       | direction for creating your own model:
       | https://rapid.roboflow.com/
        
       | hodgehog11 wrote:
       | This is an incredible model. But once again, we find an
       | announcement for a new AI model with highly misleading graphs.
       | That SA-Co Gold graph is particularly bad. Looks like I have
       | another bad graph example for my introductory stats course...
        
       | clueless wrote:
       | With a avg latency of 4 seconds, this still couldn't be used in
       | real-time video, correct?
       | 
       | [Update: should have mentioned I got the 4 second from the
       | roboflow.com links in this thread]
        
         | Etheryte wrote:
         | Didn't see where you got those numbers, but surely that's just
         | a problem of throwing more compute at it? From the blog post:
         | 
         | > This excellent performance comes with fast inference -- SAM 3
         | runs in 30 milliseconds for a single image with more than 100
         | detected objects on an H200 GPU.
        
       | daemonologist wrote:
       | First impressions are that this model is _extremely_ good - the
       | "zero-shot" text prompted detection is a huge step ahead of what
       | we've seen before (both compared to older zero-shot detection
       | models and to recent general purpose VLMs like Gemini and Qwen).
       | With human supervision I think it's even at the point of being a
       | useful teacher model.
       | 
       | I put together a YOLO tune for climbing hold detection a while
       | back (trained on 10k labels) and this is 90% as good out of the
       | box - just misses some foot chips and low contrast wood holds,
       | and can't handle as many instances. It would've saved me a huge
       | amount of manual annotation though.
        
         | rocauc wrote:
         | As someone that works on a platform users have used for
         | labeling 1B images, I'm bullish SAM 3 can automate at least 90%
         | of the work. Data prep is flipped to models being human-
         | assisted instead of humans being model-assisted (see
         | "autolabel" https://blog.roboflow.com/sam3/). I'm optimistic
         | majority of users can now start deploying a model to then
         | curate data instead of the inverse.
        
       | bangaladore wrote:
       | Probably still can't get past a Google Captcha when on a VPN. Do
       | I click the square with the shoe of the person who's riding the
       | motorcycle?
        
         | conception wrote:
         | There are services you can get that will bypass those with a
         | browser extension for you.
        
       | exe34 wrote:
       | can anyone confirm if this fits in a 3090? the files look about
       | 3.5GB, but I can't work out what the memory needs will be
       | overall.
        
       | foota wrote:
       | Obligatory xkcd: https://xkcd.com/1425/
        
       | maelito wrote:
       | Can it detect the speed of a vehicle on any video unsupervised ?
        
       | Benjamin_Dobell wrote:
       | For background removal (at least my niche use case of background
       | removal of kids drawings -- https://breaka.club/blog/why-were-
       | building-clubs-for-kids) I think birefnet v2 is still working
       | slightly better.
       | 
       | SAM3 seems to less precisely trace the images -- it'll discard
       | kids drawing out the lines a bit, which is okay, but then it also
       | seems to struggle around sharp corners and includes a bit of the
       | white page that I'd like cut out.
       | 
       | Of course, SAM3 is _significantly_ more powerful in that it does
       | _much_ more than simply cut out images. It seems to be able to
       | identify what these kids ' drawings represent. That's very
       | impressive, AI models are typically trained on photos and adult
       | illustrations -- they struggle with children's drawings. So I
       | could perhaps still use this for identifying content, giving kids
       | more freedom to draw what they like, but then unprompted attach
       | appropriate behavior to their drawings in-game.
        
       | ge96 wrote:
       | Dang that seems like it would work great for game asset
       | generation regarding 3D
        
       | tonyhart7 wrote:
       | This would be good for video editor
        
       ___________________________________________________________________
       (page generated 2025-11-19 23:00 UTC)