https://blog.roboflow.com/sam-2-video-segmentation/ Roboflow logo Products Platform * [icon-plane] Universe Open source computer vision datasets and pre-trained models * [icon-objec] Annotate Label images fast with AI-assisted data annotation * [icon-noun-] Train Hosted model training infrastructure and GPU access * [icon-netwo] Workflows New Low-code interface to build pipelines and applications * [icon-rocke] Deploy Run models on device, at the edge, in your VPC, or via API Solutions By Industry * [icon-jet-f] Aerospace & Defence * [icon-farm] Agriculture * [icon-cars] Automotive * [icon-sack-] Banking & Finance * [icon-build] Government * [icon-steth] Healthcare & Medicine * [icon-conve] Manufacturing * [icon-oil-w] Oil & Gas * [icon-store] Retail & Ecommerce * [icon-helme] Safety & Security * [icon-tower] Telecommunications * [icon-train] Transportation * [icon-light] Utilities Developers Resources * [icon-book-] Documentation * [icon-messa] User Forum * [icon-chart] Computer Vision Models * [icon-newsp] Blog * [icon-right] Convert Annotation Formats * [icon-gradu] Learn Computer Vision * [icon-recta] Inference Templates Pricing Docs Blog Sign In Get Started Computer Vision Model Deployment How to Use SAM 2 for Video Segmentation Written by Piotr Skalski Aug 1, 2024 7 min read [blog-how-t] Segment Anything Model 2 (SAM 2) is a unified video and image segmentation model. 0:00 /0:04 [0 ] 1x [100 ] Video segmentation presents unique challenges compared to image segmentation. Object motion, deformation, occlusion, lighting changes, and other factors can vary dramatically from frame to frame. Videos are often lower quality than images due to camera motion, blur, and lower resolution, further increasing the difficulty. SAM 2 demonstrates improved accuracy in video segmentation, with 3 times fewer interactions than previous approaches. SAM 2 is more accurate for image segmentation and 6 times faster than the original Segment Anything Model (SAM). Load SAM 2 Model for Video Processing Open the notebook that accompanies this guide. First, clone the repository and install the required dependencies using the following commands: git clone https://github.com/facebookresearch/segment-anything-2.git cd segment-anything-2 pip install -e . python setup.py build_ext --inplace Due to a bug in the segment-anything-2 codebase, after installation, you need to run the command python setup.py build_ext --inplace. Installing SAM 2 locally may take a few minutes. Be patient! In this project, we will also utilize the Supervision package, which will assist us in visualizing SAM 2 results, among other tasks. pip install supervision SAM 2 is available in 4 different model sizes, ranging from the lightweight "sam2_hiera_tiny" (38.9M parameters) to the more powerful "sam2_hiera_large" (224.4M parameters). These models also differ in inference speed. The smallest model processes approximately 47 frames per second, while the largest processes around 30. These values were obtained on an NVIDIA A100, using PyTorch 2.3.1 and CUDA 12.1 under automatic mix precision with bfloat16. The image encoder was compiled using torch.compile. For this demo, we will utilize the largest model. Links to the weights for other model sizes can be found in the README.md file of the repository. We can download the model weights as follows. wget -q https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_large.pt For image use cases, we load SAM 2 usingbuild_sam2, while for videos, we use build_sam2_video_predictor. This is because video processing utilizes model memory, which is not initialized when processing single images. More on model memory later in this article. To load the model, we need the path to the previously downloaded weights file and the name of the YAML configuration file. Configuration files for each model size can also be found in the repository. import torch from sam2.build_sam import build_sam2_video_predictor CHECKPOINT = "checkpoints/sam2_hiera_large.pt" CONFIG = "sam2_hiera_l.yaml" sam2_model = build_sam2_video_predictor(CONFIG, CHECKPOINT) SAM 2 Data Preprocessing SAM 2 is equipped with memory that stores information about the object and previous interactions, allowing it to generate mask predictions throughout the video and effectively correct them based on the stored object context from previously observed frames. [Screenshot-2024-08-01-at-21]Image 1. SAM2 architecture uses memory to generate correct masks for video. Source: Segment Anything 2 paper . Before starting segmentation, SAM 2 needs to know the content of all frames. For this purpose, the frames must be saved to disk. It is crucial to save the frames in JPEG format, as this is currently the only supported format. Since all these frames will be loaded into VRAM in the next step, depending on the resolution of your video, it may be necessary to downscale the frames before saving them to disk. We can accomplish all of this using the supervision package. import supervision as sv frames_generator = sv.get_video_frames_generator() sink = sv.ImageSink( target_dir_path=, image_name_pattern="{:05d}.jpeg") with sink: for frame in frames_generator: sink.save_image(frame) Initialize the Inference State for SAM 2 SAM 2 requires stateful inference for interactive video segmentation, so we need to initialize an inference state on this video. During initialization, it loads all the JPEG frames in video_path and stores their pixels in inference_state. inference_state = sam2_model.init_state() If you have run any previous tracking using this inference_state, please reset it first via reset_state. sam2_model.reset_state(inference_state) Segment and Track One Object with SAM 2 To get started, let's try to segment the ball in the first frame of the video. Label 1 indicates a positive click (to add a region), while label 0 indicates a negative click (to remove a region). When defining our prompt beyond points and labels, we also need to pass the frame index we interact with and give a unique ID to each object we interact with (it can be any integer). import numpy as np points = np.array([[703, 303]], dtype=np.float32) labels = np.array([1]) frame_idx = 0 tracker_id = 1 _, object_ids, mask_logits = sam2_model.add_new_points( inference_state=inference_state, frame_idx=frame_idx, obj_id=tracker_id, points=points, labels=labels, ) [single-object-prompt-single-frame]Image 2. Prompt. The positive point-prompt is marked with a filled circle. [single-object-single-frame-1]Image 3. Segment Anything 2 result for the above single-object prompt. Refine Predictions for SAM 2 Similar to SAM, SAM 2 allows for prompting the model with negative points - points that do not belong to the object. This enables a precise definition of the boundaries of the object of interest. import numpy as np points = np.array([ [703, 303], [731, 256], [713, 356], [740, 297] ], dtype=np.float32) labels = np.array([1, 0, 0, 0]) frame_idx = 0 tracker_id = 1 _, object_ids, mask_logits = sam2_model.add_new_points( inference_state=inference_state, frame_idx=frame_idx, obj_id=tracker_id, points=points, labels=labels, ) [single-object-negative-prompt-single-frame]Image 4. Prompt. The positive point-prompt is marked with a filled circle, while the negative point-prompt is marked with an empty circle. According to the Segment Anything 2 paper, the model's accuracy on video tasks increases with the number of labeled video frames. Don't be afraid to annotate several frames in different parts of the video. Make sure to use the appropriate frame_idx in the various add_new_points calls. [Screenshot-2024-08-01-at-21]Image 5. Zero-shot accuracy over 9 datasets in interactive offline and online evaluation settings. Source: Segment Anything 2 paper. Propagate Prompts Across the Video To apply our point prompts to all video frames, we use the propagate_in_video generator. Each call returns frame_idx (the index of the current frame), object_ids (IDs of objects detected in the frame), and mask_logits (corresponding object_ids logit values), which we can convert to masks using thresholding. We then read each frame, apply the masks to it using a MaskAnnotator, and finally, write the annotated frame to the output video. import cv2 import supervision as sv colors = ['#FF1493', '#00BFFF', '#FF6347', '#FFD700'] mask_annotator = sv.MaskAnnotator( color=sv.ColorPalette.from_hex(colors), color_lookup=sv.ColorLookup.TRACK) video_info = sv.VideoInfo.from_video_path() frames_paths = sorted(sv.list_files_with_extensions( directory=, extensions=["jpeg"])) with sv.VideoSink(, video_info=video_info) as sink: for frame_idx, object_ids, mask_logits in sam2_model.propagate_in_video(inference_state): frame = cv2.imread(frames_paths[frame_idx]) masks = (mask_logits > 0.0).cpu().numpy() N, X, H, W = masks.shape masks = masks.reshape(N * X, H, W) detections = sv.Detections( xyxy=sv.mask_to_xyxy(masks=masks), mask=masks, tracker_id=np.array(object_ids) ) frame = mask_annotator.annotate(frame, detections) sink.write_frame(frame) 0:00 /0:03 [0 ] 1x [100 ] Segment and Track Multiple Objects with SAM 2 SAM 2 can also segment and track two or more objects simultaneously. One way is to perform them individually. However, it would be more efficient to combine them so that we can share image features between objects to reduce computational costs. Each object should be assigned a different object ID. [multi-object-prompt-single-frame]Image 6. Prompt. The positive point prompt is marked with a filled circle. Use a different color for each object.[multi-object-single-frame]Image 7. Segment Anything 2 result for the above multi-object prompt. Tracking Objects Across Multiple Videos with SAM 2 During our experiments, we discovered that SAM 2 can detect the same objects visible in shots from different cameras. In our experiment, we used two additional clips looking at the same basketball play. We performed labeling only on frames from one shot and inference on frames from all three clips. Even though the model had not seen frames from other shots, SAM 2 was able to detect them almost perfectly in all three clips. 0:00 /0:13 [0 ] 1x [100 ] SAM 2 for Video Limitations SAM 2 may struggle with segmenting objects across shot changes and can lose track of or confuse objects in crowded scenes, after long occlusions, or in long videos. It also faces challenges with accurately tracking objects that have very thin or fine details, especially when they are moving quickly. Another difficult scenario occurs when there are objects with similar appearances nearby. While SAM 2 can track multiple objects in a video simultaneously, it processes each object separately, utilizing only shared embeddings per frame without inter-object communication. 0:00 /0:05 [0 ] 1x [100 ] Conclusions Segment Anything Model 2 (SAM 2) is a significant advancement in image and video segmentation, offering a unified model with improved accuracy, speed, and context awareness. While it faces limitations in certain scenarios, SAM 2 represents a powerful tool in the field of image and video segmentation with broad applications across various domains. The release of the original SAM model sparked a wave of projects like HQ Sam, FastSAM, and MobileSAM, and I am excited about the research papers and models that will be published in the coming months as the community builds upon SAM 2's capabilities. Cite this Post Use the following entry to cite this post in your research: Piotr Skalski. (Aug 1, 2024). How to Use SAM 2 for Video Segmentation. Roboflow Blog: https://blog.roboflow.com/ sam-2-video-segmentation/ Discuss this Post If you have any questions about this blog post, start a discussion on the Roboflow Forum. Piotr Skalski ML Growth Engineer @ Roboflow | Owner @ github.com/SkalskiP/ make-sense (2.4k stars) | Blogger @ skalskip.medium.com/ (4.5k followers) VIEW MORE POSTS TOPICS: Computer Vision, Model Deployment Piotr Skalski Aug 1, 2024 7 min read Search blog Table of Contents MORE ABOUT Computer Vision View All What is Segment Anything 2 (SAM 2)? James Gallagher Jul 30, 2024 Evaluating 2024 Euro Cup and COPA America Cup Jersey Color Accessibility Carmen Lee Jul 19, 2024 Tomato Leaf Disease Detection and Diagnosis using Computer Vision Contributing Writer Jul 19, 2024 Red Zone Monitoring Using Computer Vision Nathan Yan Jul 19, 2024 People Counting Using Computer Vision Abirami Vina Jul 19, 2024 Top 7 Open-Source Object Tracking Tools [2024] Abirami Vina Jul 17, 2024 Roboflow logo Product * Universe * Annotate * Train * Workflows * Deploy * Pricing * Talk to Sales Ecosystem * Notebooks * Autodistill * Supervision * Inference * Roboflow 100 * Open Source Developers * Documentation * User Forum * Blog * Learn Computer Vision * Convert Annotation Formats * Computer Vision Models Industries * Manufacturing * Oil and Gas * Retail * Safety and Security * Transportation * Explore All Industries Models * YOLOv9 * YOLOv8 * YOLOv5 * PaliGemma * CLIP * Multimodal Models * Explore All Models Company * About Us * Careers HIRING * Press * Contact * Status Badge AWS QualifiedBadge AWS QualifiedBadge SOC-NonCPA Terms of Service Enterprise Terms Privacy Policy Sitemap (c) 2024 Roboflow, Inc. All rights reserved.