In a major leap for computer vision research and creative AI tools, Meta has introduced a new generation of models designed to dramatically enhance the way machines understand images, videos, physical objects, and even human forms. These models — named SAM 3 and SAM 3D — are the latest additions to Meta’s rapidly evolving Segment Anything Collection, an initiative that has already reshaped how segmentation tasks are performed across industries.

With the launch of these systems, Meta is not merely introducing minor upgrades. Instead, the company is laying the foundation for a new class of AI that bridges the fidelity of human visual perception with the speed, flexibility, and generative capabilities of large multimodal AI systems. SAM 3 brings unprecedented precision to object detection, tracking, and segmentation driven by natural language descriptions, while SAM 3D pushes the boundaries of reconstructing 3D shapes and human bodies from a single image — something that was previously considered one of the hardest challenges in computer vision research.
The release also includes Meta’s new interactive platform, Segment Anything Playground, designed to democratize access to cutting-edge visual AI. Users can upload images or videos, issue intuitive text prompts, and instantly see the models identify, isolate, or reconstruct objects without requiring deep technical expertise. The new environment provides a gateway for creators, developers, and curious enthusiasts to explore how these models reinterpret the world through a computational lens.
Together, SAM 3 and SAM 3D signify Meta’s intention to build an AI ecosystem where creative professionals, AR/VR developers, scientists, and everyday consumers can manipulate visual content with effortless precision. The advancements also hint at Meta’s broader ambitions across mixed reality, commerce tools, creative media, and the next generation of digital interfaces.
The Evolution Toward Language-Driven Segmentation
Earlier versions of the Segment Anything Model — SAM 1 and SAM 2 — were groundbreaking in their own right, enabling users to isolate parts of an image through simple visual cues such as clicks, bounding boxes, or scribbles. This approach, while powerful, still required a degree of manual interaction and lacked a deeper semantic understanding of objects beyond their shapes.
SAM 3 builds upon this foundation by introducing full-scale natural language prompting, allowing the system to interpret detailed verbal descriptions and apply them directly to segmentation tasks. This leap transforms segmentation from a largely hands-on task into a linguistically guided process that aligns more closely with how humans naturally identify and describe objects.
Traditional segmentation models typically relied on a rigid catalog of predefined labels — a system that worked reasonably well for simple categories like “dog,” “car,” or “tree” but struggled with specificity. Asking an older model to identify “the person wearing the blue scarf behind the counter,” or “a yellow school bus with black windows,” would likely result in failure because such models lacked the ability to connect nuanced language to the visual details of a scene.
SAM 3 eliminates this constraint entirely. Users can type a phrase like “the woman wearing a silver necklace but not holding a cup”, and the model will isolate precisely the elements matching that description across a complex scene. The richness of natural language opens the door to far more granular segmentation, enabling a form of visual reasoning that resembles the way humans interpret their surroundings.
This language-driven approach becomes even more powerful when paired with large multimodal language models. When linked to an LLM, SAM 3 can interpret multi-sentence instructions, contextual storytelling, or highly descriptive queries that incorporate logic, exclusions, or multi-object relationships. The result is an unprecedented level of control and flexibility, making SAM 3 one of the most semantically aware segmentation systems currently available.
Transforming Creative Tools Through Precise Visual Editing
One of Meta’s primary goals with SAM 3 is to push the creative tool industry beyond traditional boundaries. The company has already begun integrating SAM 3 into its Edits video creation app, which aims to simplify complex video workflows for creators who may not have access to advanced editing software.
With SAM 3 powering the backend, creators will be able to apply effects, enhancements, or transformations to very specific elements within a video — such as changing the color of a piece of clothing, applying motion accents to a moving person, isolating a pet, or altering the appearance of a background object. Unlike conventional tools that require frame-by-frame masking or complicated timeline adjustments, SAM 3 handles the segmentation automatically, even tracking objects across multiple frames with consistent accuracy.
Meta also plans to expand SAM 3’s presence into Vibes on the Meta AI app and through its meta.ai interface, signaling a shift toward more interactive and AI-assisted media creation experiences. These tools will allow users to blend video editing with content generation, styling, and interactive visual effects, unlocking new storytelling possibilities across short-form content platforms.
SAM 3D: Bridging the Physical and Digital Worlds Through Single-Image Reconstruction
While SAM 3 is built to refine 2D understanding, SAM 3D focuses on one of the most ambitious goals in computer vision: generating detailed 3D reconstructions from a single image. This task traditionally required specialized multi-camera rigs, depth sensors, or structured datasets. By contrast, SAM 3D uses advanced learning techniques to infer depth, geometry, and realistic structure using only a single perspective input.
The SAM 3D system is composed of two major open-source models:
- SAM 3D Objects, which reconstructs 3D shapes, items, and entire scenes.
- SAM 3D Body, which focuses specifically on human body structure, pose estimation, and shape understanding.
These models set a new standard in the field of 3D reconstruction, significantly outperforming prior approaches across various benchmarks. Meta also developed SAM 3D Artist Objects, a unique evaluation dataset created in collaboration with professional artists. This dataset serves as a robust measurement tool for assessing the artistic, structural, and geometric accuracy of reconstructions generated by AI, marking a shift toward more rigorous and realistic benchmarking standards.
The potential applications of SAM 3D span multiple industries. In robotics, a robot equipped with SAM 3D capabilities could analyze unfamiliar objects and understand their 3D structure in real time, enabling safer navigation and more precise manipulation. In sports medicine, the model could assist therapists by reconstructing body movements or diagnosing posture issues from a single captured image. Scientific research, too, could benefit from its ability to generate 3D models of biological specimens, artifacts, or environmental structures quickly and accurately.
Meta is already putting SAM 3D into practical use through the new View in Room feature on Facebook Marketplace, which allows customers to visualize furniture or decorative items in their own homes. This capability brings more confidence to online purchases by bridging the gap between digital imagery and real-world scale.
Segment Anything Playground: Visual AI for Everyone
To accompany the release of SAM 3 and SAM 3D, Meta has launched Segment Anything Playground, an interactive platform that eliminates the technical barrier for users who want to explore the models’ capabilities. From the moment an image or video is uploaded, users can simply describe the object they wish to isolate or reconstruct, and the interface will respond instantly with a visual output.
The platform features templates for both practical and playful tasks. Users can pixelate faces, blur license plates, mask screens, add highlights to specific elements, apply movement-based visual effects, or experiment with transformation effects that manipulate the composition of a scene. This accessible interface positions Meta as a leader in human-centric AI platforms that allow users to touch and feel the future of visual computing.
Open Source Commitment and Research Collaboration
As part of the release, Meta is sharing SAM 3 model weights, a new benchmark dataset for open-vocabulary segmentation, and a detailed research paper documenting the architecture and training methodology behind SAM 3. Similarly, SAM 3D comes with model checkpoints, inference code, and a new dataset intended to elevate the standard of 3D research across the entire community.
Meta has also partnered with Roboflow, a popular data annotation platform, to allow researchers and developers to fine-tune SAM 3 for project-specific use cases. This collaboration streamlines workflows for professionals building custom datasets or training specialized segmentation tools.
By choosing to open-source these models, Meta is reinforcing its strategic role in shaping the future of visual AI research and expanding the capabilities available to both academic and industry developers worldwide.
A Glimpse Into the Future of Visual Computing
The introduction of SAM 3 and SAM 3D demonstrates Meta’s long-term vision of creating AI systems capable of deep perception, contextual understanding, and creative content generation. These systems will likely play key roles in Meta’s broader ecosystem — from mixed-reality experiences in Meta Quest devices to AR interfaces in Ray-Ban Meta smart glasses, and eventually across immersive environments where digital and physical worlds merge seamlessly.
With each iteration of the Segment Anything Collection, Meta edges closer to constructing an AI layer that sees, interprets, and interacts with the world in a profoundly human-like manner. The implications extend across entertainment, commerce, education, virtual production, industrial design, research, and everyday communication.
SAM 3 and SAM 3D are not just upgrades; they represent a major shift in visual AI, marking the beginning of a decade where the boundaries of perception — human or machine — may blur more than ever before.