Meta introduced the Segment Anything model (SAM), a project that aims at image segmentation. This is the largest segmentation dataset ever built, to enable a broad range of applications and promote further research on underlying models for computer vision.
Segment Anything, the new Meta project at the service of users
Segmentation is the identification of image pixels that belong to an object. It is a fundamental activity in computer vision and is used in a wide variety of applications, from scientific image analysis to photo editing.
However, building an accurate segmentation model for specific businesses typically requires a lot of work specialized by technical experts with access to AI training infrastructure and large volumes of carefully annotated internal data.
Meta’s goal was to build a basic model for image segmentation. A prompt template that is trained on different data and can adapt to specific tasks, similar to how prompting is used in natural language processing models.
However, the segmentation data needed to train such a model is not readily available online or elsewhere, unlike images, video and text, which are plentiful on the Internet. Therefore, with Segment Anything, Meta decided to simultaneously develop a model of general and timely segmentation and using it to create a segmentation dataset of unprecedented scale.
What the Segment Anything model can do: image domains
The Segment Anything model has learned a general understanding of what objects are. He can generate masks for any object in any image or video, even including objects and image types that he hadn’t encountered in training.
Also, the model is general enough to cover a wide range of use cases and can be used right out of the box new “domains” of imageswhether underwater photos or cell microscopy, without requiring additional training (a skill often referred to as zero-shot transfer) .
In the future, it could be used to power applications in numerous domains that require searching and segmenting any object in any image.
The model at the service of artificial intelligence
For the research community onartificial intelligencethe Segment Anything model could become a component for broader AI systems and a more general multimedia understanding of the world, for example, understanding both the visual and textual content of a web page.
The model at the service of virtual and augmented reality
The template could also be useful for the virtual and augmented reality. How? It could allow you to select an object based on a user’s gaze and then “lift” it in 3D. For content creators, SAM can enhance creative applications such as extracting image regions for collage or video editing.
How the Segment Anything model works
The Meta team made sure that SAM was able to return a valid segmentation mask for any prompt. For prompt means foreground/background points, a box or mask, freehand text or, in general, any information that indicates what to segment in an image.
The requirement of a valid mask it simply means that even when a prompt is ambiguous and could refer to multiple objects (for example, a dot on a shirt could indicate the shirt or the person wearing it), the output should be a reasonable mask for one of those objects . This activity is used to pre-train the model and to solve general downstream segmentation tasks via prompts.
With SAM, collecting new segmentation masks is faster than ever. With this tool, it only takes approx 14 seconds to interactively annotate a mask.
The gears of the data engine
Meta created a data engine for the SA-1B dataset. This data engine has three “gears”. In first gear, the model assists the annotators, as described above. The second gear is a mix of fully automatic annotation combined with assisted annotation, helping to increase the diversity of collected masks. The latest gear of the data engine is the fully automatic mask creation, which allows scalability of the data set.
In natural language processing and, more recently, computer vision, one of the most exciting developments is that of basic models that can perform zero-shot and few-shot learning for new datasets and tasks using “suggestion”. We were inspired by this line of work.
The image encoder
And image encoder produces a one-time embed for the image, while a lightweight encoder converts any prompt into a real-time embed vector. These two sources of information are then combined in a lightweight decoder that features the segmentation masks. After the image embedding has been calculated, SAM can produce a segment in suns 50 milliseconds to any request in a web browser.
To train our model, we needed one huge and diverse source of data, which did not exist at the beginning of our work. The segmentation dataset we are releasing today is the largest to date (by far). The data was collected using the model. Specifically, the annotators used SAM to interactively annotate the images, and then the newly annotated data was used to update SAM in turn. We repeated this cycle many times to iteratively improve both the model and the dataset.
With SAM, the collection of new segmentation masks is more fast that never. With our tool, it only takes about 14 seconds to interactively annotate a mask. Our per-mask annotation process is only 2x slower than annotating bounding boxes, which takes about 7 seconds using the fastest annotation interfaces.
Compared to previous large-scale segmentation data collection efforts, our model is 6.5x faster than mask annotation based on polygon fully manual COCO and 2x faster than the previous largest data annotation effort, also assisted by the model.
The engine that generates the data made by Meta, at the service of SAM
The final dataset includes over 1.1 billion segmentation masks collected over approximately 11 million licensed and privacy-protecting images. SA-1B has 400 times more masks than any existing segmentation dataset, and as verified by human evaluation studies, the masks are of high quality and diversity, and in some cases even comparable in quality to masks from previous sets of much smaller data, annotated completely manually.