SAM (Segment Anything Model) Tool
The SAM tool brings Meta AI’s powerful Segment Anything Model to your application, enabling instant object segmentation with just a click.
Overview
Section titled “Overview”The SAM tool delegates all inference to a backend server via a predictFn. This architecture:
- ✅ Avoids browser memory issues - No ONNX WASM in browser (critical for Safari)
- ✅ Reduces bundle size - No ~20MB ONNX runtime in client
- ✅ Enables GPU acceleration - Backend can use CUDA/Metal
- ✅ Real-time preview - Hover to see segmentation before clicking
- ✅ Smart caching - Reuses preview mask when clicking
How it works:
- User hovers over an object → backend runs inference → blue preview overlay appears
- User clicks on the preview to create annotation
- Cached mask is converted to polygon (~instant)
- Annotation is automatically created and selected
Quick Start
Section titled “Quick Start”import { SamTool } from 'annota';import type { SamPredictFn, SamPredictInput, SamPredictOutput } from 'annota';
// Create a predict function that calls your backendconst predictFn: SamPredictFn = async (input: SamPredictInput): Promise<SamPredictOutput> => { const response = await fetch('/api/sam', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ embeddingPath: input.embedding, // Path to .npy file on server clickX: input.clickX, clickY: input.clickY, imageWidth: input.imageWidth, imageHeight: input.imageHeight, }), });
const result = await response.json();
// Convert base64 to Blob const binary = atob(result.maskBase64); const bytes = new Uint8Array(binary.length); for (let i = 0; i < binary.length; i++) { bytes[i] = binary.charCodeAt(i); }
return { maskBlob: new Blob([bytes], { type: 'image/png' }), iouScore: result.iouScore, maskStats: result.maskStats, };};
// Create SAM tool with remote predictionconst samTool = new SamTool({ predictFn, imageWidth: 1024, imageHeight: 1024, annotationProperties: { classification: 'positive', }});
// Initialize (validates configuration)await samTool.initializeModel();
// Set embedding path when image loadssamTool.setEmbedding('/embeddings/image_001.npy', 2048, 1536);
// Activate the toolannotator.setTool(samTool);Installation
Section titled “Installation”SAM support is built-in to Annota. You need:
- ONNX Decoder Model (for backend inference)
- Python Encoder Model (for generating embeddings)
ONNX Decoder Models (for backend)
Section titled “ONNX Decoder Models (for backend)”Download one of the quantized decoder models (~4.5MB each):
# ViT-B decoder (recommended - good balance of speed and quality)wget https://bitrepo.oss-cn-shanghai.aliyuncs.com/models/sam/sam_onnx_quantized_vit_b.onnx
# ViT-H decoder (higher quality, same size due to quantization)wget https://bitrepo.oss-cn-shanghai.aliyuncs.com/models/sam/sam_onnx_quantized_vit_h.onnxPlace in your backend’s models/ directory.
Python Encoder Models (for generating embeddings)
Section titled “Python Encoder Models (for generating embeddings)”Download the corresponding PyTorch checkpoint for embedding generation:
# ViT-B encoder (~358MB - recommended)wget https://bitrepo.oss-cn-shanghai.aliyuncs.com/models/sam/sam_vit_b_01ec64.pth
# ViT-H encoder (~2.4GB - highest quality)wget https://bitrepo.oss-cn-shanghai.aliyuncs.com/models/sam/sam_vit_h_4b8939.pthModel Pairing:
- Use
sam_vit_b_01ec64.pth(Python) withsam_onnx_quantized_vit_b.onnx(backend) - Use
sam_vit_h_4b8939.pth(Python) withsam_onnx_quantized_vit_h.onnx(backend)
Generating Embeddings
Section titled “Generating Embeddings”import torchimport numpy as npfrom segment_anything import sam_model_registry, SamPredictor
# Load SAM model (one-time setup)# Use "vit_b" for sam_vit_b_01ec64.pth or "vit_h" for sam_vit_h_4b8939.pthsam = sam_model_registry["vit_b"](checkpoint="sam_vit_b_01ec64.pth")sam.to(device='cuda') # or 'cpu' if no GPUpredictor = SamPredictor(sam)
# Load your imageimport cv2image = cv2.imread("image.jpg")image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Generate embeddingpredictor.set_image(image)embedding = predictor.get_image_embedding().cpu().numpy()
# Save as .npy filenp.save("embeddings/image_001.npy", embedding)Tip: Use the provided script for batch processing:
bash scripts/gen_sam_embeddings.sh -m vit_b \ -i docs/public/playground/images/test \ -o docs/public/playground/embeddings/testImplementing a Backend
Section titled “Implementing a Backend”You need a backend that runs SAM inference. Here are two approaches:
Node.js API (Next.js / Express)
Section titled “Node.js API (Next.js / Express)”Using onnxruntime-node for server-side inference:
// app/api/sam/route.ts (Next.js App Router)import { NextRequest, NextResponse } from 'next/server';import * as ort from 'onnxruntime-node';import * as fs from 'fs';import sharp from 'sharp';
// Singleton session for performancelet cachedSession: ort.InferenceSession | null = null;
async function getSession(modelPath: string) { if (!cachedSession) { cachedSession = await ort.InferenceSession.create(modelPath, { executionProviders: ['cpu'], }); } return cachedSession;}
export async function POST(request: NextRequest) { const { embeddingPath, clickX, clickY, imageWidth, imageHeight } = await request.json();
// Load embedding from .npy file const embeddingData = loadNpyFile(embeddingPath);
// Get ONNX session const session = await getSession('models/sam_onnx_quantized_vit_b.onnx');
// Prepare inputs const modelScale = 1024 / Math.max(imageWidth, imageHeight); const feeds = { image_embeddings: new ort.Tensor('float32', embeddingData, [1, 256, 64, 64]), point_coords: new ort.Tensor('float32', [ clickX * modelScale, clickY * modelScale, 0, 0 // padding point ], [1, 2, 2]), point_labels: new ort.Tensor('float32', [1, -1], [1, 2]), mask_input: new ort.Tensor('float32', new Float32Array(256 * 256), [1, 1, 256, 256]), has_mask_input: new ort.Tensor('float32', [0], [1]), orig_im_size: new ort.Tensor('float32', [imageHeight, imageWidth], [2]), };
const results = await session.run(feeds);
// Find best mask by IoU score const iouData = results.iou_predictions.data as Float32Array; let bestIdx = 0; for (let i = 1; i < iouData.length; i++) { if (iouData[i] > iouData[bestIdx]) bestIdx = i; }
// Convert mask to PNG using sharp const maskData = results.masks.data as Float32Array; const maskWidth = results.masks.dims[3]; const maskHeight = results.masks.dims[2]; const maskSize = maskWidth * maskHeight;
const binaryMask = new Uint8Array(maskSize); let whiteCount = 0; for (let i = 0; i < maskSize; i++) { if (maskData[bestIdx * maskSize + i] > 0) { binaryMask[i] = 255; whiteCount++; } }
const pngBuffer = await sharp(Buffer.from(binaryMask), { raw: { width: maskWidth, height: maskHeight, channels: 1 } }).png().toBuffer();
return NextResponse.json({ maskBase64: pngBuffer.toString('base64'), iouScore: iouData[bestIdx], maskStats: { width: maskWidth, height: maskHeight, whiteCount, blackCount: maskSize - whiteCount, foregroundRatio: whiteCount / maskSize, isEmpty: whiteCount === 0, isTiny: whiteCount / maskSize < 0.001, } });}Hover Preview & Click-to-Create
Section titled “Hover Preview & Click-to-Create”The SAM tool features a real-time hover preview that dramatically improves user experience:
Preview Behavior:
- Hover over any object → backend runs inference → blue overlay appears
- Preview updates in real-time as you move the mouse (throttled to 100ms)
- Move mouse outside image bounds → preview disappears
Click-to-Create:
- Click on the preview → instantly creates polygon annotation
- The cached preview mask is reused (no redundant inference)
- Clicking within 5 pixels of the last preview position uses the cache
- Annotation matches the preview shape exactly
Customization:
const samTool = new SamTool({ predictFn, imageWidth: 1024, imageHeight: 1024,
// Disable hover preview showHoverPreview: false,
// Adjust preview transparency previewOpacity: 0.3, // 0.0 = invisible, 1.0 = opaque});API Reference
Section titled “API Reference”SamToolOptions
Section titled “SamToolOptions”interface SamToolOptions { // Required: prediction function for backend inference predictFn: SamPredictFn;
// Image dimensions imageWidth: number; imageHeight: number;
// Optional: initial embedding (can be set later with setEmbedding) embedding?: Float32Array | string;
// Preview options showHoverPreview?: boolean; // Show preview on hover (default: true) previewOpacity?: number; // Preview overlay opacity (default: 0.5)
// Annotation properties annotationProperties?: Partial<Annotation>;
// Callbacks onAnnotationCreated?: (annotation: Annotation) => void; onPredictionStart?: () => void; onPredictionComplete?: (iouScore: number) => void; onError?: (error: Error) => void;}SamPredictFn Type
Section titled “SamPredictFn Type”// Input passed to your prediction functioninterface SamPredictInput { embedding: Float32Array | string; // Embedding data or path clickX: number; clickY: number; imageWidth: number; imageHeight: number; positivePoints?: Array<{ x: number; y: number }>; negativePoints?: Array<{ x: number; y: number }>;}
// Output your prediction function must returninterface SamPredictOutput { maskBlob: Blob; // PNG mask as Blob iouScore: number; // IoU confidence score maskStats: MaskStats; // Mask statistics}
type SamPredictFn = (input: SamPredictInput) => Promise<SamPredictOutput>;Methods
Section titled “Methods”initializeModel(): Validates configuration (lightweight, no model loading)isModelInitialized(): Check if tool is readysetEmbedding(embedding, width, height): Update embedding for new imagedestroy(): Clean up resources
Do I need Python?
Section titled “Do I need Python?”Short answer: Only once, to generate embeddings. After that, backend inference uses the pre-generated embeddings.
Why can’t I generate embeddings in the browser?
Section titled “Why can’t I generate embeddings in the browser?”The SAM encoder model is ~350MB (too large for browser) and computationally expensive (~5-10 seconds per image even on GPU). Pre-generating embeddings is much more practical.
How much disk space do embeddings take?
Section titled “How much disk space do embeddings take?”Each embedding is ~4MB ([1, 256, 64, 64] float32 tensor).
- 100 images: ~400 MB
- 1000 images: ~4 GB
Does the preview affect performance?
Section titled “Does the preview affect performance?”The preview runs inference on-demand while hovering, but the result is cached. When you click, the cached mask is instantly reused, making annotation creation feel instant (~0ms inference vs ~100-300ms without cache).
Can I disable the hover preview?
Section titled “Can I disable the hover preview?”Yes, set showHoverPreview: false in the SamTool options. The tool will still work - clicking will trigger inference and create annotations, just without the preview.
Why was browser ONNX removed?
Section titled “Why was browser ONNX removed?”The ONNX Runtime WASM has significant memory leak issues on Safari, especially on Intel Macs. Moving inference to the backend completely avoids these issues and also reduces the client bundle size by ~20MB.