Grounded-SAM-2/sam2/configs/sam2.1/sam2.1_hiera_s.yaml

# @package _global_

# Model
model:
  _target_: sam2.modeling.sam2_base.SAM2Base
  image_encoder:
    _target_: sam2.modeling.backbones.image_encoder.ImageEncoder
    scalp: 1
    trunk:
      _target_: sam2.modeling.backbones.hieradet.Hiera
      embed_dim: 96
      num_heads: 1
      stages: [1, 2, 11, 2]
      global_att_blocks: [7, 10, 13]
      window_pos_embed_bkg_spatial_size: [7, 7]
    neck:
      _target_: sam2.modeling.backbones.image_encoder.FpnNeck
      position_encoding:
        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
        num_pos_feats: 256
        normalize: true
        scale: null
        temperature: 10000
      d_model: 256
      backbone_channel_list: [768, 384, 192, 96]
      fpn_top_down_levels: [2, 3]  # output level 0 and 1 directly use the backbone features
      fpn_interp_model: nearest

  memory_attention:
    _target_: sam2.modeling.memory_attention.MemoryAttention
    d_model: 256
    pos_enc_at_input: true
    layer:
      _target_: sam2.modeling.memory_attention.MemoryAttentionLayer
      activation: relu
      dim_feedforward: 2048
      dropout: 0.1
      pos_enc_at_attn: false
      self_attention:
        _target_: sam2.modeling.sam.transformer.RoPEAttention
        rope_theta: 10000.0
        feat_sizes: [64, 64]
        embedding_dim: 256
        num_heads: 1
        downsample_rate: 1
        dropout: 0.1
      d_model: 256
      pos_enc_at_cross_attn_keys: true
      pos_enc_at_cross_attn_queries: false
      cross_attention:
        _target_: sam2.modeling.sam.transformer.RoPEAttention
        rope_theta: 10000.0
        feat_sizes: [64, 64]
        rope_k_repeat: True
        embedding_dim: 256
        num_heads: 1
        downsample_rate: 1
        dropout: 0.1
        kv_in_dim: 64
    num_layers: 4

  memory_encoder:
      _target_: sam2.modeling.memory_encoder.MemoryEncoder
      out_dim: 64
      position_encoding:
        _target_: sam2.modeling.position_encoding.PositionEmbeddingSine
        num_pos_feats: 64
        normalize: true
        scale: null
        temperature: 10000
      mask_downsampler:
        _target_: sam2.modeling.memory_encoder.MaskDownSampler
        kernel_size: 3
        stride: 2
        padding: 1
      fuser:
        _target_: sam2.modeling.memory_encoder.Fuser
        layer:
          _target_: sam2.modeling.memory_encoder.CXBlock
          dim: 256
          kernel_size: 7
          padding: 3
          layer_scale_init_value: 1e-6
          use_dwconv: True  # depth-wise convs
        num_layers: 2

  num_maskmem: 7
  image_size: 1024
  # apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask
  sigmoid_scale_for_mem_enc: 20.0
  sigmoid_bias_for_mem_enc: -10.0
  use_mask_input_as_output_without_sam: true
  # Memory
  directly_add_no_mem_embed: true
  no_obj_embed_spatial: true
  # use high-resolution feature map in the SAM mask decoder
  use_high_res_features_in_sam: true
  # output 3 masks on the first click on initial conditioning frames
  multimask_output_in_sam: true
  # SAM heads
  iou_prediction_use_sigmoid: True
  # cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder
  use_obj_ptrs_in_encoder: true
  add_tpos_enc_to_obj_ptrs: true
  proj_tpos_enc_in_obj_ptrs: true
  use_signed_tpos_enc_to_obj_ptrs: true
  only_obj_ptrs_in_the_past_for_eval: true
  # object occlusion prediction
  pred_obj_scores: true
  pred_obj_scores_mlp: true
  fixed_no_obj_ptr: true
  # multimask tracking settings
  multimask_output_for_tracking: true
  use_multimask_token_for_obj_ptr: true
  multimask_min_pt_num: 0
  multimask_max_pt_num: 1
  use_mlp_for_obj_ptr_proj: true
  # Compilation flag
  compile_image_encoder: False
Initial commit 2024-07-29 21:54:20 +00:00			`# @package _global_`

			`# Model`
			`model:`
			`_target_: sam2.modeling.sam2_base.SAM2Base`
			`image_encoder:`
			`_target_: sam2.modeling.backbones.image_encoder.ImageEncoder`
			`scalp: 1`
			`trunk:`
			`_target_: sam2.modeling.backbones.hieradet.Hiera`
			`embed_dim: 96`
			`num_heads: 1`
			`stages: [1, 2, 11, 2]`
			`global_att_blocks: [7, 10, 13]`
			`window_pos_embed_bkg_spatial_size: [7, 7]`
			`neck:`
			`_target_: sam2.modeling.backbones.image_encoder.FpnNeck`
			`position_encoding:`
			`_target_: sam2.modeling.position_encoding.PositionEmbeddingSine`
			`num_pos_feats: 256`
			`normalize: true`
			`scale: null`
			`temperature: 10000`
			`d_model: 256`
			`backbone_channel_list: [768, 384, 192, 96]`
			`fpn_top_down_levels: [2, 3] # output level 0 and 1 directly use the backbone features`
			`fpn_interp_model: nearest`

			`memory_attention:`
			`_target_: sam2.modeling.memory_attention.MemoryAttention`
			`d_model: 256`
			`pos_enc_at_input: true`
			`layer:`
			`_target_: sam2.modeling.memory_attention.MemoryAttentionLayer`
			`activation: relu`
			`dim_feedforward: 2048`
			`dropout: 0.1`
			`pos_enc_at_attn: false`
			`self_attention:`
			`_target_: sam2.modeling.sam.transformer.RoPEAttention`
			`rope_theta: 10000.0`
SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking (#486) This PR provides new features and updates for SAM 2: - We now support `torch.compile` of the entire SAM 2 model on videos, which can be turned on by setting `vos_optimized=True` in `build_sam2_video_predictor` (it uses the new `SAM2VideoPredictorVOS` predictor class in `sam2/sam2_video_predictor.py`). * Compared to the previous setting (which only compiles the image encoder backbone), the new full model compilation gives a major speedup in inference FPS. * In the VOS prediction script `tools/vos_inference.py`, you can specify this option in `tools/vos_inference.py` via the `--use_vos_optimized_video_predictor` flag. * Note that turning on this flag might introduce a small variance in the predictions due to numerical differences caused by `torch.compile` of the full model. * PyTorch 2.5.1 is the minimum version for full support of this feature. (Earlier PyTorch versions might run into compilation errors in some cases.) Therefore, we have updated the minimum PyTorch version to 2.5.1 accordingly in the installation scripts. - We also update the implementation of the `SAM2VideoPredictor` class for the SAM 2 video prediction in `sam2/sam2_video_predictor.py`, which allows for independent per-object inference. Specifically, in the new `SAM2VideoPredictor`: * Now we handle the inference of each object independently (as if we are opening a separate session for each object) while sharing their backbone features. * This change allows us to relax the assumption of prompting for multi-object tracking. Previously (due to the batching behavior in inference), if a video frame receives clicks for only a subset of objects, the rest of the (non-prompted) objects are assumed to be non-existent in this frame (i.e., in such frames, the user is telling SAM 2 that the rest of the objects don't appear). Now, if a frame receives clicks for only a subset of objects, we do not make any assumptions about the remaining (non-prompted) objects (i.e., now each object is handled independently and is not affected by how other objects are prompted). As a result, we allow adding new objects after tracking starts after this change (which was previously a restriction on usage). * We believe that the new version is a more natural inference behavior and therefore switched to it as the default behavior. The previous implementation of `SAM2VideoPredictor` is backed up to in `sam2/sam2_video_predictor_legacy.py`. All the VOS inference results using `tools/vos_inference.py` should remain the same after this change to the `SAM2VideoPredictor` class. 2024-12-11 15:00:55 -08:00			`feat_sizes: [64, 64]`
Initial commit 2024-07-29 21:54:20 +00:00			`embedding_dim: 256`
			`num_heads: 1`
			`downsample_rate: 1`
			`dropout: 0.1`
			`d_model: 256`
			`pos_enc_at_cross_attn_keys: true`
			`pos_enc_at_cross_attn_queries: false`
			`cross_attention:`
			`_target_: sam2.modeling.sam.transformer.RoPEAttention`
			`rope_theta: 10000.0`
SAM 2 Update 12/11/2024 -- full model compilation for a major VOS speedup and a new SAM2VideoPredictor to better handle multi-object tracking (#486) This PR provides new features and updates for SAM 2: - We now support `torch.compile` of the entire SAM 2 model on videos, which can be turned on by setting `vos_optimized=True` in `build_sam2_video_predictor` (it uses the new `SAM2VideoPredictorVOS` predictor class in `sam2/sam2_video_predictor.py`). * Compared to the previous setting (which only compiles the image encoder backbone), the new full model compilation gives a major speedup in inference FPS. * In the VOS prediction script `tools/vos_inference.py`, you can specify this option in `tools/vos_inference.py` via the `--use_vos_optimized_video_predictor` flag. * Note that turning on this flag might introduce a small variance in the predictions due to numerical differences caused by `torch.compile` of the full model. * PyTorch 2.5.1 is the minimum version for full support of this feature. (Earlier PyTorch versions might run into compilation errors in some cases.) Therefore, we have updated the minimum PyTorch version to 2.5.1 accordingly in the installation scripts. - We also update the implementation of the `SAM2VideoPredictor` class for the SAM 2 video prediction in `sam2/sam2_video_predictor.py`, which allows for independent per-object inference. Specifically, in the new `SAM2VideoPredictor`: * Now we handle the inference of each object independently (as if we are opening a separate session for each object) while sharing their backbone features. * This change allows us to relax the assumption of prompting for multi-object tracking. Previously (due to the batching behavior in inference), if a video frame receives clicks for only a subset of objects, the rest of the (non-prompted) objects are assumed to be non-existent in this frame (i.e., in such frames, the user is telling SAM 2 that the rest of the objects don't appear). Now, if a frame receives clicks for only a subset of objects, we do not make any assumptions about the remaining (non-prompted) objects (i.e., now each object is handled independently and is not affected by how other objects are prompted). As a result, we allow adding new objects after tracking starts after this change (which was previously a restriction on usage). * We believe that the new version is a more natural inference behavior and therefore switched to it as the default behavior. The previous implementation of `SAM2VideoPredictor` is backed up to in `sam2/sam2_video_predictor_legacy.py`. All the VOS inference results using `tools/vos_inference.py` should remain the same after this change to the `SAM2VideoPredictor` class. 2024-12-11 15:00:55 -08:00			`feat_sizes: [64, 64]`
Initial commit 2024-07-29 21:54:20 +00:00			`rope_k_repeat: True`
			`embedding_dim: 256`
			`num_heads: 1`
			`downsample_rate: 1`
			`dropout: 0.1`
			`kv_in_dim: 64`
			`num_layers: 4`

			`memory_encoder:`
			`_target_: sam2.modeling.memory_encoder.MemoryEncoder`
			`out_dim: 64`
			`position_encoding:`
			`_target_: sam2.modeling.position_encoding.PositionEmbeddingSine`
			`num_pos_feats: 64`
			`normalize: true`
			`scale: null`
			`temperature: 10000`
			`mask_downsampler:`
			`_target_: sam2.modeling.memory_encoder.MaskDownSampler`
			`kernel_size: 3`
			`stride: 2`
			`padding: 1`
			`fuser:`
			`_target_: sam2.modeling.memory_encoder.Fuser`
			`layer:`
			`_target_: sam2.modeling.memory_encoder.CXBlock`
			`dim: 256`
			`kernel_size: 7`
			`padding: 3`
			`layer_scale_init_value: 1e-6`
			`use_dwconv: True # depth-wise convs`
			`num_layers: 2`

			`num_maskmem: 7`
			`image_size: 1024`
			`# apply scaled sigmoid on mask logits for memory encoder, and directly feed input mask as output mask`
			`sigmoid_scale_for_mem_enc: 20.0`
			`sigmoid_bias_for_mem_enc: -10.0`
			`use_mask_input_as_output_without_sam: true`
			`# Memory`
			`directly_add_no_mem_embed: true`
SAM2.1 SAM2.1 checkpoints + training code + Demo 2024-09-28 08:20:56 -07:00			`no_obj_embed_spatial: true`
Initial commit 2024-07-29 21:54:20 +00:00			`# use high-resolution feature map in the SAM mask decoder`
			`use_high_res_features_in_sam: true`
			`# output 3 masks on the first click on initial conditioning frames`
			`multimask_output_in_sam: true`
			`# SAM heads`
			`iou_prediction_use_sigmoid: True`
			`# cross-attend to object pointers from other frames (based on SAM output tokens) in the encoder`
			`use_obj_ptrs_in_encoder: true`
SAM2.1 SAM2.1 checkpoints + training code + Demo 2024-09-28 08:20:56 -07:00			`add_tpos_enc_to_obj_ptrs: true`
			`proj_tpos_enc_in_obj_ptrs: true`
			`use_signed_tpos_enc_to_obj_ptrs: true`
Initial commit 2024-07-29 21:54:20 +00:00			`only_obj_ptrs_in_the_past_for_eval: true`
			`# object occlusion prediction`
			`pred_obj_scores: true`
			`pred_obj_scores_mlp: true`
			`fixed_no_obj_ptr: true`
			`# multimask tracking settings`
			`multimask_output_for_tracking: true`
			`use_multimask_token_for_obj_ptr: true`
			`multimask_min_pt_num: 0`
			`multimask_max_pt_num: 1`
			`use_mlp_for_obj_ptr_proj: true`
			`# Compilation flag`
			`compile_image_encoder: False`