add grounded samurai demo with dino-x
This commit is contained in:
141
README.md
141
README.md
@@ -1,139 +1,28 @@
|
|||||||
<div align="center">
|
## Grounded SAMURAI
|
||||||
<img align="left" width="100" height="100" src="https://github.com/user-attachments/assets/1834fc25-42ef-4237-9feb-53a01c137e83" alt="">
|
|
||||||
|
|
||||||
# SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
|
We have tried to implement Grounded SAMURAI for long video object tracking and segmentation.
|
||||||
|
|
||||||
[Cheng-Yen Yang](https://yangchris11.github.io), [Hsiang-Wei Huang](https://hsiangwei0903.github.io/), [Wenhao Chai](https://rese1f.github.io/), [Zhongyu Jiang](https://zhyjiang.github.io/#/), [Jenq-Neng Hwang](https://people.ece.uw.edu/hwang/)
|
[![Video Name]()](https://github.com/user-attachments/assets/51db13b6-1083-4c22-af14-c34e09403591)
|
||||||
|
|
||||||
[Information Processing Lab, University of Washington](https://ipl-uw.github.io/)
|
## Installation
|
||||||
</div>
|
|
||||||
|
|
||||||
|
### Install SAMURAI
|
||||||
|
Please refer to [SAMURAI Install](./SAMURAI_README.md) for more details.
|
||||||
|
|
||||||
[](https://paperswithcode.com/sota/visual-object-tracking-on-lasot-ext?p=samurai-adapting-segment-anything-model-for-1)
|
### Register on Offical Website to Get API Token
|
||||||
[](https://paperswithcode.com/sota/visual-object-tracking-on-got-10k?p=samurai-adapting-segment-anything-model-for-1)
|
|
||||||
[](https://paperswithcode.com/sota/visual-object-tracking-on-needforspeed?p=samurai-adapting-segment-anything-model-for-1)
|
|
||||||
[](https://paperswithcode.com/sota/visual-object-tracking-on-lasot?p=samurai-adapting-segment-anything-model-for-1)
|
|
||||||
[](https://paperswithcode.com/sota/visual-object-tracking-on-otb-2015?p=samurai-adapting-segment-anything-model-for-1)
|
|
||||||
|
|
||||||
[[Arxiv]](https://arxiv.org/abs/2411.11922) [[Project Page]](https://yangchris11.github.io/samurai/) [[Raw Results]](https://drive.google.com/drive/folders/1ssiDmsC7mw5AiItYQG4poiR1JgRq305y?usp=sharing)
|
- **First-Time Application**: If you are interested in our project and wish to try our algorithm, you will need to apply for the corresponding API Token through our [request API token website](https://cloud.deepdataspace.com/apply-token?from=github) for your first attempt.
|
||||||
|
|
||||||
This repository is the official implementation of SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
|
- **Request Additional Token Quotas**: If you find our project helpful and need more API token quotas, you can request additional tokens by [filling out this form](https://docs.google.com/forms/d/e/1FAIpQLSfjogAtkgoVyFX9wvCAE15mD7QtHdKdKOrVmcE5GT1xu-03Aw/viewform?usp=sf_link). Our team will review your request and allocate more tokens for your use in one or two days. You can also apply for more tokens by sending us an email.
|
||||||
|
|
||||||
https://github.com/user-attachments/assets/9d368ca7-2e9b-4fed-9da0-d2efbf620d88
|
**Note:** If you encounter some errors with API, please install the latest version of `dds-cloudapi-sdk`:
|
||||||
|
|
||||||
All rights are reserved to the copyright owners (TM & © Universal (2019)). This clip is not intended for commercial use and is solely for academic demonstration in a research paper. Original source can be found [here](https://www.youtube.com/watch?v=cwUzUzpG8aM&t=4s).
|
```bash
|
||||||
|
pip install dds-cloudapi-sdk --upgrade
|
||||||
## Getting Started
|
|
||||||
|
|
||||||
#### SAMURAI Installation
|
|
||||||
|
|
||||||
SAM 2 needs to be installed first before use. The code requires `python>=3.10`, as well as `torch>=2.3.1` and `torchvision>=0.18.1`. Please follow the instructions [here](https://github.com/facebookresearch/sam2?tab=readme-ov-file) to install both PyTorch and TorchVision dependencies. You can install **the SAMURAI version** of SAM 2 on a GPU machine using:
|
|
||||||
```
|
|
||||||
cd sam2
|
|
||||||
pip install -e .
|
|
||||||
pip install -e ".[notebooks]"
|
|
||||||
```
|
```
|
||||||
|
|
||||||
Please see [INSTALL.md](https://github.com/facebookresearch/sam2/blob/main/INSTALL.md) from the original SAM 2 repository for FAQs on potential issues and solutions.
|
### Demos
|
||||||
|
|
||||||
Install other requirements:
|
```bash
|
||||||
```
|
python grounded_samurai_dinox.py
|
||||||
pip install matplotlib==3.7 tikzplotlib jpeg4py opencv-python lmdb pandas scipy loguru
|
|
||||||
```
|
|
||||||
|
|
||||||
#### SAM 2.1 Checkpoint Download
|
|
||||||
|
|
||||||
```
|
|
||||||
cd checkpoints && \
|
|
||||||
./download_ckpts.sh && \
|
|
||||||
cd ..
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Data Preparation
|
|
||||||
|
|
||||||
Please prepare the data in the following format:
|
|
||||||
```
|
|
||||||
data/LaSOT
|
|
||||||
├── airplane/
|
|
||||||
│ ├── airplane-1/
|
|
||||||
│ │ ├── full_occlusion.txt
|
|
||||||
│ │ ├── groundtruth.txt
|
|
||||||
│ │ ├── img
|
|
||||||
│ │ ├── nlp.txt
|
|
||||||
│ │ └── out_of_view.txt
|
|
||||||
│ ├── airplane-2/
|
|
||||||
│ ├── airplane-3/
|
|
||||||
│ ├── ...
|
|
||||||
├── basketball
|
|
||||||
├── bear
|
|
||||||
├── bicycle
|
|
||||||
...
|
|
||||||
├── training_set.txt
|
|
||||||
└── testing_set.txt
|
|
||||||
```
|
|
||||||
|
|
||||||
#### Main Inference
|
|
||||||
```
|
|
||||||
python scripts/main_inference.py
|
|
||||||
```
|
|
||||||
|
|
||||||
## Demo on Custom Video
|
|
||||||
|
|
||||||
To run the demo with your custom video or frame directory, use the following examples:
|
|
||||||
|
|
||||||
**Note:** The `.txt` file contains a single line with the bounding box of the first frame in `x,y,w,h` format.
|
|
||||||
|
|
||||||
### Input is Video File
|
|
||||||
|
|
||||||
```
|
|
||||||
python scripts/demo.py --video_path <your_video.mp4> --txt_path <path_to_first_frame_bbox.txt>
|
|
||||||
```
|
|
||||||
|
|
||||||
### Input is Frame Folder
|
|
||||||
```
|
|
||||||
# Only JPG images are supported
|
|
||||||
python scripts/demo.py --video_path <your_frame_directory> --txt_path <path_to_first_frame_bbox.txt>
|
|
||||||
```
|
|
||||||
|
|
||||||
## FAQs
|
|
||||||
**Question 1:** Does SAMURAI need training? [issue 34](https://github.com/yangchris11/samurai/issues/34)
|
|
||||||
|
|
||||||
**Answer 1:** Unlike real-life samurai, the proposed samurai do not require additional training. It is a zero-shot method, we directly use the weights from SAM 2.1 to conduct VOT experiments. Kalman filter is used to estimate the current and future state (bounding box location and scale in our case) of a moving object based on measurements over time, it is a common approach that had been adapt in the field of tracking for a long time which does not requires any training. Please refer to code for more detail.
|
|
||||||
|
|
||||||
**Question 2:** Does SAMURAI support streaming input (e.g. webcam)?
|
|
||||||
|
|
||||||
**Answer 2:** Not yet. The existing code doesn't support live/streaming video as we inherit most of the codebase from the amazing SAM 2. Some discussion that you might be interested in: facebookresearch/sam2#90, facebookresearch/sam2#388 (comment).
|
|
||||||
|
|
||||||
**Question 3:** How to use SAMURAI in longer video?
|
|
||||||
|
|
||||||
**Answer 3:** See the discussion from sam2 https://github.com/facebookresearch/sam2/issues/264.
|
|
||||||
|
|
||||||
|
|
||||||
## Acknowledgment
|
|
||||||
|
|
||||||
SAMURAI is built on top of [SAM 2](https://github.com/facebookresearch/sam2?tab=readme-ov-file) by Meta FAIR.
|
|
||||||
|
|
||||||
The VOT evaluation code is modifed from [VOT Toolkit](https://github.com/votchallenge/toolkit) by Luka Čehovin Zajc.
|
|
||||||
|
|
||||||
## Citation
|
|
||||||
|
|
||||||
Please consider citing our paper and the wonderful `SAM 2` if you found our work interesting and useful.
|
|
||||||
```
|
|
||||||
@article{ravi2024sam2,
|
|
||||||
title={SAM 2: Segment Anything in Images and Videos},
|
|
||||||
author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
|
|
||||||
journal={arXiv preprint arXiv:2408.00714},
|
|
||||||
url={https://arxiv.org/abs/2408.00714},
|
|
||||||
year={2024}
|
|
||||||
}
|
|
||||||
|
|
||||||
@misc{yang2024samurai,
|
|
||||||
title={SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory},
|
|
||||||
author={Cheng-Yen Yang and Hsiang-Wei Huang and Wenhao Chai and Zhongyu Jiang and Jenq-Neng Hwang},
|
|
||||||
year={2024},
|
|
||||||
eprint={2411.11922},
|
|
||||||
archivePrefix={arXiv},
|
|
||||||
primaryClass={cs.CV},
|
|
||||||
url={https://arxiv.org/abs/2411.11922},
|
|
||||||
}
|
|
||||||
```
|
```
|
||||||
|
139
SAMURAI_README.md
Normal file
139
SAMURAI_README.md
Normal file
@@ -0,0 +1,139 @@
|
|||||||
|
<div align="center">
|
||||||
|
<img align="left" width="100" height="100" src="https://github.com/user-attachments/assets/1834fc25-42ef-4237-9feb-53a01c137e83" alt="">
|
||||||
|
|
||||||
|
# SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
|
||||||
|
|
||||||
|
[Cheng-Yen Yang](https://yangchris11.github.io), [Hsiang-Wei Huang](https://hsiangwei0903.github.io/), [Wenhao Chai](https://rese1f.github.io/), [Zhongyu Jiang](https://zhyjiang.github.io/#/), [Jenq-Neng Hwang](https://people.ece.uw.edu/hwang/)
|
||||||
|
|
||||||
|
[Information Processing Lab, University of Washington](https://ipl-uw.github.io/)
|
||||||
|
</div>
|
||||||
|
|
||||||
|
|
||||||
|
[](https://paperswithcode.com/sota/visual-object-tracking-on-lasot-ext?p=samurai-adapting-segment-anything-model-for-1)
|
||||||
|
[](https://paperswithcode.com/sota/visual-object-tracking-on-got-10k?p=samurai-adapting-segment-anything-model-for-1)
|
||||||
|
[](https://paperswithcode.com/sota/visual-object-tracking-on-needforspeed?p=samurai-adapting-segment-anything-model-for-1)
|
||||||
|
[](https://paperswithcode.com/sota/visual-object-tracking-on-lasot?p=samurai-adapting-segment-anything-model-for-1)
|
||||||
|
[](https://paperswithcode.com/sota/visual-object-tracking-on-otb-2015?p=samurai-adapting-segment-anything-model-for-1)
|
||||||
|
|
||||||
|
[[Arxiv]](https://arxiv.org/abs/2411.11922) [[Project Page]](https://yangchris11.github.io/samurai/) [[Raw Results]](https://drive.google.com/drive/folders/1ssiDmsC7mw5AiItYQG4poiR1JgRq305y?usp=sharing)
|
||||||
|
|
||||||
|
This repository is the official implementation of SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory
|
||||||
|
|
||||||
|
https://github.com/user-attachments/assets/9d368ca7-2e9b-4fed-9da0-d2efbf620d88
|
||||||
|
|
||||||
|
All rights are reserved to the copyright owners (TM & © Universal (2019)). This clip is not intended for commercial use and is solely for academic demonstration in a research paper. Original source can be found [here](https://www.youtube.com/watch?v=cwUzUzpG8aM&t=4s).
|
||||||
|
|
||||||
|
## Getting Started
|
||||||
|
|
||||||
|
#### SAMURAI Installation
|
||||||
|
|
||||||
|
SAM 2 needs to be installed first before use. The code requires `python>=3.10`, as well as `torch>=2.3.1` and `torchvision>=0.18.1`. Please follow the instructions [here](https://github.com/facebookresearch/sam2?tab=readme-ov-file) to install both PyTorch and TorchVision dependencies. You can install **the SAMURAI version** of SAM 2 on a GPU machine using:
|
||||||
|
```
|
||||||
|
cd sam2
|
||||||
|
pip install -e .
|
||||||
|
pip install -e ".[notebooks]"
|
||||||
|
```
|
||||||
|
|
||||||
|
Please see [INSTALL.md](https://github.com/facebookresearch/sam2/blob/main/INSTALL.md) from the original SAM 2 repository for FAQs on potential issues and solutions.
|
||||||
|
|
||||||
|
Install other requirements:
|
||||||
|
```
|
||||||
|
pip install matplotlib==3.7 tikzplotlib jpeg4py opencv-python lmdb pandas scipy loguru
|
||||||
|
```
|
||||||
|
|
||||||
|
#### SAM 2.1 Checkpoint Download
|
||||||
|
|
||||||
|
```
|
||||||
|
cd checkpoints && \
|
||||||
|
./download_ckpts.sh && \
|
||||||
|
cd ..
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Data Preparation
|
||||||
|
|
||||||
|
Please prepare the data in the following format:
|
||||||
|
```
|
||||||
|
data/LaSOT
|
||||||
|
├── airplane/
|
||||||
|
│ ├── airplane-1/
|
||||||
|
│ │ ├── full_occlusion.txt
|
||||||
|
│ │ ├── groundtruth.txt
|
||||||
|
│ │ ├── img
|
||||||
|
│ │ ├── nlp.txt
|
||||||
|
│ │ └── out_of_view.txt
|
||||||
|
│ ├── airplane-2/
|
||||||
|
│ ├── airplane-3/
|
||||||
|
│ ├── ...
|
||||||
|
├── basketball
|
||||||
|
├── bear
|
||||||
|
├── bicycle
|
||||||
|
...
|
||||||
|
├── training_set.txt
|
||||||
|
└── testing_set.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Main Inference
|
||||||
|
```
|
||||||
|
python scripts/main_inference.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Demo on Custom Video
|
||||||
|
|
||||||
|
To run the demo with your custom video or frame directory, use the following examples:
|
||||||
|
|
||||||
|
**Note:** The `.txt` file contains a single line with the bounding box of the first frame in `x,y,w,h` format.
|
||||||
|
|
||||||
|
### Input is Video File
|
||||||
|
|
||||||
|
```
|
||||||
|
python scripts/demo.py --video_path <your_video.mp4> --txt_path <path_to_first_frame_bbox.txt>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Input is Frame Folder
|
||||||
|
```
|
||||||
|
# Only JPG images are supported
|
||||||
|
python scripts/demo.py --video_path <your_frame_directory> --txt_path <path_to_first_frame_bbox.txt>
|
||||||
|
```
|
||||||
|
|
||||||
|
## FAQs
|
||||||
|
**Question 1:** Does SAMURAI need training? [issue 34](https://github.com/yangchris11/samurai/issues/34)
|
||||||
|
|
||||||
|
**Answer 1:** Unlike real-life samurai, the proposed samurai do not require additional training. It is a zero-shot method, we directly use the weights from SAM 2.1 to conduct VOT experiments. Kalman filter is used to estimate the current and future state (bounding box location and scale in our case) of a moving object based on measurements over time, it is a common approach that had been adapt in the field of tracking for a long time which does not requires any training. Please refer to code for more detail.
|
||||||
|
|
||||||
|
**Question 2:** Does SAMURAI support streaming input (e.g. webcam)?
|
||||||
|
|
||||||
|
**Answer 2:** Not yet. The existing code doesn't support live/streaming video as we inherit most of the codebase from the amazing SAM 2. Some discussion that you might be interested in: facebookresearch/sam2#90, facebookresearch/sam2#388 (comment).
|
||||||
|
|
||||||
|
**Question 3:** How to use SAMURAI in longer video?
|
||||||
|
|
||||||
|
**Answer 3:** See the discussion from sam2 https://github.com/facebookresearch/sam2/issues/264.
|
||||||
|
|
||||||
|
|
||||||
|
## Acknowledgment
|
||||||
|
|
||||||
|
SAMURAI is built on top of [SAM 2](https://github.com/facebookresearch/sam2?tab=readme-ov-file) by Meta FAIR.
|
||||||
|
|
||||||
|
The VOT evaluation code is modifed from [VOT Toolkit](https://github.com/votchallenge/toolkit) by Luka Čehovin Zajc.
|
||||||
|
|
||||||
|
## Citation
|
||||||
|
|
||||||
|
Please consider citing our paper and the wonderful `SAM 2` if you found our work interesting and useful.
|
||||||
|
```
|
||||||
|
@article{ravi2024sam2,
|
||||||
|
title={SAM 2: Segment Anything in Images and Videos},
|
||||||
|
author={Ravi, Nikhila and Gabeur, Valentin and Hu, Yuan-Ting and Hu, Ronghang and Ryali, Chaitanya and Ma, Tengyu and Khedr, Haitham and R{\"a}dle, Roman and Rolland, Chloe and Gustafson, Laura and Mintun, Eric and Pan, Junting and Alwala, Kalyan Vasudev and Carion, Nicolas and Wu, Chao-Yuan and Girshick, Ross and Doll{\'a}r, Piotr and Feichtenhofer, Christoph},
|
||||||
|
journal={arXiv preprint arXiv:2408.00714},
|
||||||
|
url={https://arxiv.org/abs/2408.00714},
|
||||||
|
year={2024}
|
||||||
|
}
|
||||||
|
|
||||||
|
@misc{yang2024samurai,
|
||||||
|
title={SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory},
|
||||||
|
author={Cheng-Yen Yang and Hsiang-Wei Huang and Wenhao Chai and Zhongyu Jiang and Jenq-Neng Hwang},
|
||||||
|
year={2024},
|
||||||
|
eprint={2411.11922},
|
||||||
|
archivePrefix={arXiv},
|
||||||
|
primaryClass={cs.CV},
|
||||||
|
url={https://arxiv.org/abs/2411.11922},
|
||||||
|
}
|
||||||
|
```
|
226
grounded_samurai_dinox.py
Normal file
226
grounded_samurai_dinox.py
Normal file
@@ -0,0 +1,226 @@
|
|||||||
|
# libraries for SAMURAI
|
||||||
|
import os
|
||||||
|
import cv2
|
||||||
|
import torch
|
||||||
|
import numpy as np
|
||||||
|
import supervision as sv
|
||||||
|
import sys
|
||||||
|
from pathlib import Path
|
||||||
|
from tqdm import tqdm
|
||||||
|
from PIL import Image
|
||||||
|
sys.path.append("./sam2")
|
||||||
|
from sam2.build_sam import build_sam2_video_predictor
|
||||||
|
|
||||||
|
# dds cloudapi for DINO-X
|
||||||
|
from dds_cloudapi_sdk import Config
|
||||||
|
from dds_cloudapi_sdk import Client
|
||||||
|
from dds_cloudapi_sdk.tasks.dinox import DinoxTask
|
||||||
|
from dds_cloudapi_sdk.tasks.types import DetectionTarget
|
||||||
|
from dds_cloudapi_sdk import TextPrompt
|
||||||
|
|
||||||
|
"""
|
||||||
|
Hyperparam for Ground and Tracking
|
||||||
|
"""
|
||||||
|
VIDEO_PATH = "demo.mp4"
|
||||||
|
TEXT_PROMPT = "person."
|
||||||
|
OUTPUT_VIDEO_PATH = "./tracking_demo.mp4"
|
||||||
|
SOURCE_VIDEO_FRAME_DIR = "./custom_video_frames"
|
||||||
|
SAVE_TRACKING_RESULTS_DIR = "./tracking_results"
|
||||||
|
API_TOKEN_FOR_DINOX = "Your API token"
|
||||||
|
PROMPT_TYPE_FOR_VIDEO = "box" # choose from ["point", "box", "mask"]
|
||||||
|
BOX_THRESHOLD = 0.2
|
||||||
|
|
||||||
|
"""
|
||||||
|
Step 1: Environment settings and model initialization for SAM 2
|
||||||
|
"""
|
||||||
|
# use bfloat16 for the entire notebook
|
||||||
|
torch.autocast(device_type="cuda", dtype=torch.bfloat16).__enter__()
|
||||||
|
|
||||||
|
if torch.cuda.get_device_properties(0).major >= 8:
|
||||||
|
# turn on tfloat32 for Ampere GPUs (https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-32-tf32-on-ampere-devices)
|
||||||
|
torch.backends.cuda.matmul.allow_tf32 = True
|
||||||
|
torch.backends.cudnn.allow_tf32 = True
|
||||||
|
|
||||||
|
# init sam image predictor and video predictor model
|
||||||
|
sam2_checkpoint = "/comp_robot/rentianhe/code/samurai/sam2/checkpoints/sam2.1_hiera_large.pt"
|
||||||
|
model_cfg = "configs/samurai/sam2.1_hiera_l.yaml"
|
||||||
|
|
||||||
|
video_predictor = build_sam2_video_predictor(model_cfg, sam2_checkpoint)
|
||||||
|
|
||||||
|
# # `video_dir` a directory of JPEG frames with filenames like `<frame_index>.jpg`
|
||||||
|
# video_dir = "notebooks/videos/bedroom"
|
||||||
|
|
||||||
|
"""
|
||||||
|
Custom video input directly using video files
|
||||||
|
"""
|
||||||
|
video_info = sv.VideoInfo.from_video_path(VIDEO_PATH) # get video info
|
||||||
|
print(video_info)
|
||||||
|
frame_generator = sv.get_video_frames_generator(VIDEO_PATH, stride=1, start=0, end=None)
|
||||||
|
|
||||||
|
# saving video to frames
|
||||||
|
source_frames = Path(SOURCE_VIDEO_FRAME_DIR)
|
||||||
|
source_frames.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
with sv.ImageSink(
|
||||||
|
target_dir_path=source_frames,
|
||||||
|
overwrite=True,
|
||||||
|
image_name_pattern="{:05d}.jpg"
|
||||||
|
) as sink:
|
||||||
|
for frame in tqdm(frame_generator, desc="Saving Video Frames"):
|
||||||
|
sink.save_image(frame)
|
||||||
|
|
||||||
|
# scan all the JPEG frame names in this directory
|
||||||
|
frame_names = [
|
||||||
|
p for p in os.listdir(SOURCE_VIDEO_FRAME_DIR)
|
||||||
|
if os.path.splitext(p)[-1] in [".jpg", ".jpeg", ".JPG", ".JPEG"]
|
||||||
|
]
|
||||||
|
frame_names.sort(key=lambda p: int(os.path.splitext(p)[0]))
|
||||||
|
|
||||||
|
# init video predictor state
|
||||||
|
inference_state = video_predictor.init_state(video_path=SOURCE_VIDEO_FRAME_DIR)
|
||||||
|
|
||||||
|
ann_frame_idx = 0 # the frame index we interact with
|
||||||
|
"""
|
||||||
|
Step 2: Prompt DINO-X with Cloud API for box coordinates
|
||||||
|
"""
|
||||||
|
|
||||||
|
# prompt grounding dino to get the box coordinates on specific frame
|
||||||
|
img_path = os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[ann_frame_idx])
|
||||||
|
image = Image.open(img_path)
|
||||||
|
|
||||||
|
# Step 1: initialize the config
|
||||||
|
config = Config(API_TOKEN_FOR_DINOX)
|
||||||
|
|
||||||
|
# Step 2: initialize the client
|
||||||
|
client = Client(config)
|
||||||
|
|
||||||
|
# Step 3: run the task by DetectionTask class
|
||||||
|
# image_url = "https://algosplt.oss-cn-shenzhen.aliyuncs.com/test_files/tasks/detection/iron_man.jpg"
|
||||||
|
# if you are processing local image file, upload them to DDS server to get the image url
|
||||||
|
image_url = client.upload_file(img_path)
|
||||||
|
|
||||||
|
task = DinoxTask(
|
||||||
|
image_url=image_url,
|
||||||
|
prompts=[TextPrompt(text=TEXT_PROMPT)],
|
||||||
|
bbox_threshold=0.25,
|
||||||
|
targets=[DetectionTarget.BBox],
|
||||||
|
)
|
||||||
|
|
||||||
|
client.run_task(task)
|
||||||
|
result = task.result
|
||||||
|
|
||||||
|
objects = result.objects # the list of detected objects
|
||||||
|
|
||||||
|
|
||||||
|
input_boxes = []
|
||||||
|
confidences = []
|
||||||
|
class_names = []
|
||||||
|
|
||||||
|
for idx, obj in enumerate(objects):
|
||||||
|
input_boxes.append(obj.bbox)
|
||||||
|
confidences.append(obj.score)
|
||||||
|
class_names.append(obj.category)
|
||||||
|
|
||||||
|
input_boxes = np.array(input_boxes)
|
||||||
|
|
||||||
|
print(input_boxes)
|
||||||
|
|
||||||
|
# process the detection results
|
||||||
|
OBJECTS = class_names
|
||||||
|
|
||||||
|
print(OBJECTS)
|
||||||
|
|
||||||
|
"""
|
||||||
|
Step 3: Register each object's positive points to video predictor with seperate add_new_points call
|
||||||
|
"""
|
||||||
|
|
||||||
|
assert PROMPT_TYPE_FOR_VIDEO in ["point", "box", "mask"], "SAM 2 video predictor only support point/box/mask prompt"
|
||||||
|
|
||||||
|
# Using box prompt
|
||||||
|
if PROMPT_TYPE_FOR_VIDEO == "box":
|
||||||
|
for object_id, (label, box) in enumerate(zip(OBJECTS, input_boxes), start=1):
|
||||||
|
_, out_obj_ids, out_mask_logits = video_predictor.add_new_points_or_box(
|
||||||
|
inference_state=inference_state,
|
||||||
|
frame_idx=ann_frame_idx,
|
||||||
|
obj_id=object_id,
|
||||||
|
box=box,
|
||||||
|
)
|
||||||
|
break
|
||||||
|
|
||||||
|
"""
|
||||||
|
Step 4: Propagate the video predictor to get the segmentation results for each frame
|
||||||
|
"""
|
||||||
|
video_segments = {} # video_segments contains the per-frame segmentation results
|
||||||
|
for out_frame_idx, out_obj_ids, out_mask_logits in video_predictor.propagate_in_video(inference_state):
|
||||||
|
video_segments[out_frame_idx] = {
|
||||||
|
out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
|
||||||
|
for i, out_obj_id in enumerate(out_obj_ids)
|
||||||
|
}
|
||||||
|
|
||||||
|
"""
|
||||||
|
Step 5: Visualize the segment results across the video and save them
|
||||||
|
"""
|
||||||
|
|
||||||
|
if not os.path.exists(SAVE_TRACKING_RESULTS_DIR):
|
||||||
|
os.makedirs(SAVE_TRACKING_RESULTS_DIR)
|
||||||
|
|
||||||
|
ID_TO_OBJECTS = {i: obj for i, obj in enumerate(OBJECTS, start=1)}
|
||||||
|
|
||||||
|
for frame_idx, segments in video_segments.items():
|
||||||
|
img = cv2.imread(os.path.join(SOURCE_VIDEO_FRAME_DIR, frame_names[frame_idx]))
|
||||||
|
|
||||||
|
object_ids = list(segments.keys())
|
||||||
|
masks = list(segments.values())
|
||||||
|
masks = np.concatenate(masks, axis=0)
|
||||||
|
|
||||||
|
detections = sv.Detections(
|
||||||
|
xyxy=sv.mask_to_xyxy(masks), # (n, 4)
|
||||||
|
mask=masks, # (n, h, w)
|
||||||
|
class_id=np.array(object_ids, dtype=np.int32),
|
||||||
|
)
|
||||||
|
box_annotator = sv.BoxAnnotator()
|
||||||
|
annotated_frame = box_annotator.annotate(scene=img.copy(), detections=detections)
|
||||||
|
label_annotator = sv.LabelAnnotator()
|
||||||
|
annotated_frame = label_annotator.annotate(annotated_frame, detections=detections, labels=[ID_TO_OBJECTS[i] for i in object_ids])
|
||||||
|
mask_annotator = sv.MaskAnnotator()
|
||||||
|
annotated_frame = mask_annotator.annotate(scene=annotated_frame, detections=detections)
|
||||||
|
cv2.imwrite(os.path.join(SAVE_TRACKING_RESULTS_DIR, f"annotated_frame_{frame_idx:05d}.jpg"), annotated_frame)
|
||||||
|
|
||||||
|
|
||||||
|
"""
|
||||||
|
Step 6: Convert the annotated frames to video
|
||||||
|
"""
|
||||||
|
|
||||||
|
def create_video_from_images(image_folder, output_video_path, frame_rate=25):
|
||||||
|
# define valid extension
|
||||||
|
valid_extensions = [".jpg", ".jpeg", ".JPG", ".JPEG", ".png", ".PNG"]
|
||||||
|
|
||||||
|
# get all image files in the folder
|
||||||
|
image_files = [f for f in os.listdir(image_folder)
|
||||||
|
if os.path.splitext(f)[1] in valid_extensions]
|
||||||
|
image_files.sort() # sort the files in alphabetical order
|
||||||
|
print(image_files)
|
||||||
|
if not image_files:
|
||||||
|
raise ValueError("No valid image files found in the specified folder.")
|
||||||
|
|
||||||
|
# load the first image to get the dimensions of the video
|
||||||
|
first_image_path = os.path.join(image_folder, image_files[0])
|
||||||
|
first_image = cv2.imread(first_image_path)
|
||||||
|
height, width, _ = first_image.shape
|
||||||
|
|
||||||
|
# create a video writer
|
||||||
|
fourcc = cv2.VideoWriter_fourcc(*'mp4v') # codec for saving the video
|
||||||
|
video_writer = cv2.VideoWriter(output_video_path, fourcc, frame_rate, (width, height))
|
||||||
|
|
||||||
|
# write each image to the video
|
||||||
|
for image_file in tqdm(image_files):
|
||||||
|
image_path = os.path.join(image_folder, image_file)
|
||||||
|
image = cv2.imread(image_path)
|
||||||
|
video_writer.write(image)
|
||||||
|
|
||||||
|
# source release
|
||||||
|
video_writer.release()
|
||||||
|
print(f"Video saved at {output_video_path}")
|
||||||
|
|
||||||
|
|
||||||
|
create_video_from_images(SAVE_TRACKING_RESULTS_DIR, OUTPUT_VIDEO_PATH)
|
Reference in New Issue
Block a user