- ailia input shape: (16 * n_text_labels, 77)
n_text_labels
is the number of text input labels- 16 is the number of augmentations
- Preprocessing: apply text prompt on the input text labels
- ailia input shape: (batch_size * num_segments, 3, 224, 224) RGB channel order
num_segments
is the number of segments the model has been trained on
- Preprocessing: normalization using means [0.48145466, 0.4578275, 0.40821073] and standard deviations [0.26862954, 0.26130258, 0.27577711] (RGB channel order)
- ailia input shape: (batch_size, num_segments, 512)
num_segments
is the number of segments the model has been trained on
- Zero-Shot Prediction
### Predicts the top 5 most likely labels among input text labels ###
==============================================================
class_count = 10
+ idx = 0
category = 2 [driving]
prob = 0.5602660179138184
+ idx = 1
category = 3 [driving car]
prob = 0.3114832639694214
+ idx = 2
category = 4 [driving truck]
prob = 0.12353289872407913
+ idx = 3
category = 9 [talking phone]
prob = 0.0027049153577536345
+ idx = 4
category = 7 [reading]
prob = 0.0006968703237362206
Script finished successfully.
- ailia Predict API output:
text_features
: encoded text features- Shape: (n, 512)
- Features need to be normalized by the norm over
axis=1
before computing the similarity
- ailia Predict API output:
image_features
: encoded image features- Shape: (batch_size * num_segments, 512)
- ailia Predict API output:
fused_image_features
: fused image features- Shape: (batch_size, 512)
- Features need to be normalized by the norm over
axis=1
before computing the similarity
This model requires additional packages.
pip3 install ftfy regex
Automatically downloads the onnx and prototxt files on the first run. It is necessary to be connected to the Internet while downloading.
The following runs a basic inference on the sample GIF.
$ python3 action_clip.py
By adding the `--video` option, you can input the video. Webcam input is not supported.
```bash
$ python3 action_clip.py --video VIDEO_PATH
You can use the --text
option if you want to specify custom text input labels.
$ python3 action_clip.py --video VIDEO_PATH --text "drinking" --text "eating" --text "laughing"
If you want to load custom text input labels from a file, use the --desc_file
option (1 label/line).
$ python3 action_clip.py --video VIDEO_PATH --desc_file imagenet_classes.txt
PyTorch 1.8.1
ONNX opset = 10
vit-32-8f-text_clip.onnx.prototxt
vit-32-8f-image_clip.onnx.prototxt
vit-32-8f-fusion.onnx.prototxt