This is the official implementation of our paper:
SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions
- Install all requirements in
requirements.txt
. - Download models inside the folder
models/
. - Make a new empty folder with the name
uploads/
. - Create an account in ngrok to be able to use the model from browser via internet.
- Copy your ngrok authtoken.
- Edit
website.py
script by replacingYOUR_TOKEN_GOES_HERE
string with yourngrok authtoken
. - Run the script
python website.py
- Enjoy ;)
The recorded question was: "What kind of animals is this?"
The recorded question was: "What is the color of the topping of the cake?"
SBVQA 2.0 dataset = SBVQA 1.0 dataset + The complementary spoken questions
- SBVQA 1.0 original data (identical copy of the data from zted/sbvqa repo): Download
- The complementary spoken questions: Download
Also, you can download mp3_files_by_question.pkl
, a mapper where the key is the textual question and the value is the .mp3
file name, from this link.
To load the mapper, use the following code snippet:
import re
import pickle
def clean_question(text):
text = text.lower()
return ' '.join(re.sub(u"[^a-zA-Z ]", "", text, flags=re.UNICODE).split())
mp3_files_by_question_mapper = pickle.load(open('mp3_files_by_question.pkl', 'rb'))
textual_question = 'Is this a modern interior?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000010.mp3'
textual_question = 'Where can milk be obtained?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000011.mp3'
textual_question = 'What are the payment method of the parking meter?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000012.mp3'
These links were taken from the VQA Website
- BLIP features (train2014 images): Download
- BLIP features (val2014 images): Download
- Speech features of the whole SBVQA 2.0 dataset (Joanna only): Download
- NeMo speech encoder checkpoint: stt_en_conformer_ctc_large_24500_hours_bpe.nemo
- Best SBVQA 2.0 checkpoint: best_sbvqa_2.0_model.pt
- Faris Alasmary - farisalasmary
This project is licensed under the MIT License - see the LICENSE file for details
- This code is mainly adapted from this repo: Bottom-Up and Top-Down Attention for Visual Question Answering
- NeMo Conformer checkpoint we used to develop the model: Download
- BLIP model checkpoint finetuned on image captioning used in this repo: Download
- VGG19 pretrained used in the SBVQA 1.0 implementation: Download
- speech feature extraction script (NeMo Conformer)
- noise injection script
- inference script
- visual feature extraction script (BLIP ViT)
- main model training scripts
- upload find_the_best_speech_encoder.py script
- our SBVQA 1.0 implementation scripts
- visualization scripts (GradCAM + attention maps)
- upload SBVQA 2.0 dataset
- upload precomputed visual and speech features
- upload our pretrained models
@article{alasmary2023sbvqa,
author={Alasmary, Faris and Al-Ahmadi, Saad},
journal={IEEE Access},
title={SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions},
year={2023},
volume={11},
number={},
pages={140967-140980},
doi={10.1109/ACCESS.2023.3339537}
}