SBVQA 2.0 Official Implementation

This is the official implementation of our paper:

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

How to run?

Steps

Install all requirements in requirements.txt.
Download models inside the folder models/.
Make a new empty folder with the name uploads/.
Create an account in ngrok to be able to use the model from browser via internet.
Copy your ngrok authtoken.
Edit website.py script by replacing YOUR_TOKEN_GOES_HERE string with your ngrok authtoken.
Run the script python website.py
Enjoy ;)

Demo Examples

The recorded question was: "What kind of animals is this?"

The recorded question was: "What is the color of the topping of the cake?"

Data

Audio files

SBVQA 2.0 dataset = SBVQA 1.0 dataset + The complementary spoken questions

SBVQA 1.0 original data (identical copy of the data from zted/sbvqa repo): Download
The complementary spoken questions: Download

Also, you can download mp3_files_by_question.pkl, a mapper where the key is the textual question and the value is the .mp3 file name, from this link.

To load the mapper, use the following code snippet:

import re
import pickle

def clean_question(text):
    text = text.lower()
    return ' '.join(re.sub(u"[^a-zA-Z ]", "", text,  flags=re.UNICODE).split())

mp3_files_by_question_mapper = pickle.load(open('mp3_files_by_question.pkl', 'rb'))

textual_question = 'Is this a modern interior?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000010.mp3'

textual_question = 'Where can milk be obtained?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000011.mp3'

textual_question = 'What are the payment method of the parking meter?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000012.mp3'

Image files

These links were taken from the VQA Website

train2014 images: Download
val2014 images: Download
test2015 images: Download

Precomputed features

BLIP features (train2014 images): Download
BLIP features (val2014 images): Download
Speech features of the whole SBVQA 2.0 dataset (Joanna only): Download

Pretrained Models

NeMo speech encoder checkpoint: stt_en_conformer_ctc_large_24500_hours_bpe.nemo
Best SBVQA 2.0 checkpoint: best_sbvqa_2.0_model.pt

Authors

Faris Alasmary - farisalasmary

License

This project is licensed under the MIT License - see the LICENSE file for details

Resources

This code is mainly adapted from this repo: Bottom-Up and Top-Down Attention for Visual Question Answering
NeMo Conformer checkpoint we used to develop the model: Download
BLIP model checkpoint finetuned on image captioning used in this repo: Download
VGG19 pretrained used in the SBVQA 1.0 implementation: Download

ToDo

Citation

@article{alasmary2023sbvqa,
	author={Alasmary, Faris and Al-Ahmadi, Saad},
	journal={IEEE Access},
	title={SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions},
	year={2023},
	volume={11},
	number={},
	pages={140967-140980},
	doi={10.1109/ACCESS.2023.3339537}
}

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
images		images
models		models
sbvqa2		sbvqa2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_audio_feats.py		extract_audio_feats.py
index.html		index.html
inject_noise.py		inject_noise.py
my_gradcam.py		my_gradcam.py
requirements.txt		requirements.txt
visualize.py		visualize.py
website.py		website.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SBVQA 2.0 Official Implementation

How to run?

Steps

Demo Examples

Data

Audio files

Image files

Precomputed features

Pretrained Models

Authors

License

Resources

ToDo

Citation

About

Releases

Packages

Languages

License

farisalasmary/sbvqa2.0

Folders and files

Latest commit

History

Repository files navigation

SBVQA 2.0 Official Implementation

How to run?

Steps

Demo Examples

Data

Audio files

Image files

Precomputed features

Pretrained Models

Authors

License

Resources

ToDo

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages