Skip to content

The official implementation of the paper: SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

License

Notifications You must be signed in to change notification settings

farisalasmary/sbvqa2.0

Repository files navigation

Open In Colab

SBVQA 2.0 Official Implementation

This is the official implementation of our paper:

SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

How to run?

Steps

  1. Install all requirements in requirements.txt.
  2. Download models inside the folder models/.
  3. Make a new empty folder with the name uploads/.
  4. Create an account in ngrok to be able to use the model from browser via internet.
  5. Copy your ngrok authtoken.
  6. Edit website.py script by replacing YOUR_TOKEN_GOES_HERE string with your ngrok authtoken.
  7. Run the script python website.py
  8. Enjoy ;)

Demo Examples

The recorded question was: "What kind of animals is this?"

The recorded question was: "What is the color of the topping of the cake?"

Data

Audio files

SBVQA 2.0 dataset = SBVQA 1.0 dataset + The complementary spoken questions

  • SBVQA 1.0 original data (identical copy of the data from zted/sbvqa repo): Download
  • The complementary spoken questions: Download

Also, you can download mp3_files_by_question.pkl, a mapper where the key is the textual question and the value is the .mp3 file name, from this link.

To load the mapper, use the following code snippet:

import re
import pickle

def clean_question(text):
    text = text.lower()
    return ' '.join(re.sub(u"[^a-zA-Z ]", "", text,  flags=re.UNICODE).split())

mp3_files_by_question_mapper = pickle.load(open('mp3_files_by_question.pkl', 'rb'))

textual_question = 'Is this a modern interior?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000010.mp3'

textual_question = 'Where can milk be obtained?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000011.mp3'

textual_question = 'What are the payment method of the parking meter?'
mp3_files_by_question_mapper[clean_question(textual_question)]
# Output: 'complementary_0000012.mp3'

Image files

These links were taken from the VQA Website

Precomputed features

  • BLIP features (train2014 images): Download
  • BLIP features (val2014 images): Download
  • Speech features of the whole SBVQA 2.0 dataset (Joanna only): Download

Pretrained Models

  1. NeMo speech encoder checkpoint: stt_en_conformer_ctc_large_24500_hours_bpe.nemo
  2. Best SBVQA 2.0 checkpoint: best_sbvqa_2.0_model.pt

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

Resources

ToDo

  • speech feature extraction script (NeMo Conformer)
  • noise injection script
  • inference script
  • visual feature extraction script (BLIP ViT)
  • main model training scripts
  • upload find_the_best_speech_encoder.py script
  • our SBVQA 1.0 implementation scripts
  • visualization scripts (GradCAM + attention maps)
  • upload SBVQA 2.0 dataset
  • upload precomputed visual and speech features
  • upload our pretrained models

Citation

@article{alasmary2023sbvqa,
	author={Alasmary, Faris and Al-Ahmadi, Saad},
	journal={IEEE Access},
	title={SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions},
	year={2023},
	volume={11},
	number={},
	pages={140967-140980},
	doi={10.1109/ACCESS.2023.3339537}
}

About

The official implementation of the paper: SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published