BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models

The official code repository for the paper: BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models.

Introduction:

BLIP BeaconSaliency, is a text-guided saliency prediction model base on BLIP and GradCam. leverages the synergy between visual-text matching-—a forefront in multi-modal learning—and text-guided saliency models, marks a significant stride in this domain. This repo implements our innovative approach, which showcases exceptional performance across diverse datasets. Our methodology, encapsulates a unique blend of technical prowess and creative ingenuity, setting a new benchmark in the field. The introduction of our work seeks to not only articulate the technical merits of our model but also to engage and inspire the reader, bridging the gap between complex computational concepts and their practical implications in understanding visual attention. Our code and finetuned models can be obtained at.

In essence, BLIP BeaconSaliency is more than an addition to the compendium of visual saliency methodologies; it is an approach that aligns with the complexities of visual saliency prediction and intricate image processing technologies.

🤗 Please cite BLIP BeaconSaliency in your publications if it helps with your work. Please star🌟 this repo to help others notice AugmentIQ if you think it is useful. It really means a lot to our open-source research. Thank you! BTW, you may also like BLIP, pysaliency, the two great open-source repositories upon which we built our architecture.

📣 Attention please:
> BLIP BeaconSaliency is developed under the framework of pysaliency, a Python toolbox for predicting visual saliency maps and is pre-configured with several datasets and matlab-implemented models. An example of using pysaliency for predicting SJTU-VIS dataset is shown below. With pysaliency, easy peasy! 😉

👉 Click here to see the example 👀

cd pysaliency
pip install -e .

# Data preprocessing. Tedious, but pysaliency can help. 🤓
data_location = "../../datasets"
mit_stimuli, mit_fixations = pysaliency.external_datasets.get_mit1003(location=data_location) # For datasets in pysaliency database, pysaliency will automatically download, extract it, and store it in hfd5 format.
index = 10
plt.imshow(mit_stimuli.stimuli[index])
f = mit_fixations[mit_fixations.n == index]
plt.scatter(f.x, f.y, color='r')
_ = plt.axis('off')
cutoff = 20

aim: pysaliency.SaliencyMapModel = pysaliency.AIM(location='../../models', cache_location=os.path.join('model_caches', 'AIM'))

from pysaliency.external_datasets.sjtuvis import TextDescriptor
text_descriptor = TextDescriptor('../../datasets/test/original_sjtuvis_dataset/text.xlsx')
data_location = "../../datasets/test"
original_dataset_path = "../../datasets/test/original_sjtuvis_dataset"
mit_stimuli, mit_fixations = pysaliency.external_datasets.get_sjtu_vis(original_dataset_path=original_dataset_path, location=data_location, text_descriptor = text_descriptor)
short_stimuli = pysaliency.FileStimuli(filenames=mit_stimuli.filenames[:cutoff]) # hold first 10 visual stimuli as pilot test set.
short_fixations = mit_fixations[mit_fixations.n < cutoff]

# Model inference. This is pysaliency showtime. 💪
smap = model.saliency_map(mit_stimuli[10])
plt.imshow(smap)
plt.show()


# Several evaluation metrics calculation.
auc_uniform = model.AUC(short_stimuli, short_fixations, nonfixations='uniform', verbose=True)  # Measures the accuracy of predicting fixations by evaluating the model's ability to distinguish between fixated and non-fixated regions.
auc_shuffled = model.AUC(short_stimuli, short_fixations, nonfixations='shuffled', verbose=True)
auc_identical_nonfixations = model.AUC(short_stimuli, short_fixations, nonfixations=short_fixations, verbose=True)
kl_uniform = model.fixation_based_KL_divergence(short_stimuli, short_fixations, nonfixations='uniform') #  Quantifies the dissimilarity between fixation and nonfixation distributions, providing insights into the model's ability to capture saliency patterns.
kl_shuffled = model.fixation_based_KL_divergence(short_stimuli, short_fixations, nonfixations='shuffled')
kl_identical = model.fixation_based_KL_divergence(short_stimuli, short_fixations, nonfixations=short_fixations)
nss = model.NSS(short_stimuli, short_fixations) # Measures the alignment between the model's saliency values and human eye fixations, indicating consistency with human gaze behavior.
gold_standard = pysaliency.FixationMap(short_stimuli, short_fixations, kernel_size=30)
image_based_kl = model.image_based_kl_divergence(short_stimuli, gold_standard) # Calculates the similarity between the model's saliency predictions and the ground truth fixation map using KL divergence.
cc = model.CC(short_stimuli, gold_standard) # Measures the linear relationship between the model's saliency predictions and the ground truth fixation map.
ssim = model.SIM(short_stimuli, gold_standard) # Measures the structural similarity between the model's saliency predictions and the ground truth fixation map.

❖ Contributions and Performance

⦿ Contributions:

Our model innovatively explores text-guided saliency, an area previously under-researched. It focuses on how textual prompts affect visual saliency map prediction, advancing the field’s boundaries.
The architecture integrates BERT and GradCam. BERT excels in aligning text and image features, while GradCam effectively draws saliency maps using esti- mated gradient flows. Their combination enhances model performance on various metrics.
Our model achieves better and consistent results, comparable to state-of-the-art models across several metrics like sAUC, AUC, SSIM, CC, confirming its efficacy.

⦿ Performance: BLIP BeaconSaliency outperforms various models on several saliency benchmarks, especially in datset SJTU-VIS.

❖ Brief Graphical Illustration of Our Methodology

Here we only show the main component of our method: the joint-optimization training approach combining three encoders while frozening their own weights. For the detailed description and explanation, please read our full paper if you are interested.

Fig. 1: Training approach

❖ Repository Structure

The implementation of BLIP BeaconSaliency is in dir blipsaliency.Please install it via pip install -e . or python setup.py install. Due to the time and resource limit, we haven't performed extensive enough parameter finetuning experiments, if you like this repo, plek and PR to help us improve it ! 💚 💛 🤎.

❖ Development Environment

We run on Ubuntu 22.04 LTS with a system configured with a NVIDIA RTX A40 GPU.

Use conda to create a env for BLIP BeaconSaliency and activate it.

conda env create --file=enviornment.yaml
conda activate blipsaliency

Then install blipsaliency as a package

cd blipsaliency
pip install -e .

Additionally, if you want to reproduce the results about the gradient-based models in the paper(extensive experiments on Vanilla Gradient, SmoothGrad, Integrated gradients, GradCam... based on InceptionV3, CNN or MLP architectures), please refer to the sanity-check directory !🤗

To follow the original implementation of these gradient-based models, we have to create a new conda environment to test it!

conda env create --file=tf_env.yaml
conda activate sanity

Also, if you want to see how I process the SJTU-VIS dataset inot to hdf5 format and integrate them into the pysaliency framework, please refer to the pysaliency directory and more specifically, the file, (and we are planning to give a PR to the original pysaliency repo since SJTU-VIS is an awesome dataset integrated with the power of textual prompts) !😎

❖ Datasets

We run on nine datasets, more specifically, MIT1003(MIT300), SJTU-VIS, CAT2000, FIGRIM, SALICON, Toronto, DUT-OMRON, OSIE, PASCAL-S, NUSEF-Public.

Here are some samples taken randomly from the dataset:

Some samples taken from the SJTU-VIS dataset, coupled with multi-scale textual prompts(including non-salient objects, salient objects, personal computersglobal and local context).

Now the directory tree should be the following:

- datasets
- deepgaze
    - deepgaze_pytorch
- docs
- blipsaliency
    - examples
    - blipsaliency
    - projects
- pysaliency
    - notebooks
    - optpy
    - pysaliency
- models
- pretrained_weights
    - deepgaze
- saliency_toolbox
- sanity_check
    - notebooks
    - src

❖ Usage

We use the BLIP pretrained model as our back-bone, and finetune it on the SJTU-VIS dataset. Also please take a tour to the pysaliency repo for further details.

❖ Quick Run

👉 Click here to see the example 👀

Please see the .ipynb notebooks under the pysaliency/notebooks/ directory for reference about the training procedure and inference pass of our best model.

Please see the .ipynb notebooks under the sanity_check/notebooks/ directory for experiments about various gradient-based models with their backbone choosing from InceptionV3, CNN, MLP architecture.

❗️Note that paths of datasets and saving dirs may be different on your PC, please check them in the configuration files.

❖ Experimental Results

❖ Acknowledgments

I extend my heartfelt gratitude to the esteemed faculty and dedicated teaching assistants of CS3324 for their invaluable guidance and support throughout my journey in image processing. Their profound knowledge, coupled with an unwavering commitment to nurturing curiosity and innovation, has been instrumental in my academic and personal growth. I am deeply appreciative of their efforts in creating a stimulating and enriching learning environment, which has significantly contributed to the development of this paper and my understanding of the field. My sincere thanks to each one of them for inspiring and challenging me to reach new heights in my studies.

✨Stars/forks/issues/PRs are all welcome!

👏 Click to View Contributors:

❖ Last but Not Least

If you have any additional questions or have interests in collaboration,please take a look at my GitHub profile and feel free to contact me 😃.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.vscode		.vscode
blipsaliency		blipsaliency
deepgaze		deepgaze
docs/_static		docs/_static
lavis		lavis
pysaliency		pysaliency
sanity_check		sanity_check
.gitignore		.gitignore
BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models.pdf		BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models.pdf
README.md		README.md
README.txt		README.txt
environment.yaml		environment.yaml
output.png		output.png
results.csv		results.csv
tf_env.yaml		tf_env.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models

Introduction:

❖ Contributions and Performance

❖ Brief Graphical Illustration of Our Methodology

❖ Repository Structure

❖ Development Environment

❖ Datasets

❖ Usage

❖ Quick Run

❖ Experimental Results

❖ Acknowledgments

✨Stars/forks/issues/PRs are all welcome!

❖ Last but Not Least

About

Releases

Packages

Languages

Learner209/text-guided-saliency

Folders and files

Latest commit

History

Repository files navigation

BLIP BeaconSaliency: Harnessing BERT and ViT Synergy for Text-Guided Visual Focus in Deep Learning Models

Introduction:

❖ Contributions and Performance

❖ Brief Graphical Illustration of Our Methodology

❖ Repository Structure

❖ Development Environment

❖ Datasets

❖ Usage

❖ Quick Run

❖ Experimental Results

❖ Acknowledgments

✨Stars/forks/issues/PRs are all welcome!

❖ Last but Not Least

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages