Definition: Identifying connections between elements of multiple modalities
- Supervised Approach
- Unsupervised Approach
Definition: Model all cross-modal connections and interactions to learn better representations
Li et al., VisualBERT: A Simple and Performant Baseline for Vision and Language, arxiv 2019
- Lu, Jiasen, et al. "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks." arXiv (August 6, 2019).
- Tan, Hao, and Mohit Bansal. "Lxmert: Learning cross-modality encoder representations from transformers." arXiv (August 20, 2019).
- First advantage: Does not require all elements to be connected
- Second advantage: Allows different edge functions for modality and temporal connections
Yang et al., MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimo23dal Language Sequences, NAACL 2021
Definition: Handle ambiguity in segmentation and element’s granularity during alignment
Grave et al., Connectionist Temporal Classification: Labelling Unsegmented Seque26nce Data with Recurrent Neural Networks, ICML 2006
Hsu et al., HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units, arxiv 2021