Definition: Learn a joint representation that models cross-modal interactions between individual elements of different modalities
Unimodal encoders can be jointly learned with fusion network, or pre-trained
- Additive Fusion
- Multiplicative Fusion
- Multiplicative fusion
- Bilinear Fusion
- Tensor Fusion
- Low-rank Fusion
- High-Order Polynomial Fusion
- Hou et al., Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, Neurips 2019
- Gated Fusion
- Nonlinear Fusion
- Measuring Non-Additive Interactions
- Projection from nonlinear to additive (using EMAP)
- Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020 → introduced the EMAP method
- Complex Fusion
- Barnum, et al. “On the Benefits of Early Fusion in Multimodal Representation Learning." arxiv 2022
Definition: Learn multimodally-contextualized representations that are coordinated through their cross-modal interactions
- Strong Coordination
- Partial Coordination
- Cosine similarity
- Kernel similarity functions
- Canonical Correlation Analysis (CCA)
Wang et al., On deep multi-view representation learning, PMLR 2015
Xu et al., Multi-View Intact Space Learning, TPAMI 2015
Given multiple views
- There is an “intact” representation which is complete and not damaged
- The views
$z_i$ are partial (and possibly degenerated) representations of the intact representation
Zhang et al., AE2-Nets: Autoencoder in Autoencoder Networks, CVPR 2019
Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, NIPS 2014
Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arxiv 2021
Definition: learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering
- Tsai et al., Learning Factoriazed Multimodal Representations, ICLR 2019
- Tsai et al., Self-Supervised Learning from a Multi-View Perspective, ICLR 2021
- Hu et al., Deep Multimodal Clustering for Unsupervised Audiovisual Learning, CVPR 2019