Challenge 1: Representation

Sub-Challenge 1a: Representation Fusion

Definition: Learn a joint representation that models cross-modal interactions between individual elements of different modalities

Fusion with Unimodal Encoders

Unimodal encoders can be jointly learned with fusion network, or pre-trained

Additive Fusion
Multiplicative Fusion
- Multiplicative fusion
- Bilinear Fusion
Tensor Fusion
- Zadeh et al., Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017
Low-rank Fusion
- Liu et al., Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018
High-Order Polynomial Fusion
- Hou et al., Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, Neurips 2019
Gated Fusion
- Arevalo et al., Gated Multimodal Units for information fusion, ICLR-workshop 2017
- Tsai et al., Transformer Dissection: A Unified Understanding of Transformer’s Attention via the Lens of Kernel, EMNLP 2019
Nonlinear Fusion
Measuring Non-Additive Interactions
- Projection from nonlinear to additive (using EMAP)
- Hessel and Lee, Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020 → introduced the EMAP method
Complex Fusion
- Barnum, et al. “On the Benefits of Early Fusion in Multimodal Representation Learning." arxiv 2022

Sub-Challenge 1b: Representation Coordination

Definition: Learn multimodally-contextualized representations that are coordinated through their cross-modal interactions

Strong Coordination
Partial Coordination

Coordination Function

Cosine similarity
Kernel similarity functions
Canonical Correlation Analysis (CCA)

Deep Canonically Correlated Autoencoders (DCCAE)

Wang et al., On deep multi-view representation learning, PMLR 2015

Multi-view Latent “Intact” Space

Xu et al., Multi-View Intact Space Learning, TPAMI 2015

Given multiple views $z_i$ from the same “object”:

There is an “intact” representation which is complete and not damaged
The views $z_i$ are partial (and possibly degenerated) representations of the intact representation

Auto-Encoder in Auto-Encoder Network

Zhang et al., AE2-Nets: Autoencoder in Autoencoder Networks, CVPR 2019

Gated Coordination

Coordination with Contrastive Learning

Example – Visual-Semantic Embeddings

Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, NIPS 2014

Example – CLIP (Contrastive Language–Image Pre-training)

Radford et al., Learning Transferable Visual Models From Natural Language Supervision, arxiv 2021

Sub-Challenge 1c: Representation Fission

Definition: learning a new set of representations that reflects multimodal internal structure such as data factorization or clustering

Modality-Level Fission

Tsai et al., Learning Factoriazed Multimodal Representations, ICLR 2019
Tsai et al., Self-Supervised Learning from a Multi-View Perspective, ICLR 2021

Fine-Grained Fission

Hu et al., Deep Multimodal Clustering for Unsupervised Audiovisual Learning, CVPR 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2-representation.md

2-representation.md

Challenge 1: Representation

Sub-Challenge 1a: Representation Fusion

Fusion with Unimodal Encoders

Sub-Challenge 1b: Representation Coordination

Coordination Function

Deep Canonically Correlated Autoencoders (DCCAE)

Multi-view Latent “Intact” Space

Auto-Encoder in Auto-Encoder Network

Gated Coordination

Coordination with Contrastive Learning

Example – Visual-Semantic Embeddings

Example – CLIP (Contrastive Language–Image Pre-training)

Sub-Challenge 1c: Representation Fission

Modality-Level Fission

Fine-Grained Fission

Files

2-representation.md

Latest commit

History

2-representation.md

File metadata and controls

Challenge 1: Representation

Sub-Challenge 1a: Representation Fusion

Fusion with Unimodal Encoders

Sub-Challenge 1b: Representation Coordination

Coordination Function

Deep Canonically Correlated Autoencoders (DCCAE)

Multi-view Latent “Intact” Space

Auto-Encoder in Auto-Encoder Network

Gated Coordination

Coordination with Contrastive Learning

Example – Visual-Semantic Embeddings

Example – CLIP (Contrastive Language–Image Pre-training)

Sub-Challenge 1c: Representation Fission

Modality-Level Fission

Fine-Grained Fission