Definition: Combining knowledge, usually through multiple inferential steps, exploiting multimodal alignment and problem structure.
Definition: Defining or learning the relationships over which reasoning occurs.
Temporal structure in multi-view sequences
Key ideas: memory to capture cross-modal interactions across time
- Structuring multimodal memory: ideas from representation fusion, coordination, and fission
- Rajagopalan et al., Extending Long Short-Term Memory for Multi-View Structured Learning. ECCV 2016
- Writing: Coordination function measuring similarity between feature and memory to weight feature:
- Wang et al., Multimodal Memory Modelling for Video Captioning. CVPR 2018
- Compose: Weighted function to compose previous memory and new addition
- Xiong et al., Dynamic Memory Networks for Visual and Textual Question Answering. arXiv 2016
- Reading: Summary function to summarize multimodal information
- Hazarika et al., ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. EMNLP 2018
Hong et al., Learning to Compose and Reason with Language Tree Structures for Visual Grounding. IEEE TPAMI 2019
Structure defined through interactive environment
Main difference from temporal - actions taken at previous time steps affect future states
Integrates multimodality into the reinforcement learning framework
Luketina et al., A Survey of Reinforcement Learning Informed by Natural Language. IJCAI 2019
Xu et al., MUFASA: Multimodal Fusion Architecture Search for Electronic Health Records. AAAI 2021
Definition: The parameterization of individual multimodal concepts in the reasoning process.
Hand-crafted concepts based on domain knowledge
Andreas et al., Neural Module Networks. CVPR 2016
Definition: How increasingly abstract concepts are inferred from individual multimodal evidences.
Towards explicit inference paradigms:
- Logical inference: given premises inferred from multimodal evidence, how can one derive logical conclusions?
- Causal inference: how can one determine the actual causal effect of a variable in a larger system?
Gokhale et al., VQA-LOL: Visual Question Answering Under the Lens of Logic. ECCV 2020
Causal VQA: does my multimodal model capture causation or correlation?
Agarwal et al., Towards Causal VQA: Revealing & Reducing Spurious Correlations by Invariant & Covariant Semantic Editing. CVPR 2020
Definition: The derivation of knowledge in the study of inference, structure, and reasoning.
Knowledge can also be gained from external sources
Marino et al., OK-VQA: A visual question answering benchmark requiring external knowledge. CVPR 2019
Gui et al., KAT: A Knowledge Augmented Transformer for Vision-and-Language. NAACL 2022
Zhu et al., Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries. arXiv 2015