Lecture 8: Introduction to Multimodal Machine Learning — Part 1 Representation
|Last edited: 2024-11-28

Challenges in Multimodal Learning

  • Representation
  • Alignment
  • Translation
  • Fusion
  • Co-Learning

Representation

Definition: Learning representations that reflect cross-modal interactions between individual elements, across different modalities.

Representation Fusion

Definition: Learn a joint representation that models cross-modal interactions between individual elements of different modalities.
  • Unimodal encoders can be jointly learned with fusion network, or pre-trained
    • notion image

Early and Late Fusion

notion image

Basic Concepts

  • Additive terms
  • Multiplicative ‘interaction’ term

Additive Fusion

notion image

Multiplicative Fusion

notion image

Tensor Fusion

The weight matrix may end up quite large!

Low-rank Fusion

传统的张量融合
可能非常巨大。

权重分解

其中, 是视觉模态的子权重矩阵, 是语言模态的子权重矩阵。

输入特征分解

输入特征 也可以分解为多个子特征
其中, 是视觉特征经过投影矩阵 后的子特征。 类似。

Contrastive Language-Image Pretraining, CLIP

notion image
其中, 是相似度的度量,通常为余弦相似度。

在训练过程中,CLIP通过构建批量的数据构建正负样本矩阵。特征间的相似性矩阵用于计算损失,其中对角线表示正样本,非对角线为负样本。
Loading...