A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning

论文总结了通用的监督学习存在的问题：

大量的依赖训练数据的监督标签。当数据量巨大的时候，并不能保证每一个类的标注都是充足的和有效的。
在模型训练完成之后，可能会存在新的类别出现，而当前的模型并没有经过这些新出现类别的训练。
现实中，新出现的类别是可能通过已经存在的类别推导出来的，而监督学习没有考虑这一点。

ZSH的目的就是利用语义信息（Semantic information），将新出现的类别和已存在的类别联系起来。

总结了ZSL的两种方法：

attribute/word vector prediction

Given an image, they attempt to approximate label embedding and then classify an unseen class image based on the similarity of predicted vector with unseen attribute/word vector.

给定一张图像，先从图像中提取出一个label embedding，然后计算提取出的label embedding和unseen class attribute的相似度，进行分类。
compatibility function

learn a compatibility function between image and label embeddings, which returns a compatibility score. An unseen instance is then assigned to the class that gives the maximum score.

学习的是一个函数，返回每一张图像和label embedding的相似度，根据相似度进行分类。这里只有图像和label embedding。

主要的贡献：

提出了Class Adapting Principal Directions(CAPD)，来解决图像特征和语义特征之间的联系。
定义了semantic space，提出用seen class来推导unseen class。
提出将同一个类别的sample聚类，使得对unseen class的推导更robust。

主要的思路：

先为为每一个seen class 学习一个矩阵w。对每一个image，通过w，使其在所有seen class中都产生一个向量p，然后对每一个unseen class，学习一个从seen class到unseen class的推到。通过这个推到，能够从seen class的每一个p得到unseen class的每一个p，然后对unseen class的每一个p与之对于的e进行点积操作，得到具体数值，最大的数值对应的class就是unseen 分类得到的标签。

符号定义：

label：$\mathbf{y} = \mathbf{y ^ { s }} \cup \mathbf{y ^ { u }}$ 其中，$\mathbf { y } ^ { \mathcal { S } } = { 1 , \ldots , \mathrm { S } }$为seen class的label，$\mathbf { y } ^ { \mathcal { U } } = { S , \ldots , \mathrm { S+U } }$为 unseen class的label。
semantic class embedding: $\mathbf { E } ^ { \mathcal { S } } = \lbrace \mathbf { e } _ { s } : s \in \mathbf { y } ^ { \mathcal { S } } \rbrace $, $\mathbf { E } ^ { \mathcal { U } } = \lbrace \mathbf { e } _ { u } : u \in \mathbf { y } ^ { \mathcal { U } } \rbrace $, $\mathbf { e } _ { s } , \mathbf { e } _ { u } \in \mathbb { R } ^ { d }$
Images: $\mathbf { X } _ { s } = \left[ \mathbf { x } _ { s } ^ { 1 } , \ldots , \mathbf { x } _ { s } ^ { n _ { s } } \right]$, $\mathbf { X } _ { u } = \left[ \mathbf { x } _ { u } ^ { 1 } , \ldots , \mathbf { x } _ { u } ^ { n _ { u } } \right]$ 其中，$n_s$是seen class中data的总数，同理$n_u$
zsl的目标是为了给每一个unseen images 指定其类别
gzsl的目标是为了给任意一个image指定其类别，该image可以来自seen class，也可以来自unseen class
fsl的目标是同gzsl，只不过在训练数据中加入了少量的unseen class

class adapting principle direction(CAPD)

CAPD对seen class和unseen class的处理方式是不同的。

CAPD on seen class
$$
\mathbf { p } _ { s } = \mathbf { W } _ { s } ^ { T } \mathbf { x } _ { s }
$$
对seen class，需要学习一个mapping function [$\mathbf{W_s}$]，将image映射到semantic space中的principle direction [$\mathbf{p_s}$]

对$\mathbf{W_s}$的学习使用到如下的目标函数：
$$
\min _ { \mathbf { W } _ { s } } \frac { 1 } { \kappa } \sum _ { c = 1 } ^ { S } \sum _ { m = 1 } ^ { n _ { c } } \log \left( 1 + \exp \lbrace L \left( \mathbf { x } _ { c } ^ { m } ; \mathbf { W } _ { s } \right) \rbrace \right) + \frac { \lambda _ { s } } { 2 } \left| \mathbf { W } _ { s } \right| _ { 2 } ^ { 2 }
$$

$$
\begin{equation}
\left( \mathbf { x } _ { c } ^ { m } ; \mathbf { W } _ { s } \right) =
\begin {cases}
\left\langle \mathbf { p } _ { s } , \mathbf { e } _ { c } \right\rangle - \left\langle \mathbf { p } _ { s } , \mathbf { e } _ { s } \right\rangle , & { c \neq s } \
\left\langle \mathbf { p } _ { s } , \frac { 1 } { \mathrm { S } - 1 } \sum _ { t \neq s } \mathbf { e } _ { t } \right\rangle - \left\langle \mathbf { p } _ { s } , \mathbf { e } _ { s } \right\rangle , & { c = s }
\end{cases}
\end{equation}
$$

以上loss的第一项表示，当前的的image属于第c个类别，在s类别的矩阵下生成的p在他对应的label的embedding投影小，在s label的enbedding投影大。

第二项表示，当前的image属于第s个类别时，生成的p和在其他label的embedding的均值方向的投影小，在对应的label的embedding投影大。【论文中提出，这一项同时保证了p在对应的e上的投影要大于其他e的均值上的投影】

CAPD on unseen class

对unseen class，论文提出为每一个image使用双线性映射来估计他的p。
$$
\mathbf { p } _ { u } = \sum _ { s = 1 } ^ { S } \theta _ { s , u } \mathbf { p } _ { s } = \mathbf { P } ^ { \mathcal { S } } \theta _ { u }
$$
也就是说，每一个unseen class是从seen中加权得到的。那这里的theta就是一个相似度metric，如何得到θ，论文提到。
$$
\max _ { \mathbf { M } } \min _ { ( i , j ) \in \overline { A } } d _ { \mathbf { M } } ^ { 2 } \left( \mathbf { p } _ { i } , \mathbf { p } _ { j } \right)\ \text { s.t. } \sum _ { ( i , j ) \in \mathbf { A } } d _ { \mathbf { M } } ^ { 2 } \left( \mathbf { p } _ { i } , \mathbf { p } _ { j } \right) \leq 1
\
d _ { \mathbf { M } } = \sqrt { \left( \mathbf { p } _ { i } - \mathbf { p } _ { j } \right) ^ { T } \mathbf { M } \left( \mathbf { p } _ { i } - \mathbf { p } _ { j } \right) }
$$

确保每训练数据中，每一个同类 (i, j) ∈ A，之间的度量小于1，每一个不同类 (i, j) ∈ A^ 之间的最小距离最大化。

在学习得到M之后，可以估计出unseen class的e
$$
\hat { \mathbf { e } } _ { u } = \sum _ { s = 1 } ^ { \mathrm { S } } \alpha _ { s , u } \mathbf { e } _ { s } = \mathbf { E } ^ { \mathcal { S } } \alpha _ { u }
$$
上面的α可以通过下面的方式得到：
$$
\min _ { \alpha _ { u } } \left( \hat { \mathbf { e } } _ { u } - \mathbf { e } _ { u } \right) ^ { T } \mathbf { M } \left( \hat { \mathbf { e } } _ { u } - \mathbf { e } _ { u } \right) + \frac { \lambda _ { u } } { 2 } \left| \alpha _ { u } \right| _ { 2 } ^ { 2 }
$$

reduced set description of unseen classes

论文提出，从seen class推理到unseen class不需要要所欲的类别参与，从S个类别降到N个类别：
$$
\hat { \mathbf { e } } _ { u } = \sum _ { i = 1 } ^ { N } \beta _ { i , u } \mathbf { e } _ { i }
$$
这里的N是用e进行最邻近搜索得到的。

同时，论文还提出来使用kernel density estimation来获得距离的概率the number of seen classes with the highest probability score is assigned as the value of N

A Unified Approach for Conventional Zero-Shot, Generalized Zero-Shot, and Few-Shot Learning

class adapting principle direction(CAPD)

reduced set description of unseen classes

to be continued.