Intro

常用的人脸特征向量生成的方法都存在一定的问题：

 1. 基于分类的方法 
 - 网络中用于分类的全连接层W随着参与训练的identity的增加而增加，使得网络参数量大。
 - 在open-set的情景下，学习到的人脸特征区分力不够。
 2. 基于三元组的方法
 - 对于大规模的数据集，三元组的组合呈现出爆炸式的增长。
 - `hard sample`的挖掘很难。

作者提出了Addictive Angular Margin Loss(ArchFace)来提高人脸特征的区分力(discriminative power)。

ArchFace 流程图

步骤包括如下几点：

对人脸特征和最后的全连接层的参数做归一化操作，在进行点积操作。这里等同于他们的cosine距离。
通过$\arccos(\cdot)$来计算特征$x_i$和$W_j$ 的之间的角度$\theta_{yi}$。这里$W_j$可以看做是第$j$个类的类心。
对$\theta$加上一个marge m后，计算$\cos(\theta + m)$。
对每一个逻辑单元进行上述相同的计算。随后，用一个scale s对求得的添加了margin的cosine值进行缩放。
接下来的操作和普通的softmax一样。

Proposed Approach

原始的softmax如下：
$$
L_{1}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{W_{y_{i}}^{T} x_{i}+b_{y_{i}}}}{\sum_{j=1}^{n} e^{W_{j}^{T} x_{i}+b_{j}}}
$$
偏置项置零，$W、x$进行$l_2$归一化处理，增加scale变量s，有：
$$
L_{2}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s \cos \theta_{y_{i}}}}
{e^{s \cos \theta_{y_{i}}}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cos \theta_{j}}}
$$
添加一个margin m来增加类内的compactness和类间的discrepancy：
$$
L_{3}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s\left(\cos \left(\theta_{y_{i}}+m\right)\right)}}{e^{s\left(\cos \left(\theta_{y_{i}}+m\right)\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cos \theta_{j}}}
$$

通过对二维特征的对比，可以明显的看到差别：

Softmax V.S. ArcFace

Comparison

在已有的工作中，有几个相似的工作都是通过对softmax添加margin来进行的。这里对比了一下几个：

SphereFace [cvpr2017]
$$
L_{ang} = \frac{1}{N} \sum _{i} - \log{ \frac{||e^{x_i}|| \cos{(m \theta _{yi, i})}}
{e^{||x_i|| \cos{(m \theta _{yi, i})}} + \sum _{j \ne y _i} e^{||x_i \cos{(\theta _{j, i})}||}}}
$$

SphereFace在角度空间进行的margin。
CosFace [cvpr2018]

$$
\begin{equation}
L_{l m c}=\frac{1}{N} \sum_{i}-\log \frac{e^{s\left(\cos \left(\theta_{y_{i}, i}\right)-m\right)}}{e^{s\left(\cos \left(\theta_{y_{i}, i}\right)-m\right)}+\sum_{j \neq y_{i}} e^{s \cos \left(\theta_{j, i}\right)}}
\end{equation}
$$

CosFace在cosine空间进行的margin。

作者将这三种loss归纳在一个式子中：
$$
\begin{equation}
L_{4}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s\left(\cos \left(m_{1} \theta_{y}+m_{2}\right)-m_{3}\right)}}{e^{s\left(\cos \left(m_{1} \theta_{y}+m_{2}\right)-m_{3}\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cos \theta_{j}}}
\end{equation}
$$
Comparison

同时，作者还提出了集中变种方案，从不同的角度来诠释”减小类内差异和增大类间差异”：

Intra-Loss
$$
L_{5}=L_{2}+\frac{1}{\pi N} \sum_{i=1}^{N} \theta_{y_{i}}
$$
Inter-Loss
$$
\begin{equation}
L_{6}=L_{2}-\frac{1}{\pi N(n-1)} \sum_{i=1}^{N} \sum_{j=1, j \neq y_{i}}^{n} \arccos \left(W_{y_{i}}^{T} W_{j}\right)
\end{equation}
$$
Triplet-Loss
$$
\begin{equation}
\arccos \left(x_{i}^{p o s} x_{i}\right)+m \leq \arccos \left(x_{i}^{n e g} x_{i}\right)
\end{equation}
$$

Experiments

对比了大量的数据集：

Datasets

一些实验设置如下：

$112\times112$ 对齐后的人脸
backbone为 ResNet50和ResNet100
卷积之后的结构为”BN-Dropout-FC-BN”
512D face feature
s=64
m=0.5
Batchsize = 512
在Celeab中，lr=0.1，lr/10@20k @28k epoch，总共迭代32K次
在MS1MV2中，lr/10 @100k, @160k
momentum = 0.9， weight-decay = 5e-4
测试和训练集中没有重合的identity

实验对比非常详尽，这里不做过多的介绍，可以参考原文。

Parallel Acceleration

论文提到了，当identity数量很大时（millions of identites），W可能会超过GPU的显存大小，这时候，可以运用并行加速策略。[A distributed training solution for face recognition]

对特征x和全连接层参数w都有进行并行计算。作者在8块1080ti(11GB)上，使用ResNet50， batchsize 为8*64，特征维度为512，float 32的情况下，每秒钟能够跑800个sample。

获得特征x

对于每一个batch中的特征，是分别从8块GPU中聚合得到的，获得 feature matrix
计算softmax score

$score = xW$

将feature matrix复制到每一块GPU中，并行的sub score，(512个feature * 1M/8 identities)
获得dW

$dW = x^T dscore $

在每一块GPU中，对feature matrix进行转置，并行的乘每一个sub score的梯度。得到W的梯度
获得dx

$dx = dscoreW^T$

在每一块GPU中，对sub matrix进行转置，乘以每一个subscore的梯度，将8块GPU产生的梯度加起来，得到x的梯度