Today I read a paper titled “Neural Aggregation Network for Video Face Recognition”
The abstract is:
In this paper, we present a Neural Aggregation Network (NAN) for video face recognition.
The network takes a face video or face image set of a person with variable number of face frames as its input, and produces a compact and fixed-dimension visual representation of that person.
The whole network is composed of two modules.
The feature embedding module is a CNN which maps each face frame into a feature representation.
The neural aggregation module is composed of two content based attention blocks which is driven by a memory storing all the features extracted from the face video through the feature embedding module.
The output of the first attention block adapts the second, whose output is adopted as the aggregated representation of the video faces.
Due to the attention mechanism, this representation is invariant to the order of the face frames.
The experiments show that the proposed NAN consistently outperforms hand-crafted aggregations such as average pooling, and achieves state-of-the-art accuracy on three video face recognition datasets: the YouTube Face, IJB-A and Celebrity-1000 datasets.