Article Highlight | 13-Dec-2023

YuNet: a tiny millisecond-level face detector

Beijing Zhongke Journal Publising Co. Ltd.

Face detection has been an attractive topic in computer vision for decades. It is heavily dependent as a pre-requisite step for many face-related applications such asface recognition, face beautification, face alignment, facetracking, etc. Given an image, face detection locates theface regions by bounding boxes. Many methods have beenproposed to improve face detection performance, fromearly hand-crafted features such as Haar to currentCNN-based features. The runtime ofthe two-stage or multi-stage detectors depends on thenumber of faces. Therefore, the single-stage CNN-baseddetectors have become popular in recent years.


Face detection is less challenging than generic object detection. The accuracy reaches saturation on the challenging benchmark WIDER FACE. Some people may think face detection is a solved problem. However, it is not. The top-ranked methods all use large pre-trained backbone networks, complex feature enhancement modules and heavy test time augmentations (TTAs) for better ranks. For example, one of the best detectors, Mogface, achieves state-of-the-art accuracy with 711M parameters and 808 GFLOPs (for VGA images). The impressive accuracy comes from the consumption of considerable storage and computation resources.


However, face detection is widely deployed on edgedevices such as cell phones, service robots, surveillancecameras and Internet of things (IoT) devices in real-worldapplications. These devices have limited storage resources and computing capability due to their cost. In addition, only a few noticeable faces need to be detected,and tiny faces in the background are normally not neededin many applications. Even when deployed in a centralserver, a fast and efficient detector can save considerableenergy and make the server handle considerable data synchronously. Compared with a huge face detector that canimprove the average precision (AP) slightly on somebenchmarks, researchers argue that an efficient tiny detector ismore urgently needed.


The backbone networks in a face detector are essential for performance. Some popular backbone networks such as VGG-16 from the VGGNet series, ResNet-50/101/152 from the ResNet series and MobileNet were originally designed for image classification of ImageNet. Face detection is different from image classification, which takes the output of the deepest layer as the feature vector. To handle objects of different scales, different feature maps from different layers are employed for detection. Large faces are easier to detect due to the richness of information. In addition, large faces are normally detected from a deeper feature map and are easier to detect than smaller faces. It gives a strong hint that the backbone should focus on small faces in face detection.


The distribution of the face sizes should also be noted. In the WIDER FACE dataset, most faces are small ones, which are less than 20 pixels. It is similar in many face-related applications. Many data augmentation operations, especially random cropping, will change the distributions of face sizes. If researchers train a model with a dataset of different distributions, the AP will decrease obviously. The further from the original distribution, the lower AP will be.


A tiny millisecond-level face detector, YuNet, has been designed and presented in the following part of the paper. The contributions of the paper are listed as follows: Firstly, according to researchers’ unique understanding of face detection, they designed a tiny face detector, which has a very limited number of parameters, a very low latency and promising accuracies.Secondly, researchers suggested a data sampling strategy for model training. It can obviously improve the accuracy of a deep detector, especially of a lightweight detector.Thirdly, it is supposedthat the proposed YuNet should be the best tiny face detector, which achieves an AP of 81.1% on the WIDER FACE validation hard set and has gained more than 11K stars at for its effectiveness.


Face detection is a popular topic in object detection and is also very mature for real applications. In the past decade, deep learning-based face detection can handle face scale, pose, occlusion, expression, makeup, illumination, blur, etc., very well. Some benchmarks, such as WIDER FACE, have been widely used for evaluating different methods and have promoted research. Section 2 givesabriefintroductionof the relatedstudies.


Section 3 is about the methodology. Before introducing the proposed YuNet, some analysis and design principles will be given first. By analyzing the relationship among the model size, computational cost and speed, researchers can have some ideas on how to design a good backbone for face detection. They take RetinaFace as an example to analyze how to design a good detector. Most CNN-based face detectors follow a similar manner as RetinaFace. According to the previous analysis, researchers designed a tiny network for face detection. One principle is to focus on difficult small faces and remove computational cost from easy large faces. Another one is to use depth wise convolution and pointwise convolution to replace standard convolution. Thissectionshowsthe architecture of the proposed YuNet, and it contains a backbone, a tiny feature pyramid network (TFPN) neck and a head.


Section 4 introducesmodeltraining, which consistsof scale augmentation and trainingdetails. Due to the extremely large-scale variations (from several pixels to thousands of pixels) of faces in real-world scenarios, different scale augmentation strategies are employed to adjust the sample scale distribution in the training phase. The most popular scale augmentation strategies are RandomCrop and its variants.


Experiments and results are presentedinSection 5. Firstisthedataset. WIDER FACE is the largest public face detection dataset and has large number of images and faces. Second is the evaluation on WIDER FACE. The following state-of-the-art face detectors arecollected for comparison. They are SCRFD, RetinaFace and YOLO5Face, the performance results of the detectors are achieved under different test conditions.According to the comparison, YuNet can achieve a similar accuracy to most other small models, but it has muchfewer parameters and is much faster. To better understand YuNet, researchers further conductexperiments to examine how to impact performance byadding or removing some components and present thecomparison in one Table of the paper. Another ablation study is on the sample scale distribution. Finally, it is found that YuNet demonstrates superior inference efficiency compared to all other detectorsacross all resolutions.


In this paper, an efficient tiny face detector, YuNet, isspecifically designed for real-time applications. It canachieve a millisecond-level speed on CPUs, and is suitable for mobile and embedded devices. In the future, researchers hope to continue to reduce the sizeof the model and to improve the speed while keeping theaccuracy unchanged.



See the article:

YuNet: A Tiny Millisecond-level Face Detector

Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.