A scientist from HSE University has developed an image recognition algorithm that works 40% faster than analogues. It can speed up real-time processing of video-based image recognition systems. The results of the study have been published in the journal Information Sciences.
Convolutional neural networks (CNNs), which include a sequence of convolutional layers, are widely used in computer vision. Each layer in a network has an input and an output. The digital description of the image goes to the input of the first layer and is converted into a different set of numbers at the output. The result goes to the input of the next layer and so on until the class label of the object in the image is predicted in the last layer. For example, this class can be a person, a cat, or a chair. For this, a CNN is trained on a set of images with a known class label. The greater the number and variability of the images of each class in the dataset are, the more accurate the trained network will be.
If there are only a few examples in the training set, the additional training (fine-tuning) of the neural network is used. CNN is trained to recognize images from a similar dataset that solves the original problem. For example, when a neural network learns to recognize faces or their attributes (emotions, gender, age), it is preliminary trained to identify celebrities from their photos. The resulting neural network is then fine-tuned on the available small dataset to identify the faces of family or relatives in home video surveillance systems. The more depth (number) of layers there are in a CNN, the more accurately it predicts the type of object in the image. However, if the number of layers is increased, more time is required to recognize objects.
The study's author, Professor Andrey Savchenko of the HSE Campus in Nizhny Novgorod, was able to speed up the work of a pre-trained convolutional neural network with arbitrary architecture, consisting of 90-780 layers in his experiments. The result was an increase in recognition speed of up to 40%, while controlling the loss in accuracy to no more than 0.5-1%. The scientist relied on statistical methods such as sequential analysis and multiple comparisons (multiple hypothesis testing).
"The decision in the image recognition problem is made by a classifier -- a special mathematical algorithm that receives an array of numbers (features/embeddings of an image) as inputs, and outputs a prediction about which class the image belongs to. The classifier can be applied by feeding it the outputs of any layer of the neural network. To recognize "simple" images, the classifier only needs to analyse the data (outputs) from the first layers of the neural network.
There is no need to waste further time if we are already confident in the reliability of the decision made. For "complex" pictures, the first layers are clearly not enough -- you need to move on to the next. Therefore, classifiers were added to the neural network into several intermediate layers. Depending on the complexity of the input image, the proposed algorithm decided whether to continue recognition or complete it. Since it is important to control errors in such a procedure, I applied the theory of multiple comparisons: I introduced many hypotheses, at which intermediate layer to stop, and sequentially tested these hypotheses," explained Professor Savchenko.
If the first classifier already produced a decision that was considered reliable by the multiple hypothesis testing procedure, the algorithm stopped. If the decision was declared unreliable, the calculations in the neural network continued to the intermediate layer, and the reliability check was repeated.
As the scientist notes, the most accurate decisions are obtained for the outputs of the last layers of the neural network. Early network outputs are classified much faster, which means it is necessary to simultaneously train all classifiers in order to accelerate recognition while controlling loss in accuracy. For example, so that the error due to an earlier stop is no more than 1%.
"High accuracy is always important for image recognition. For example, if a decision in face recognition systems is made incorrectly, then either someone outside can gain access to confidential information or conversely the user will be repeatedly denied access, because the neural network cannot identify him correctly. Speed ??can sometimes be sacrificed, but it matters, for example, in video surveillance systems, where it is highly desirable to make decisions in real time, that is, no more than 20-30 milliseconds per frame. To recognize an object in a video frame here and now, it is very important to act quickly, without losing accuracy," said Professor Savchenko.