Neural Networks for Face Recognition




Comprehensive Comparison of Vision Transformers and Traditional Convolutional Neural Networks for Face Recognition Tasks




This paper provides a comprehensive comparison of Vision Transformers and traditional Convolutional Neural Networks for face recognition related tasks. We conduct extensive experiments on the tasks of face identification and face verification. To evaluate the models, three datasets with different characteristics have been used. Our contribution to the field includes a deep analysis of the experimental results, including a thorough examination of the training and evaluation process, as well as the software and hardware configurations used. Our results show that Vision Transformers outperform Convolutional Neural Networks in terms of accuracy and inference speed for face recognition related tasks, while also having a smaller memory footprint. We conclude that Vision Transformers not only provide improved performance but also are a more efficient and environmentally friendly solution for face recognition related tasks, contributing to the ongoing ecological transition.


Figure 1. Common CNN architecture.


Figure 2. Vision Transformer architecture. 

The following figures provide a comprehensive summary of the training process for five distinct deep learning models on the VGG-Face2 dataset. These models were developed with the aim of advancing facial recognition capabilities and leveraging the rich dataset to achieve high accuracy in facial recognition tasks. Each figure showcases the training and validation accuracy, allowing for a detailed comparison of the models' learning dynamics. Through this analysis, valuable insights can be gained into the effectiveness and efficiency of different architectural approaches and training strategies in the realm of facial recognition using deep learning techniques.

2 3 4

5 6 9

Figure 3. Accuracy in the first 25 epochs when training (blue line) and validating (orange line) the ViT_B32 (a), ResNet_50 (b), VGG_16 (c), Inception_V3 (d), MobileNet_V2 (e), and EfficientNet_B0 (f) networks.


Table 1 and Table 2 presented below provide a comprehensive summary of the results obtained from training and testing deep learning models on the VGG-Face2 dataset, respectively. These tables encapsulate key performance metrics, which are crucial for evaluating the models' performance in facial recognition tasks. By examining these tables, one can gain valuable insights into the models' effectiveness in recognizing faces accurately and reliably. These results serve as a valuable reference for assessing the performance of various deep learning approaches and fine-tuning strategies when applied to the VGG-Face2 dataset.


Table 1. Training summary of the three networks on VGG Face 2 dataset. The accuracy corresponds to the highest accuracy obtained on the validation set during training. 


Table 2. Face identification results obtained for the five different networks on the evaluation set of VGG Face 2 dataset.

The ROC curves presented in Figure 4 summarize the results obtained by the three networks on the face verification task using the LFW dataset. While all three networks perform comparably well, ViTs achieve a slightly better performance, approaching the top-left corner of the curve. ViTs also show the highest AUC value and lowest EER, thus improving the results provided by CNNs.




We have made the implementation of this project, including the models’ weights and extensive results, publicly available.

Click here to download the implementation.
A GitHub version is also available here.

For questions about this implementation, please contact Marcos Rodrigo at This email address is being protected from spambots. You need JavaScript enabled to view it..



M. Rodrigo, C. Cuevas, and N. García, “Comprehensive Comparison of Vision Transformers and Traditional Convolutional Neural Networks for Face Recognition Tasks”, under review.