Neural Network-based Face Pose Tracking for Interactive Face Recognition System

— Security system has long been a very important aspect in almost every field. Technology advancement has made biometric systems applicable to provide higher level of security. Face recognition is one of the popular biometric modalities. An interactive face recognition system has been developed to increase the yield of the face recognition yet still uses a simple recognition algorithm. The system provides responses to the user in order to obtain a frontal face image. A neural-based face pose estimation is proposed to generate the proper response based on the captured face image. The overall accuracy of the pose recognition is 95.6%.


I. INTRODUCTION
We are now living in the information era, which means that information openness is very essential. However, information security is very important to ensure its availability, integrity and confidentiality [1]. Accessing information resources often need us to enter password or PIN, which are the weakest means of authorization systems. Higher level security systems include the use of parts of the user body or gestures, which may provide not only authorization but also authentication [2]. This so-called biometric systems possess one-to-one relationship between the users and the data that represents themselves. Many biometric modalities have been developed during the last decades, including finger print, face recognition, iris recognition, hand geometry, palm print, DNA, speech recognition and signature verification [1].
Among many modalities of biometric system, face recognition is one of the most popular authentication and authorization systems [3], despite some issues or problems that exist in it. These problems are arisen from internal and external factors that take place during the process. The internal factors comprise the color and texture of the face, face pose as well as facial expression. The external factors consist of the background scene as well as lighting environment [2][3].
Face color and texture usually can be considered as parts of the face that need to be recognized, assuming these factors do not change for each individual during a relatively long span of time. To reduce the effect of other factors, those factors are standardized. By setting constant background and lighting, the variation due to the external factors can be minimized. Facial expression can be reduced by telling the user to express normally during the authentication process. The pose can be standardized as well, although some researchers developed technique to extend the recognition technique to include face pose variations [4][5][6][7]. This paper, however, focuses on the uses frontal face pose as standard pose, hence simpler face recognition techniques are applicable.
Chai et al. found that frontal face pose gave a more accurate result compare to angled poses [4]. Table I shows that the bigger the yaw angle the less accurate the face recognition result. The best result is achieved when the face pose is frontal to the direction of the camera. In order to prevent the lessening of accuracy by the face yaw angle, our system provides a feedback mechanism to maintain the user face pose frontal during a recognition process. For that we need to estimate the face pose. Face pose estimation techniques have been developed by many researchers. Breitenstein [2]. The criteria developed for this classification were determined heuristically, which in turn still gave some errors. In this paper we proposed the use of artificial neural network to classify the face pose.
Some researchers utilized neural network for face recognition and pose estimation. Huang et al [5], Latta et al. [11] and Nazeer et al. [12] used PCA to generate eigenface and combined with neural network for face recognition. Anam et al. used cropped face image as a feature and combined the neural network with genetic algorithm [13]. Asteriadis et al. convolved face images with a series of filters and fed it into a neural network for head pose estimation [14]. In this paper, we used landmarks coordinates for the neural network inputs. Fig. 1 illustrates the schematic diagram of a common face recognition system [12]. The system has two main subsystems, i.e. face detection and face recognition. During face detection step, an image is captured by a camera. In this case, the image scene mostly includes the user's upper body. An algorithm is subsequently applied to check whether the captured image contains a face. Then the face part of the image is extracted. The size of the face image is usually arbitrary, depends on some factors such as the resolution of the camera as well as the distance between the camera and the user. Therefore, the face image needs to be normalized to a size that matches with the face image database. In the second step, a feature vector is extracted from the normalized face image. Afterward, by applying certain classification and comparing to the image face database, a face class is obtained.

II. MATERIALS AND METHODS
In this paper, we add some steps in between the original two steps of the above face recognition system. Since we only want the frontal face image to be fed to the face recognition subsystem, we introduce pose estimation with a response to the system. After the face detection subsystem supplies normalized face image, the pose of the face is determined. If the pose is frontal, then the image will be inputted to the recognition subsystem. Otherwise, the system provides a response to the user in order to fix his/her pose until a frontal face image is gathered. The response is an action that depends on the pose captured by the camera, which is the opposite of the current pose. For instance, if the user face pose is turned to the left then the system will tell the user to turn his or her face to the right. If the user face pose is tilted to the right, the system will ask the user to tilt his or her face to the left. The face poses defined in this paper is given in Table II. Examples of the five face pose are given in Fig 3.

Face pose
System response Frontal "Steady, please" Tilted to the left "Please tilt your head to the right" Tilted to the right "Please tilt your head to the left" Turned to the left "Please turn your head to the right" Turned to the right "Please turn your head to the left" The proposed pose estimation subsystem firstly detects and calculates face landmarks (i.e. both eyes, nose and mouth) from face image (Fig. 4). The camera captures the scene. Once the user face has been detected, the face area of the image is cropped and for every detected face landmark, their coordinates are calculated. When all the four landmarks are detected, the result is then supplied to the neural network, otherwise, the step goes back to frame capture from the camera. The coordinates of the face landmarks are normalized to the width and height of the face image, hence the distance between the user and the camera will not produce a significant effect. The direction of the x coordinate is to the right, while the direction of the y axis is downward. The zero axis of the x coordinate is at the center of the image face, whereas the zero axis of the y coordinate is at the top of the image.
The coordinates of the landmarks are then used for the input of the neural-based dace pose estimation (Fig. 5). Thus, the neural network has eight inputs, i.e. the x and y coordinates of the four face landmarks. The neural network architecture is a feed forward network with one hidden layer consists of 10 nodes and logistic activation function. The output of the neural network is the face poses as has been defined in Table II. The output is a five-bit binary code for each face poses as described in Table III. The normalized face images from the face detection are divided into data for training and data for testing. During training phase, the system detects and calculates the face landmark coordinates, and the face poses are determined manually by the operator. Afterward, the neural network is undergone learning until convergence achieved. The weight of the trained neural network is then used for the tracking or operational phase in which the face poses will be calculated automatically by the neural network. The user interface of the system is depictured in Fig. 6. It consists of the camera scene with detected face and landmarks, the cropped and normalized face image, the coordinates of the landmarks as well as the face pose. The system response is given in the form of speech voice.  It can be seen that the neural network can recognize completely the frontal pose. Whereas for other poses, some incorrect recognitions exist. Some of the non frontal poses are mistakenly identified as frontal pose. Some turned poses are also incorrectly recognized as tilted because there are some turned poses that nearly close to the corresponding tilted poses, i.e. turned to the left is similar to tilted to the right and vice versa. This paper takes into account the face yaw, pitch and roll angles altogether. The turned poses are the effect of yaw and pitch angles, while the tilted poses are caused by the yaw and roll angles. However, there are overlaps between the tilted and turned poses on some yaw-pitch and yaw-roll angle combinations. Hence, the result depends on the manual face pose decision during the neural network training.

IV. CONCLUSIONS
A neural network based face pose estimation has been developed to provide responses in an interactive face recognition system. The face poses defined in this paper are frontal, tilted and turned for both directions to the left and to the right. The feed forward type neural network uses eight face landmark coordinates for the inputs and five face pose codes for the outputs. The success rate of the face pose estimation is 95.6%.