Object Searching on Real-Time Video Using Oriented FAST and Rotated BRIEF Algorithm

— The pre-processing and feature extraction stages are the primary stages in object searching on video data. Processing video in all frames is inefficient. Frames that have the same information should only be once processed to the next stage. Then, the feature extraction algorithm that is often used to process video frames is SIFT and SURF. The SIFT algorithm is very accurate but slow. On the other hand, the SURF algorithm is fast but less accurate. Therefore, the requirement for keyframe selection and feature extraction methods is fast and accurate in object searching on real-time video. Video is pre-processed by extracting video into frames. Then, the mutual information entropy method is used for keyframe selection. Keyframes are extracted using the ORB algorithm. The multiple object detection in the video is done by clustering on features. The feature extraction results on each cluster are matched with the results of the feature from the query image. Matching results from keyframe on video with the query image is used to retrieve the video's frame information. The experiment shows that keyframe selection is beneficial in real-time video data processing because the keyframe selection speed is faster than feature extraction on each frame. Then, feature extraction using the ORB algorithm results 2 times faster speed results than SIFT and SURF algorithms with values not so different from SIFT algorithm. This study's results can be developed as a security warning system in public places, especially by security in providing evidence of criminal cases from videos.


I. INTRODUCTION
The function of surveillance through Closed-Circuit Television (CCTV) camera in security is generally carried out in public places where there are valuables, such as ATMs, offices, schools, and shopping centers. Traditional surveillance through CCTV is very ineffective because the process of object searching on video requires the operator to observe the whole of the video [1]. When criminal acts such as theft, CCTV operators still look for evidence of items stolen on video manually by viewing the entire video from beginning to end. Recently, intelligent surveillance has begun for processing video data using algorithms for object searching [2], [3], [4].
The primary stages in video processing are pre-processing by extracting video into frames and extracting features from the frame. Research conducted by Jabnoun et al. [5] processed all video frames to detect household appliance objects. Processing all frames make delay in real-time video processing. Processing on real-time video requires a speed that can compensate the produced frame from the camera every second. Another disadvantage of processing all video frames is feature extraction on frames that no change on content to be inefficient [6], [7]. Therefore, the keyframe selection method is needed to reduce delay in real-time processing and reduce redundancy of processing frames containing the same information.
Researchers in video processing have widely used keyframe selection. Research conducted by Ouyang et al. [8] studied several keyframe selection methods. The best result from several keyframe selection methods is obtained by the mutual information entropy method that produces the right keyframe in traffic videos. The recall value of the mutual information entropy method is the highest compared to other keyframe selection methods, 89.7%. Li et al. [9] also use the mutual information entropy method for selecting a keyframe that produces a keyframe according to the video's main content.
The next stage is feature extraction on query image and keyframe. Some feature extraction algorithm research, including objects searching on video conducted by Jabnoun et al. [5] in the case of object searching for household appliances. The results showed that the SIFT algorithm results in an accuracy of 82% with a processing time of 30 seconds, whereas the SURF algorithm results in 18% accuracy with a processing time of 9 seconds for matching feature. The SIFT algorithm gives better accuracy results than SURF algorithm, but the processing time is the opposite of accuracy. The disadvantage of this feature extraction algorithm is used in processing real-time video that requires fast and accurate processing.
Some feature extraction algorithms are Harris Corner [10], [11], Scale Invariant Feature Transform (SIFT) [12], [13], Speed-Up Robust Feature (SURF) [14], [15], and Oriented FAST and Rotated BRIEF (ORB) [16], [17], [18]. Toapanta et al. [19] conducted a study of SIFT, SURF, and ORB methods to recognize human identity through iris that the highest result is ORB with accuracy of 99.6%. Yu and Kong [20] also compared Harris Corner, SURF, and ORB for stitching frames in videos that result in the fastest algorithm is ORB, 0.105 seconds per processing. Rublee et al. [16] also study image processing by taking 1000 features. The ORB algorithm is the fastest on the processing time of SIFT and SURF algorithms. The advantages of the ORB algorithm are also robust in image noise and rotate invariant. In this research, we propose a method to process real-time object searching on video that combining pre-processing stage using mutual information entropy for selecting keyframe and ORB for extracting features from keyframes. We also compare the speed and accuracy of the ORB algorithm with SIFT and SURF algorithms. The system starts with capturing real-time video data. Then, the video is extracted into frames. The system does not process all frames because it creates a delay when the system is running in real-time processing. We use mutual information entropy method to select frames into keyframes. Then, selected keyframes is done by feature extraction using SIFT, SURF, or ORB algorithm. Then, the keypoint and descriptor from feature extraction on the keyframe are clustered using Mean-Shift to detect multiple objects. All cluster results will be matched with keypoint and descriptor from query image. Furthermore, the input as a query image is also done by feature extraction using SIFT, SURF, or ORB algorithm. The extraction feature results on query image and keyframe are matched using FLANN for SIFT and SURF or Brute-Force Matcher for ORB. Then, the result of matching feature is done by forming a homography matrix to see the polygon formation. If the polygon is formed, the query image and keyframe is matching, and vice versa. The architecture of object searching system on video is shown in Figure 1.

A. Data Acquisition
There are two data used in this study. The first is the query image data. Then, the second data is real-time video data. The query image is image data that contains objects to be searched on video. The resolution of the query image in this research is resized to 100×100. Figure 2 shows an example of a query image in this study. The video data is then taken from a CCTV camera with a recording height of 1.5 meters from the floor. Video data processing is done by extracting video into frames. Next, video frames are selected into keyframes.

B. Keyframe Selection
The keyframe selection process is shown in Figure 3. The first frame automatically turns into a keyframe. Then, the next frame is calculated by the number of mutual information entropy with the previous keyframe. The mutual information entropy calculation can be seen in equation (1) [21]. Where ; : the number of mutual information entropy on frame and .
, : Probability gray level on frame and . Mutual information implies that two frames have the same information. If the number of mutual information from the two images is getting bigger, the two frames have almost the same information [22].

C. Scale Invariant Feature Transform (SIFT)
According to Lowe [12], the SIFT method's algorithm consists of four stages, searching extreme values on scalespace, detecting keypoint, determination of orientation, and creating keypoint descriptors. Feature extraction using SIFT algorithm begins with constructing a scale-space (octave) using Gaussian blur with equation (2), , , = , , * , is variable scales of Gaussian, and , is image intensity.
The SIFT requires 4 octaves and 5 blur scales for each detection. The first octave is two times larger than the second octave. Furthermore, each octave difference is searching using the Difference of Gaussian (DoG) with equation (3), resulting in 4 octaves of DoG.
, , = , , where , , is convolution on image with Difference of Gaussian filter. Keypoint determination begins with taking a sample point which is compared with its neighbors (26 pixels neighboring). If the point has the smallest (local minima) or largest (local maxima) value, it will become candidate keypoint. Then, candidate keypoints are filtered against low contrast keypoints and keypoints are located near the edge. Then, keypoints are calculated on magnitude and angle. This stage makes SIFT invariant towards orientation.
In the step of creating descriptor, SIFT creates an area of 16×16-pixel size around the keypoint and 4×4 sub-areas with 8 orientation directions in each sub-area. Furthermore, each sub area is made into a bin histogram to store the orientation of the keypoint that has similarities in a certain angular range. The SIFT descriptor is normalized so the descriptor value is not affected by lighting changes. The final result is 128 descriptors from 8 directions on each sub area.

D. Speed-Up Robust Features (SURF)
Feature extraction using SURF algorithm is searching blob which is a set of pixels that have same intensity. Bay et al. [14] divides SURF stage into three stages, representation of scale space, keypoint detection, and keypoint descriptors. SURF algorithm begins with the establishment of integral images using equation (4).
: Center representation on image with consist of number gray level SURF algorithm establishes pyramid images where is not going through blurring image. Keypoint searching is performed on scale-space that forms a pyramid image. Scalespace is resulting in keypoint that has scale invariant. Furthermore, keypoint on scale space is selecting candidate keypoint using non-maxima suppression. Candidate keypoints are sought by using local maxima and determinant of Hessian Matrix as in equation (5) at the testing points , , of integral images.
SURF algorithm checks 26 neighbor points between scales. If the Hessian extremity value at the test point is greater than all the neighbors, the test point is a keypoint. The last stage is descriptor on keypoint. Each keypoint must have a unique descriptor so it is not affected by image rotation. This process is carried out with the Haar Wavelet response to the and directions referring to the values of / and / . The SURF descriptor is an area of size 20s. Each area is divided into 4×4 subarea. Each subarea is explained by the Haar Wavelet response based on a 5×5 sample with a vector containing 4 components. The result of SURF descriptor contains 64 dimensions.

E. Oriented FAST and Rotated BRIEF (ORB)
ORB consists of FAST to detect keypoint and BRIEF to create descriptor on each keypoint [23]. ORB is free from the licensing restrictions of SIFT and SURF [16]. The ORB algorithm starts with transformation scale pyramid on image. Then, it uses FAST detector to detect corner of the image. FAST detects a keypoint such as pixel compared to 16 pixels around it that form a circle. The circle pixels are sorted into 3 classes, i.e., brighter than , darker than , and equal to . If there are more than 8 pixels that are darker or lighter than , pixel becomes a keypoint. The results of FAST detection are calculated using Harris Corner to find the best keypoint. Afterward, an orientation searching on object is performed using centroid intensity as within equation (6) where is 0 ̅ , 2 is centroid of object on image, 4 66 is moment level 0 (area of object), and 4 56 , 4 65 is moment level 1.
The results from keypoint detection are extracted using BRIEF algorithm, as it does not have rotational invariant. The next step is comparing all sampling pairs (the first pixel with the second pixel on image). If the first pixel is brighter than the second pixel, it has a value of 1, or else it will be 0. This step is done with following equation: < ; , ≔ > = ? 1, < 0, ≥ where is the value of pixel intensity and is the value of pixel intensity. This step will be repeated up to 256 pairs. Then, 256 bits is converted to byte that resulting in binary descriptor with 32 dimensions.

F. Clustering Features on Keyframe
The feature extraction of the SIFT, SURF, and ORB algorithms is shown in Figure 4. The keypoint points are marked in green. It can be seen that the results of the ORB algorithm only produce a few keypoints. This is because the ORB algorithm has a Harris cornet detection filter. Feature extraction using SIFT, SURF, or ORB algorithms on keyframes generates keypoints and descriptors. Furthermore, keypoints and descriptors on video keyframes are clustered based on the closest neighbor which point to detect multiple objects on keyframes. The clustering algorithm that used in this study is Mean-Shift. The Mean-Shift algorithm is based on centroid on keypoints which continuously updates the centroid candidate by calculating the mean at all points according to the window area [25]. Furthermore, the candidate's centroid is filtered to eliminate the duplication of the adjacent centroid. Candidate centroid C in iteration % is updated continuously with following equation.
where " C is neighborhood of samples.
The Mean-Shift algorithm automatically determines the number of clusters. This is based on the bandwidth parameters that determine the size of the region to search through.
The clustering process using the Mean-Shift algorithm begins by making the key point from feature extraction as the center of the cluster, as shown in Figure 5a. Furthermore, the window size (kernel bandwidth) is determined automatically through the estimate bandwidth function. As the algorithm's name implies, this algorithm calculates the mean cluster center of all points in the window based on the nearest neighbors, as shown in Figure 5b. This algorithm then performs a shift in a denser area by renewing the center of the cluster's mean value with its neighboring points using equation 9, as shown in Figure 5c. The algorithm will stop when the cluster center position has not shifted, with the final result shown in Figure 5d. Figure 6 shows an example of the Mean-Shift algorithm operation on a video keyframe that produces 4 clusters.

G. Matching Feature Query Image with Keyframe
The matching feature stage SIFT and SURF algorithms are using FLANN method while ORB algorithm is using BF-Matcher. Matching features occur if there are at least 4 keypoint good matches. If the results of the best keypoint matching (good match) are more than or equal to four, a Homography matrix [26,27] search of the two images is performed. The image will have geometrical transformations such as translation, rotation, scaling, shear. The next step is checking whether the Homography matrix is formed. If it is not formed, the process will be stop which indicates it does not match. Homography matrix is used to find the angle of an image. If the corners are connected to be polygon, the query image and keyframe are declared to match. If the connected corner does not form a polygon, the image query and keyframe are declared not match.

1) FLANN Method: The Fast Library Approximated
Nearest Neighbor (FLANN) method is used for finding the value of nearest neighbor [28,29]. Descriptors are produced by SIFT algorithm that is 128 dimensions for each keypoint while the SURF algorithm has 64 dimensions descriptor. Therefore, using the FLANN method for matching multidimensional data is needed. FLANN uses the K-Dimensional Tree (KD Tree) index type. KD-Tree is a multidimensional binary tree data representation that aims to separate certain data areas based on their position value. An illustration of the FLANN method using the KD-Tree algorithm is shown in Figure 7. Fig. 7 The illustration of FLANN method with KD-Tree algorithm Illustration using the KD-Tree algorithm, for example, using a 2-dimensional descriptor. For example, there are 7 keypoints, with descriptor {(5,7), (3,4), (9,2), (1,2), (2,7), (6,1), (7,8)}. Descriptor (5,7) becomes the root, the next descriptor is placed in the left or right tree depending on the split part, as shown in Figure 7a. The resulting tree is then modeled with coordinates to facilitate clustering. The boundaries of the red lines, as shown in Figure 7b, are the clusters formed. Then, suppose there is an asterisk as the query image's descriptor, as shown in Figure 7b. It turns out that point A is not the nearest neighbor of the query image. The distance used in the nearest neighbor using Euclidean Distance. Next, look for the nearest neighbor point, found at point B, as shown in Figure 5c. After finding the nearest neighbor between the query image and the keyframe, point A, G, F, C are marked as no nearest neighbors, as shown in Figure 7d and Figure 7e.

2) BF-Matcher Method:
The ORB algorithm produces keypoint and binary descriptors in query image and keyframe. The BF-Matcher work is comparing each descriptor in the query image with all descriptors on the keyframe to find the smallest result [30]. ORB generates 32 descriptors for each keypoint, as shown in Table 1, an example of a descriptor in the query image. Then, Table 2 is an example of a descriptor on a keyframe.  The Hamming distance calculates the difference in the descriptor. The smaller the distance value, the more match the descriptor between the query image and the keyframe. For example, using the data descriptor in Table 1 and Table 2, coordinates (1,6) with (9.8) distance value 5, coordinates (10,6) with (9,13) distance value 1, coordinates (1,13) with (3,8) a distance value of 1, and coordinates (10,13) with (3,13) a distance value of 0. Then we look for the smallest and second smallest values to do a good match search. Fig. 8 The example of matching feature between query image and keyframe An example of the matching feature between the query image and the keyframe is shown in Figure 8. Each clustering result on the keyframe is matched with the query image. The query image's bag object matches one of the clusters on the keyframe, namely the bag object. Using the Homography matrix function, the matching results between keypoints are represented according to the query image's angle. Therefore the polygon shape on the video keyframe appears in the object's corners following the query image. A red mark on the keyframe indicates a polygon is formed so that the query image matches on that keyframe.

A. Data Acquisition
The video is collected from a recording using CCTV HiLook Wi-Fi PT camera with IPC-P120-D / W model. The resolution of the CCTV camera is 2.0 MP with 10 frames per second (fps). The duration of the video is 60 seconds. This study is using 4 types of bags as an object with different patterns such as textured bag (Batik), lettering patterned bag, black and white patterned bag, and color pattern bag.

B. The Experiment of Keyframe Selection
The experiment of keyframe selection and the suitability of keyframes generated are using a combination of variations in the bin's parameter and the threshold value of mutual information entropy method. In this experiment, the number of frame data is 607 frames. Table 3 shows the results of the keyframe selection experiment. The suitability of the resulting keyframe can be seen from appearance of number of keyframes in recording with no objects which only produces one keyframe. From the experiment, keyframe selection with bins 7 and threshold 1.1 give better result than other combination. Then, the speed of keyframe selection is 6 ms in processing per frame. This method is much faster than speed of feature extraction method. Based on Table IV, the matching feature on all algorithm, the fastest algorithm gets 0.719 second in processing frame. Therefore, keyframe selection can reduce delays in real-time video processing.

C. The Experiment of Recording Resolution
The resolution of the video needs to be tested because it is the determination of the best resolution for L 5 value and speed in real-time processing. The resolution experiment is performed on Full HD (1920×1080), HD (1280×720) and VGA (640×480). The black and white patterned bag is used to see the optimal resolution in speed and also L 5 value. In Figure 9, the best L 5 value was at Full HD resolution but based on Table 4, in Full HD resolution to process one keyframe takes an average of 3.488 seconds on the SIFT algorithm, 3.126 seconds on the SURF algorithm, and 1.286 seconds on the ORB algorithm. Speed above 1 second for real-time processing is very slow. Table 4 shows that the result of L 5 value experiment on the HD resolution is quite good and the speed experiment is faster than Full HD resolution. In all resolutions, the L 5 The ORB algorithm value is better than the SURF algorithm, and the speed experiment of the ORB algorithm is the fastest of the SIFT and SURF algorithms.  Figure 10 shows that in VGA resolution, the L 5 value is not good. As shown in Figure 10h and 10i, in VGA resolution for SURF and ORB algorithms is not detected. This variation in resolution affects to number of features obtained. The smaller size of resolution makes fewer features that can be obtained. But if using too large resolution like Full HD, the processing time is also longer.

D. The Experiment of Object Distance
In this experiment, the distance of object to CCTV camera is varied, which is 1 meter, 2 meters, and 3 meters. The object used in this experiment is textured bag (Batik), lettering patterned bag, black and white patterned bag, and color pattern bag. Figure 11 shows the results of L 5 value from this experiment.
On L 5 average results as shown in Figure 11, the L 5 value of SIFT, SURF, and ORB algorithms decreases at 3 meters distance experiment. If it is too close to camera at 1 meter distance, the L 5 value is also not optimal for the SURF and ORB algorithms. This is because some objects are not recorded, so the feature that obtained from the object is reduced. Through this experiment, it shows that all algorithm is scale-invariant. Its means that bag object with vary in the distance from camera can still detect. The best algorithm in scale invariant is SIFT that 3 meters distance can still be detected in all types of bags. Fig. 11 The experiment result of effect on object distance toward L 5 value Based on Figure 12, at distance of 3 meters is resulting in many false negative because the object in video is getting smaller which causes the feature to appear blurry, thereby reducing the number of features captured. As shown in Figure  12b, 12c, 12h, 12i which are an example of false negative. False negative is caused by features that match less than 4 so a homography matrix cannot be formed.

E. Discussion
SIFT algorithm is good because it has 128 dimensions of descriptor and matching features is using distance matching, i.e. Euclidean distance. This is having an impact on processing time because matching uses 128 dimensions in each feature. Then, SURF algorithm uses BLOB feature so the resulting feature is not good which affects to L 5 value. SURF algorithm is slightly faster than SIFT because the descriptor of SURF algorithm is 64 dimensions. Whereas the ORB algorithm produces 32 dimensions of descriptor and the matching feature uses binary matching, namely Hamming distance, it is much faster than SIFT and SURF algorithms. The feature that used in ORB algorithm is using a corner, so the selection of unique features affects L 5 value that is not too different than SIFT algorithm.
The ORB algorithm's speed is the fastest of the SIFT and SURF algorithms based on the experimental results. In a previous study [5], the accuracy and speed in processing a video frame in the SIFT and SURF algorithms are inversely related. In this study, besides the ORB algorithm is the fastest in processing a frame, the L 5 value of the ORB algorithm is 0.81, which is not much different from the SIFT algorithm of 0.97 and better than the SURF algorithm, which is only 0.73. These results prove that the ORB algorithm for object searching in the real-time video is the fastest on processing time, and the L 5 value is not much different from the SIFT algorithm. Fig. 12 The result of matching feature on 1 meter distance: a) SIFT b) SURF c) ORB, 2 meters distance: d) SIFT e) SURF f) ORB, 3 meters distance: g) SIFT h) SURF i) ORB

IV. CONCLUSION
This study provides an overview of real-time video data processing algorithms that are fast and accurate in object searching. This system helps CCTV operators or security forces analyze in finding evidence of criminal acts of theft of goods on video. Through video analysis, related parties can provide data on the goods in question recorded on how many frames or seconds. In this study, three local feature algorithms were tested: the SIFT, SURF, and ORB, to process real-time video data. We also added a keyframe selection because several video frames have the same information.
In this study, Mutual Information Entropy method can be used for selecting keyframes, thereby reducing video delay in real time processing because not all frames are processed. Then, the ORB algorithm can be applied as feature extraction for object searching on video that the result of processing time is the fastest compared to SIFT and SURF algorithms with L 5 value that is not too different. However, the proposed method is still slow because detecting multiple objects uses the Mean-Shift method that is computationally relatively expensive.
Our future work will be combined with faster clustering method.
b c e f h i a d g