CA3143843A1

CA3143843A1 - Systems and methods for face and object tracking and monitoring

Info

Publication number: CA3143843A1
Application number: CA3143843A
Authority: CA
Inventors: Danish Ahmed Ansari; Murugan Krishnasamy; Rajneesh Kant Saxena; Rajeev YADAV
Original assignee: Cybersmart Technologies Inc
Current assignee: Cybersmart Technologies Inc
Priority date: 2021-12-23
Filing date: 2021-12-23
Publication date: 2023-06-23

Abstract

Methods, devices, and systems for identifying an object, such as a person, and tracking an object. In an example, a method for identifying an object includes: receiving, at a processor, two or more images of a first object from different directions; detecting, at the processor using a convolutional neural network (CNN), features of the first object in the two or more images; comparing, at the processor, the features with a second set of features extracted from two or more existing images of a second object; in response to the features matching the second set of features of the two or more existing images to a predetermined threshold, identifying, by the processor, the first object to be the same as the second object. The processor can initiate a security event when the first object is the same object as the second object.

Description

SYSTEMS AND METHODS FOR FACE AND OBJECT TRACKING AND
MONITORING
TECHNICAL FIELD
[0001] Example embodiments relate to facial recognition, tracking, and monitoring for security applications.
BACKGROUND

[0002] Traditional facial recognition (FR) technology is built on "face foreword photos".
Lighting, angle, and obstructions such as glasses, hair, or masks can cause inaccuracies.

[0003] Existing conventional FR Artificial Intelligence (AI) is difficult to use for person tracking, and typically requires identification badges to store the credentials for comparing to the person being tracked. Person tracking can be difficult due to great volume of identifications to be verified, and the superior computing capability required for training AT
models. As well, existing face tracking system do not work well to track objects at a large scale.
Finally, there is a need to control errors based on organizational standards or security consideration.
SUMMARY

[0004] Example embodiments include methods, devices, and systems for identifying an object, such as a person, and tracking the person for security purposes. In an example, a method for identifying the object includes: receiving, at a processor, two or more images of a first object from different directions; detecting, at the processor using a convolutional neural network (CNN), features of the first object in the two or more images; comparing, at the processor, the features with a second set of features extracted from two or more existing images of a second object; in response to the features matching the second set of features of the two or more existing images to a predetermined threshold, identifying, by the processor, the first object to be same as the second object.

Date recue/ date received 2021-12-23

[0005] In another example, a method for tracking an object includes receiving, at a processor, two or more images of an object taken from different directions;
receiving, at the processor, a location associated with the at least one of the two or more images; identifying, at the processor using convolutional neural network (CNN), the object in the two or more images;
in response to determining that the object is included in stored images, generating, at the processor, locations of the object based on the location associated with at least one of the two or more images and locations associated with the stored images.

[0006] In another example, a method for tracking an object includes transmitting, at a camera device, two or more images of an object taken from different directions; transmitting, by the camera device, a location associated with the at least one of the two or more images;
identifying, at a display device using a convolutional neural network (CNN), the object in the two or more images; in response to determining that the object is included in stored images, generating, at the display device, locations of the object based on the location associated with at least one of the two or more images and locations associated with the stored images.

[0007] In another example, a non-transitory computer-readable medium includes instructions which, when executed by at least one processor, cause the at least one processor to perform the method as claimed in any one of the preceding examples.
BRIEF DESCRIPTION OF THE DRAWINGS

[0008] Reference will now be made, by way of example, to the accompanying drawings which show example embodiments, and in which:

[0009] Figure 1 is a schematic structural diagram of a system of an object tracking system, in accordance with an example embodiment;

[0010] Figure 2 is a schematic diagram of a hardware structure of a camera device of the system in Figure 1;

[0011] Figure 3 is a schematic diagram of a hardware structure of a display device of the system in Figure 1;

Date recue/ date received 2021-12-23

[0012] Figure 4A is a diagram illustrating object tracking by the system of Figure 1, in accordance with an example embodiment;

[0013] Figure 4B is a diagram illustrating example images captured by the camera device in Figure 4A, in accordance with an example embodiment;

[0014] Figure 5 is a diagram illustrating steps of a facial recognition method, in accordance with an example embodiment;

[0015] Figure 6 is a diagram illustrating example steps of liveness recognition in Figure 5;

[0016] Figure 7 is a flow chart showing an object tracking method of the system in Figure 1; and

[0017] Figure 8 is a block diagram of a convolutional neural network (CNN) model for use in identifying and tracking an object by the system of Figure 1, in accordance with an example embodiment.

[0018] Similar reference numerals may have been used in different figures to denote similar components.
DETAILED DESCRIPTION

[0019] Figure 1 illustrates a block diagram of an example facial recognition, object identifying, and object tracking system 100, in accordance with an example embodiment.
Objects include persons (humans) and non-human objects. The system 100 can be used to identify a person using facial recognition, identify an object, track an objects, including movement of people.

Date recue/ date received 2021-12-23

[0020] In the example of Figure 1, the system 100 can include: one or more camera devices 104 and one or more display devices 106. The camera device 104 can be used to capture images 102 of an object of interest. The object can be a person or a physical object, such as wearable objects including glasses, hat, mask, etc. The camera device 104 can also be used to identify and track an object. The facial recognition can be used to identify a person. The object identifying can be used to identity an object, such as facial recognition of a person, and object tracking can be used to track an object.

[0021] The camera device 104 can include rules based models to identify an object, including to perform the facial recognition, and/or track and object. The camera device 104 can also include machine learning models, which can include one or more neural networks (NNs) such as convolutional neural networks (CNNs). The camera device 104 can be a security camera.
The display devices 106 can be configured to display the objects and coordinates of the objects to a user. The display device 106 can be a security monitoring terminal or a security monitoring mobile device.

[0022] In examples, the camera device 104 and the display device 106 can communicate over communication links 108 and communication sessions. The communication links 108 can be wireless or wired. In an example, each of the communication links can include a WebSocket protocol to provide continuous two-way communication.

[0023] Figure 2 illustrates a block diagram of the camera device 104, in accordance with an example embodiment. The camera device 104 can be an electronic device or user equipment.
The camera device 104 can be a camera or a video camera. The camera device 104 can also be a mobile camera device 104. The camera device 104 can be operated by a user or a robot. The camera device 104 can be a security camera, which can be in a fixed location or can be mobile.
The security camera may be in a fixed location and controllable with respect to pan, zoom, and tilt (also known as PZT).

Date recue/ date received 2021-12-23

[0024] The camera device 104 includes one or more cameras 522, which can be used to capture images of the objects from more or more directions. The example camera device 104 includes at least one memory 502, at least one processor 504, and at least one communications interface 506. The camera device 104 can include input or output (I/O) interface devices 508, including but not limited to touch screen, display screen, keyboard, micro-phone, speaker, mouse, gesture feedback devices (through the camera 522) and/or haptic feed-back device. In some examples, memory 502 can access the object database 110 and the map database 112, from the cloud storage.

[0025] In the example of Figure 2, the camera device 104 includes sensors 520 which are used to detect information from the environment of the camera device 104. In an example, the sensors 520 can be used to determine a location and an orientation (e.g., pitch, roll, yaw) of the camera device 104. The sensors 520 can include: global positioning system (GPS), local positioning system (LPS), range director or scanner such as LiDAR to determine the camera distance to objects or points of the objects, barometric pressure sensor to determine a height of the camera device 104, compass to determine orientation of the camera device 104 in relation to North, and/or accelerometers to determine orientation of the camera device 104. The GPS and/or the LPS can be used to generate the location of the camera device 104. The range detector can be used to determine a distance between the camera device 104 and the object being captured by the camera 522.

[0026] The range director such as LiDAR can be used by the camera device 104 to determine the camera distance to objects or points of the objects. For example, the distance between the closest point of the object to the camera device 104.

[0027] In some examples, the distance between the camera device 104 and the object can be generated using photogrammetry. In some examples, Google (TM) ARCore can be used to generate the distance between the camera device 104 and the object. In some examples, a combination of photogrammetry and at least one of the sensors 520 can be used by the positioning module 518 to determine the distance.

Date recue/ date received 2021-12-23

[0028] The positioning module 518 can generate, using the sensor information: i) camera location, ii) camera orientation, and/or iii) camera distance to object. In some examples, the positioning module 518 uses data from the sensors 520. In some examples, the positioning module 518 uses data from the GPS and/or the LPS. In some examples, the object is tracked and presumed to be the same location and optionally the same orientation as the camera device 104, i.e., the user is holding the camera device 104.

[0029] In an example, the positioning module 518 may also include ARCore. ARCore includes a mobile augmented reality library that can be used for camera orientation estimation, which is readily available on most Android (TM) devices or smaiiphones. ARCore is a library by Google (TM), which uses the data from the inertial measurement unit (IMU) sensors (e.g accelerometer, magnetometer, and gyroscope), along with image feature points for tracking the camera orientation of the camera device 104 utilizing a Simultaneous Localization and Mapping (SLAM) algorithm. ARCore can perform camera orientation estimation in real-time. In that regard, to track the motion of the camera device 104, an android application (i.e. the positioning module 518) using ARCore can be developed in Unity3D environment, the Unreal environment, or other interactive 3D environments, for capturing RGB images along with the real world location of the camera device 104. The positioning module 518 can generate or determine the location and the camera orientation of the camera device 104 for each image 102.

[0030] In example embodiments, the memory 502 can store modules for execution by the processor 504, including: image 2D or 3D object detection module 510, positioning module 518, and anchor point generator 524. The modules can include software stored in the memory 502, hardware, or a combination of software and hardware. In some examples, the modules of the camera device 104 include machine learning models, which can include NNs such as CNNs. For example, the image 2D or 3D object detection module 510 can include an image 2D or 3D object detector model which includes a CNN. In some examples, one or more of the modules are executed by other devices, such as a cloud server.

Date recue/ date received 2021-12-23

[0031] The anchor point generator 524 is used to generate anchor points of the feature points of the object including a person, using the location and orientation of the camera device 104. For example, the anchor points are generated for identifying and/or tracking the object. In an example, the anchor points can be generated by the camera device 104 using ARAnchorManager from AR Foundation. In some examples, each anchor point of the object is individually trackable. In examples, movements of the object, or a part of the object, can be tracked using the anchor points.

[0032] Figure 3 illustrates a block diagram of the display device 106, in accordance with an example embodiment. The display device 106 can be an electronic device or user equipment.
The display device 106 can be a desktop, a laptop, a set top box, or a mobile communication device such as a smart phone or a tablet. The display device 106 can be the same as or different from the camera device 104 (e.g., for AR purposes). The user of the display device 106 can be the same as or different from the user of the camera device 104.

[0033] The example display device 106 in Figure 3 includes at least one memory 402, at least one processor 404, at least one communications interface 406, and I/O
interface devices 408. The memory 402, the processor 404, the communications inter-face 406 can be similar to those described camera device 104 of Figure 2. The memory 402 can store a 2D/3D display module 410 for execution by the processor 404. The modules (e.g. 2D/3D display module 410) of the display device 106 can include software stored in the memory 402, hardware, or a combination of software and hardware. The display device 106 includes a display 412, which can be a 360-degree display. The I/O interface devices 408 can include but are not limited to touch screen, keyboard, camera, microphone, speaker, mouse, gesture feedback device (through the camera or accelerometers) and/or haptic feedback device.

[0034] The 2D/3D display module 410 can receive, from a third party mapping service, a 2D or 3D map for display on the display 412. The 2D/3D display module 410 can display movement of an object based on the map data or real time coordinates of the object. In some examples, the 2D/3D display module 410 is executed by a particular platform such as a 3D video Date recue/ date received 2021-12-23 platform such as a mobile platform, streaming platform, web platform, gaming platform, application plug-ins, etc. The display device 106 can include input/output (I/O) interface devices 408 for interacting with the user. In an example embodiment, the display 412 is a computer monitor.

[0035] In examples, the system 100 is configured to identify an object or a person. In the example of Figure 4A, the system 100 is configured to identify an object 202, in accordance with an example embodiment. Examples will be described with relation to one object 202, such as a person as shown in Figure 4A. The object may be any physical object, such as a car, a chair, an animal, or a plant, etc.

[0036] In Figure 4A, the camera device 104 can be operated by a user or machine that takes images 102 of the object 202. The camera device 104 can take one or more images 102 of the object 202. In some examples, the camera device 104 captures a video of the object 202, therefore generating one or more images 102.

[0037] In some examples, the system 100 is configured to identify a person based on the image 102 using facial recognition. In some examples, camera device 104 is configured to perform a facial recognition of a human face, which is three-dimensional and changes in appearance with lighting and facial expression. The camera device 104 or display device 106 is configured to detect face and to segment the face from the image background, align the segmented face image to account for face pose, image size and photographic properties, such as .. illumination and grayscale to enable the accurate localization of facial features.

[0038] In some examples, the camera device 104 or display device 106 may extract the facial features, in which features such as eyes, nose and mouth are pinpointed and measured in the image to represent the face. The camera device 104 may then match the established feature vector of the face against a database of faces on images, photos or photo IDs stored in a database, such as object database 110 or in memory 402.

Date recue/ date received 2021-12-23

[0039] The facial feature points are features detected in the image 102 by the camera device 104 or display device 106, represented by the circles 203 in Figure 4A.
Facial feature points, also known as feature edge points, Kanade¨Lucas¨Tomasi (KLT) corners or Harris corners, are identified visual features of particular edges detected from the image 102. In an example, Google ARCore can be used to generate the facial feature points.

[0040] The extracted facial features can be used to search for or compare with stored images of people. The camera device 104 or the display device 106 can be configured to perform the comparison using eigenfaces, linear discriminant analysis and elastic bunch graph matching using the Fisherface algorithm, the hidden Markov model, the multilinear subspace learning using tensor representation, and/or the neuronal motivated dynamic link matching. In some examples, the comparison is performed using NNs or CNNs.

[0041] If camera device 104 or the display device 106 determines that extracted facial features on image 102 match the facial features of a stored image to a predetermined threshold, the camera device 104 or display device 106 is configured to consider the identity of the person in association with stored image to be the identity of the person on the image 102. For example, the camera device 104 or display device 106 can also generate an object score which represents the probability or confidence score of the identity of the person. The camera device 104 or display device 106 may also generate an object label in the system 100 to uniquely identify the person or object 202 in the image of 102.

[0042] In some examples, the system 100 may also perform 3D face recognition using cameras 522 to capture two or more images about the shape of a face from different directions.
The images are used to identify distinctive features on the surface of a face, such as the contour of the eye sockets, nose, and chin. 3D face recognition is not affected by changes in lighting like other techniques. It can also identify a face from a range of viewing angles, including a profile view. 3D data points from a face improve the precision of face recognition. In an example, 3D
images of faces may be captured by three cameras 522 that point at different angles; one camera will be pointing at the front of the subject, second one to the side, and third one at an angle. All Date recue/ date received 2021-12-23 these cameras 522 work together to track a subject's face in real-time and be able to detect and recognize the face of a person.

[0043] By using 3D facial recognition or liveness recognition, the system 100 has an expanded accuracy range for angles, lighting conditions, and obstructive objects. For example, the system 100 may have up to 45 degrees lateral, 30 degrees vertical, and partially obstructed still has 90% accuracy.With the facial recognition capacity, the system 100 is able to perform mass identification as it does not require the cooperation of the rest subject to work. The system 100 is accurate for identifying and tracking objects or people of interest in controlled areas, including airports, offices, and prisons. The system 100 can also be used as traditional facial recognition technologies for security or login. For example, the system 100 may be implemented in airports, multiplexes, prisons, and other selected places to more accurately identify individuals among the crowd, without passers-by even being aware of the system.

[0044] In some examples, the system 100 is configured to perform face recognition including face verification, face recognition, and liveness recognition. For example, the camera .. device 104 or display device 106 may also perform a face verification by comparing a person in the image 102 with the person on a photo of a photo ID. The camera device 104 or display device 106 may also perform a face recognition by comparing a person in the image 102 with all available photos and determining the identity of the person.

[0045] Figures 5 is an example flow chart showing an example method 500 for facial recognition. The face recognition in system 100 may include three levels. At step 552, system 100 receives two or more images taken from different directions of a person or an object. At step 554, system 100 first performs face recognition and verification as described in Figure 4A above.

[0046] As well, by using 3D facial recognition or liveness recognition, the system 100 can take into account wearable object of a person, such as head covering, face masks, sunglasses, facial hairs, hats, caps, facial gestures, various lighting situations. At step 555, the system 100 may identify a wearable object worn by a person. For example, the system 100 may generate Date recue/ date received 2021-12-23 images from different directions of the person, the system 100 is configured to identify a wearable object worn by a person. For example, based on a position relative to the features of eyes, the system 100 identify the wearable object is a hat if the object is above the eyes, a pair of glass if the object in worn in front of the eyes, or a mask if the object is below the eyes. The identification of the wearable object can be used to assist in the identification of the person.

[0047] At step 556, the system 100 may include anti-spoofing mechanism. At step 506, the system 100 can distinguish a real face of a person from an image or video of the person. For example, the system 100 may use 3D face recognition in which physically present a person is needed in order to recognize the person correctly. A photo or a video of the person would not suffice to fool the system 100. As such, the system 100 prevents bots and bad actors from using stolen photos, injected deepfake videos, life-like masks, or other spoofs.
This mechanism ensures only real humans can be recognized, and can distinguish a real face from an image even without user interaction. For example, as illustrated in Figure 6, the system 100 may perform a liveness recognition for an enhanced security mechanism. For example, at step 602, the system 100 can detect whether the person in the image 102 is an actual live person, and can distinguish a real face of a person from a saved picture of a person. For example, the system 100 may use 3D face recognition in which physically present a person is needed in order to recognize the person correctly.

[0048] In some examples, the system 100 can include enhanced security mechanism for a restricted area or initiate a security event. In some examples, the security event includes access to a device in the restricted area. At step 558, the system 100 is configured to detect actions of a person. For example, at step 604 in Figure 6, the system 100 is configured to recognize a person's responses, such as head turns, lifting arm, etc., and the system 100 is configured to monitor the movement of the relevant body parts, to evaluate whether the person has responded as requested. The requested action may include one or more actions, discretely or continuously, instructed by an audio output from the I/0 interface device 508. For example, the system 100 is configured to require users to perform a simple task such as following a moving dot on the screen. By determining that the person correctly responded to the request, the system 100 Date recue/ date received 2021-12-23 determines that the image in the person in the image is indeed a live person.
In some examples, the system 100 may have initially stored a person's response, or a movement of a body part of the person. The system 100 can then compare the stored response of the person with the responses captured by the camera device 104 to determine whether the person on the camera is the same person previously stored in the system 100.

[0049] In another example, the system 100, such as the camera device 104, is configured to display several flashing dots on the surface of the camera device 104, and to monitor the movement of the pupils, evaluating whether the flashing dot has been followed correctly. If the pupils have correctly follow the flashing dots, the system 100 may grant the person access to a .. restricted area, such as the entrance of a room storing a safe. If the pupils have incorrectly follow the flashing dots, the system 100 may deny the person access to a restricted area, for example, by keeping a door locked. In addition to facial recognition, the system 100 can also require a user to perform multiple actions concurrently or sequentially to identify a person.

[0050] In some examples, the system 100 is further configured to combine facial recognition described above with geo-positioning to identify a person or object and the location of an object or a person. For example, the geo-location of the person may be used to further determine a specific person is indeed at a predetermined location, such as s prisoner or a security guard in a prison. In some examples, the system 100 is configured to combine facial recognition with bio-informatics of a person to more accurately identify the person. For example, the system 100 is further configured to recognize a person based on identify the person and the biological and behavioral traits of the person for enhanced security or to improve the accuracy of facial recognition. For example, the system 100 may identify the person by recognizing the person's fingerprint, face, iris, palmprint, retina, hand geometry, voice, signature, posture and gait received in system 100. The system 100 can also take into consideration of the biological and .. behavioral traits of the person to increase the confidence of the probability or object score when the system 100 identifies a person by facial recognition to assist with identifying the person in steps 554 and 706 to be described below.

Date recue/ date received 2021-12-23

[0051] In some examples, the system 100 is configured to track the object 202. As described above, the camera device 104 can perform image 2D or 3D object detection on the image 102 to identify the object 202.

[0052] In the example of Figure 4A, the camera devices 104(1)-104(3) takes three images 102(1)-102(3) of the object 202, with the locations of the camera device 104(1)-(3) shown as 1st location, 2nd location, and 3rd location. Each camera device 104(1)-104(3) can determine the locations (coordinates) of the camera device 104 with positioning module 518 described in Figure 2. Figure 4B illustrates example images 102 captured by the camera device 104, in accordance with an example embodiment.

[0053] Referring to Figure 4A, a first image 102(1) is captured by the camera device 104(1) from the 1st location, a second image 102(2) is captured by the camera device 104(2) from the 2nd location, and a third image 102(3) is captured by the camera device 104(3) from the 3rd location. The images 102(1)-(3) all have different POVs of the same object 202 based on where the images 102 are captured by the camera devices 104(1)-(3). In some examples, multiple images can be captured at the same orientation of the camera device 104(1)-(3), at different zoom distances to the object 202, e.g., optical zoom, digital zoom, or manually moving the camera device 104. More or fewer images 102 of the object 202 can be taken than those shown in Figure 4A.

[0054] The camera device 104 can also generate feature points in the images 102(1)-(3) in the same manner as described in facial recognition to identify a person above. Although not shown in Figure 4B, an object label and feature points of the object 202 are also generated for the second image 102(2) from the 2nd location and for the third image 102(3) from the 3rd location. For the same object 202, the object label is the same in the first image 102(1), the second image 102(2), and the third image 102(3). Consensus rules and/or object scores can be used to resolve any conflicts in the object label. As such, by identifying a same object at different locations, the system 100 can track the positon of the object.

Date recue/ date received 2021-12-23

[0055] The camera device 104 or display device 106 can include a front detection model that includes a NN such as a CNN. For example, the CNN can be trained to return a vector that represents the front identifying information. By training the CNN, the system 100 can reduce the errors in facial recognition and object tracking. In an example, the front identifying information is the anchor points of the front of the object. In an example, the front identifying information can include descriptive text, e.g. "face" or "nose" for a human, or "bill" of a hat. In an example, the front detection model can query the object database 110 to retrieve any one of the following example front identifying information: the descriptive text, an image of a front of the object.

[0056] In example embodiments, using the object label, the system 100 can track an object 202 in the system 100. For example, the display device 106 can be used to track and monitor an object 202. A distance threshold for the movement of the object can be used in some examples to determine whether the object 202 had actually moved, in which the distance threshold can vary depending on the application, the size of the object 202, or the particular environment.

[0057] In some examples, the camera device 104 captures the images 102 using video capture. A video can include a plurality of video frames, which are the images 102. For example, a user or a machine can activate a video record function of the camera device 104 and move the camera device 104 to the first location, the second location, and the third location (and/or other locations). The video can then be used by extracting the images 102 (video frames), which are then used to identify an object 202 for example, using facial recognition and/or to track object 202. The video can be recorded and then processed by the object tracking method at a later time, or can be processed in real-time. In some examples, audio from the video can be used to assist the object tracking method in generating the object label, for example to identify a human voice (using voice recognition), etc.

[0058] Figure 7 illustrates an example block diagram of an object tracking method 700 performed by system 100, in accordance with an example embodiment. At step 702, the camera device 104 receives at least two or more images which includes an object. For example, at least Date recue/ date received 2021-12-23 one image is received from the camera 522. At step 704, the camera device 104 generates, for each image, using the positioning module 518, a camera location associated with each image.
The system 100 may consider camera location as the object location. At step 706, the camera device 104 identifies, using the image 2D/3D object detection module 510, the object in each image. Optionally, the camera device 104 may general an object label to uniquely identify the object detected in each image, using the object identification method 600 describes above. As illustrated in Figure 4A, different camera devices 104(1)-(3) can transmit the captured image, extracted features points of the object, the location data of the camera devices 104, and generated object label to the display device 106. At step 708, display device 106 determines whether the object is also included in previous images. In an example, the display device 106 may decide whether the object is also included in previous images by comparing the object label associated with the object. If the object label associated with a first object in the image captured by a first camera device 104 is the same as the object label associated with the second object in the image captured by a second camera device 104, the display device 106 considers the first and second objects to be the same. In another example, the display device 106 may determine whether the object is also included in previous images by comparing the feature points of a first object in the image captured by a first camera device 104 with the features points of a second object in the image captured a second camera device 104. If the feature points of a first object are determined to be the same or substantially the same as the second object, the first object is determined to be the same as the second object. In either case, the display device 106 determines that the object 202 is also included in previous images.

[0059] When the display device 106 determines that the object is also included in previous images, at step 710, the display device 106 may generate a location label corresponding to the camera location. The display device 106 may also display, for example using 2D/3D
display module 410, location label of the object. As such, with the method 700, the system 100 can track the location of the object 202. For example, the location label may be "server room on the fifth floor", or "exit at the underground parking". The display device 106 may also display a location label of the object over a period to indicate movement of the object.
Date recue/ date received 2021-12-23

[0060] In some examples, the display device 106 is configured to receive images and coordinates of the object 202 from the camera devices 104(1)-104(3), and perform the method 700 from step 706-710.

[0061] It would be appreciated that the facial recognition method 500 and object tracking method 700 can be applied to a plurality of objects 202. For example, each object 202 can be identified using the facial recognition method 500, and/or processed at the same time through the tracking method 700, or alternatively each individual object 202 can be identified individually using the facial recognition method 500, and/or processed individually through the tracking method 700 to detect and track each individual object instance at a time. The tracking method .. 700 is used to determine movement of the object 202, and locate the object using the coordinates associated with the object 202.

[0062] As described above, the system 100 may be configured to use artificial intelligence, such as CNN, to perform the object identification, facial recognition in method 500 and/or object tracking in method 700 above. Figure 8 illustrates an example detailed block diagram of a CNN model for facial recognition and object tracking performed by the system 100, in accordance with an example embodiment. For examples, at least one or more of the described modules or applications of the camera device 104 and/or display device 106 can include a CNN.
The CNN is a deep neural network with a convolutional structure, and is a deep learning architecture. The deep learning architecture indicates that a plurality of layers of learning is performed at different abstraction layers by using a machine learning algorithm. As a deep learning architecture, the CNN is a feed-forward (feed-forward) artificial neural network. Each neural cell in the feed-forward artificial neural network may respond to an image input to the neural cell.

[0063] As shown in Figure 8, the CNN 1100 may include an input layer 1110, a .. convolutional layer/pooling layer 1120 (the pooling layer is optional), and a fully connected network layer 1130. In examples, the input layer 1110 can receive the image 102 and can receive other information (depending on the particular module or model).

Date recue/ date received 2021-12-23

[0064] The convolutional layer/pooling layer 1120 shown in Figure 7 can include, for example, layers 1122(1), 1122(2), ..., 1122(n). For example: In an implementation, the layer 1122(1) is a convolutional layer, the layer 1122(2) is a pooling layer, the layer 1122(3) is a convolutional layer, the layer 1122(4) is a pooling layer, the layer 1122(5) is a convolutional layer, and the layer 122(6) is a pooling layer, and so on. In another implementation, the layers 1122(1) and 1122(2) are convolutional layers, the layer 1122(3) is a pooling layer, the layers 1122(4) and 1122(5) are convolutional layers, and the 1122(6) is a pooling layer. In examples, an output from a convolutional layer may be used as an input to a following pooling layer, or may be used as an input to another convolutional layer, to continue a convolution operation.

[0065] The following describes internal operating principles of a convolutional layer by using the layer 1122(1) as an example of a convolutional layer 1122(1). The convolutional layer 1122(1) may include a plurality of convolutional operators. The convolutional operator is also referred to as a kernel. A role of the convolutional operator in image processing is equivalent to a filter that extracts specific information from an input image matrix. In essence, the convolutional operator may be a weight matrix. The weight matrix is usually predefined. In the process of performing a convolution operation on an image, the weight matrix is usually processed one pixel after another (or two pixels after two pixels), depending on the value of a stride in a horizontal direction on the input image, to extract a specific feature from the image. The size of the weight matrix needs to be related to the size of the image. It should be noted that a depth dimension of the weight matrix is the same as a depth dimension of the input image. In the convolution operation process, the weight matrix extends to the entire depth of the input image.
Therefore, after convolution is performed on a single weight matrix, convolutional output with a single depth dimension is output. However, the single weight matrix is not used in most cases, but a plurality of weight matrices with same dimensions (row x column) are used, in other words, a plurality of same-model matrices. Outputs of all the weight matrices are stacked to form the depth dimension of the convolutional image. It can be understood that the dimension herein is determined by the foregoing "plurality". Different weight matrices may be used to extract different features from the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract a specific color of the image, still another Date recue/ date received 2021-12-23 weight matrix is used to blur unneeded noises from the image, and so on. The plurality of weight matrices have the same size (row x column). Feature graphs obtained after extraction performed by the plurality of weight matrices with the same dimension also have the same size, and the plurality of extracted feature graphs with the same size are combined to form an output of the convolution operation.

[0066] Weight values in the weight matrices need to be obtained through a large amount of training in actual application. The weight matrices formed by the weight values obtained through training may be used to extract information from the input image, so that the CNN 1100 performs accurate prediction. By continuously training the CNN model, the system 100 can learn automatically from previous identifications and improve accuracy in facial recognition and tracking object. .

[0067] When the CNN 1100 has one or more convolutional layers, an initial convolutional layer (such as 1122(1)) usually extracts a relatively large quantity of common features. The common feature may also be referred to as a low-level feature.
As the depth of the CNN 1100 increases, a feature extracted by a deeper convolutional layer (such as 1122(6) or 1122(n)) becomes more complex, for example, a feature with high-level semantics or the like. A
feature with higher-level semantics is more applicable to a to-be-resolved problem.

[0068] An example of the pooling layer is also described. Because a quantity of training parameters usually needs to be reduced, a pooling layer usually needs to periodically follow a convolutional layer. To be specific, at the layers 1122(1), .... 1122(n), one pooling layer may follow one convolutional layer, or one or more pooling layers may follow a plurality of convolutional layers. In an image processing process, the purpose of the pooling layer is to reduce the space size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator, to perform sampling on the input image to obtain an image .. of a relatively small size. The average pooling operator may compute a pixel value in the image within a specific range, to generate an average value as an average pooling result. The maximum pooling operator may obtain, as a maximum pooling result, a pixel with a largest value within the Date recue/ date received 2021-12-23 specific range. In addition, just like the size of the weight matrix in the convolutional layer needs to be related to the size of the image, an operator at the pooling layer also needs to be related to the size of the image. The size of the image output after processing by the pooling layer may be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer indicates an average value or a maximum value of a subarea corresponding to the image input to the pooling layer.

[0069] The fully connected network layer 1130 is now described. After the image is processed by the convolutional layer/pooling layer 1120, the CNN 110000 is still incapable of outputting desired output information. As described above, the convolutional layer/pooling layer 1120 only extracts a feature, and reduces a parameter brought by the input image. However, to generate final output information (desired category information or other related information), the CNN 1100 needs to generate an output of a quantity of one or a group of desired categories by using the fully connected network layer 1130. Therefore, the fully connected network layer 1130 may include a plurality of hidden layers (such as 1132(1), 1132(2), ..., 1132(n) in Figure 11) and an output layer 1140. A parameter included in the plurality of hidden layers may be obtained by performing pre-training based on related training data of a specific task type. For example, the task type may include image recognition, image classification, image super-resolution re-setup, or the like.

[0070] The output layer 1140 follows the plurality of hidden layers 1132(1), 1132(2), ..., 1132(n) in the network layer 1130. In other words, the output layer 1140 is a final layer in the entire CNN 1100. The output layer 1140 has a loss function similar to category cross-entropy and is specifically used to calculate a prediction error. Once forward propagation (propagation in a direction from 1110 to 1140 in Figure 8 is forward propagation) is complete in the entire CNN
1100, back propagation (propagation in a direction from 1140 to 1110 in Figure 8 is back propagation) starts to update the weight values and offsets of the foregoing layers, to reduce a loss of the CNN 1100 and an error between an ideal result and a result output by the CNN 1100 by using the output layer.

Date recue/ date received 2021-12-23

[0071] It should be noted that the CNN 1100 shown in Figure 7 is merely used as an example of a CNN. In actual application, the CNN may exist in a form of another network model.

[0072] In the present application, one or more of the steps 554, 555, 556, 558 in method 500, one or more of steps 602, 604 in method 600, and one or more of steps 704, 706, and 708 and 710 in method 700 may be performed by system 100 using CNN 1100.

[0073] In the present application, to implement this functionality using techniques previously known in the art, the processor 404 or 504 would need to execute a vastly complicated rule-based natural language processing algorithm that would consume many more .. computational resources than the deep learning-based methods described herein, and/or would produce less accurate results,

[0074] The system 100 has compatibility with other services based on Restful API, an application programming interface (API or web API) that conforms to the constraints of REST
architectural style and allow the system 100 to interact with RESTful web services. The system 100 is device independent and can function in Andriod i0S, etc.

[0075] The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual requirements to achieve the objectives of the solutions of the embodiments.

[0076] In addition, functional units in the example embodiments may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.

[0077] When the functions are implemented in the form of a software functional unit and sold or used as an independent product, the functions may be stored in a computer-readable Date recue/ date received 2021-12-23 storage medium. Based on such an understanding, the technical solutions of example embodiments may be implemented in the form of a software product. The software product is stored in a storage medium, and includes several instructions for instructing a computer device (which may be a personal computer, a server, or a network device) to perform all or some of the steps of the methods described in the example embodiments. The foregoing storage medium includes any medium that can store program code, such as a Universal Serial Bus (USB) flash drive, a removable hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disc. In an example, the software product can be an inference model generated from a machine learning training process.

[0078] In the described methods or block diagrams, the boxes may represent events, steps, functions, processes, modules, messages, and/or state-based operations, etc. While some of the example embodiments have been described as occurring in a particular order, some of the steps or processes may be performed in a different order provided that the result of the changed order of any given step will not prevent or impair the occurrence of subsequent steps.
Furthermore, some of the messages or steps described may be removed or combined in other embodiments, and some of the messages or steps described herein may be separated into a number of sub-messages or sub-steps in other embodiments. Even further, some or all of the steps may be repeated, as necessary. Elements described as methods or steps similarly apply to systems or subcomponents, and vice-versa. Reference to such words as "sending"
or "receiving"
could be interchanged depending on the perspective of the particular device.

[0079] The described embodiments are considered to be illustrative and not restrictive.
Example embodiments described as methods would similarly apply to systems or devices, and vice-versa.

[0080] The various example embodiments are merely examples and are in no way meant to limit the scope of the example embodiments. Variations of the innovations described herein will be apparent to persons of ordinary skill in the art, such variations being within the intended scope of the example embodiments. In particular, features from one or more of the example Date recue/ date received 2021-12-23 embodiments may be selected to create alternative embodiments comprised of a sub-combination of features which may not be explicitly described. In addition, features from one or more of the described example embodiments may be selected and combined to create alternative example embodiments composed of a combination of features which may not be explicitly described.
Features suitable for such combinations and sub-combinations would be readily apparent to persons skilled in the art. The subject matter described herein intends to cover all suitable changes in technology.

Date recue/ date received 2021-12-23

Claims

WHAT IS CLAIMED IS:

1. A method for identifying an object, comprising:
receiving, at a processor, two or more images of a first object from different directions;
detecting, at the processor using a convolutional neural network (CNN), features of the first object in the two or more images;
comparing, at the processor, the features with a second set of features extracted from two or more existing images of a second object; and in response to the features matching the second set of features of the two or more existing images to a predetermined threshold, identifying, by the processor, the first object to be a same object as the second object.

2. The method of claim 2, further comprising: generating, at the processor, an object label uniquely identifying the first object.

3. The method of claim 1 or 2, wherein each of the first object and the second object is a person, and the two or more images and the two or more existing images include a face of the person.

4. The method of claim 3, wherein the two or more existing images include at least one photo from a photo identification of the person.

5. The method of claim 3, wherein the two or more existing images include all available photos of the person.

6. The method of any one of claims 3 to 5, further comprising distinguishing, by the processor using 3 dimensional face recognition, a real face of the person versus an image of the person.

7. The method of any one of claims 3 to 6, further comprising detecting, by the processor using the CNN, at least one action of the first object for the identifying the first object to be the same object as the second object.

8. The method of any one of claims 3 to 7, further comprising monitoring, by the processor using the CNN, one or more responses of the person to one or more prompts from the processor, or a movement of a body part of the person, for the identifying the first object to be the same object as the second object.

9. The method of any one of claims 3 to 8, further comprising initiating, by the processor when the identifying of the first object is the same object as the second object, a security event with respect to the person.

10. The method of any one of claim 9, wherein the security event comprises granting or denying access of the person to an area.

11. The method of any one of claim 8, wherein the person's responses are detected by the processor concurrently or sequentially.

12. The method of any one of claims 1 to 11, wherein the features are Kanade¨Lucas¨
Tomasi (KLT) features.

13. The method of any one of claims 1 to 12, further comprising generating, by the processor, a location label associated with a location of the first object.

14. The method of any one of claims 3 to 13, further comprising detemining, by the processor, an identity of the person by iris, retina, hand geometry, or voice of the person for the identifying the first object to be the same object as the second object.

15. The method of claims 3 to 14, further comprising identifying, by the processor, a wearable object worn by the person.

16. The method of any one of claims 1 to 15, wherein the identifying is performed by the processor using the CNN.

17. A method for tracking an object, comprising:
receiving, at a processor, two or more images of an object taken from different directions;
receiving, at the processor, a location associated with the at least one of the two or more images;
identifying, at the processor using a convolutional neural network (CNN), the object in the two or more images; and in response to determining that the object is included in stored images, generating, at the processor, a location label associated with a location of the first object base on a location of a camera generating the two or more images .

18. The method claim 17, wherein the locations of the object comprise coordinates of the object.

19. A method for tracking an object, comprising:
transmitting, at a camera device, two or more images of an object taken from different directions;
transmitting, by the camera device, a location associated with the at least one of the two or more images;
identifying, at a display device using a convolutional neural network (CNN), the object in the two or more images; and in response to determining that the object is included in stored images, generating, at the display device, a location label associated with a location of the first object base on a location of a camera generating the two or more images.

20. The method of claim 19, further comprising, displaying, by the display device, one or more location labels associated with the first object during a selected period.

21. A non-transitory computer-readable medium including instructions which, when executed by at least one processor, cause the at least one processor to perform the method as claimed in any one of claims 1-18.