WO2021233357A1

WO2021233357A1 - Object detection method, system and computer-readable medium

Info

Publication number: WO2021233357A1
Application number: PCT/CN2021/094720
Authority: WO
Inventors: Xiang Li; Yi Xu; Yuan Tian
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2020-05-20
Filing date: 2021-05-19
Publication date: 2021-11-25
Also published as: CN115428040A

Abstract

An object detection method, system and storage medium is proposed. The method includes creating object representation data based on outputs from a neural network and an AR framework, building a semantic map comprising object superpoints, determining a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, updating the semantic map in response to the incoming object point, and modifying a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map. The semantic map consists of object superpoints with a list of scores corresponding to detected labels, a list of view directions and a list of scales. By modifying a probability of an object label or category from the neural network based on the semantic map, object detection accuracy is enhanced.

Description

OBJECT DETECTION METHOD, SYSTEM AND COMPUTER-READABLE MEDIUM

BACKGROUND OF DISCLOSURE

1. Field of Disclosure

The present application relates to image processing technologies, and more particularly, to an object detection method, system and computer-readable medium.

2. Description of Related Art

Object detection could play an important role in Augmented Reality (AR) . The awareness/understanding of objects in the real-world scene can enable a lot of applications for AR. For example, one can change the appearance of real objects by adjusting virtual overlay accordingly. One can also place virtual objects into the scene by certain rules of association (e.g., a matching virtual chair displayed near a real table) . For e-commerce applications, one can also recommend related merchandise based on the understanding of the scene.

Object detection on an image or a video has been realized by Deep Neural Networks (DNN (s) ) including SSD, YOLO and faster R-CNN, etc. Furthermore, with the advancement of mobile integrated chips and dedicated DNN (s) solutions for mobile devices, there are more and more DNN (s) that can run on smartphones in real-time. However, DNN (s) for object detection require a huge number of training data to handle variance in scale and viewpoints. Besides, typical 2D object detection DNN (s) on mobile devices usually have relatively low mAP (mean Average Precision) , e.g., MobileNetV2_SSDLite achieves 22.1%mAP, which indicates that there are many false positives during inference. With geometry information either from CAD model or depth camera, the accuracy can be improved. But it is not easy to obtain CAD model for many real-world objects and the depth cameras on mobile devices are energy consuming, expensive, and limited in their operating range (< 5m) .

On the other hand, mobile AR frameworks have become mainstream, e.g., Apple Inc. ’s ARKit and Google Inc. ’s ARCore. Such AR frameworks employ SLAM algorithm, more specifically VIO algorithm, to track the 6 Degree-of-Freedom (DoF) camera pose (i.e., position and orientation) . 3D sparse point cloud data is also provided by such frameworks. They can reconstruct 3D points over 50 meters from the camera.

SUMMARY

An object of the present application is to propose an object detection method, system and computer-readable medium to use a semantic map to improve accuracy of object detection.

In a first aspect of the present application, an object detection method includes:

creating, by a processor, object representation data based on outputs from a neural network and an augmented reality (AR) framework, wherein the object representation data includes object label information of an object identified by the neural network on an image and three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework;

building, by a processor, a semantic map including object superpoints, wherein each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point;

determining, by a processor, a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, wherein the certain distance is determined based on a scale of a category of the object identified by the neural network and the incoming object point belongs to the category;

updating, by a processor, the semantic map in response to the incoming object point, wherein the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point; and

modifying, by a processor, a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.

According to an embodiment in conjunction with the first aspect of the present application, in the building the semantic map, the semantic map is built from the object points whose projections on the image are within a bounding area of the object identified by the neural network.

According to an embodiment in conjunction with the first aspect of the present application, a median point of the object points whose projections on the image are within the bounding area of the object is used to construct the object superpoint of the semantic map.

According to an embodiment in conjunction with the first aspect of the present application, in the determining the set of the object superpoints in the semantic map whose locations are within the certain distance of the incoming object point, the certain distance is a maximum scale of the category to which the incoming object point belongs to.

According to an embodiment in conjunction with the first aspect of the present application, the updating the semantic map in response to the incoming object point includes:

computing the score of the incoming object point in consideration of a comparison between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or a comparison between the scale information of the incoming object point and historical scale information of the object superpoints in the set, wherein the incoming object point is of the category identified by the neural network with the probability of the category, and the score of the incoming object point indicates a change of the probability of the category to which the incoming object point belongs.

According to an embodiment in conjunction with the first aspect of the present application, in the computing the score of the incoming object point, the score of the incoming object point is computed based on a first weight and a second weight, wherein the first weight is associated with a minimum angular difference between a viewpoint corresponding to the incoming object point and all the viewpoints corresponding to the object superpoints in the set, and the second weight is associated with a minimum scale difference between a scale corresponding to the incoming object point and all the scales corresponding to the object superpoints in the set.

According to an embodiment in conjunction with the first aspect of the present application, the first weight is set to be a first number if the minimum angular difference is less than a first predetermined degree and the first weight is set to be a second number if the minimum angular difference is greater than a second predetermined degree, and wherein the first number is less than the second number and the first predetermined degree is less than the second predetermined degree.

According to an embodiment in conjunction with the first aspect of the present application, the second weight is proportional to the minimum scale difference if the minimum scale difference is within a predetermined range, and the second weight is set to a fixed number if the minimum scale difference exceeds the predetermined range.

According to an embodiment in conjunction with the first aspect of the present application, the score of the incoming object point increases as the minimum angular difference and/or the minimum scale difference increases; the score of the incoming object point decreases as the minimum angular difference and/or the minimum scale difference decreases, and wherein an increase of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network increases; a decrease of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network decreases.

updating the scores of the object superpoints in the set for the category of the object superpoints in the set that is identical to the category of the incoming object point identified by the neural network by utilizing the score of the incoming object point, wherein for the object superpoints in the set that are of the category identical to the category of the incoming object point, the object superpoints in the set get an extra score if the object superpoints in the set fall within a minimum scale of the category to which the incoming object point belongs.

updating the historical data of the viewpoint information of any one of the object superpoints in the set if a minimum angular difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set is greater than a predetermined degree; and/or updating the historical data of the scale information of any one of the object superpoints in the set if a minimum scale difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set exceeds a predetermined value.

initializing the historical data of the incoming object point if no object superpoint is within a minimum scale of the category to which the incoming object point belongs, wherein only current score, viewpoint information and scale information of the incoming object point are recorded on the semantic map for the incoming object point.

According to an embodiment in conjunction with the first aspect of the present application, the modifying the probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map includes:

modifying the probability of the category to which the incoming object point belongs by using the updated scores of the object superpoints in the set.

According to an embodiment in conjunction with the first aspect of the present application, the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.

According to an embodiment in conjunction with the first aspect of the present application, the probability of the category to which the incoming object point belongs is modified based on a sigmoid function.

In a second aspect of the present application, an object detection system includes:

at least one memory configured to store program instructions; and

at least one processor configured to execute the program instructions, which cause the at least one processor to perform steps including:

creating object representation data based on outputs from a neural network and an augmented reality (AR) framework, wherein the object representation data includes object label information of an object identified by the neural network on an image and three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework;

building a semantic map including object superpoints, wherein each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point;

determining a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, wherein the certain distance is determined based on a scale of a category of the object identified by the neural network and the incoming object point belongs to the category;

updating the semantic map in response to the incoming object point, wherein the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point; and

modifying a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.

According to an embodiment in conjunction with the second aspect of the present application, the updating the semantic map in response to the incoming object point includes:

According to an embodiment in conjunction with the second aspect of the present application, the modifying the probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map includes:

modifying the probability of the category to which the incoming object point belongs by using the updated scores of the object superpoints in the set, wherein the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.

In a third aspect of the present application, a non-transitory computer-readable medium is deployed with program instructions stored thereon, that when executed by at least one processor, cause the at least one processor to perform any of above-described object detection method.

In the present application, a semantic map is used to improve the accuracy of object detection. Semantic points in the map are generated by combining object detection results from a neural network and pose data and three-dimensional points results from an AR framework. The semantic map consists of object superpoints with a list of scores corresponding to detected labels, a list of view directions and a list of scales. The probabilities from the neural network are modified based on them. By modifying the probability of an object label or category, object detection accuracy is enhanced.

BRIEF DESCRIPTION OF DRAWINGS

In order to more clearly illustrate the embodiments of the present application or related art, the following figures will be described in the embodiments are briefly introduced. It is obvious that the drawings are merely some embodiments of the present application, a person having ordinary skill in this field can obtain other figures according to these figures without paying the premise.

FIG. 1 is a schematic diagram illustrating the architecture of object detection according to the present invention.

FIG. 2 is a flowchart of an object detection method according to the present application.

FIG. 3 is a flowchart of a semantic map updating process according to the present application.

FIG. 4 is a block diagram illustrating an object detection system according to the present application.

FIG. 5 is a block diagram illustrating an updating module of an object detection system according to the present application.

FIG. 6 is a block diagram illustrating an electronic device for implementing an object detection method according to the present application.

DETAILED DESCRIPTION OF EMBODIMENTS

Embodiments of the present application are described in detail with the technical matters, structural features, achieved objects, and effects with reference to the accompanying drawings as follows. Specifically, the terminologies in the embodiments of the present application are merely for describing the purpose of the certain embodiment, but not to limit the invention.

Terminologies used in the present application are defined as follows.

Table 1

The following descriptions are illustrated using DNN (s) and AR frameworks as examples; however, other neural networks and similar augmented reality technologies can apply in the present application. It is not intended that the present application is limited to any illustrated examples.

The present application is to use a (3D) semantic map to improve the accuracy of 2D object detection DNN (s) . The semantic point clouds are generated by combining object detection results from DNN (s) and pose and 3D points results from AR framework. The semantic map consists of 3D superpoints with a list of scores corresponding to detected labels, a list of view directions and a list of scales. The probabilities from DNN (s) are updated or modified based on them. By modifying the probability of an object label or category, object detection accuracy is enhanced.

In some embodiments of the present application, a score measure for estimated object points considers not only how many times a certain label has been detected at a certain location, but also the detection view directions and detection scales. This approach performs better than ordinary DNN (s) in AR scenario when viewpoint is constantly changing. This can decrease the probability of false positives when the object is recognized as a category which has never been seen from similar view directions recently, while another category has been seen many times at the same location. For example, this approach can correct false positives where a bed has been detected as a couch at the current frame, but the semantic map shows a bed had been detected consistently at the same location in previous frames. In another example, when the object detection DNN (s) outputs a relative low probability for an object label, but it can be known from the semantic map that this object at this location from very different directions a while ago, this approach will increase the probability of the said category. This approach increases the accuracy of 2D object detection with AR framework without any additional training data to handle scale and viewpoint variance of the task. This enables many AR applications. For example, it can assign semantic labels to 3D point cloud which can trigger corresponding virtual contents for the users.

FIG. 1 is a schematic diagram illustrating the architecture of object detection according to the present invention. The architecture of object detection of the present application is described as follows.

In a typical AR session, when a device moves around in a scene after an arbitrary amount of time and comes back to a previously visited location, the 6DOF pose (location and orientation) will remain approximately same as the previously recorded value if the device is tracked continuously and successfully. That means, an AR framework has the capability to remember places. However, sometimes tracking will be lost when SLAM (VIO) of the AR framework fails due to lack of enough features being matched between the current frame and recent previous frames. Re-localization can be used in this situation to re-estimate the device pose with respect to the map by matching the features in the input image with stored features of previously saw images. Another problem can rise during VIO is the drift in the device trajectory accumulated over time from the true trajectory. To handle this, intermittent re-localizations is typically performed to detect revisited places in order to close the loop. This capability of memorizing and recognizing previously visited places of AR framework can be incorporated in developing an object detection method and system.

As shown in FIG. 1, a 3D semantic map is built using the output from both AR framework and DNN (s) . There are a lot of challenges with object detection such as variance in scale and viewpoint: an object detector must detect objects with different scales on the images and from different viewpoints. This challenge is addressed by using a category scale database to verify whether the detected object’s category agrees with the scale estimated from AR framework. For example, an airplane should not appear in a 5mx5m space. The viewpoints generated by AR framework when an object is detected by DNN (s) are stored. In some embodiments, those consistent detections of the same object from different view directions and/or at different scales are favored. In some embodiments, a probabilistic model is used to insert and update object category, viewpoint, and scale information in the 3D semantic map. Information from the 3D semantic map is extracted to update the object label probability from the DNN (s) as shown in FIG. 1.

FIG. 2 is a flowchart of an object detection method according to the present application. The object detection method is detailedly described below.

In Step S200, creating object representation data based on outputs from a neural network and an AR framework.

In this step, outputs from a neural network (e.g., DNN) and an AR framework (e.g., Apple Inc. ’s ARKit or Google Inc. ’s ARCore, which employs SLAM algorithm, more specifically VIO algorithm) are used to create object representation data. The object representation data includes object label information (e.g., a chair label shown in FIG. 1) of an object (e.g., a chair) identified by the neural network on an image, and further includes three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework. 2D object points of the object (e.g., the chair) on the image have corresponding 3D object points estimated from the AR framework, wherein the mapping of the 3D object points onto the image results in the 2D object points.

In an illustrated example, for each frame, the DNN (s) may output a list of N object categories with associated bounding boxes and probabilities. For each object output, an object representation data structure is created as (loc, label, view, scale) , where loc is the 3D coordinates of an estimated object point in a current frame, tabel is the object label from DNN (s) , view is the view direction (or viewpoint) from camera to loc, and scale is the scale information that depends on the distance from camera position to loc. For each frame, the AR framework generates the 6DoF pose of the camera and a set of sparse 3D points with global 3D coordinates. These are used to compute loc, view, and scale.

In Step S202, building a semantic map including object superpoints.

In this step, each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point.

In an illustrated example, different from estimated object point data structure (loc, label, view, scale) , the object superpoints are represented as (loc, list_score, list_view, list_scale) . The three lists encode information from all previous frames in the AR session. 1) list_score (E ₁, E ₂, E ₃, …, E _l, …) stores the list of the scores E _l for each label l that has been detected at this point; the higher the score, the higher probability that this point is of category l. 2) list_view (v ₁, v ₂, v ₃, …) stores the list of historical view directions (or viewpoints) from camera positions to the point when an object is detected; and 3) list_scale (s ₁, s ₂, s ₃, …) stores the list of historical scales from camera positions to the point when an object is detected at the point. A point might be labelled as different categories during the AR session at different time instances.

In an embodiment, in the building the semantic map, the semantic map is built from the object points whose projections on the image are within a bounding area of the object identified by the neural network. That is, only the 3D object points that map to or fall within the bounding area (e.g., a bounding box) of the object identified by the neural network are to be concerned or interested in building the semantic map. One reason for this is to reduce computation amount and another is to get more accurate results since only the interested information participates in the calculation.

In an embodiment, a median point of the object points whose projections on the image are within the bounding area (e.g., a bounding box) of the object is used to construct the object superpoint of the semantic map. It is ensured that the median point falls within the bounding area of the object on the image. In another aspect, this further reduces the computation amount.

In an illustrate example, to put the detected objects from DNN (s) onto the 3D semantic map, some form of statistics of all the 3D points whose projections on the image are within the 2D bounding box of a detected object label may be computed. In one embodiment, the median for each of the XYZ dimensions for all points is used to represent the object in the current view. In this way, it is avoided assigning object labels to irrelevant points on other objects or background in the semantic map. This may make the approach more robust and efficient. For example, the AR framework estimates the pose of camera and reconstructs a few 3D points as shown in FIG. 1 as circular points (see the right side of FIG. 1) . The median, p (2.8733, 1.09483, 1.2345) of those circular points that are within the 2D bounding box of “chair” , is used to represent estimated object point loc. A view direction v (0.61497, 0.76871, 0.17458) is computed from the position of the camera to loc. v is a normalized unit vector. The scale information s is an integer defined as the round number of log ₂d, where d is the distance from the camera position to loc. For example, when the camera is 1 meter away from loc, s=0. Therefore, for the detected chair, the data structure (p, "Chair" , v, s) is created to represent the detected object.

In Step S204, determining a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point.

Viewpoint will be changed during an AR session. For a same object, new object points may be generated from different viewpoints. In addition, different labels or categories may be given to the same object. In order to fuse an incoming object point with the object superpoints in the 3D semantic map, a set of the object superpoints in the semantic map whose locations are within a certain distance of the incoming object point is determined. More specifically, the certain distance is determined based on a scale of a category of the object (e.g., a chair category) identified by the neural network and the incoming object point belongs to the category. The scale of the object category or label may be retrieved from a category scale database as shown in FIG. 1.

In an embodiment, in the determining the set of the object superpoints in the semantic map whose locations are within the certain distance of the incoming object point, the certain distance is a maximum scale of the category to which the incoming object point belongs to. For example, the scale of a chair category ranges from 0.5m to 1.5m. The maximum scale of the chair category would be 1.5m.

In an illustrated example, from different viewpoints during an AR session, different estimated object points might be generated for the same object because each image only sees a partial surface of an object. Moreover, the same object might be given different labels during the session. To fuse each incoming estimated object representation (loc ⁱⁿ, label ⁱⁿ, view ⁱⁿ, scale ⁱⁿ) with the superpoints (loc, list_score, list_view, list_scale) in the semantic map, a set of superpoints in the map whose loc are within a certain distance of the incoming object point loc ⁱⁿ are located. To do this, for each category, a minimum scale and maximum scale are defined for that category (e.g., 0.5m to 1.5m for the scale of a chair category) . Then any superpoints within the maximum scale of the incoming object category will be added to the set S ⁱⁿ for processing.

In Step S206, updating the semantic map in response to the incoming object point.

In response to an incoming object point during an AR session, the semantic map is updated. The updated semantic map is for modifying in subsequent processes a probability of the category to which the incoming object belongs to and identified by the neural network for facilitating the object detection. In the updating the semantic map, the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point. That is, the information from the incoming object point participates in building historical scores of set of the object superpoints in the semantic map.

FIG. 3 is a flowchart of a semantic map updating process according to the present application. The updating the semantic map in Step S206 may include the following steps, i.e., Steps 300 to 306.

In Step 300, computing the score of the incoming object point.

In an illustrated example, for each incoming object point of category l, its score

is computed. Two weights w _v and w _s are defined, where 0≤w _v, w _s≤1, and

is computed as follows:

where p _l is the probability of category l computed by DNN (s) , weights w _v and w _s are calculated as:

w _s=k _s*s _diff, if s _diff<1/k _s, otherwise w _s=1

v _diff is the minimum angular difference between current view direction (or viewpoint) and all the view directions in the list_view of all the points in S ⁱⁿ. The higher the v _diff, the higher the weight w _v is. Also, weight w _v is set to zero when the v _diff is within 45 degrees to only update the semantic map intermittently. w _v is capped at 1 when the v _diff is larger than 90 degrees.

Similarly, S _diff is the minimum scale difference between current scale and all scales in the list_scale of all the points in S ⁱⁿ. The higher the s _diff, the higher the weight w _s is. k _s is the factor to normalize the s _diff. In an example, the scale is defined as s=log2 (d) , where d is the distance between camera position and the superpoint. It is necessary to define a range for the scale difference s _diff that the same object can be detected within this range. In this example, this range may be selected empirically as [0, 5] . Therefore k _s is set to be 1/5=0.2.

In an embodiment, the computing the score of the incoming object point may include computing the score of the incoming object point in consideration of a comparison (e.g., v _diff) between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or a comparison (e.g., s _diff) between the scale information of the incoming object point and historical scale information of the object superpoints in the set. That is, the comparison between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or the scale information of the incoming object point and historical scale information of the object superpoints in the set can be used to estimate how much magnitude the camera pose is changed at. More specifically, the incoming object point is of the category identified by the neural network with the probability of the category, and the score of the incoming object point indicates a change of the probability of the category to which the incoming object point belongs. It is desired that a large magnitude the camera pose is changed at leads to a high score of the incoming object point since in this circumstance it would be better to assign a new label identified by the neural network to the incoming object point, and a small magnitude the camera pose is changed at leads to a low score of the incoming object point since in this circumstance it would be better to keep an already-detected label for the incoming object point.

In an embodiment, in the computing the score of the incoming object point, the score of the incoming object point is computed based on a first weight and a second weight, in which the first weight is associated with a minimum angular difference between a viewpoint corresponding to the incoming object point and all the viewpoints corresponding to the object superpoints in the set, and the second weight is associated with a minimum scale difference between a scale corresponding to the incoming object point and all the scales corresponding to the object superpoints in the set. As illustrated in above example, the first weight and the second weight may be w _v and w _s, respectively. The minimum angular difference v _diff is used to determine the first weight. The minimum scale difference s _diff is used to determine the second weight.

In an embodiment, the first weight is set to be a first number if the minimum angular difference is less than a first predetermined degree and the first weight is set to be a second number if the minimum angular difference is greater than a second predetermined degree, and the first number is less than the second number and the first predetermined degree is less than the second predetermined degree. As illustrated in above example, the first weight may be w _v. The first weight w _v is set to be 0 if the minimum angular difference v _diff is less than 45 degrees and is set to 1 if the minimun angular difference v _diff is greather than 90 degrees.

In an embodiment, the second weight is proportional to the minimum scale difference if the minimum scale difference is within a predetermined range, and the second weight is set to a fixed number if the minimum scale difference exceeds the predetermined range. As illustrated in above example, the second weight may be w _s. The second weight w _s is proportional to the minimum scale difference s _diff if the minimum scale difference s _diff is within 1/k _s, and the second weight w _s is set to 1 if the minimum scale difference s _diff exceeds 1/k _s.

In an embodiment, the score of the incoming object point increases as the minimum angular difference and/or the minimum scale difference increases (e.g.,

increases as s _diff and/or s _diff increases) ; the score of the incoming object point decreases as the minimum angular difference and/or the minimum scale difference decreases (e.g.,

decreases as v _diff and/or s _diff decreases) , and an increase of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network increases, that is, it is desired that a large magnitude the camera pose is changed at leads to a high score of the incoming object point since in this circumstance it would be better to assign a new label identified by the neural network to the incoming object point; a decrease of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network decreases, that is, it is desired that a small magnitude the camera pose is changed at leads to a low score of the incoming object point since in this circumstance it would be better to keep an already-detected label for the incoming object point.

In Step 302, updating the scores of the object superpoints in the set by utilizing the score of the incoming object point.

In an illustrated example, given an incoming object representation with score

for a nearby superpoint n∈S ⁱⁿ , its detection score is updated for category l:

Given the distance D between the incoming point and n, its score

is updated as follows:

Here, for any superpoints within the neighborhood defined by the minimum scale of the incoming object point, extra 1 is added to the score of the detected label as a reward of being inside of the smallest object scale in its category as estimated.

In an embodiment, in the updating the scores of the object superpoints in the set, the scores of the object superpoints in the set are updated for the category of the object superpoints in the set that is identical to the category of the incoming object point identified by the neural network, by utilizing the score of the incoming object point. More specifically, a distance from the object superpoints in the set with the same category to the incoming object point is considered. For the object superpoints falling between a maximum scale and a minimum scale of the category of the incoming object point, their scopes are updated by adding the score of the incoming object point to their original scores. For the object superpoints in the set that are of the category identical to the category of the incoming object point, these object superpoints get an extra score (e.g., 1) if they fall within the minimum scale of the category to which the incoming object point belongs. This takes the times a certain label has been detected at a certain location into consideration.

In Step 304, updating the historical data of viewpoint information and/or scale information.

In an illustrated example, view direction v is added into list_view if v _diff≥45°, and scale s is added into list_scale if s _diff≥1 of all points within the set S ⁱⁿ no matter what their label is.

In an embodiment, the historical data of the viewpoint information of any one of the object superpoints in the set is updated if a minimum angular difference (e.g., v _diff) between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set is greater than a predetermined degree (e.g., v _diff≥45°) ; and/or the historical data of the scale information of any one of the object superpoints in the set is updated if a minimum scale difference (e.g., s _diff) between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set exceeds a predetermined value (e.g., s _diff≥1) .

In Step 306, initializing the historical data of the incoming object point under certain circumstances.

In an illustrated example, finally, if there is no superpoints within the minimum scale distance of the incoming object point, all the three lists list_score, list_view and list_scale are initialized with the corresponding values

v and s. For example,

list_view (v) and list_scale (s) . Then, the new superpoint (loc, list_score, list_view, list_scale) is added into the semantic map.

In an embodiment, the historical data of the incoming object point is initialized if no object superpoint is within a minimum scale of the category to which the incoming object point belongs. The initialization means that only current score, viewpoint information and scale information of the incoming object point are recorded on the semantic map for the incoming object point and previous or historical scores, viewpoint information and scale information are initialized as zero or deleted.

In Step S208, modifying a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.

In an illustrated example, for each incoming object point with label l, its label probability p _l can be updated using superpoints stored in a neighborhood of the semantic map. It is first to locate the maximum score

among all object superpoints with label l and locate the maximum score

among all object superpoints with any other label. The probability p _map of semantic map at point p is defined by a modified sigmoid function as following:

A minimum value of p _map is defined as 0.5 to make sure it does not decrease the output probability p _l from DNN (s) dramatically. Therefore, the final probability of object is:

p=p _map*p _l

For each frame, the final output is a list of bounding boxes, each of which has the output (l, p, bbos) , where label and bounding box are the same as the output from DNN (s) .

In an embodiment, the probability of the category to which the incoming object point belongs is modified by using the updated scores of the object superpoints in the set. In an embodiment, the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.

FIG. 4 is a block diagram illustrating an object detection system according to the present application. As illustrated in FIG. 4, an object detection system 40 is provided. The object detection system 40 includes a creating module 400, a building module 402, a determining module 404, an updating module 406 and a modifying module 408.

The creating module 400 is configured to create object representation data based on outputs from a neural network and an augmented reality (AR) framework, wherein the object representation data includes object label information of an object identified by the neural network on an image and three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework.

The building module 402 is configured to build a semantic map including object superpoints, wherein each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point.

The determining module 404 is configured to determine a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, wherein the certain distance is determined based on a scale of a category of the object identified by the neural network and the incoming object point belongs to the category.

The updating module 406 is configured to update the semantic map in response to the incoming object point, wherein the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point.

The modifying module 408 is configured to modify a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.

FIG. 5 is a block diagram illustrating an updating module of an object detection system according to the present application. As illustrated in FIG. 5, The updating module 406 of the object detection system 40 includes a computing unit 500, a score updating unit 502, a data updating unit 504 and an initializing unit 506.

The computing unit 500 is configured to compute the score of the incoming object point in consideration of a comparison between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or a comparison between the scale information of the incoming object point and historical scale information of the object superpoints in the set, wherein the incoming object point is of the category identified by the neural network with the probability of the category, and the score of the incoming object point indicates a change of the probability of the category to which the incoming object point belongs.

The score updating unit 502 is configured to update the scores of the object superpoints in the set for the category of the object superpoints in the set that is identical to the category of the incoming object point identified by the neural network by utilizing the score of the incoming object point, wherein for the object superpoints in the set that are of the category identical to the category of the incoming object point, the object superpoints in the set get an extra score if the object superpoints in the set fall within a minimum scale of the category to which the incoming object point belongs.

The data updating unit 504 is configured to update the historical data of the viewpoint information of any one of the object superpoints in the set if a minimum angular difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set is greater than a predetermined degree; and/or updating the historical data of the scale information of any one of the object superpoints in the set if a minimum scale difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set exceeds a predetermined value.

The initializing unit 506 is configured to modify the probability of the category to which the incoming object point belongs by using the updated scores of the object superpoints in the set, wherein the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.

For the specific definition of the object detection system, reference is made to the above definition of the object detection method, which will not be repeated herein. All or part of the modules or units in the above-mentioned object detection system may be implemented by software, hardware, and a combination thereof. The foregoing modules or units may be embedded in or independent from a processor of a computer equipment in the form of hardware, or may be stored in a memory of the computer equipment in the form of software, so that the processor can invoke and execute the operations corresponding to the foregoing modules or units.

The modules or units in the object detection system may be implemented by a computer program. The computer program can be run on a terminal or a server. The program module composed of the computer program can be stored in a memory of the terminal or the server. When the computer program is executed by the processor, the operations of the method described in the implementations are realized.

Implementations also provide a non-transitory computer-readable storage medium. One or more non-transitory computer-readable storage media contain computer-executable instructions which, when executed by one or more processors, cause the one or more processors to perform the operations of the object detection method.

FIG. 6 is a block diagram illustrating an electronic device 600 according to an embodiment of the present application. For example, the electronic device 600 can be a mobile phone, a game controller, a tablet device, a medical equipment, an exercise equipment, or a personal digital assistant (PDA) .

Referring to FIG. 6, the electronic device 600 may include one or a plurality of the following components: a housing 602, a processor 604, a storage 606, a circuit board 608, and a power circuit 610. The circuit board 608 is disposed inside a space defined by the housing 602. The processor 604 and the storage 606 are disposed on the circuit board 608. The power circuit 610 is configured to supply power to each circuit or device of the electronic device 600. The storage 606 is configured to store executable program codes. By reading the executable program codes stored in the storage 606, the processor 604 runs a program corresponding to the executable program codes to execute the object detection method of any one of the afore-mentioned embodiments.

The processor 604 typically controls overall operations of the electronic device 600, such as the operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processor 604 may include one or more processor 604 to execute instructions to perform actions at all or part of the steps in the above-described methods. Moreover, the processor 604 may include one or more modules which facilitate the interaction between the processor 604 and other components. For instance, the processor 604 may include a multimedia module to facilitate the interaction between the multimedia component and the processor 604.

The storage 606 is configured to store various types of data to support the operation of the electronic device 600. Examples of such data include instructions for any application or method operated on the electronic device 600, contact data, Phonebook data, messages, pictures, video, etc. The storage 606 may be implemented using any type of volatile or non-volatile memory devices, or a combination thereof, such as a static random access memory (SRAM) , an electrically erasable programmable read-only memory (EEPROM) , an erasable programmable read-only memory (EPROM) , a programmable read-only memory (PROM) , a read-only memory (ROM) , a magnetic memory, a flash memory, a magnetic or optical disk.

The power circuit 610 supplies power to various components of the electronic device 600. The power circuit 610 may include a power management system, one or more power sources, and any other component associated with generation, management, and distribution of power for the electronic device 600.

In exemplary embodiments, the electronic device 600 may be implemented by one or more application specific integrated circuits (ASICs) , digital signal processors (DSPs) , digital signal processing devices (DSPDs) , programmable logic devices (PLDs) , field programmable gate arrays (FPGAs) , controllers, micro-controllers, microprocessors, or other electronic components, for performing the above-described methods.

In exemplary embodiments, there is also provided a non-transitory computer-readable storage medium including instructions, such as included in the storage 606, executable by the processor 604 of the electronic device 600 for performing the above-described methods. For example, the non-transitory computer-readable storage medium may be a ROM, a random access memory (RAM) , a CD-ROM, a magnetic tape, a floppy disc, an optical data storage device, and the like.

A person having ordinary skill in the art understands that each of the units, modules, algorithm, and steps described and disclosed in the embodiments of the present application are realized using electronic hardware or combinations of software for computers and electronic hardware. Whether the functions run in hardware or software depends on the condition of application and design requirement for a technical plan. A person having ordinary skill in the art can use different ways to realize the function for each specific application while such realizations should not go beyond the scope of the present application.

It is understood by a person having ordinary skill in the art that he/she can refer to the working processes of the system, device, and module in the above-mentioned embodiment since the working processes of the above-mentioned system, device, and module are basically the same. For easy description and simplicity, these working processes will not be detailed.

It is understood that the disclosed system and method in the embodiments of the present application can be realized with other ways. The above-mentioned embodiments are exemplary only. The division of the modules is merely based on logical functions while other divisions exist in realization. It is possible that a plurality of modules or components are combined or integrated in another system. It is also possible that some characteristics are omitted or skipped. On the other hand, the displayed or discussed mutual coupling, direct coupling, or communicative coupling operate through some ports, devices, or modules whether indirectly or communicatively by ways of electrical, mechanical, or other kinds of forms.

The modules as separating components for explanation are or are not physically separated. The modules for display are or are not physical modules, that is, located in one place or distributed on a plurality of network modules. Some or all of the modules are used according to the purposes of the embodiments.

Moreover, each of the functional modules in each of the embodiments can be integrated in one processing module, physically independent, or integrated in one processing module with two or more than two modules.

If the software function module is realized and used and sold as a product, it can be stored in a readable storage medium in a computer. Based on this understanding, the technical plan proposed by the present application can be essentially or partially realized as the form of a software product. Or, one part of the technical plan beneficial to the conventional technology can be realized as the form of a software product. The software product in the computer is stored in a storage medium, including a plurality of commands for a computational device (such as a personal computer, a server, or a network device) to run all or some of the steps disclosed by the embodiments of the present application. The storage medium includes a USB disk, a mobile hard disk, a read-only memory (ROM) , a random access memory (RAM) , a floppy disk, or other kinds of media capable of storing program codes.

While the present application has been described in connection with what is considered the most practical and preferred embodiments, it is understood that the present application is not limited to the disclosed embodiments but is intended to cover various arrangements made without departing from the scope of the broadest interpretation of the appended claims.

Claims

An object detection method, comprising:

creating, by a processor, object representation data based on outputs from a neural network and an augmented reality (AR) framework, wherein the object representation data comprises object label information of an object identified by the neural network on an image and three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework;

building, by a processor, a semantic map comprising object superpoints, wherein each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point;

determining, by a processor, a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, wherein the certain distance is determined based on a scale of a category of the object identified by the neural network and the incoming object point belongs to the category;

updating, by a processor, the semantic map in response to the incoming object point, wherein the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point; and

modifying, by a processor, a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.
The object detection method according to claim 1, wherein in the building the semantic map, the semantic map is built from the object points whose projections on the image are within a bounding area of the object identified by the neural network.
The object detection method according to claim 2, wherein a median point of the object points whose projections on the image are within the bounding area of the object is used to construct the object superpoint of the semantic map.
The object detection method according to claim 1, wherein in the determining the set of the object superpoints in the semantic map whose locations are within the certain distance of the incoming object point, the certain distance is a maximum scale of the category to which the incoming object point belongs to.
The object detection method according to claim 1, wherein the updating the semantic map in response to the incoming object point comprises:

computing the score of the incoming object point in consideration of a comparison between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or a comparison between the scale information of the incoming object point and historical scale information of the object superpoints in the set, wherein the incoming object point is of the category identified by the neural network with the probability of the category, and the score of the incoming object point indicates a change of the probability of the category to which the incoming object point belongs.
The object detection method according to claim 5, wherein in the computing the score of the incoming object point, the score of the incoming object point is computed based on a first weight and a second weight, wherein the first weight is associated with a minimum angular difference between a viewpoint corresponding to the incoming object point and all the viewpoints corresponding to the object superpoints in the set, and the second weight is associated with a minimum scale difference between a scale corresponding to the incoming object point and all the scales corresponding to the object superpoints in the set.
The object detection method according to claim 6, wherein the first weight is set to be a first number if the minimum angular difference is less than a first predetermined degree and the first weight is set to be a second number if the minimum angular difference is greater than a second predetermined degree, and wherein the first number is less than the second number and the first predetermined degree is less than the second predetermined degree.
The object detection method according to claim 6, wherein the second weight is proportional to the minimum scale difference if the minimum scale difference is within a predetermined range, and the second weight is set to a fixed number if the minimum scale difference exceeds the predetermined range.
The object detection method according to claim 6, wherein the score of the incoming object point increases as the minimum angular difference and/or the minimum scale difference increases; the score of the incoming object point decreases as the minimum angular difference and/or the minimum scale difference decreases, and wherein an increase of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network increases; a decrease of the minimum angular difference and/or the minimum scale difference indicates a chance to use the probability of the category of the incoming object point obtained from the neural network decreases.
The object detection method according to claim 1, wherein the updating the semantic map in response to the incoming object point comprises:

updating the scores of the object superpoints in the set for the category of the object superpoints in the set that is identical to the category of the incoming object point identified by the neural network by utilizing the score of the incoming object point, wherein for the object superpoints in the set that are of the category identical to the category of the incoming object point, the object superpoints in the set get an extra score if the object superpoints in the set fall within a minimum scale of the category to which the incoming object point belongs.
The object detection method according to claim 1, wherein the updating the semantic map in response to the incoming object point comprises:

updating the historical data of the viewpoint information of any one of the object superpoints in the set if a minimum angular difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set is greater than a predetermined degree; and/or updating the historical data of the scale information of any one of the object superpoints in the set if a minimum scale difference between a viewpoint corresponding to the one of the object superpoints in the set and all the viewpoints corresponding to the object superpoints in the set exceeds a predetermined value.
The object detection method according to claim 1, wherein the updating the semantic map in response to the incoming object point comprises:

initializing the historical data of the incoming object point if no object superpoint is within a minimum scale of the category to which the incoming object point belongs, wherein only current score, viewpoint information and scale information of the incoming object point are recorded on the semantic map for the incoming object point.
The object detection method according to claim 1, wherein the modifying the probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map comprises:

modifying the probability of the category to which the incoming object point belongs by using the updated scores of the object superpoints in the set.
The object detection method according to claim 13, wherein the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.
The object detection method according to claim 13, wherein the probability of the category to which the incoming object point belongs is modified based on a sigmoid function.
An object detection system, comprising:

at least one memory configured to store program instructions; and

at least one processor configured to execute the program instructions, which cause the at least one processor to perform steps comprising:

creating object representation data based on outputs from a neural network and an augmented reality (AR) framework, wherein the object representation data comprises object label information of an object identified by the neural network on an image and three-dimensional location of an object point and viewpoint information and scale information of the object point from the AR framework;

building a semantic map comprising object superpoints, wherein each of the object superpoints is represented by historical data of scores, the viewpoint information and the scale information of the object point;

determining a set of the object superpoints in the semantic map whose locations are within a certain distance of an incoming object point, wherein the certain distance is determined based on a scale of a category of the object identified by the neural network and the incoming object point belongs to the category;

updating the semantic map in response to the incoming object point, wherein the scores of the object superpoints in the determined set in the semantic map are updated based on information from the incoming object point; and

modifying a probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map.
The object detection system according to claim 16, wherein the updating the semantic map in response to the incoming object point comprises:

computing the score of the incoming object point in consideration of a comparison between the viewpoint information of the incoming object point and historical viewpoint information of the object superpoints in the set and/or a comparison between the scale information of the incoming object point and historical scale information of the object superpoints in the set, wherein the incoming object point is of the category identified by the neural network with the probability of the category, and the score of the incoming object point indicates a change of the probability of the category to which the incoming object point belongs.
The object detection system according to claim 16, wherein the updating the semantic map in response to the incoming object point comprises:

updating the scores of the object superpoints in the set for the category of the object superpoints in the set that is identical to the category of the incoming object point identified by the neural network by utilizing the score of the incoming object point, wherein for the object superpoints in the set that are of the category identical to the category of the incoming object point, the object superpoints in the set get an extra score if the object superpoints in the set fall within a minimum scale of the category to which the incoming object point belongs.
The object detection system according to claim 16, wherein the modifying the probability of the category to which the incoming object point belongs and identified by the neural network based on the updated semantic map comprises:

modifying the probability of the category to which the incoming object point belongs by using the updated scores of the object superpoints in the set, wherein the probability of the category to which the incoming object point belongs is modified in consideration of a maximum score of all the object superpoints in the set with a category as the same as the category of the incoming object point and a maximum score of all the object superpoints in the set with any other category.
A non-transitory computer-readable medium, deployed with program instructions stored thereon, that when executed by at least one processor, cause the at least one processor to perform the object detection method according to claim 1.