CN111967288A

CN111967288A - Intelligent three-dimensional object identification and positioning system and method

Info

Publication number: CN111967288A
Application number: CN201910419430.XA
Authority: CN
Inventors: 李应樵; 马志雄
Original assignee: Marvel Digital Ai Ltd
Current assignee: Marvel Digital Ai Ltd
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2020-11-20

Abstract

The invention discloses a method and a system for identifying and positioning a three-dimensional object, which comprises the following steps: obtaining a three-dimensional light field image through a micro-lens array in a light field camera; carrying out image reconstruction on the obtained three-dimensional light field image through a decoding step, combining the reconstructed light field images for refocusing to obtain a high-resolution image and calculating a depth map corresponding to the high-resolution image; carrying out convolutional neural network training on the obtained high-resolution image to extract the characteristics of all targets in the image; carrying out target detection on the reconstructed light field image by using a target detection model established by a convolutional neural network; then identifying the obtained target; the identified object is finally calculated from its corresponding depth map for the actual distance and location. The method and the system provided by the invention can shoot images with high resolution ratio in real time, and carry out deep learning model training on data through a visual intelligent technology, thereby not only accurately identifying the object of the photo and/or the video, but also accurately determining the position of the object through the calculated depth map.

Description

Intelligent three-dimensional object identification and positioning system and method

Technical Field

The invention belongs to the field of intelligent identification and positioning, and particularly relates to a system and a method for intelligently identifying and positioning a three-dimensional object.

Background

Methods and devices for three-dimensional object recognition in the prior art include those disclosed in chinese patent application 201610712507.9: performing convolution on the object image or the feature image obtained by performing convolution on the object image; performing linear transformation on the convolution result of the object image or the characteristic diagram; taking the result of the linear transformation as the input of a three-dimensional deformation model, and outputting the three-dimensional deformation model as a position point of a feature to be extracted in an object image; extracting features from the object image or a feature image obtained by convolving the object image according to the position points output by the three-dimensional deformation model; and the Chinese invention patent application 201811243485.1: A. acquiring a three-dimensional object sample set; B. updating a three-dimensional object sample set; C. enveloping and gridding the sample; D. training a neural network; E. determining a three-dimensional object to be detected and a detection sample; F. setting a sliding step length; G. setting a confidence threshold; H. and determining the type and the position of the three-dimensional object contained in the three-dimensional digital space by the object to be detected. These prior art methods are inefficient and the cost of object acquisition and identification remains high, with low identification accuracy.

Accordingly, there is a need for a three-dimensional object recognition and localization system and method that is more efficient and accurate.

Disclosure of Invention

The invention aims to provide a real-time three-dimensional data acquisition and analysis technology, which identifies an object in an image and accurately determines the position of the object through training of obtained image data.

The invention provides a method for identifying and positioning a three-dimensional object, which comprises the following steps: obtaining a three-dimensional light field image through a micro-lens array in a light field camera; carrying out image reconstruction on the obtained three-dimensional light field image through a decoding step, combining the reconstructed light field images for refocusing to obtain a high-resolution image and calculating a depth map corresponding to the high-resolution image; carrying out convolutional neural network training on the obtained high-resolution image to extract the characteristics of all targets in the image; performing target detection on the reconstructed light field image by using a target detection model established by a convolutional neural network ('CNN'); then identifying the obtained target; the identified objects are used to calculate the actual distance and location from their corresponding depth maps.

In one aspect of the present invention, the plurality of lenses of the microlens array are a plurality of lenses having generally equal or uniform size, and are a spherical microlens array, a hexagonal microlens array or a square microlens array.

In another aspect of the invention, information of a point on the object is transmitted to the sensor through the microlens array via the main lens of the light field camera.

In yet another aspect of the present invention, the sensor may be a CMOS sensor or a charge coupled device image sensor ("CCD sensor"), or like photosensor.

Another aspect of the present invention, wherein the image reconstructing the obtained three-dimensional light field image by the decoding step comprises: (a) obtaining a light field image imaged by a microlens array; (b) obtaining a sequence of sub-aperture images arranged according to the distance of the focal plane; (c) obtaining a single sub-aperture image; (d) arranging the multi-view sub-aperture images according to the position on the main lens; wherein the multi-view sub-aperture image array is obtained after processing the light field image.

In another aspect of the present invention, the pixel points in the light field image are re-projected into each sub-aperture image, so as to form images of different viewing angles of a scene, further synthesize and extract the light field information of the light field image, obtain a multi-view of an imaging space, and further obtain a digital refocusing sequence; and further a depth map is obtained.

In another aspect of the invention, the following formula is used:

L′(u，v，x′，y′)＝L(u，v，kx′+(1-k)u，ky′+(1-k)v) (3)

I′(x′，y′)＝∫∫L(u，v，kx′+(1-k)u，ky′+(1-k)v)dudv (4)

wherein, I, I' represents the coordinate system of the primary and secondary imaging surfaces;

l and L' represent the energy of the primary and secondary imaging planes.

After the depth data of the object in each micro lens is obtained, the depth map of the whole image can be calculated, and a three-dimensional 3D image can be shot.

In another aspect of the invention, all light rays passing through a pixel pass through its parent microlens and through a conjugate square on the main lens to form a sub-aperture, and all light rays passing through the sub-aperture are focused by the corresponding pixel under a different microlens.

In another aspect of the present invention, wherein the light field image I (x, y) may be represented by the formula:

I(x,y)＝∫∫L_F(u,v,x,y)dudv (5)

where (u, v, x, y) denotes the light traveling along the ray intersecting the main lens at (u, v) and the microlens plane at (x, y), and a full aperture is used. The refocused image can be calculated by moving the sub-aperture image in the manner shown in fig. 6 (e):

the shifted light field function can be expressed as:

in another aspect of the present invention, the target detection model established by the convolutional neural network further comprises the steps of feature extraction, candidate region and classification and regression; the convolutional neural network consists of an input layer, a convolutional layer, an activation function, a pooling layer and a full-connection layer; inputting a characteristic diagram obtained by the CNN through a convolutional layer, a pooling layer, an activation function and a full-connection layer into the candidate region; and classifying and regressing the results obtained through the candidate regions.

In another aspect of the invention, the feature extraction step examines each pixel of the light field image to determine whether the pixel represents a feature; or, when the feature extraction step is part of a larger algorithm, only the feature regions of the image are examined.

In another aspect of the invention, wherein prior to the feature extraction step, the light field image is smoothed in scale space by a gaussian blur kernel; and calculating one or more features of the light field image by local derivative operations.

In another aspect of the invention, wherein the activation function is a ReLU function to represent non-linear elements of the image, ensuring that data input and output are also differentiable, and continuously performing loop calculations that continuously change the value of each neuron during each generation of loop.

In another aspect of the present invention, wherein the activation function is a Sigmoid function, a Tanh function, a Leaky ReLU function, or a Maxout function.

In another aspect of the present invention, wherein the ReLU function is f (x) ═ max (0, x), the feature points are detected using a template-based method, an edge-based method, a grayscale-based method, or a spatial transform-based method.

In another aspect of the present invention, the pooling layer is a maximum pooling that extracts a maximum value from the modified feature map as a pooled value of the region; the input light field image is divided into a plurality of rectangular areas, the maximum value is output for each sub-area, and the pooling layer acts on each input feature and reduces the size of the input feature.

Another aspect of the invention, wherein pooling layers are periodically inserted between convolutional layers of the CNN.

In another aspect of the present invention, before the step of classifying and regressing the results obtained through the candidate regions, the method further comprises the step of pooling the information obtained through the candidate regions step into the region of interest.

In another aspect of the present invention, the step of selecting the candidate region uses a method of using a region with convolutional neural network characteristics (R-CNN), a region with convolutional neural network characteristics rapidly (Fast R-CNN), a region with convolutional neural network characteristics more rapidly (Fast R-CNN), or a region with convolutional neural network characteristics masked (Mask R-CNN).

In another aspect of the present invention, the step of calculating the actual distance and location of the identified object by its corresponding depth map includes determining a reference object, calculating a depth map according to a known approximate distance range between the identified reference object and the reference object, and normalizing the depth in the depth map within the distance range to obtain the required distance.

The invention also provides a system for identifying and positioning three-dimensional objects, comprising:

the image acquisition module is used for acquiring a three-dimensional light field image through a micro lens array in the light field camera;

the image reconstruction focusing and depth map module is used for reconstructing the image of the obtained three-dimensional light field image through a decoding step, combining the reconstructed light field image for refocusing to obtain a high-resolution image and calculating a depth map corresponding to the high-resolution image;

the CNN training and feature extraction module is used for carrying out convolutional neural network training on the obtained high-resolution image so as to extract the features of all targets in the image;

a detection and identification module for performing target detection on the reconstructed light field image by using a target detection model established by a convolutional neural network ('CNN'); then identifying the obtained target;

and the positioning module is used for calculating the actual distance and positioning of the identified object through the corresponding depth map.

The three-dimensional object intelligent recognition and positioning system provided by the invention can shoot images with high resolution in real time, and carry out deep learning model training on data through a visual intelligent technology, so that not only can objects of pictures and/or videos be accurately recognized, but also the positions of the objects can be accurately determined through a calculated depth map.

Drawings

In order to more clearly illustrate the technical solution in the embodiments of the present invention, the drawings required to be used in the embodiments will be briefly described below. It is obvious that the drawings in the following description are only examples of the invention, and that for a person skilled in the art, other drawings can be derived from them without making an inventive step.

FIG. 1 is a flow chart of a three-dimensional object recognition and localization method of the present invention.

FIG. 2 is a schematic diagram of one embodiment of a microlens array in a three-dimensional object recognition and positioning system of the present invention.

FIG. 3 is a schematic diagram of another embodiment of a microlens array in a three-dimensional object recognition and positioning system of the present invention.

FIG. 4 is a schematic diagram of yet another embodiment of a microlens array in a three-dimensional object recognition and positioning system of the present invention.

Fig. 5 is a schematic light diagram (schematic) of a data acquisition portion of the three-dimensional object recognition and localization system of the present invention.

FIG. 6(a) is a schematic diagram of an example of decoding a light field image obtained by the data acquisition portion of a three-dimensional object recognition and positioning system of the present invention.

Fig. 6(b) and 6(c) are schematic diagrams of the light field imaging system of the present invention.

FIG. 6(d) is an exemplary diagram of a light field image after being processed by the present invention.

FIG. 6(e) is a schematic diagram of the moving sub-aperture image of the three-dimensional object recognition and localization system of the present invention to calculate a refocused image.

FIG. 6(f) is a schematic diagram of the digital refocusing of a synthetic aperture image by the three-dimensional object recognition and localization system of the present invention.

Fig. 7 is a schematic diagram of a target (object) detection section in the three-dimensional object recognition and localization system of the present invention.

Fig. 8(a) and (b) are diagrams illustrating examples of a convolution layer of an object (object) detecting portion in the three-dimensional object recognition and localization system of the present invention.

Fig. 8(c) is an exemplary diagram of a single depth slice of the present invention.

FIG. 9 is a schematic diagram of a target (object) detection part of candidate regions and classification and regression in the three-dimensional object recognition and localization system of the present invention.

Fig. 10 is a diagram showing the detection results of two examples of the target (object) detecting section in the three-dimensional object recognition and localization system of the present invention.

FIG. 11(a) is an exemplary diagram of the detection, classification and semantic segmentation results in the three-dimensional object recognition and localization system of the present invention.

Fig. 11(b) and 11(c) are examples of determining distances from a depth map for a three-dimensional object of the present invention.

FIG. 12 is a schematic diagram of one identification example of the three-dimensional object identification and localization system of the present invention.

FIG. 13 schematically shows a block diagram of a server for performing the method according to the invention; and

fig. 14 schematically shows a memory unit for holding or carrying program code implementing the method according to the invention.

Detailed Description

Specific embodiments of the present invention will now be described with reference to the accompanying drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. These embodiments are provided only for the purpose of exhaustive and comprehensive description of the invention so that those skilled in the art can fully describe the scope of the invention. The terminology used in the detailed description of the embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention.

FIG. 1 is a flow chart 100 of a three-dimensional object recognition and localization system of the present invention. Wherein, in step 101, three-dimensional image information is obtained by a microlens array in a light field camera; in step 102, performing image reconstruction on the obtained three-dimensional light field image through a decoding step, combining the reconstructed light field images for refocusing to obtain a high-resolution image and calculating a depth map corresponding to the high-resolution image; (ii) a In step 103, performing convolutional neural network training on the obtained high-resolution image to extract the features of all targets (i.e. objects) in the image; carrying out target detection on the reconstructed image by using a target (object) detection model established by the convolutional neural network; and identifies and locates the acquired object in

steps

104 and 105, respectively.

Fig. 2-4 are schematic diagrams of different embodiments of a microlens array in a three-dimensional object recognition and positioning system of the present invention. It is shown that the core component of the data acquisition part in the three-dimensional object recognition and positioning system of the present invention is a micro-lens array, i.e. a "compound eye" in a light field camera. Fig. 5 is a schematic light diagram (schematic) of a data acquisition portion of the three-dimensional object recognition and localization system of the present invention. One example of the data collection part in the three-dimensional object recognition and positioning system of the present invention is a camera 500, and information of a certain point on a subject 504 is transmitted to a sensor 501, which may be a photoelectric sensor such as a CMOS sensor or a charge coupled device image sensor ("CCD sensor"), through a main lens 503 of the camera 500 and through a microlens array 502. Wherein each microlens in the microlens array 502 covers a plurality of pixels of the photosensor, for example a microlens 120 microns in diameter covers 28 x 28 pixels; a microlens with a diameter of 60 microns covers a pixel of 14 x 14 and impinges the rays of light under the pixel of one small image. Information on a certain point on the object 504 recorded by the photosensor 501 is a light field image, which is also referred to as 4D light field information. The optical design and processing of the microlens array, which is decisive for obtaining the three-dimensional object recognition and positioning system of the present invention, is composed of a one-dimensional or two-dimensional array of a plurality of lenses on a supporting substrate, the plurality of lenses in the microlens array are typically a plurality of lenses of equal or uniform size, which may be a spherical microlens array 200, a hexagonal microlens array 300 or a square microlens array 400. Typically, each microlens array is 40 to 200 microns in size. The arrangement mode of the hexagonal micro-lens array can increase more micro-lenses in the same area, so that the light sensing effect is higher.

FIG. 6(a) is a schematic diagram of an example of decoding a light field image obtained by the data acquisition portion of a three-dimensional object recognition and positioning system of the present invention. The decoding process 600 aims at image reconstruction of the obtained light-field image 601 at step 605, refocusing 603 the light-field image in combination with the estimated depth map 602, and combining the images of refocusing 603 to obtain all focused light-field images 604. Wherein the image reconstruction performed in step 605 is a key step in decoding light field images, which comprises: obtaining a light field image by imaging through a microlens array; (b) obtaining a sequence of sub-aperture images (arranged according to the distance of the focal plane); (c) obtaining a single sub-aperture image; (d) a sequence of sub-aperture images (arranged according to position on the main lens); fig. 605(d) is a multi-view sub-aperture image array obtained by the camera system of the present invention, wherein the multi-view sub-aperture image array is obtained after the original compound eye image is processed. According to the synthetic aperture technology, pixel points in the original compound eye image are re-projected into each sub-aperture image, and imaging of different view angles of a scene is formed. Light field information in the original compound eye image can be further synthesized and extracted to obtain a multi-view of an imaging space, and a digital refocusing sequence is further obtained; and further a depth map is obtained. Fig. 6(f) is a schematic diagram of the three-dimensional object recognition and positioning system of the present invention performing digital refocusing on a synthetic aperture image, and the synthetic aperture image is digitally refocused using the principle of fig. 6 (f):

L′(u，v，x′，y′)＝L(u，v，kx′+(1-k)u，ky′+(1-k)v) (3)

I′(x′，y′)＝∫∫L(u，v，kx′+(1-k)u，ky′+(1-k)v)dudv (4)

l and L' represent the energy of the primary and secondary imaging planes.

Fig. 6(b) and 6(c) show the mechanism of a light field imaging system with a microlens array 608 in front of the CMOS sensor 607. Fig. 6(b) all light rays passing through the pixel pass through its parent microlens and through the conjugate square (sub-aperture) on the main lens 609. Fig. 6(c) all light rays passing through the sub-apertures are focused by the corresponding pixels under the different microlenses. These pixels form a picture seen through the sub-aperture.

The light-field image I (x, y) can be represented by the formula:

I(x，y)＝∫∫L_F(u，v，x，y)dudv (5)

where (u, v, x, y) denotes the light traveling along the ray intersecting the main lens at (u, v) and the microlens plane at (x, y), and a full aperture is used. FIG. 6(e) is a schematic diagram of the moving sub-aperture image of the three-dimensional object recognition and localization system of the present invention to calculate a refocused image.

The refocused image can be calculated by moving the sub-aperture image in the manner shown in fig. 6 (e):

the shifted light field function can be expressed as:

light field imaging techniques allow refocusing images and estimating a depth map of a scene. The basic depth range is calculated from the light field.

FIG. 6(d) is an exemplary diagram of a light field image after processing by the present invention. Taking semiconductor manufacturing for chip-on-board applications as an example, compound eyes can be used to find the maximum loop height of the aluminum bond line, the first bonding height on the chip and the second bonding height on the substrate. In fig. 6(d), a larger number (μm) in the positive direction means a closer virtual focal plane toward the objective lens. The focal plane on the surface of the objective lens is calibrated to 0 μm. A processed light field image. The top left image of fig. 6(d) is the top line layer, the top right image of fig. 6(d) is the middle layer, the bottom left image of fig. 6(d) is the bottom metal layer, and the bottom right image of fig. 6(d) is the all-in-focus image. Autofocus software will be developed to capture all line images without any mechanical movement of the commanded vertical axis. Real-time AOI software will be developed and used in conjunction with auto-focus software. The user interface will display the image taken by the camera and the full focus image, and will mark any defects detected.

Fig. 7 is a schematic diagram of a target (object) detection section in the three-dimensional object recognition and localization system of the present invention. The object detection model established based on the deep learning convolutional neural network basically comprises the steps of feature extraction 702(feature extraction), candidate regions 703(region deployment) and classification and regression 704(classification and regression).

The convolutional neural network ("CNN") 701 is composed of INPUT layers, convolutional layers, activation functions, pooling layers, and fully-connected layers, i.e., INPUT layer-CONV (convolutional layer) -RELU (activation function) -POOL (pooling layer) -Full Connection. CNN701 is used to perform feature extraction 702, which refers to a process of converting raw image data that cannot be recognized by a machine learning algorithm into features that can be recognized by the algorithm, and which examines each pixel to determine whether the pixel represents a feature. If it is part of a larger algorithm, the algorithm generally examines only the feature regions of the image. As a prerequisite operation for feature extraction 702, the input image is typically smoothed in scale space by a gaussian blur kernel. Thereafter one or more features of the image are calculated by local derivative operations. The original features are converted into a set of features with obvious physical significance (Gabor, geometric features [ corner points, invariant ], texture [ LBP HOG ]) or statistical significance or kernel.

Fig. 8(a) and (b) are diagrams illustrating examples of a convolution layer of an object (object) detecting portion in the three-dimensional object recognition and localization system of the present invention. For example, the input image 801 is 32 × 32 pixels and the convolution layer is a 5 × 5 hypothetical convolution kernel 802(filter or neuron). A 28 × 28 feature map can be obtained by convolution of the hypothetical convolution kernel 802 with the input image 801, using two feature maps 803 obtained by two hypothetical convolution kernels. Alternatively, a multi-layer convolutional layer may be used to obtain a deeper feature map, and two 28 × 28 and 24 × 24

feature maps

804 and 805 are obtained by convolutional layer (CONV) and activation function (RELU), respectively. The activation function RELU is used for expressing the nonlinear factors of the image, the selected activation function can ensure that the data input and output are also differentiable, and the operation characteristic is that the cyclic calculation is continuously carried out, so that the value of each neuron is continuously changed in the process of each generation of cycle; the activation function chosen is non-linear, continuously differentiable, range unsaturated, monotonic, and approximately linear at the origin. Besides the ReLU function, a Sigmoid function, a Tanh function, a leakage ReLU function, or a Maxout function may be selected. Wherein the ReLU function is

And f (x) max (0, x) is used for solving the gradient dissipation problem of the BP algorithm when the deep neural network is optimized.

According to different image information processing methods, a template-based method, an edge-based method, a gray-scale-based method and a spatial transformation-based method can be selected to detect the feature points.

Further Pooling (Pooling) means, in particular Max Pooling (Max Pooling), is used, i.e. extracting the maximum value from the corrected profile as the pooled value of the region. An input image is divided into a plurality of rectangular regions, and a maximum value is output for each sub-region. Intuitively, this mechanism can be effective because, after a feature is found, its precise location is far less important than the relationship of its relative location to other features. The pooling layer will constantly reduce the spatial size of the data and hence the number of parameters and the amount of calculations will also decrease, which to some extent also controls the overfitting. Typically, pooling layers are periodically inserted between convolutional layers of a CNN. The pooling layer will typically act on each input feature separately and reduce its size. The currently most common form of pooling layer is to divide a 2 x 2 block from the picture every 2 elements and then take the maximum of 4 numbers in each block. This would reduce the amount of data by 75%.

Fig. 8(c) is an exemplary diagram of a single depth slice of the present invention, as illustrated in the example of fig. 8 (c).

After the maximum pooling step, a dimension-reduced feature map is obtained, and because the dimension-reduced feature map is still a two-dimensional picture, in the step, a flattening layer (Flatten) is generated, that is, the input of the two-dimensional picture is subjected to one-dimensional processing, and a one-dimensional array is generated for transition from the convolution layer to a full connection layer (full connection). The flattening operation on the two-dimensional picture does not affect the size of the Batch (Batch), which is a loss function for better handling of non-convex; and reasonably utilizing memory capacity.

As shown in fig. 7, feature maps 705 obtained by CNN via convolutional layers, pooling layers, activation functions, and full connectivity layers are input to a candidate regions step 703, other embodiments of which are described in detail below, in one embodiment, a region of interest is selected from a 3 × 3 feature map, one of which obtains a candidate portion (propofol) by a reshaping function (reshape), by a normalized exponential function (softmax), and a further reshaping function (reshape), while the other of the feature maps directly obtains a candidate portion. The information obtained via the candidate regions step 703 is subjected to Region of interest Pooling 706 (ROI Pooling) and the results are classified and regressed 704. Methods of candidate regions that can be selected include regions with convolutional neural network features (R-CNN), regions with convolutional neural network features Fast (Fast R-CNN); faster regions with convolutional neural network features (Faster R-CNN). A method of masking a region having a convolutional neural network characteristic (Mask R-CNN) may also be selected. The following Table 2 shows the differences between R-CNN, Fast R-CNN and Fast R-CNN:

TABLE 2

In a candidate region (region) process in target detection, an input image 901 enters a candidate region network 904 through feature extraction 702 of a convolutional neural network 701; the classification output by the candidate area network does not directly determine the corresponding class of the object on the Common object in Context ("COCO") dataset, but outputs the binary value p, pe [0, 1], and sets the threshold to 0.5. The function of the candidate area network is that if p ≧ 0.5, the area can be considered as possibly being of some unknown class. The function of the candidate area network 904 is to select the area and define the selected area as the area of interest. The candidate area network 904 will frame the approximate location of these areas of interest on a feature map 903, i.e., the output bounding box. That is, the candidate area network in the object detection deletes the area of no interest in one of the maps, leaves the area of interest, and determines the category of the content of interest. For example, in a street view photograph, the sky, green trees, are deleted, leaving buildings that identify street locations as the area of interest.

Assuming that the size of the feature map 903 input to the candidate area network is 64 × 64, the size of the extracted region of interest is smaller than 64 × 64, the region of interest is extracted from the feature map, the candidate area outputs a classification loss function 908 and a bounding box regression loss function 909, and the sizes of all the extracted regions are adjusted to the same size by pooling 905. Inputting these areas into a common classification network, the final output classification (i.e. specific class in the Coco dataset) of the whole network can be obtained, such as human, dog. After accurate fine adjustment, the regression of the bounding box at the upper right in the first graph is output; a classification loss function 907 and a bounding box regression loss function 906 are output.

Fig. 10 is a diagram showing the detection results of two examples of the target (object) detecting section in the three-dimensional object recognition and localization system of the present invention. Wherein the upper part of figure 10 shows adult and child detected from a dark background; the lower half of fig. 10 shows cars and people detected in the traffic environment.

FIG. 11(a) is an exemplary diagram of the detection, classification and semantic segmentation results in the three-dimensional object recognition and localization system of the present invention. The bounding box shows the position of the object to be measured, and the text displayed on each graph shows the classification result, for example, the bounding

boxes

1101, 1102, 1103 and 1104 show that the classification result is car, truck, human and potted plant, and the number means the classification accuracy. In one embodiment, target location may also be obtained, i.e., the location of the bounding box is obtained, e.g., the car bounding box 1101 is 218 meters away; truck bounding box 1102 distance 162 meters; human bounding box 1103 is 58 meters away; potted plant bounding box 1104 is 87 meters away, etc.

Fig. 11(b) and 11(c) are examples of determining distances from a depth map for a three-dimensional object of the present invention. Fig. 11(b) is a depth map of fig. 11 (c). The principle of distance measurement is to determine the reference objects, such as the miniascape at the lower left corner and the white wall at the farthest corner in fig. 11(c), and to know the approximate distances between the two, such as 5 meters and 30 meters. Then, a depth map (depth map) is calculated, and fig. 11(b) is a depth map after visualization. The depth in the depth map is normalized within this distance range, forming the color bar on the right side of fig. 11 (b).

FIG. 12 is a schematic diagram of one identification example of the three-dimensional object identification and localization system of the present invention. A procedure in which the face detection step 1201 automatically locates a human face in visual media (digital pictures or video). For a detected face, the process reports its position and associated size and orientation information. After the face is detected, the process also searches for characteristic points such as eyes, a nose and the like. In one example, machine learning does not detect feature points first, and then performs full-face detection based on the feature points; but the full-face detection and the detailed feature point detection are separately performed. Thus, feature point detection is an optional step, not enabled by default. Further outlined with a set of points representing the shape of the facial features and indicating by classification whether certain facial features are present.

In Face alignment (Face alignment) step 1202, in one example, a Dense Face alignment strategy (Dense Face alignment Defa) may be used to align the Face image, where three constraints may be added to the Face model: landmark fitting constraints, contour fitting constraints and sift pair constraints (landmark fitting constraints, constraint fitting constraints and sift pair constraints); training is performed using multiple face alignment databases. To achieve high quality dense face alignment (DeFA), no corresponding training database exists, no more than 68 feature points are marked in all face alignment databases, useful information is searched for as additional supervision, and the information is embedded into a learning framework. Optionally adding additional contour constraint to predict that the contour of the face shape matches the detected face edge in the image; and an additional Scale Invariant Feature Transform (Scale Invariant Feature Transform, SIFT) constraint condition is added, and different face images corresponding to the same person see SIFT key points which should correspond to the same vertex in the face model.

In the face recognition step 1203, the features that can be used are classified into visual features, pixel statistical features, face image transformation coefficient features, face image algebraic features, and the like. The face feature extraction is performed on some features of the face. The face feature extraction is a process of carrying out feature modeling on a face. Knowledge-based characterization methods may be employed; or characterization methods based on algebraic features or statistical learning. The knowledge-based characterization method mainly obtains feature data which is helpful for face classification according to shape description of face organs and distance characteristics between the face organs, and feature components of the feature data generally comprise Euclidean distance, curvature, angle and the like between feature points. The human face is composed of parts such as eyes, nose, mouth, and chin, and geometric description of the parts and their structural relationship can be used as important features for recognizing the human face, and these features are called geometric features. The knowledge-based face characterization mainly comprises a geometric feature-based method and a template matching method. An identification algorithm based on human face feature points, an identification algorithm based on the whole human face image, an identification algorithm based on a template, an algorithm for identification by using a neural network, an illumination estimation model, optimized deformation statistical correction and the like can be adopted.

In the facial image matching and recognition step 1204, the extracted feature data of the facial image is subjected to search matching with a feature template stored in the database, and by setting a threshold, when the similarity exceeds the threshold, the result obtained by matching is output. The face recognition is to compare the face features to be recognized with the obtained face feature template, and judge the identity information of the face according to the similarity degree. This process is divided into two categories: one is confirmation, which is a process of performing one-to-one image comparison, and the other is recognition, which is a process of performing one-to-many image matching comparison.

FIG. 13 is a block diagram of the object detection portion of the three-dimensional object recognition and positioning system of the present invention. Such as object identification and location system server 1301. The object recognition and location system server includes a processor 1310, which here may be a general purpose or application specific chip (ASIC/ASIC) or FPGA or NPU, etc., and a computer program product or computer readable medium in the form of a memory 1320. The memory 1320 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 1320 has a storage space 1330 for program code for performing any of the method steps described above. For example, the storage space 1330 for the program code may include respective program codes 1331 for respectively implementing various steps in the above methods. These program codes may be read from or written to the processor 1310. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a portable or fixed storage unit as described with reference to fig. 14. The storage unit may have a storage section, a storage space, and the like arranged similarly to the memory 1320 in the server of fig. 13. The program code may be compressed, for example, in a suitable form. Generally, the storage unit comprises computer readable code 1331', i.e. code that can be read by a processor, such as 1310, for example, which when executed by a server causes the server to perform the steps of the method described above. The codes, when executed by the server, cause the server to perform the steps of the method described above.

Reference herein to "one embodiment," "an embodiment," or "one or more embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Moreover, it is noted that instances of the word "in one embodiment" are not necessarily all referring to the same embodiment.

The above description is only for the purpose of illustrating the present invention, and any person skilled in the art can modify and change the above embodiments without departing from the spirit and scope of the present invention. Therefore, the scope of the claims should be accorded the full scope of the claims. The invention has been explained above with reference to examples. However, other embodiments than the above described are equally possible within the scope of this disclosure. The different features and steps of the invention may be combined in other ways than those described. The scope of the invention is limited only by the appended claims. More generally, those of ordinary skill in the art will readily appreciate that all parameters, dimensions, materials, and configurations described herein are exemplary and that actual parameters, dimensions, materials, and/or configurations will depend upon the particular application or applications for which the teachings of the present invention is/are used.

Claims

1. A method of three-dimensional object recognition and localization, comprising:

obtaining a three-dimensional light field image through a micro-lens array in a light field camera;

carrying out image reconstruction on the obtained three-dimensional light field image through a decoding step, combining the reconstructed light field images for refocusing to obtain a high-resolution image and calculating a depth map corresponding to the high-resolution image;

carrying out convolutional neural network training on the obtained high-resolution image to extract the characteristics of all targets in the image;

performing target detection on the reconstructed light field image by using a target detection model established by a convolutional neural network ('CNN'); then identifying the obtained target;

the identified objects are used to calculate the actual distance and location from their corresponding depth maps.

2. The method of claim 1, wherein,

the plurality of lenses of the microlens array are typically a plurality of lenses of equal or uniform size, either a spherical microlens array, a hexagonal microlens array, or a checkered microlens array.

3. The method of claim 1, wherein information of a point on the subject is transmitted to the sensor through the microlens array, through the main lens of the light field camera.

4. The method of claim 3, wherein the sensor may be a CMOS sensor or a charge-coupled device image sensor ("CCD sensor") or like photosensor.

5. The method of claim 1, wherein the image reconstructing the obtained three-dimensional light field image by the decoding step comprises: (a) obtaining a light field image imaged by a microlens array; (b) obtaining a sequence of sub-aperture images arranged according to the distance of the focal plane; (c) obtaining a single sub-aperture image; (d) arranging the multi-view sub-aperture images according to the position on the main lens; wherein the multi-view sub-aperture image array is obtained after processing the light field image.

6. The method of claim 5, wherein pixel points in the light field image are re-projected into each sub-aperture image to form images of different view angles of a scene, the light field information of the light field image is further synthesized and extracted to obtain a multi-view of an imaging space, and further obtain a digital refocusing sequence; and further a depth map is obtained.

7. The method of claim 6, wherein the following formula is employed:

L′(u，v，x′，y′)＝L(u，v，kx′+(1-k)u，ky′+(1-k)v) (3)

I′(x′，y′)＝∫∫L(u，v，kx′+(1-k)u，ky′+(1-k)v)dudv (4)

l and L' represent the energy of the primary and secondary imaging planes,

8. The method of claim 1, wherein,

all rays passing through a pixel pass through its parent microlens and through the conjugate square on the main lens forming a sub-aperture, and all rays passing through the sub-aperture are focused by the corresponding pixel under a different microlens.

9. The method of claim 1, wherein,

the light field image I (x, y) may be represented by the formula:

I(x,y)＝∫∫L_F(u,v,x,y)dudv (5)

where (u, v, x, y) represents the light traveling along the ray intersecting the main lens at (u, v) and the microlens plane at (x, y), and using the full aperture, a refocused image can be calculated by moving the sub-aperture image:

the shifted light field function can be expressed as:

10. the method of claim 1, wherein the object detection model built with convolutional neural network further comprises the steps of feature extraction, candidate region and classification and regression; the convolutional neural network consists of an input layer, a convolutional layer, an activation function, a pooling layer and a full-connection layer; inputting a characteristic diagram obtained by the CNN through a convolutional layer, a pooling layer, an activation function and a full-connection layer into the candidate region; and classifying and regressing the results obtained through the candidate regions.

11. The method of claim 10, wherein the feature extraction step examines each pixel of the light field image to determine whether the pixel represents a feature; or, when the feature extraction step is part of a larger algorithm, only the feature regions of the image are examined.

12. The method of claim 11, wherein prior to the feature extraction step, the light field image is smoothed in scale space by a gaussian blur kernel; and calculating one or more features of the light field image by local derivative operations.

13. The method of claim 10, wherein the activation function is a ReLU function to represent non-linear elements of the image, ensuring that data inputs and outputs are also differentiable, and performing a loop calculation that continuously changes the value of each neuron during each generation of the loop.

14. The method of claim 10, wherein the activation function is a Sigmoid function, a Tanh function, a leakage ReLU function, or a Maxout function.

15. The method of claim 13, wherein the ReLU function is

f(x)＝max(0,x)，

Feature points are detected using a template-based method, an edge-based method, a grayscale-based method, or a spatial transformation-based method.

16. The method of claim 10, wherein the pooling layer is a maximum pooling that extracts a maximum value from the modified feature map as a pooled value for the region; the input light field image is divided into a plurality of rectangular areas, the maximum value is output for each sub-area, and the pooling layer acts on each input feature and reduces the size of the input feature.

17. The method of claim 10, wherein pooling layers are periodically inserted between convolutional layers of the CNN.

18. The method of claim 10, wherein the step of classifying and regressing the results obtained through the candidate regions is preceded by the step of pooling the information obtained through the candidate regions step for the region of interest.

19. The method as claimed in claim 10, wherein the step of candidate regions adopts a region with convolutional neural network characteristics (R-CNN), a region with convolutional neural network characteristics rapidly (Fast R-CNN), a region with convolutional neural network characteristics more rapidly (Fast R-CNN), or a region with convolutional neural network characteristics masked (Mask R-CNN) method.

20. The method of claim 1, wherein the step of calculating the actual distance and location of the identified object from its corresponding depth map comprises determining a reference, calculating a depth map from a known approximate distance range between the determined reference and the identified object, and normalizing the depth in the depth map within the distance range to obtain the required distance.

21. A system for three-dimensional object recognition and localization, comprising:

22. The system of claim 21, wherein,

23. The system of claim 21, wherein information of a point on the subject is transmitted to the sensor through the microlens array, through the main lens of the light field camera.

24. The system of claim 23, wherein the sensor may be a photosensor such as a CMOS sensor or a charge-coupled device image sensor ("CCD sensor").

25. The system of claim 21, wherein the image reconstruction focus and depth map module comprises: (a) obtaining a light field image imaged by a microlens array; (b) obtaining a sequence of sub-aperture images arranged according to the distance of the focal plane; (c) obtaining a single sub-aperture image; (d) arranging the multi-view sub-aperture images according to the position on the main lens; wherein the multi-view sub-aperture image array is obtained after processing the light field image.

26. The system of claim 25, wherein pixel points in the light field image are re-projected into each sub-aperture image to form images of different viewing angles of a scene, the light field information of the light field image is further synthesized and extracted to obtain a multi-view of an imaging space, and further obtain a digital refocusing sequence; and further a depth map is obtained.

27. The system of claim 26, wherein the following formula is employed:

L′(u，v，x′，y′)＝L(u，v，kx′+(1-k)u，ky′+(1-k)v) (3)

I′(x′，y′)＝∫∫L(u，v，kx′+(1-k)u，ky′+(1-k)v)dudv (4)

l and L' represent the energy of the primary and secondary imaging planes,

28. The system of claim 21, wherein,

29. The system of claim 21, wherein,

the light field image I (x, y) may be represented by the formula:

I(x,y)＝∫∫L_F(u,v,x,y)dudv (5)

the shifted light field function can be expressed as:

30. the system of claim 21, wherein the CNN training and feature extraction module further comprises a feature extraction module, a candidate region module, and a classification and regression module; the convolutional neural network consists of an input layer, a convolutional layer, an activation function, a pooling layer and a full-connection layer; inputting a characteristic diagram obtained by the CNN through a convolutional layer, a pooling layer, an activation function and a full-connection layer into the candidate region; and classifying and regressing the results obtained through the candidate regions.

31. The system of claim 30, wherein the feature extraction module examines each pixel of the light field image to determine whether the pixel represents a feature; or, when the feature extraction module is part of a larger algorithm, only the feature regions of the image are examined.

32. The system of claim 31, wherein in the feature extraction module further comprises, smoothing the light field image in scale space by a gaussian blur kernel; and calculating one or more features of the light field image by local derivative operations.

33. The system of claim 30, wherein the activation function is a ReLU function to represent non-linear elements of the image, ensuring that data inputs and outputs are also differentiable, and performing a loop calculation that continuously changes the value of each neuron during each generation of the loop.

34. The system of claim 30, wherein the activation function is a Sigmoid function, a Tanh function, a leakage ReLU function, or a Maxout function.

35. The system of claim 33, wherein the ReLU function is

f(x)＝max(0,x)；

Feature points are detected using a template-based module, an edge-based module, a grayscale-based module, or a spatial transform-based module.

36. The system of claim 30, wherein the pooling layer is a maximum pooling that extracts a maximum value from the modified feature map as a pooled value for the region; the input light field image is divided into a plurality of rectangular areas, the maximum value is output for each sub-area, and the pooling layer acts on each input feature and reduces the size of the input feature.

37. The system of claim 30, wherein pooling layers are periodically inserted between convolutional layers of the CNN.

38. The system of claim 30, wherein prior to the classifying and regressing module classifying results obtained via the candidate region module, further comprising a module pooling information obtained via the candidate region module for regions of interest.

39. The system according to claim 30, wherein the module of candidate regions employs a region with convolutional neural network characteristics (R-CNN), a region with convolutional neural network characteristics rapidly (Fast R-CNN), a region with convolutional neural network characteristics Faster (Fast R-CNN), or a region with convolutional neural network characteristics masked (Mask R-CNN) module.

40. The system of claim 21, wherein the positioning module comprises determining a reference object, computing a depth map according to a known approximate distance range between the determined reference object and the reference object, and normalizing the depth in the depth map within the distance range to obtain the required distance.

41. A storage medium storing code for implementing the method of claims 1-20.