CN111428729A

CN111428729A - Target detection method and device

Info

Publication number: CN111428729A
Application number: CN201910019056.4A
Authority: CN
Inventors: 危磊
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2019-01-09
Filing date: 2019-01-09
Publication date: 2020-07-17

Abstract

The invention discloses a target detection method and device, and relates to the technical field of computers. One embodiment of the method comprises: collecting a 2D image and a 3D depth image of a detection target; taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map; and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target. The embodiment can improve the identification degree of the detection target and each plane thereof, improve the detection accuracy, and can save the detection cost without increasing extra calculation burden.

Description

Target detection method and device

Technical Field

The invention relates to the technical field of computers, in particular to a target detection method and device.

Background

With the progress of science and technology and the development of mechanical automation, more and more warehouses begin to adopt the manipulator to snatch the goods and sort in order to practice thrift the manpower. Under the condition that the existing manipulator control system is mature, the detection and positioning of a grabbed target are always difficult points.

Existing target detection schemes include:

SIFT (Scale-invariant feature transform) provides a feature-plus-template matching scheme, which requires the template to be established in advance for the detection target. Taking the detection of warehouse goods as an example, the warehouse goods are warehoused in thousands of goods, the cost for establishing a template for each kind of goods is very high, and SIFT features are difficult to extract for some goods without textures, so that the goods are difficult to identify;

edge detection schemes, which are very sensitive to illumination variations and texture disturbances on the detection target (e.g., a package of goods), have low recognition of the planes of the detection target.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the existing scheme has low identification degree on a detection target and each plane thereof, low detection accuracy and higher detection cost.

Disclosure of Invention

In view of this, embodiments of the present invention provide a target detection method and apparatus, which can improve the recognition degree of the detected target and each plane thereof, improve the detection accuracy, and save the detection cost without increasing the extra computation burden.

To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided an object detection method.

A method of target detection, comprising: acquiring a 2D (two-dimensional) image and a 3D (three-dimensional) depth image of a detection target; taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map; and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target.

Optionally, the step of processing the first feature map generated from the 2D image by using the 3D depth image as mask data to obtain a second feature map includes: performing N-level convolution and pooling on the 2D image to obtain N first feature maps; performing N-level pooling on the 3D depth image to obtain N pooled 3D depth images, wherein the pooled 3D depth images correspond to the first feature map one by one, and N is a positive integer; and for each first feature map, using the corresponding pooled 3D depth image as masking data, and performing preset processing on the feature in the first feature map to obtain a second feature map.

Optionally, the preset processing includes one of dot multiplication, addition and combination.

According to another aspect of the embodiments of the present invention, there is provided an object detecting apparatus.

An object detection device comprising: the image acquisition module is used for acquiring a 2D image and a 3D depth image of a detection target; the feature map processing module is used for processing a first feature map generated by the 2D image by taking the 3D depth image as masking data to obtain a second feature map; and the target detection module is used for inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target.

Optionally, the feature map processing module is further configured to: performing N-level convolution and pooling on the 2D image to obtain N first feature maps; performing N-level pooling on the 3D depth image to obtain N pooled 3D depth images, wherein the pooled 3D depth images correspond to the first feature map one by one, and N is a positive integer; and for each first feature map, using the corresponding pooled 3D depth image as masking data, and performing preset processing on the feature in the first feature map to obtain a second feature map.

According to yet another aspect of an embodiment of the present invention, an electronic device is provided.

An electronic device, comprising: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the object detection method provided by the present invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the object detection method provided by the invention.

One embodiment of the above invention has the following advantages or benefits: collecting a 2D image and a 3D depth image of a detection target; taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map; and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detected target. The method can improve the identification degree of the detected target and each plane thereof, improve the detection accuracy, avoid additional calculation burden and save the detection cost.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of a target detection method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a structure of a residual block according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an object detection model according to a second embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an object detection model according to a third embodiment of the present invention;

FIG. 5 is a diagram illustrating a merchandise detection process according to a fourth embodiment of the present invention;

FIG. 6 is a schematic view of the main blocks of an object detecting apparatus according to a fifth embodiment of the present invention;

FIG. 7 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 8 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

Fig. 1 is a schematic diagram of main steps of a target detection method according to an embodiment of the present invention.

As shown in fig. 1, the target detection method according to an embodiment of the present invention mainly includes steps S101 to S103 as follows.

Step S101: and acquiring a 2D image and a 3D depth image of the detection target.

The detection target provided by the embodiment of the invention can be a target to be detected and identified in various application scenes, for example, in a warehouse picking scene, the detection target is a commodity grabbed by a mechanical arm and the like.

Wherein, the 2D image of the detection target, i.e. the 2D color image (RGB image) of the detection target, may respectively acquire the 2D image and the 3D depth image by using a 2D color camera and a 3D depth camera (the two cameras may be the same camera integrated).

After step S101, the acquired 2D image and the 3D depth image of the detection target may be aligned, and the missing portion on the 3D depth image may be blackened, so that the black area on the 3D depth image does not affect the identification of the detection target area.

Step S102: and processing the first feature map generated by the 2D image of the detection target by using the 3D depth image of the detection target as the mask data to obtain a second feature map.

Step S102 may specifically include:

carrying out convolution and pooling on the 2D image of the detection target through an N-level residual block and a pooling layer to obtain N first feature maps;

performing N-level pooling on the 3D depth image of the detection target to obtain N pooled 3D depth images, wherein the pooled 3D depth images correspond to the first characteristic maps one by one;

and for each first feature map, performing point multiplication on the corresponding pooled 3D depth image serving as masking data and the features in the first feature map to obtain a second feature map.

Wherein N is a positive integer.

The structure of the residual block (reblock) can be as shown in fig. 2, and as shown in fig. 2, the structure includes a convolution layer having 64 convolution kernels and a convolution kernel size of 1 × 1, a convolution layer having 64 convolution kernels and a convolution kernel size of 1 × 3, and a convolution layer having 256 convolution kernels and a convolution kernel size of 1 × 1. relu denotes an activation function.

As an alternative embodiment, the above-described residual Block may be replaced with a Block (Block) structure.

As an alternative embodiment, the dot product process may be replaced with an addition process or a merge (concatenate) process.

Step S103: and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detected target.

The target detection network may be a convolutional network composed of convolutional layers, or a detection layer or a detection module of an existing target detection network, for example, a last detection layer of an SSD (single-shot multi-box detector), or a RPN (regional candidate network) detection module behind a fast regional convolutional neural network (fast regional convolutional neural network), or the like may be used.

Fig. 3 is a schematic structural diagram of an object detection model according to a second embodiment of the present invention. The second embodiment of the present invention inputs the acquired 2D image and 3D depth image into the object detection model shown in fig. 3 after acquiring the 2D image and 3D depth image of the detection object. The target detection model includes: a Block (Block), a first pooling layer, a second pooling layer, a detection layer. The Block (Block) includes a plurality of convolution layers, the Block (Block) in the embodiment of the present invention may adopt a structure of a residual Block, and a specific structure of the residual Block may refer to the description of fig. 2, which is not described herein again. The blocks and a first pooling layer (the first pooling layer being connected behind the corresponding block) are used for multi-level convolution and pooling of the 2D image, and a multi-level second pooling layer is used for pooling of the 3D depth image. The hierarchy of the blocks and pooling layers of fig. 3 is only exemplified by four layers, and the target detection model of the embodiment of the present invention may not be limited to four levels of blocks and pooling layers.

The first pooling layer and the second pooling layer of the same level can be realized by the same pooling layer, namely, the same pooling layer can be used for pooling 2D images and 3D depth images.

The four-level block and the first pooling layer of fig. 3 output four first feature maps (one first feature map per level), and the second pooling layer outputs four pooled 3D depth images (one pooled 3D depth image per level). The first feature map generated at each stage corresponds to the pooled 3D depth image generated at each stage. For each first feature map, the corresponding pooled 3D depth image is used as mask data, and dot multiplication processing is performed on the feature in the first feature map to obtain a second feature map, for example, in fig. 3, the pooled 3D depth map generated by the second pooling layer 1 is used as mask data, dot multiplication processing is performed on the feature of the first feature map generated by the block + the first pooling layer 1 to obtain one second feature map, and the other three levels perform the same method to finally obtain four second feature maps, and the four second feature maps are input to the detection layer to perform target detection processing to generate detection results, that is, category information and position information (detection frame information) corresponding to the detection target.

In FIG. 3

Representing dot multiplication, in which features extracted from a 2D color image are dot multiplied using a 3D depth image as a mask (i.e., mask data), therebyDifferent weighting is carried out on the pixel values of different planes, and the identification degree of the detected target and each plane can be greatly improved. In addition, the embodiment of the invention only performs pool (pool) operation on the 3D depth image, and basically has no extra calculation burden in order to keep consistent with the dimension of the feature (feature), thereby saving the detection cost.

The detection layer of the embodiment of the invention can adopt a convolution network formed by convolution layers, and can also adopt a detection layer or a detection module of the existing target detection network. And the detection layer performs convolution on the second feature maps respectively and extracts category information and detection frame information simultaneously.

Fig. 4 shows the structure of an object detection model of a third embodiment of the present invention. The structure of the object detection model in this embodiment is similar to that in the second embodiment, except that for each first feature map, the corresponding pooled 3D depth image is added to the features in the first feature map as mask data (features in the map)

Representing an addition process) to obtain a second profile. Likewise, the object detection model of the present embodiment may not be limited to four levels of blocks and pooling layers. Other implementation details may be incorporated into the description of the second embodiment.

The target detection model of the embodiment of the invention is a trained model. In the model training stage, a 2D color camera and a 3D depth camera can be used for collecting real data (2D images and 3D depth images) of a detection target as training data, manual marking of the training data is carried out, the position and the type of a box (a goods packaging box) are marked for a detection goods scene, then parameters of each layer of the model are trained, and finally the trained target detection model is obtained. The model training method may adopt various training methods of the target detection model, such as a stochastic gradient descent and a back propagation training method. The embodiment of the invention trains the target detection model based on the 3D depth image and the 2D color image, overcomes the defect that the extraction information of the existing target detection model is limited, and the target detection model of the embodiment of the invention takes the depth as a mask data (mask), modifies the color image and then detects the color image, thereby improving the detection precision. The target detection model provided by the embodiment of the invention is applied to the step of deploying to warehouse picking, so that the detection rate of goods can be greatly improved.

The fourth embodiment of the present invention takes commodity detection as an example, and as shown in fig. 5, the commodity detection process of the fourth embodiment of the present invention includes steps S501 to S504.

Step S501: and collecting a 2D image and a 3D depth image of the commodity, and obtaining a training sample through manual marking.

Step S502: and constructing a commodity detection model, wherein the commodity detection model comprises a residual block, a pooling layer and a detection layer.

Step S503: and training the commodity detection model based on the depth image and the color image by using the training sample.

Step S504: and inputting the 2D image and the 3D depth image of the commodity to be detected into the trained commodity detection model to obtain the category and the detection frame information of the commodity to be detected.

In this embodiment, the target detection step is described by taking a commodity as an example, and the detailed implementation of each step and the structure of the commodity detection model can be referred to the descriptions of the other embodiments above.

Fig. 6 is a schematic diagram of main blocks of an object detection apparatus according to a fifth embodiment of the present invention.

The object detection apparatus 600 according to the fifth embodiment of the present invention mainly includes: an image acquisition module 601, a feature map processing module 602, and an object detection module 603.

The image acquisition module 601 is configured to acquire a 2D image and a 3D depth image of a detection target.

And the feature map processing module 602 is configured to process the first feature map generated by the 2D image by using the 3D depth image as mask data to obtain a second feature map.

The feature map processing module 602 is specifically configured to:

carrying out convolution and pooling on the 2D image of the detection target through N-level blocks and pooling layers to obtain N first feature maps;

performing N-level pooling on the 3D depth image of the detection target to obtain N pooled 3D depth images, wherein the pooled 3D depth images correspond to the first characteristic map one by one, and N is a positive integer;

and for each first feature map, taking the corresponding pooled 3D depth image as masking data, and performing preset processing on the features in the first feature map to obtain a second feature map.

As an alternative implementation, the block may take the structure of a residual block.

The preset processing includes one of dot multiplication, addition and combination.

And the target detection module 603 is configured to input the second feature map into a target detection network, so as to obtain category information and location information corresponding to the detected target.

In addition, the detailed implementation of the object detection device in the embodiment of the present invention has been described in detail in the above object detection method, and therefore, the repeated description is not repeated here.

Fig. 7 shows an exemplary system architecture 700 to which the object detection method or object detection apparatus of the present invention may be applied.

As shown in fig. 7, the system architecture 700 may include

terminal devices

701, 702, 703, a network 704, and a server 705. The network 704 serves to provide a medium for communication links between the

terminal devices

701, 702, 703 and the server 705. Network 704 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

701, 702, 703 to interact with a server 705 over a network 704, to receive or send messages or the like. The

terminal devices

701, 702, 703 may have installed thereon various communication client applications, such as a shopping-like application, a web browser application, a search-like application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only).

The

terminal devices

701, 702, 703 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 705 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the

terminal devices

701, 702, 703. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, product information — just an example) to the terminal device.

It should be noted that the object detection method provided by the embodiment of the present invention is generally executed by the server 705, and accordingly, the object detection apparatus is generally disposed in the server 705.

It should be understood that the number of terminal devices, networks, and servers in fig. 7 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 8, shown is a block diagram of a computer system 800 suitable for use in implementing a server according to embodiments of the present application. The server shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 8, the computer system 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

To the I/O interface 805, AN input section 806 including a keyboard, a mouse, and the like, AN output section 807 including a network interface card such as a Cathode Ray Tube (CRT), a liquid crystal display (L CD), and the like, a speaker, and the like, a storage section 808 including a hard disk, and the like, and a communication section 809 including a network interface card such as a L AN card, a modem, and the like are connected, the communication section 809 performs communication processing via a network such as the internet, a drive 810 is also connected to the I/O interface 805 as necessary, a removable medium 811 such as a magnetic disk, AN optical disk, a magneto-optical disk, a semiconductor memory, and the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted into the storage section 808 as.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises an image acquisition module, a characteristic diagram processing module and a target detection module. The names of these modules do not in some cases constitute a limitation on the module itself, and for example, the image acquisition module may also be described as a "module for acquiring a 2D image and a 3D depth image of a detection target".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: collecting a 2D image and a 3D depth image of a detection target; taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map; and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target.

According to the technical scheme of the embodiment of the invention, a 2D image and a 3D depth image of a detection target are collected; taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map; and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detected target. The method can improve the identification degree of the detected target and each plane thereof, improve the detection accuracy, avoid additional calculation burden and save the detection cost.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of object detection, comprising:

collecting a 2D image and a 3D depth image of a detection target;

taking the 3D depth image as masking data, and processing a first feature map generated by the 2D image to obtain a second feature map;

and inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target.

2. The method of claim 1, wherein the step of processing the first feature map generated from the 2D image using the 3D depth image as mask data to obtain a second feature map comprises:

performing N-level convolution and pooling on the 2D image to obtain N first feature maps;

performing N-level pooling on the 3D depth image to obtain N pooled 3D depth images, wherein the pooled 3D depth images correspond to the first feature map one by one, and N is a positive integer;

and for each first feature map, using the corresponding pooled 3D depth image as masking data, and performing preset processing on the feature in the first feature map to obtain a second feature map.

3. The method of claim 1, wherein the predetermined processing comprises one of dot multiplication, addition and combination.

4. An object detection device, comprising:

the image acquisition module is used for acquiring a 2D image and a 3D depth image of a detection target;

the feature map processing module is used for processing a first feature map generated by the 2D image by taking the 3D depth image as masking data to obtain a second feature map;

and the target detection module is used for inputting the second characteristic diagram into a target detection network to obtain the category information and the position information corresponding to the detection target.

5. The apparatus of claim 4, wherein the feature map processing module is further configured to:

6. The apparatus of claim 4, wherein the predetermined process comprises one of dot multiplication, addition and combination.

7. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-3.

8. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-3.