CN111767914A

CN111767914A - Target object detection device and method, image processing system, and storage medium

Info

Publication number: CN111767914A
Application number: CN201910255843.9A
Authority: CN
Inventors: 黄耀海; 李岩; 金浩
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-04-01
Filing date: 2019-04-01
Publication date: 2020-10-13

Abstract

The present disclosure discloses a target object detection apparatus and method, an image processing system, and a storage medium. The target object detection device includes: a unit that extracts features from the image; a unit that obtains a candidate detection region having pre-generated geometric information on an image based on the extracted features and the pre-generated geometric information, wherein the pre-generated geometric information is capable of describing at least an overall shape of a target object; a unit that detects a candidate target object in the image from the generated candidate detection region based on the extracted feature; and a unit that determines a target object in the image based on the detected target object candidate. According to the method and the device, the detection precision of the target object can be effectively improved, namely, the detection rate of the target object and the positioning accuracy of the target object can be effectively improved.

Description

Target object detection device and method, image processing system, and storage medium

Technical Field

The present invention relates to image processing, and more particularly to detection of a target object in an image, for example.

Background

The method for detecting the target object (such as the human face, the human body and the like) from the image or the video has important application value for subsequent image processing applications (such as face recognition, person tracking, people counting and the like). The current conventional approach to detecting a target object is: the target object in the image is detected by scanning the image in a sliding window manner by using a pre-generated rectangular detection window (namely, a rectangular detection area) with a plurality of rules.

For example, an exemplary technique for detecting a target object using a neural network is disclosed in U.S. Pat. No. US9858496B 2. The neural network used in the exemplary technique includes a Region proposal network layer (RPN), and the RPN layer includes a plurality of rectangular detection regions with fixed aspect ratios and fixed scaling ratios. Among them, the exemplary technique is mainly: firstly, scanning features extracted from an input image by using an RPN layer in a sliding window mode, and projecting a scanning result onto the input image to obtain a plurality of candidate rectangular detection areas; and then, the candidate rectangular detection areas are used for carrying out classification and positioning operations on the target object so as to determine the final target object.

As described above, in the detection of the target object, the above-described method uses the pre-generated detection region having a plurality of regular rectangular regions, so that the above-described method can effectively detect the target object in the regular form, that is, the above-described method can effectively detect the target object having the regular shape. However, in many scenes, the target object is often occluded (e.g., the target object is partially occluded) and/or deformed (e.g., the pose of the target object changes, the capture angle of the image changes). That is, in many scenes, the target object will usually be presented in a local form and/or a deformed form, so that the target object can only present a local shape. Since the rectangular detection area cannot describe the local shape of the target object well, in the case of the target object having the local shape, the target object cannot be detected from the image well by using the above method, and the detection accuracy of the target object is affected. In other words, the recall rate of the target object and the accuracy of the positioning of the target object will be affected.

Disclosure of Invention

In view of the above description in the background, the present disclosure is directed to solving at least one of the problems set forth above.

According to an aspect of the present disclosure, there is provided a target object detection apparatus including: an extraction unit that extracts a feature from an image; a generation unit that generates a detection region candidate having pre-generated geometric information on the image based on the extracted features and the pre-generated geometric information, the pre-generated geometric information being capable of describing at least an overall shape of a target object; a detection unit that detects a target object candidate in the image from the generated detection region candidate based on the extracted features; and a determination unit that determines a target object in the image based on the detected target object candidates.

According to another aspect of the present disclosure, there is provided a target object detection method including: an extraction step of extracting features from the image; a generation step of generating a candidate detection region having pre-generated geometric information on the image based on the extracted features and the pre-generated geometric information, wherein the pre-generated geometric information can describe at least an overall shape of a target object; a detection step of detecting a target object candidate in the image from the generated detection region candidate based on the extracted features; and a determination step of determining a target object in the image based on the detected target object candidates.

Wherein the pre-generated geometrical information is further capable of describing a local shape of the target object. Wherein the pre-generated geometric information is constituted by, for example, a bitmap or keypoints of the target object.

According to still another aspect of the present disclosure, there is provided an image processing system including: an acquisition device for acquiring an image or video; the target object detection apparatus as described above, for detecting a face in the acquired image or video; and post-processing means for performing a subsequent image processing operation based on the detected face; wherein the acquisition device, the target object detection device, and the post-processing device are connected to each other via a network.

According to yet another aspect of the present disclosure, there is provided a storage medium having stored thereon instructions that, when executed by a processor, cause a target object detection method to be performed, the target object detection method comprising: an extraction step of extracting features from the image; a generation step of generating a candidate detection region having pre-generated geometric information on the image based on the extracted features and the pre-generated geometric information, wherein the pre-generated geometric information can describe at least an overall shape of a target object; a detection step of detecting a target object candidate in the image from the generated detection region candidate based on the extracted features; and a determination step of determining a target object in the image based on the detected target object candidates. Wherein the pre-generated geometrical information is further capable of describing a local shape of the target object. Wherein the pre-generated geometric information is constituted by, for example, a bitmap or keypoints of the target object.

In the present disclosure, in generating the detection region candidates, the present disclosure makes it possible to generate detection region candidates having information that can describe the shape of the target object by using geometric information that can describe the shape of the target object. The shape of the described target object may be the overall shape of the target object, and may also be a local shape of the target object. Therefore, the present disclosure can effectively detect a target object in a corresponding scene from an image, regardless of whether the scene is presented in a global form for the target object or in a local form and/or a deformed form for the target object. Therefore, according to the present disclosure, the detection accuracy of the target object, that is, the recall rate of the target object and the accuracy of the positioning of the target object can be effectively improved.

Other features and advantages of the present disclosure will become apparent from the following description of exemplary embodiments, which refers to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the disclosure and together with the description of the embodiments, serve to explain the principles of the disclosure.

1A-1C schematically illustrate examples of occluded faces according to the present disclosure.

Fig. 2 schematically shows an example of a partially occluded human body according to the present disclosure.

Fig. 3A to 3D schematically show examples of geometric information capable of describing the overall/local shape of a target object according to the present disclosure.

Fig. 4 is a block diagram schematically illustrating a hardware configuration in which the technology according to the embodiment of the present disclosure can be implemented.

Fig. 5 is a block diagram illustrating the configuration of a target object detection apparatus according to an embodiment of the present disclosure.

Fig. 6 schematically illustrates a schematic structure of a pre-generated neural network that may be used with embodiments of the present disclosure.

Fig. 7 schematically illustrates a flow chart of a target object detection method according to an embodiment of the present disclosure.

Fig. 8A to 8C schematically show examples of a detection region candidate generated, a detected target object candidate, and a determined target object according to an embodiment of the present disclosure.

Fig. 9A to 9E schematically show examples of determination of the degree of overlap between two candidate target objects according to an embodiment of the present disclosure.

Fig. 10 shows an arrangement of an exemplary image processing system according to the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be noted that the following description is merely illustrative and exemplary in nature and is in no way intended to limit the disclosure, its application, or uses. The relative arrangement of components and steps, numerical expressions, and numerical values set forth in the embodiments do not limit the scope of the present disclosure unless specifically stated otherwise. Additionally, techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail, but are intended to be part of the present specification where appropriate.

Note that like reference numerals and letters refer to like items in the drawings, and thus, once an item is defined in a drawing, it is not necessary to discuss it in the following drawings.

As described above, in many scenes, some target objects are usually occluded and/or deformed to be presented in a local form and/or a deformed form, so that the target objects can only present a local shape. The inventors have found that, on the one hand, for a certain kind of target object to be represented in a local shape, it will usually only represent a certain few local morphologies or deformed morphologies. For example, in the case where the target object is a human face, the partial form it presents is, for example, generally that eyes are blocked by a foreign object (for example, as shown in fig. 1A), a mouth is blocked by a mask (for example, as shown in fig. 1B), a mouth is blocked by another person (for example, as shown in fig. 1C, the person 110), and the like. For example, in the case that the target object is a human body, the local form it presents is, for example, usually that the body is partially occluded by others (e.g., the person 210 as shown in fig. 2), and the like. Therefore, the inventors consider that, for a certain kind of target object, the geometric information capable of describing the local shape of the target object can be obtained by a clustering or statistical method based on a sample of the target object presented in the local form and/or deformed form under the detection view angle, wherein the local shape of the target object is marked in the sample presented in the local form and/or deformed form. In addition, some target objects are presented in the overall shape, so that the target objects present the overall shape, and therefore, for a certain kind of target objects presented in the overall shape, the geometric information capable of describing the overall shape of the target object can be obtained by a clustering or statistical method based on a sample presented in the overall shape of the target object from the detection perspective, wherein the sample presented in the overall shape is labeled with the overall shape of the target object. Thus, geometric information that can describe the overall/local shape of such target objects, i.e., "pre-generated geometric information" that may be used in the present disclosure, may be obtained.

On the other hand, whether a target object is presented in an overall shape or in a local shape, the inventors believe that in one way, where the shape of the target object itself is rectangular, geometric information that can describe the shape of such a target object may be constituted, for example, directly by rectangular areas of different sizes (as shown, for example, in fig. 3A). In another approach, the geometric information that can describe the shape of the target object may also be constituted, for example, by a bitmap (bitmap) that projects the foreground contour of the target object onto rectangular areas of different sizes. For example, for a target object as shown in fig. 1A, the corresponding geometric information is, for example, a bitmap as shown in fig. 3B. For example, for a target object as shown in fig. 2, the corresponding geometric information is, for example, a bitmap as shown in fig. 3C. In yet another approach, the geometric information that can describe the shape of the target object may also be constituted, for example, by a set of keypoints (landworks) that can describe the shape profile or apparent structure of the target object within rectangular regions of different sizes. For example, for a target object as shown in FIG. 1C, the corresponding geometric information is, for example, a set of keypoints as shown in FIG. 3D.

Thus, the inventors consider that the present disclosure can generate respective candidate detection regions by using the geometric information obtained and constituted in the above-described manner to detect respective target objects from the image. As described above, since the geometric information obtained and configured in the above manner can describe the overall shape of the target object and also describe the local shape of the target object, the present disclosure can effectively detect the target object in a corresponding scene from the image regardless of whether the scene is presented in the overall form for the target object or the scene is presented in the local form and/or the deformed form for the target object. Therefore, according to the present disclosure, the detection accuracy of the target object, that is, the recall rate of the target object and the accuracy of the positioning of the target object can be effectively improved.

(hardware construction)

A hardware configuration that can implement the technique described hereinafter will be described first with reference to fig. 4.

The hardware configuration 400 includes, for example, a Central Processing Unit (CPU)410, a Random Access Memory (RAM)420, a Read Only Memory (ROM)430, a hard disk 440, an input device 450, an output device 460, a network interface 470, and a system bus 480. Further, in one implementation, hardware configuration 400 may be implemented by a computer, such as a tablet, laptop, desktop, or other suitable electronic device. In another implementation, hardware configuration 400 may be implemented by a monitoring device, such as a digital camera, video camera, web camera, or other suitable electronic device. Where hardware configuration 400 is implemented by a monitoring device, hardware configuration 400 also includes, for example, an optical system 490.

In one implementation, a target object detection apparatus in accordance with the present invention is constructed from hardware or firmware and used as a module or component of hardware configuration 400. For example, a target object detection apparatus 500, which will be described in detail below with reference to fig. 5, may be used as a module or component of the hardware configuration 400. In another implementation, the apparatus for detecting a target object according to the present invention is constructed by software stored in the ROM430 or the hard disk 440 and executed by the CPU 410. For example, a process 700, which will be described in detail below with reference to fig. 7, may be used as a program stored in the ROM430 or the hard disk 440.

CPU 410 is any suitable programmable control device, such as a processor, and may perform various functions to be described hereinafter by executing various application programs stored in ROM430 or hard disk 440, such as memory. The RAM420 is used to temporarily store programs or data loaded from the ROM430 or the hard disk 440, and is also used as a space in which the CPU 410 executes various processes (such as implementing techniques which will be described in detail below with reference to fig. 7 to 9E) and other available functions. The hard disk 440 stores a variety of information such as an Operating System (OS), various applications, control programs, video, images, pre-generated networks (e.g., neural networks), pre-generated geometric information, and the like.

In one implementation, input device 450 is used to allow a user to interact with hardware configuration 400. In one example, a user may input video/images through input device 450. In another example, a user may trigger a corresponding process of the present invention through input device 450. In addition, input device 450 may take a variety of forms, such as a button, a keyboard, or a touch screen. In another implementation, the input device 450 is used to receive video/images output from specialized electronic devices such as digital cameras, video cameras, and/or web cameras. Additionally, where hardware configuration 400 is implemented by a monitoring device, optical system 490 in hardware configuration 400 would directly capture video/images of the monitoring site.

In one implementation, the output device 460 is used to display the detection results (such as the detected target object) to the user. Also, the output device 460 may take various forms such as a Cathode Ray Tube (CRT) or a liquid crystal display. In another implementation, the output device 460 is used to output detection results to subsequent image processing such as face recognition, person tracking, people counting, and the like.

Network interface 470 provides an interface for connecting hardware architecture 400 to a network. For example, hardware architecture 400 may communicate data with other electronic devices connected via a network via network interface 470. Optionally, hardware architecture 400 may be provided with a wireless interface for wireless data communication. The system bus 480 may provide a data transmission path for mutually transmitting data among the CPU 410, the RAM420, the ROM430, the hard disk 440, the input device 450, the output device 460, the network interface 470, the optical system 490, and the like. Although referred to as a bus, the system bus 480 is not limited to any particular data transfer technique.

The hardware configuration 400 described above is merely illustrative and is in no way intended to limit the present invention, its applications, or uses. Also, only one hardware configuration is shown in FIG. 4 for simplicity. However, a plurality of hardware configurations may be used as necessary.

(target object detecting device and method)

The detection process according to the present invention will be described next with reference to fig. 5 to 9E.

Fig. 5 is a block diagram illustrating the configuration of a target object detection apparatus 500 according to an embodiment of the present disclosure. Wherein some or all of the modules shown in figure 5 may be implemented by dedicated hardware. As shown in fig. 5, the target object detection apparatus 500 includes an extraction unit 510, a generation unit 520, a detection unit 530, and a determination unit 540.

In addition, the storage device 550 shown in fig. 5 stores, for example, at least pre-generated geometric information capable of describing the shape (e.g., overall shape, local shape) of the target object. In one implementation, the storage device 550 is the ROM430 or the hard disk 440 shown in FIG. 4. In another implementation, the storage device 550 is a server or an external storage device connected to the target object detection apparatus 500 via a network (not shown).

First, in one implementation, for example, in a case where the hardware configuration 400 shown in fig. 4 is implemented by a computer, the input device 450 receives an image output from a dedicated electronic device (e.g., a camera or the like) or input by a user. The input device 450 then transmits the received image to the target object detection apparatus 500 via the system bus 480. In another implementation, for example, where the hardware configuration 400 is implemented by a monitoring device, the target object detection apparatus 500 directly uses the image captured by the optical system 490.

Then, as shown in fig. 5, the extraction unit 510 extracts a feature from the received image (i.e., the entire image). In one implementation, extraction unit 510 extracts, for example, a deep convolution feature map from the received image using various feature extraction operators, such as, but clearly not limited to, convolutional neural networks having the structure VGC16, ResNet, SENet, and the like.

The generation unit 520 generates a candidate detection region having the pre-generated geometric information capable of describing at least the overall shape of the target object on the received image based on the features extracted by the extraction unit 510 and the pre-generated geometric information stored in the storage device 550. For a class of target objects, the geometric information describing the overall shape of the class of target objects may be obtained, for example, by clustering or statistical methods based on samples of the class of target objects in their overall shape from the detection perspective, where the samples are labeled with the overall shape of the class of target objects. As described above, in order to effectively detect the target object presented in the local form and/or the deformed form from the image as well, the pre-generated geometric information can further describe the local shape of the target object. That is, the pre-generated geometric information can describe not only the overall shape of the target object, but also the local shape of the target object. For a class of target objects, geometric information describing the local shape of the class of target objects may be obtained, for example, by clustering or statistical methods based on samples of the class of target objects presented in their local shape and/or deformed shape from the detection perspective, where the local shape of the class of target objects is labeled in the samples. As mentioned above, the pre-generated geometric information may be constituted by a bitmap or key points of the target object, for example.

In one implementation, the generating unit 520 may generate the candidate detection region on the received image, for example, by: the corresponding regions are first determined on the extracted features by pre-generated geometric information obtained from the storage device 550, and then the determined regions are mapped onto the received image to obtain candidate detection regions.

After generating the candidate detection regions, the detection unit 530 detects the candidate target objects in the received image from the candidate detection regions generated by the generation unit 520 based on the features extracted by the extraction unit 510. Also, the determination unit 540 determines a target object in the received image based on the target object candidate detected by the detection unit 530.

Finally, the determination unit 540 transmits the detection result (e.g., the detected target object) to the output device 460 via the system bus 480 shown in fig. 4 for displaying the detection result to the user or for outputting the detection result to subsequent image processing such as face recognition, person tracking, people counting, and the like.

Furthermore, preferably, in one implementation, each unit (i.e., the extracting unit 510, the generating unit 520, the detecting unit 530, and the determining unit 540) in the target object detecting apparatus 500 shown in fig. 5 may perform a corresponding operation using a pre-generated neural network. In one aspect, as shown, for example, in fig. 6, a pre-generated neural network that may be used with embodiments of the present disclosure includes, for example, a portion for extracting features (i.e., sub-networks), a portion for generating candidate detection regions, a portion for detecting candidate target objects, and a portion for determining target objects. In the present disclosure, for example, the parts in the neural network may be generated in advance based on the above-mentioned pre-generated geometric information capable of describing the overall/local shape of the target object, through an end-to-end training mode and a backward-transfer updating mode. On the other hand, the pre-generated neural network may be stored in the storage device 550, for example.

Specifically, in one aspect, the target object detection apparatus 500 retrieves a pre-generated neural network from the storage device 550. On the other hand, the extraction unit 510 extracts features from the received image using a portion for extracting features in the neural network. The generating unit 520 generates candidate detection regions on the received image based on the features extracted by the extracting unit 510 and the pre-generated geometric information, using a portion of the neural network for generating the candidate detection regions. The detection unit 530 detects a candidate target object in the received image from the candidate detection area generated by the generation unit 520 based on the feature extracted by the extraction unit 510, using a portion for detecting a candidate target object in the neural network. The determination unit 540 determines the target object in the received image based on the candidate target object detected by the detection unit 530, using a portion for determining the target object in the neural network.

The flowchart 700 shown in fig. 7 is a corresponding process of the target object detection apparatus 500 shown in fig. 5.

As shown in fig. 7, in the extraction step S710, the extraction unit 510 extracts features from the received image.

In the generating step S720, the generating unit 520 obtains corresponding pre-generated geometric information from the storage device 550 according to the type of the target object, and generates a candidate detection region having the pre-generated geometric information on the received image based on the extracted features and the obtained pre-generated geometric information, wherein the pre-generated geometric information can describe at least the overall shape of the target object. Thus, in one implementation, the candidate detection regions generated by the generation unit 520 include detection regions having the overall shape of the target object. In another implementation, the pre-generated geometric information can describe not only the overall shape of the target object, but also the local shape of the target object. Thus, the detection-candidate regions generated by the generation unit 520 include a detection region having the overall shape of the target object (which may be referred to as a "first detection-candidate region", for example) and a detection region having the local shape of the target object (which may be referred to as a "second detection-candidate region", for example). For example, taking the target object as shown in fig. 1C as an example, after the generation step S720, the generated first candidate detection regions are, for example, regions 811 to 813 composed of solid lines as shown in fig. 8A, and the generated second candidate detection regions are, for example, regions 814 to 817 composed of dotted lines as shown in fig. 8A, wherein the used pre-generated geometric information is, for example, composed of key points of the target object (for example, as shown by dots in fig. 8A).

After generating the candidate detection regions, in the detection step S730, the detection unit 530 detects the candidate target objects in the received image from the generated candidate detection regions based on the extracted features. In one implementation, the detection unit 530 performs a classification operation on the generated candidate detection regions based on the geometric information included in the generated candidate detection regions. For example, the detection unit 530 performs discriminative classification of the target object for each candidate detection region by a pre-generated classifier or the above-described pre-generated neural network. For example, a detection region candidate having therein the overall shape or the local shape of the target object is determined as a "target object candidate", and a detection region candidate having therein no overall shape or local shape of the target object is determined as a "non-target object candidate". On the other hand, the detection unit 530 performs a positioning operation on the generated candidate detection region based on the extracted features and geometric information included in the generated candidate detection region. For example, the detection unit 530 performs regression processing on each candidate detection region based on the extracted features by a pre-generated regressor or the above-described pre-generated neural network to obtain final position information of each candidate detection region, where the generated geometric information in the candidate detection region may be regarded as initial position information of the candidate detection region. Thus, for the "candidate target object" obtained by the classification operation, the final position information thereof can also be obtained by the positioning operation. Further, in the case where the pre-generated geometric information used is constituted by the keypoints of the target object, the final position information may be obtained by performing a regression operation on the keypoints in the generated candidate detection regions. For example, taking the target object as shown in fig. 1C as an example, after the detection step S730, the detected target object candidate is, for example, an area composed of a solid line and a dashed line as shown in fig. 8B.

Returning to fig. 7, the determination unit 540 determines a target object in the received image based on the detected target object candidates. In one implementation, the determining unit 540 determines the final target object by performing a selection or combination operation on the detected candidate target objects based on the geometric information included in the candidate target objects through a Non-Maximum Suppression (NMS) method. For example, the determination unit 540 performs a selection or merging operation on the candidate target objects by determining whether there is a candidate target object belonging to the same target object. Specifically, the determination unit 540 determines the final target object, for example, by:

first, for any two candidate target objects, the determination unit 540 calculates the distance between the two candidate target objects based on the geometric information possessed in the two candidate target objects. As described above, the geometric information capable of describing the overall/local shape of the target object used by the present disclosure may be constituted by a bitmap which projects the foreground contour of the target object onto a rectangular region, a set of key points within the rectangular region which may describe the shape contour or apparent structure of the target object, or a rectangular region which may directly describe the shape of the target object itself. Therefore, for any two candidate target objects, the degree of overlap between the geometric information possessed in the two candidate target objects may be used as the distance between the two candidate target objects.

For example, in the case where the used pre-generated geometric information is constituted by bitmaps of target objects, the degree of overlap between bitmaps present in the two candidate target objects may be calculated as the distance between the two candidate target objects. For example, assuming that the bitmaps included in the two candidate target objects are respectively shown as the hatched portions in fig. 9A and 9B, and assuming that the two candidate target objects overlap as shown in fig. 9C, it can be determined that the portion of true overlap between the two candidate target objects is shown as the hatched portion in fig. 9D, and thus it can be known that the degree of overlap between the two candidate target objects is small (e.g., 3). Further, without using the present disclosure, the overlapping portion between the two candidate target objects will be regarded as shown by the hatched portion in fig. 9E, that is, the overlapping degree between the rectangular regions where they are located is used as the overlapping degree between them, so that it can be known that the accuracy of the overlapping degree obtained at this time is low, and thus the determination of whether the two candidate target objects belong to the same target object will be affected. As another example, in the case where the used pre-generated geometric information is constituted by the keypoints of the target objects, the degree of overlap between polygons constituted by the keypoints included in the two candidate target objects may be calculated as the distance between the two candidate target objects. As another example, in the case where the used pre-generated geometric information is constituted by a rectangular region that can directly describe the shape of the target object itself, the degree of overlap between the rectangular regions that are present in the two candidate target objects may be calculated as the distance between the two candidate target objects.

Then, for all the candidate target objects, after the distances thereof from each other are calculated, the determination unit 540 merges the candidate target objects belonging to the same target object by the NMS method to obtain a final target object. For example, for any two candidate target objects, in the case where the distance between them is greater than or equal to a predefined threshold (e.g., TH), the two candidate target objects will be judged as belonging to the same target object and thus will be retained one of them or will be united into one. And performing the operation until all the remaining candidate target objects are less than TH in distance from each other, and then considering the remaining candidate target objects as final target objects.

As described above, in determining whether two candidate target objects belong to the same target object, the present disclosure determines the degree of overlap therebetween by using geometric information capable of describing the overall/local shape of the target object, so that a more accurate degree of overlap can be obtained, and thus the detection accuracy of the target object can be further improved. For example, taking the target object shown in fig. 1C as an example, after the detection step S740, the determined final target object is, for example, a portion where the solid line region is located and a portion where the dashed line region is located as shown in fig. 8C.

Finally, returning to fig. 7, the determination unit 540 transmits the detection result (e.g., the detected target object) to the output device 460 via the system bus 480 shown in fig. 4, for displaying the detection result to the user or for outputting the detection result to subsequent image processing such as face recognition, person tracking, people counting, and the like.

Further, as described in fig. 5, each unit (i.e., the extraction unit 510, the generation unit 520, the detection unit 530, and the determination unit 540) in the target object detection apparatus 500 may perform a corresponding operation using a neural network generated in advance. Therefore, the steps shown in fig. 7 (i.e., the extracting step S710, the generating step S720, the detecting step S730, and the determining step S740) may also perform corresponding operations using the pre-generated neural network.

As described above, the present disclosure may generate respective candidate detection regions by using geometric information capable of describing a shape of a target object to detect the respective target object from an image. Since the used geometric information can describe the overall shape of the target object and also describe the local shape of the target object, the present disclosure can effectively detect the target object in the corresponding scene from the image regardless of the scene presented in the overall shape of the target object or the scene presented in the local form and/or the deformed form of the target object. Therefore, according to the present disclosure, the detection accuracy of the target object, that is, the recall rate of the target object and the accuracy of the positioning of the target object can be effectively improved.

(applications)

Further, as described above, the present invention may be implemented by a computer (e.g., a client server). Thus, as an application, taking the present invention as an example by a client server, fig. 10 shows the arrangement of an exemplary image processing system 1000 according to the present invention. In this application, the image processing system 1000 is used for face recognition, person tracking, people counting, or the like, for example. As shown in fig. 10, the image processing system 1000 includes an acquisition device 1010 (e.g., at least one web camera), a post-processing device 1020, and a target object detection device 500 as shown in fig. 5, wherein the acquisition device 1010, the post-processing device 1020, and the target object detection device 500 are connected to each other via a network 1030. The post-processing apparatus 1020 and the target object detection apparatus 500 may be implemented by the same client server, or may be implemented by different client servers.

As shown in fig. 10, first, the acquisition device 1010 captures an image or video of a place of interest (e.g., a mall entrance, a supermarket entrance, etc.) and transmits the captured image/video to the target object detection device 500 via the network 1030.

The target object detection apparatus 500 detects a face from the captured image/video with reference to fig. 5 to 9E. That is, in this application, the target object is a face (e.g., a face of a person). Therefore, in this application, the geometric information capable of describing the whole/local shape of the face used by the target object detection apparatus 500 is composed of, for example, a set of face key points (e.g., eye key points, mouth key points, nose tip key points, etc.) that can describe the shape outline of the face within a rectangular region. Further, in this application, the whole/partial shape geometric information capable of describing a face can be obtained, for example, based on face samples at various detection angles, such as various types of face samples (e.g., a face on the front, a face on the side, and the like), face samples of various sizes (e.g., a face of a large size, a face of a small size, and the like), face samples that are blocked (e.g., a face with glasses/sunglasses, a face with a mask, and the like), and the like.

The post-processing device 1020 performs subsequent image processing operations, such as face recognition, person tracking, or people counting, based on the detected face.

All of the elements described above are exemplary and/or preferred modules for implementing the processes described in this disclosure. These units may be hardware units (such as Field Programmable Gate Arrays (FPGAs), digital signal processors, application specific integrated circuits, etc.) and/or software modules (such as computer readable programs). The units for carrying out the steps have not been described in detail above. However, in case there are steps to perform a specific procedure, there may be corresponding functional modules or units (implemented by hardware and/or software) to implement the same procedure. The technical solutions through all combinations of the described steps and the units corresponding to these steps are included in the disclosure of the present application as long as the technical solutions formed by them are complete and applicable.

The method and apparatus of the present invention may be implemented in a variety of ways. For example, the methods and apparatus of the present invention may be implemented in software, hardware, firmware, or any combination thereof. The above-described order of the steps of the method is intended to be illustrative only and the steps of the method of the present invention are not limited to the order specifically described above unless specifically indicated otherwise. Furthermore, in some embodiments, the present invention may also be embodied as a program recorded in a recording medium, which includes machine-readable instructions for implementing a method according to the present invention. Accordingly, the present invention also covers a recording medium storing a program for implementing the method according to the present invention.

While some specific embodiments of the present invention have been shown in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are intended to be illustrative only and are not limiting upon the scope of the invention. It will be appreciated by those skilled in the art that the above-described embodiments may be modified without departing from the scope and spirit of the invention. The scope of the invention is to be limited only by the following claims.

Claims

1. A target object detection apparatus, characterized by comprising:

an extraction unit that extracts a feature from an image;

a generation unit that generates a detection region candidate having pre-generated geometric information on the image based on the extracted features and the pre-generated geometric information, the pre-generated geometric information being capable of describing at least an overall shape of a target object;

a detection unit that detects a target object candidate in the image from the generated detection region candidate based on the extracted features; and

a determination unit that determines a target object in the image based on the detected target object candidates.

2. The target object detection apparatus of claim 1, wherein the pre-generated geometric information further describes a local shape of the target object;

wherein the generated detection region candidates include a detection region having an overall shape of the target object and a detection region having a local shape of the target object.

3. The target object detection apparatus according to claim 1 or 2, wherein, for the geometric information that can describe the overall shape of the target object in the pre-generated geometric information, for a class of target objects, the corresponding geometric information is obtained based on a sample in which the class of target object is presented in its overall form under the detection perspective, wherein the overall shape of the class of target object is noted in the sample.

4. The target object detection apparatus according to claim 2, wherein, for the geometric information that can describe the local shape of the target object in the pre-generated geometric information, for one kind of target object, the corresponding geometric information is obtained based on a sample in which the kind of target object is presented in its local form and/or deformed form under the detection perspective, wherein the local shape of the kind of target object is marked in the sample.

5. The target object detection apparatus according to claim 1 or 2, wherein the pre-generated geometric information is constituted by a bitmap or key points of the target object.

6. The target object detection device according to claim 1 or 2, wherein the detection unit detects the target object candidate in the image by performing a classification and localization operation on the generated candidate detection regions based on the extracted features and geometric information possessed therein.

7. The target object detection device according to claim 1 or 2, wherein the determination unit determines the target object in the image by performing a selection or merging operation on the detected candidate target objects based on a distance between the detected candidate target objects;

wherein the distance between any two candidate target objects is obtained by the geometric information of the two candidate target objects.

8. The target object detection apparatus according to claim 1, wherein the extraction unit, the generation unit, the detection unit, and the determination unit perform respective operations using a pre-generated neural network.

9. A target object detection method, characterized by comprising:

an extraction step of extracting features from the image;

a generation step of generating a candidate detection region having pre-generated geometric information on the image based on the extracted features and the pre-generated geometric information, wherein the pre-generated geometric information can describe at least an overall shape of a target object;

a detection step of detecting a target object candidate in the image from the generated detection region candidate based on the extracted features; and

a determination step of determining a target object in the image based on the detected target object candidates.

10. The target object detection method of claim 9, wherein the pre-generated geometric information further describes a local shape of the target object;

11. The target object detection method according to claim 9, wherein in the extraction step, the generation step, the detection step, and the determination step, respective operations are performed using a neural network generated in advance.

12. An image processing system, characterized in that the image processing system comprises:

an acquisition device for acquiring an image or video;

the target object detection apparatus according to any one of claims 1 to 8, for detecting a face in the acquired image or video; and

a post-processing device that performs a subsequent image processing operation based on the detected face;

wherein the acquisition device, the target object detection device, and the post-processing device are connected to each other via a network.

13. A storage medium having stored thereon instructions that, when executed by a processor, cause a target object detection method to be performed, the target object detection method comprising:

an extraction step of extracting features from the image;