CN113191942A

CN113191942A - Method for generating image, method for training human detection model, program, and device

Info

Publication number: CN113191942A
Application number: CN202110561037.1A
Authority: CN
Inventors: 支蓉; 郭子杰; 张武强; 王宝锋
Original assignee: Daimler AG
Current assignee: Mercedes Benz Group AG
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-07-30

Abstract

The present invention relates to the field of image generation and the field of person detection. In particular to a method for generating an image, wherein the method comprises the following steps: s1, providing an original image (10) containing a person; s2, cutting out at least one original character image block (11) from the original image (10); s3, generating a synthesized human image block (21) based on each original human image block (11), wherein the synthesized human image block (21) has a background in the corresponding original human image block (11) and a human pose different from the human pose in the corresponding original human image block (11); s4, replacing the corresponding original human image block (11) in the original image (10) with the synthesized human image block (21) to generate the synthesized image (20). It also relates to a method of training a human detection model (40), a computer program product and an apparatus for processing images.

Description

Method for generating image, method for training human detection model, program, and device

Technical Field

The present invention relates to the field of image generation and the field of person detection, and in particular to a method of generating an image, a method of training a person detection model, and a corresponding computer program product and apparatus for processing an image.

Background

The person detection technology based on computer vision can detect the position of a person and the like by processing images or video information acquired by a camera. The human detection has wide application prospect, for example, the human detection can be used for pedestrian detection and used as a key technology in the applications of vehicle auxiliary driving, vehicle automatic driving, intelligent video monitoring, human behavior analysis and the like. In recent years, machine learning has become an algorithm widely used in the field of computer vision and the like. People detection technology based on machine learning is increasingly receiving attention from both academic and industrial fields.

The performance of machine learning based character detection models depends not only on the quality of the model construction, but also on the quality and quantity of training data. In order to ensure the performance of the human detection model, a large number of samples are usually required for training, and the acquisition of the large number of samples consumes a large amount of manpower and material resources. The data enhancement technology is a better method for reducing acquisition cost, can effectively expand the number of training samples and improve the identification accuracy of the character detection model.

For example, existing Generative Networks such as Variational Autocodes (VAEs), Generative Adaptive Networks (GANs), etc. may generate new samples based on a training data set having a limited number of training samples.

However, most of the current generation processes are random processes, and the current generation model is difficult to ensure that the pattern of the target image is accurately controlled while generating a high-quality and high-definition image, so that the generated image is not suitable for being used as a training sample of a person detection model.

Therefore, the prior art still has many defects in image generation and in improving the recognition rate of the person detection model.

Disclosure of Invention

It is an object of the invention to provide an improved method of generating images, an improved method of training a person detection model, and corresponding computer program products and apparatuses.

According to a first aspect of the present invention, there is provided a method of generating an image, wherein the method comprises the steps of:

s1, providing an original image containing a person;

s2, cutting out at least one original character image block from the original image;

s3, generating a synthesized human image block based on each of the original human image blocks, wherein the synthesized human image block has a background in the corresponding original human image block and a human pose different from the human pose in the corresponding original human image block;

s4, replacing the corresponding original human image block in the original image with the synthesized human image block to generate a synthesized image.

According to the present invention, the newly generated synthesized human image block has the background in the original human image block. That is, the synthesized human image block contains background information that matches the environmental information in the original image. It is thus possible to directly replace the original human image block with the corresponding newly generated synthetic human image block. No element other than the person in the complete composite image thus generated is changed. In the synthesized image, the new synthesized image block of the person is in a reasonable position and a reasonable size, and simultaneously can well accord with the environmental information in the original image. These composite images have different character poses compared to the original images, and therefore can provide more diverse character poses and bounding box information.

According to an alternative embodiment of the invention the synthetic character image block has the appearance of a character in the corresponding original character image block.

In accordance with an alternative embodiment of the present invention, step S1 also includes providing target pose information, and in step S3, generating a composite human image block based on each original human image block and according to the target human pose such that the composite human image block has the target human pose represented by the target pose information.

According to an alternative embodiment of the invention, the target pose information is provided by a pose image comprising pose key points connected according to a real human skeleton link. Alternatively, the target pose information is provided by a target pose character image containing a character having a target character pose. Alternatively, the target pose information is provided by position data for a set of pose key points.

According to an alternative embodiment of the invention, the target pose information has associated annotation information, and the composite character image patch generated in step S3 has annotation information associated with the corresponding target pose information, wherein the annotation information optionally includes character intent information and/or gesture information.

According to an alternative embodiment of the present invention, in step S3, the original human image block and the target posture information are input into a human generator that generates a composite human image block.

According to an alternative embodiment of the invention, the character generator is configured to be able to perform the following steps:

s31, identifying at least one posture key point of the character in the original character image block;

s32, intercepting a plurality of foreground image blocks and a plurality of background image blocks from the original character image block based on the at least one posture key point;

s33, extracting at least one first feature vector from the plurality of foreground image patches and the plurality of background image patches;

s34, acquiring at least one second feature vector from the target posture information; and

and S35, generating a composite human image block from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

According to a second aspect of the present invention, there is provided a method of training a human detection model, wherein the method comprises the steps of: providing a training data set comprising a synthetic image generated by a method of generating images according to the invention; and training the character detection model with a training data set, wherein the training data set optionally includes original images.

As described above, on the one hand, in the synthesized image, the new synthesized human image block has both a reasonable position and a reasonable size, while well conforming to the surrounding environment information in the original image. On the other hand, these composite images may provide more varied character poses and bounding box information. Therefore, the composite image is particularly suitable for training the human detection model, so that the human detection model obtains a higher recognition rate.

According to a third aspect of the invention, there is provided a computer program product comprising computer program instructions, wherein the computer program instructions, when executed by one or more processors, enable the processor to perform a method of generating an image according to the invention or a method of training a human detection model according to the invention.

According to a fourth aspect of the present invention, there is provided an apparatus for processing an image, the apparatus comprising a processor and a computer-readable storage device communicatively connected to the processor, the computer-readable storage device having stored thereon a computer program for implementing the method of generating an image according to the present invention or the method of training a human detection model according to the present invention when the computer program is executed by the processor.

By the invention, the following effects are realized: the figure in the generated composite image has reasonable position and size, and the problem that the foreground is not matched with the background information is avoided. And training the character detection model by using the synthetic image so as to improve the recognition rate of the character detection model.

Drawings

The principles, features and advantages of the present invention may be better understood by describing the invention in more detail below with reference to the accompanying drawings. The drawings comprise:

fig. 1 is a schematic block diagram illustrating an apparatus for processing an image according to an exemplary embodiment of the present invention;

FIG. 2 shows a flow diagram of a method of generating an image according to an exemplary embodiment of the invention;

FIG. 3 schematically illustrates an original image;

FIG. 4 schematically illustrates a method of generating an image according to an exemplary embodiment of the invention;

FIG. 5 schematically illustrates a process of generating a composite human image block, according to an exemplary embodiment of the present invention; and

figure 6 schematically illustrates a method of training a human detection model according to an exemplary embodiment of the invention.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and exemplary embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention.

Fig. 1 illustrates a schematic structural block diagram of an apparatus for processing an image according to an exemplary embodiment of the present invention. The apparatus for processing images comprises a processor 1 and a computer readable storage device 2 communicatively connected to the processor 1. The computer-readable storage device 2 has stored therein a computer program for implementing a method of generating an image or a method of training a human detection model, which will be explained in detail below, when the computer program is executed by the processor 1.

According to an exemplary embodiment, a display device 3 is provided in communicative connection with the processor 1. By means of the display device 3, the user can view the original image 10 processed by the device of the image to be processed and the new composite image 20 generated by the device.

According to an exemplary embodiment, an input device 4 is provided in communicative connection with the processor 1. By means of the input device 4, the user can select or input an original image 10 to be processed by the device. The input device 4 may include, for example: a keyboard, a mouse, and/or a touch screen.

According to an exemplary embodiment, a camera device 5 is provided in communicative connection with the processor 1. By means of the camera device 5, the user can take a photograph containing an image of a person as an original image 10 to be processed by the device. In particular, the original image 10 includes not only the person but also surrounding environment information such as a scene in which the person is located.

According to an exemplary embodiment, an original image set is provided which is made up of a plurality of original images 10. The raw set of images may be stored in the computer readable storage device 2 or another storage device communicatively connected to the processor 1.

Fig. 2 shows a flowchart of a method of generating an image according to an exemplary embodiment of the present invention.

In step S1, the original image 10 containing the person is provided. Fig. 3 schematically shows an original image 10, the original image 10 containing a person and a scene in which the person is located. The original image 10 may be any of the images in the original set mentioned above. The original image 10 may be, for example, an image captured by a user via the camera 5 or a frame of a person image captured from a video stream.

Then, in step S2, at least one original-person image block 11 is cut out from the original image 10. As shown in fig. 4, two original human image blocks 11 can be cropped from the original image 10 shown in fig. 3. The cropped original character image block 11 contains the complete character and also contains a small amount of background.

In one exemplary embodiment, pose keypoints of a person contained in the original image 10 are identified, and the original-person image blocks 11 are cropped according to the identified pose keypoints, so that a single original-person image block 11 contains a single whole person. For example, from the identified pose key points, a character bounding box may be determined, which is expanded outward, e.g., by a factor of 1.5, to form a cropped bounding box for cropping out the original character image block 11.

Next, in step S3, a synthesized human image block 21 is generated based on each of the original human image blocks 11, where the synthesized human image block 21 has the background in the corresponding original human image block 11 and a human posture different from the human posture in the corresponding original human image block 11. The synthetic character image block 21 has new character pose and bounding box information, making the information in the full data set more rich.

Alternatively, the synthesized human image block 21 has the appearance of a human in the corresponding original human image block 11. Thus, the synthesized human image block 21 changes only the human pose in the original human image block 11, while maintaining the human appearance and background of the original human image block 11.

Illustratively, the synthetic human image block 21 may be generated by means of a human generator 30. As shown in fig. 4, the human generator 30 generates two corresponding synthesized human image blocks 21 based on the two original human image blocks 11.

Next, in step S4, the corresponding original human image block 11 in the original image 10 is replaced with the synthesized human image block 21 to generate the synthesized image 20. In the newly generated synthetic-human image block 21, the background information of the pixel level matching the background and environmental information of the original image 10 has already been included, so that the original-human image block 11 can be directly replaced with the corresponding newly generated synthetic-human image block 21. Thus, no elements other than the person in the newly generated complete composite image 20 are changed. In the composite image 20, the new composite human image block 21 is both in a reasonable position and in a reasonable size, while conforming well to the surrounding environment information. These composite images 20 may provide a greater variety of character poses and bounding box information. The use of the original image set together with the composite image set for training the person detection model 40 allows the person detection model 40 to achieve a higher recognition rate.

Fig. 5 illustrates a process of generating the synthesized human image block 21 according to an exemplary embodiment of the present invention. In the exemplary embodiment, step S1 also includes providing target pose information. Also, in step S3, the synthesized human image block 21 is generated based on each of the original human image blocks 11 and in accordance with the target human pose such that the synthesized human image block 21 has the target human pose represented by the target pose information.

The target pose information may be provided by a pose image that contains pose key points connected according to a real human skeleton link. The original human image block 11 and the posture image having the target posture are input into the human generator 30, and then the human generator 30 outputs the synthesized human image block 21. The source of the target pose is not limited by the invention, and the target pose can be the pose of other people in the original image set or the pose of people in other data sets.

Alternatively, the target pose information may be provided by a target pose character image containing a character having a target character pose. The target posed character image may or may not be selected from the original set of images.

Alternatively, the target pose information may be provided by position data for a set of pose key points. It should be understood that the present invention is not limited to the specific form of the target pose information.

In one exemplary embodiment, the target pose information has associated annotation information, and the synthetic character image block 21 generated in step S3 has annotation information associated with the corresponding target pose information. The annotation information includes, for example, character intention information and/or gesture information. Thus, the resulting composite images 20 will also have associated annotation information, and these composite images 20 may be particularly advantageously used to train a human intent recognizer or human gesture detector, or the like, for enhancing their performance.

In one exemplary embodiment, character generator 30 is configured to perform the following steps:

s31, identifying at least one posture key point of the person in the original person image block 11;

s32, intercepting a plurality of foreground image patches and a plurality of background image patches from the original character image block 11 based on the at least one pose keypoint;

s35, generating a synthetic human image block 21 from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

The person generator 30 may be provided as another type of generator as long as the person generator 30 functionally satisfies the requirement of being able to control the appearance, posture and background of the generated person image. It should be understood that the present invention is not limited to a particular type of character generator 30.

Fig. 6 shows a schematic diagram of a method of training a human detection model 40 according to an exemplary embodiment of the invention. In the method of training the human detection model 40, the synthetic image 20 is generated by the method of generating an image according to the present invention. Then, a training data set is provided, which comprises the synthetic image 20. The training data set may be stored in a computer readable storage medium. Optionally, the training data set further comprises the original image 10. That is, the original image 10 and the synthesized image 20 may be used together to train the human detection model 40.

As described above, the person in the composite image 20 has a reasonable size and position, and the problem of the foreground not matching the background information does not occur. The composite image 20 may provide more varied character poses and bounding box information, making the training data set more informative. The original image set and the composite image set are submitted to the human detection model 40 for training, so that the human detection model 40 can obtain higher recognition rate.

Furthermore, the invention relates to a computer program product comprising computer program instructions which, when executed by one or more processors, are capable of performing the method of generating an image according to the invention or the method of training a human detection model according to the invention. The computer program instructions may be stored in a computer readable storage medium. In the present invention, the computer-readable storage medium may include, for example, a high-speed random access memory, and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Although specific embodiments of the invention have been described herein in detail, they have been presented for purposes of illustration only and are not to be construed as limiting the scope of the invention. Various substitutions, alterations, and modifications may be devised without departing from the spirit and scope of the present invention.

Claims

1. A method of generating an image, wherein the method comprises the steps of:

s1, providing an original image (10) containing a person;

s2, cutting out at least one original character image block (11) from the original image (10);

s3, generating a synthesized human image block (21) based on each original human image block (11), wherein the synthesized human image block (21) has a background in the corresponding original human image block (11) and a human pose different from the human pose in the corresponding original human image block (11);

s4, replacing the corresponding original human image block (11) in the original image (10) with the synthesized human image block (21) to generate the synthesized image (20).

2. The method according to claim 1, wherein the synthetic human image block (21) has the appearance of a human in the corresponding original human image block (11).

3. The method of claim 1, wherein step S1 further includes providing target pose information, and in step S3 the composite human image block (21) is generated based on each original human image block (11) and in accordance with the target human pose such that the composite human image block (21) has the target human pose represented by the target pose information.

4. The method of claim 3, wherein,

the target posture information is provided by a posture image, and the posture image comprises posture key points which are connected according to a real human body skeleton linking mode; or

The target pose information is provided by a target pose character image containing a character having a target character pose; or

The target pose information is provided by position data for a set of pose key points.

5. The method according to claim 3, wherein the target pose information has associated annotation information, the synthetic character image patch (21) generated in step S3 having annotation information associated with the corresponding target pose information, wherein the annotation information optionally includes character intent information and/or gesture information.

6. The method according to any one of claims 3-5, wherein in step S3, the original human image block (11) and the target pose information are input into a human generator (30), the human generator (30) generating a composite human image block (21).

7. The method of claim 6, wherein the character generator (30) is configured to perform the steps of:

s31, identifying at least one pose key point of a person in the original person image block (11);

s32, intercepting a plurality of foreground image patches and a plurality of background image patches from the original character image block (11) based on the at least one pose keypoint;

and S35, generating a composite human image block (21) from the at least one first feature vector extracted in step S33 and the at least one second feature vector acquired in step S34.

8. A method of training a character detection model (40), wherein the method comprises the steps of:

providing a training data set comprising a synthetic image (20) generated by the method of generating an image according to any one of claims 1-7; and

a person detection model (40) is trained using a training data set, wherein the training data set optionally comprises original images (10).

9. A computer program product comprising computer program instructions, wherein the computer program instructions, when executed by one or more processors, are capable of performing the method of any one of claims 1-8.

10. An apparatus for processing an image, the apparatus comprising a processor (1) and a computer-readable storage device (2) communicatively connected to the processor (1), the computer-readable storage device (2) having stored therein a computer program for implementing the method according to any one of claims 1-8 when the computer program is executed by the processor (1).