CN117765017A

CN117765017A - Data augmentation method and electronic equipment

Info

Publication number: CN117765017A
Application number: CN202211132904.0A
Authority: CN
Inventors: 浦一成; 荣逸; 周磊
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2024-03-26

Abstract

The embodiment of the application provides a data augmentation method and electronic equipment. The method comprises the following steps: acquiring a target background image and a target semantic segmentation map corresponding to the target background image; then, determining the pose corresponding to the target foreground object based on the target semantic segmentation map, wherein the pose is the pose of the target foreground object in the target background image; then, acquiring a target foreground image based on the pose corresponding to the target foreground object, wherein the target foreground image comprises the target foreground object; and then synthesizing the target foreground image and the target background image to obtain a training image. Thus, after the training images are synthesized, the target foreground object can appear at a reasonable position of the background, so that the augmented training images are closer to the images acquired by the actual scene, and the images covering various difficult-to-case scenes are obtained. After training the preset algorithm (such as an automatic driving algorithm) by using the training image, the index, generalization and robustness of the trained preset algorithm can be improved.

Description

Data augmentation method and electronic equipment

Technical Field

The embodiment of the application relates to the field of data processing, in particular to a data augmentation method and electronic equipment.

Background

Often, a large amount of training data is required to train the autopilot algorithm (which may be a deep learning model) to obtain an autopilot algorithm that may be applied during autopilot. However, training data (such as training images) often cannot cover various rare scenes (which may be called difficult scenes) during driving, such as pedestrians, animals, etc. suddenly appearing on a highway; therefore, the training data needs to be expanded to improve generalization (i.e., adaptability to the scene that does not appear in the training data), index and robustness of the autopilot algorithm.

Because the traditional real vehicle acquisition mode is adopted to supplement data, the problems of overhigh cost and overlong acquisition period are faced, and the method has high risk, the training data is expanded by adopting an image synthesis mode generally. However, the existing image synthesis technology can only rely on manual operation to generate images meeting the situations of difficult automatic driving in small batches.

Disclosure of Invention

In order to solve the technical problems, the application provides a data augmentation method and electronic equipment. In the method, images covering various difficult scenes can be automatically generated in a large scale.

In a first aspect, embodiments of the present application provide a data augmentation method, the method comprising: firstly, acquiring a target background image and a target semantic segmentation map corresponding to the target background image; then, determining the pose corresponding to the target foreground object based on the target semantic segmentation map, wherein the pose is the pose of the target foreground object in the target background image; then, acquiring a target foreground image based on the pose corresponding to the target foreground object, wherein the target foreground image comprises the target foreground object; and then synthesizing the target foreground image and the target background image to obtain a training image.

Since the target foreground image is generated according to the pose of the target foreground object in the target background image, that is, the pose of the target foreground object in the target foreground image, that is, the pose of the target foreground object in the target background image. Therefore, after the target foreground image and the corresponding target background image are synthesized, the target foreground object in the target foreground image can be placed in the target background image according to the determined pose. After the training images are synthesized, the target foreground objects can appear at reasonable positions of the background, so that the augmented training images are closer to the images acquired by the actual scenes, and the images covering various difficult scenes are obtained; thus, images covering various difficult scenes can be automatically generated in large quantities. After training the preset algorithm (such as an automatic driving algorithm) by using the training image, the index, generalization and robustness of the trained preset algorithm can be improved.

Illustratively, the target background image is used to synthesize a training image as a background for the training image. The target background image may be one or more sheets, for example.

Illustratively, the target semantic segmentation map is used to determine the pose of the target foreground object. The target semantic segmentation map is obtained by carrying out semantic segmentation on a top view of a background corresponding to the target background image. For example, the target semantic segmentation map may be one or more; the plurality of target background images are in one-to-one correspondence with the plurality of target semantic segmentation maps.

For example, the pose may include a position and an orientation angle.

By way of example, the target foreground object may include a variety of objects, such as humans, animals, objects (e.g., vehicles), etc., as the application is not limited in this regard. Illustratively, the target foreground object may be one or more.

Illustratively, the target foreground image is used to synthesize a training image as the foreground of the training image. The target foreground image may be one or more sheets, for example.

For example, when there are multiple target foreground images and multiple target background images, one target foreground image and one corresponding target background image may be combined to obtain one training image; in this way, a plurality of training images can be obtained.

For example, when the target foreground images are multiple and the target background images are multiple, the multiple target foreground images and a corresponding target background image may be combined to obtain a training image; in this way, a plurality of training images can also be obtained.

The data augmentation method provided by the application can be applied to an automatic driving scene of a vehicle. In particular, it may be used to augment training data for training an autopilot algorithm to cover various difficult scenarios during autopilot. The difficult situations in the automatic driving process may include various situations, such as pedestrians, animals, etc. suddenly appearing on a highway, for example, a sudden lane change of a front vehicle, a sudden overtaking of a rear vehicle, etc., which is not limited in this application. It should be understood that the training data augmented by the data augmentation method provided in the present application may also cover non-difficult situations in the automatic driving process, which is not limited in this application.

The data augmentation method provided by the application can be applied to a sweeping (or mopping) scene of a sweeping robot. Specifically, the method can be used for amplifying training data for training an image sensing algorithm in the sweeping robot to cover various difficult cases in the sweeping (or mopping) process of the sweeping robot. The difficult scene in the sweeping (or mopping) process of the sweeping robot can comprise various important and small objects such as rings, necklaces and the like on the ground. It should be understood that the training data amplified by the data amplification method provided by the application can also cover a non-difficult scene in the sweeping (or mopping) process of the sweeping robot, which is not limited in this application.

It should be understood that the data augmentation method provided by the application can also be applied to the working scenes of other robots (such as meal delivery robots and patrol robots); specifically, training data for image perception algorithms in other robots can be added to cover various difficult cases in the working process of other robots; the present application is not limited in this regard.

According to a first aspect, the pose comprises a first pose, and determining the pose corresponding to the target foreground object based on the target semantic segmentation map comprises: determining probability information based on the target semantic segmentation map, wherein the probability information comprises the probability of each pixel point in the target semantic segmentation map to appear in a target foreground object; based on the probability information, a first pose is determined.

For example, when the target foreground object is a stationary object, the first pose is a final pose corresponding to the target foreground object.

For example, when the target foreground object is a moving object, the first pose is an initial pose of the target foreground object, and the second pose in the moving process of the target foreground object can be determined later. The first pose and the second pose can form a final pose corresponding to the target foreground object.

According to a first aspect, or any implementation manner of the first aspect, determining probability information based on the target semantic segmentation map includes: and inputting the category identifiers of the target semantic segmentation graph and the target foreground object into a position generation network, and carrying out position prediction by the position generation network to obtain probability information. In this way, the position generation network is used for carrying out position prediction, so that the reasonable position of the target foreground object in the target background image can be accurately determined.

According to a first aspect, or any implementation manner of the first aspect, the determining the first pose based on the probability information includes: determining the position of a target foreground object in a target background image based on probability information; determining an orientation angle of the target foreground object in the target background image based on the position of a preset reference semantic in the target semantic segmentation map and the position of the target foreground object in the target background image; the first pose is determined based on the position and orientation angle of the target foreground object in the target background image. Thus, the orientation angle of the target foreground object can be accurately determined based on the position of the reference semantic and the position of the target foreground object; and further accurately determining the first pose of the target foreground object.

According to a first aspect, or any implementation manner of the first aspect, determining a position of the target foreground object in the target background image based on the probability information includes: generating a probability heat map based on the probability information; and sampling the probability heat map according to a preset constraint condition to determine the position of the target foreground object in the target background image. Therefore, sampling is performed according to preset constraint conditions, and the target foreground object can be enabled to appear at a reasonable position in the target background image.

According to a first aspect, or any implementation manner of the first aspect above, the method further comprises training the location generation network: acquiring a semantic segmentation map, wherein the semantic segmentation map comprises the semantics of a foreground object and the semantics of a background object; shielding semantics of a foreground object in the semantic segmentation map to obtain a first image; binarizing the semantic segmentation map to obtain a second image, wherein the pixel value of the pixel point corresponding to the foreground object in the second image is different from the pixel value of the pixel point corresponding to the background object; the location generation network is trained based on the category identification of the foreground object, the first image, and the second image. In this way, the position generation network can learn the reasonable position of the foreground object in the background by taking the image which shields the foreground object and reserves the background object as a training image and taking the image which reserves the foreground object and shields the background object as a label; in the use process of the subsequent position generation network, the position prediction can be more accurately performed.

According to the first aspect, or any implementation manner of the first aspect, the categories of the foreground objects are identified as a plurality of, and the position generating network corresponds to a plurality of groups of weight parameters; training the location generation network based on the category identification of the foreground object, the first image, and the second image, comprising: for a first foreground object of the plurality of foreground objects: the location generation network configured with a set of weight parameters is trained based on the class identification of the first foreground object, the first image, and the second image. In this way, different weight coefficients can be trained for different types of foreground objects. Since the different types of foreground objects are present in different reasonable positions in the background, the position generation network can accurately predict the positions of the different types of foreground objects.

According to a first aspect, or any implementation of the first aspect above, the location generation network comprises a gaussian kernel function and a sigmoid function; the gaussian kernel functions correspond to sets of weight parameters.

Illustratively, the sigmoid function is a nonlinear activation function.

Illustratively, the gaussian kernel function may be represented by the following formula:

l∈(0,m),k∈(0,m)

wherein || (i, j) - (l, k) || ² Representing the distance between the pixel point at (i, j) and other positions (l, k) in the semantic segmentation map. Omega and theta are weight coefficients; wherein ω is the weight of the semantics corresponding to the pixel points at (l, k) to one type of foreground object, and θ is the influence range of the semantics corresponding to the pixel points at (l, k) to one type of foreground object.

According to a first aspect, or any implementation manner of the first aspect, the inputting the category identification of the target semantic segmentation map and the target foreground object into the location generating network, and performing location prediction by the location generating network to obtain probability information, includes: inputting the category identification of the target semantic segmentation map and the target foreground object to a position generation network, selecting a group of weight parameters corresponding to the category identification of the target foreground object by the position generation network, and carrying out position prediction after configuration so as to obtain probability information. In this way, the position generation network can use the weight parameter corresponding to the foreground object of the type of the target foreground object to conduct position prediction, and further can accurately predict the reasonable position of the target foreground object in the target background image.

According to the first aspect, or any implementation manner of the first aspect, the pose further includes a second pose, and determining, based on the target semantic segmentation map, a pose corresponding to the target foreground object, further includes: when the target foreground object is a dynamic object, selecting a track generation model corresponding to the motion type of the target foreground object from a preset behavior asset library; and generating a motion track corresponding to the target foreground object based on the first pose and the track generation model, wherein the motion track comprises a plurality of second poses in the motion process of the target foreground object, the second poses are the poses of the target foreground object in the target background image, and the first pose is one. Thus, the second pose of the target foreground object in the motion process can be accurately determined through the track generation model.

For example, a first pose and a plurality of second poses may constitute a final pose of the target foreground object.

For example, the preset behavior asset library may include assets for generating motion trajectories, and may include a plurality of preset trajectory generation models, each corresponding to a foreground object behavior category. For example, the preset behavior asset library may include a preset track generation model corresponding to uniform motion, a preset track generation model corresponding to acceleration motion, a preset track generation model corresponding to deceleration motion, and the like, which is not limited in this application.

According to the first aspect, or any implementation manner of the first aspect, the method further includes: acquiring requirement information of data augmentation, wherein the requirement information comprises foreground requirement information and background requirement information; the method for acquiring the target background image and the target semantic segmentation map corresponding to the target background image comprises the following steps: acquiring a target background image and a target semantic segmentation map based on a preset background asset library and background demand information; determining a target foreground image based on the pose corresponding to the target foreground object comprises: and determining a target foreground image based on the pose corresponding to the target foreground object, a preset foreground asset library and foreground requirement information. Thus, a target foreground image and a target background image which meet the data augmentation requirement can be obtained; and then training images meeting the data augmentation requirements can be synthesized.

Illustratively, the preset background asset library includes assets for generating a target background image and assets for determining a corresponding pose of a target foreground object.

In one possible manner, the preset background asset library may include a plurality of preset background model groups and a plurality of preset semantic segmentation graphs, where the plurality of preset background model groups and the plurality of preset semantic segmentation graphs are in one-to-one correspondence. The preset background model group is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

In one possible manner, the preset background asset library may include a plurality of preset neural networks and a plurality of preset semantic segmentation maps, where the plurality of preset semantic segmentation maps are in one-to-one correspondence with the plurality of preset neural networks. The preset neural network is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

In one possible manner, the preset background asset library may include a plurality of preset semantic segmentation graphs and a plurality of preset background images, where the plurality of preset semantic segmentation graphs and the plurality of preset background images are in one-to-one correspondence. The preset background image is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

Illustratively, the preset foreground asset library includes assets for generating a target foreground image.

In one possible approach, the library of preset foreground assets may include three-dimensional models of a plurality of preset foreground objects. The three-dimensional model of the foreground object is preset and used for generating a target foreground image.

In one possible approach, the library of preset foreground assets may include a plurality of neural networks. The neural network is used for generating a target foreground image.

In one possible approach, the preset foreground asset library may include a plurality of preset foreground images and preset poses of preset foreground objects in the plurality of preset foreground images. Wherein the preset foreground image and the preset pose may be used to generate a target foreground image.

According to the first aspect, or any implementation manner of the first aspect, the background asset library includes a plurality of preset semantic segmentation graphs and a plurality of preset background images collected in advance, where the plurality of preset semantic segmentation graphs are in one-to-one correspondence with the plurality of preset background images; based on a preset background asset library and background demand information, acquiring a target background image and a target semantic segmentation map, wherein the method comprises the following steps: selecting a target background image matched with background demand information from a plurality of preset background images; and selecting a preset semantic segmentation map corresponding to the target background image from the plurality of preset semantic segmentation maps as a target semantic segmentation map. Thus, the target background image and the target semantic segmentation map can be selected rapidly.

In a possible manner, when the background asset library includes a plurality of preset background model groups and a plurality of preset semantic segmentation graphs, a target background model group matched with the background requirement information can be selected from the preset background model groups; and then, rendering the target background model group based on a preset view angle to obtain a target background image. And then, selecting a preset semantic segmentation map corresponding to the target background model group from a plurality of preset semantic segmentation maps as a target semantic segmentation map. The preset viewing angle may be any photographing viewing angle (i.e., camera pose) of the vehicle-mounted camera during the running process of the vehicle.

In a possible manner, when the background asset library includes a plurality of preset neural networks and a plurality of preset semantic segmentation graphs, a target neural network matched with the background demand information can be selected from the plurality of preset neural networks; then, inputting a preset visual angle into a target neural network to obtain a target background image output by the target neural network; and then, selecting a preset semantic segmentation map corresponding to the target neural network from the plurality of preset semantic segmentation maps as a target semantic segmentation map.

According to a first aspect, or any implementation manner of the first aspect, the preset foreground asset library includes three-dimensional models of a plurality of preset foreground objects, and determining the target foreground image based on the pose corresponding to the target foreground object, the preset foreground asset library and the foreground requirement information includes: selecting a three-dimensional model of a target foreground object matched with the foreground demand information from a preset foreground asset library; and rendering the three-dimensional model of the target foreground object based on the pose corresponding to the target foreground object to obtain a target foreground image. In this way, the pose of the target foreground object in the obtained target foreground image can be completely consistent with the pose determined based on the target semantic segmentation map.

In a possible manner, when the preset foreground asset library includes a plurality of preset foreground images and preset poses of preset foreground objects in the preset foreground images, a first candidate foreground image may be selected from the preset foreground asset library according to the foreground requirement information. Then, for a target foreground object, a second candidate foreground image containing the target foreground object can be selected from the first candidate foreground images. Then, for one pose corresponding to the target foreground object, a difference value between the preset pose of each second candidate foreground image and the pose can be determined, and then the preset pose with the minimum pose difference value is determined. And then, determining the second candidate foreground image with the minimum difference between the preset pose and the pose as the target foreground image. In this case, for a target foreground image, the target foreground image may be translated and scaled based on a preset pose corresponding to the target foreground image and the pose, by using a camera projection parameter and a perspective principle of near-large and far-small, to obtain a target foreground image closer to a shooting effect corresponding to the pose.

In a possible manner, when the preset foreground asset library includes neural networks of a plurality of preset foreground objects, the neural networks of a plurality of target foreground objects matched with the foreground requirement information can be selected from the preset foreground asset library; a plurality of foreground Jing Tuxiang is determined based on the plurality of pose and the neural network of the plurality of target foreground objects, wherein one pose and the neural network of one target foreground object correspond to a determined one foreground image.

According to a first aspect, or any implementation manner of the first aspect, the synthesizing the target foreground image and the target background image to obtain the training image includes: acquiring first depth information corresponding to a target foreground image and second depth information corresponding to a target background image; based on the first depth information and the second depth information, fusing the target foreground image and the target background image to obtain a fused image, wherein the pixel value of a pixel point in a target area of the fused image is the pixel value of a pixel point with minimum depth information in the target area of the target foreground image and the target area of the target background image, and the target area is an area where a target foreground object in the target foreground image is located; and processing the target area in the fusion image to obtain a training image. Thus, based on the first depth information and the second depth information, the target foreground image and the target background image are fused, and the object far from the user can be blocked by the object near to the user.

For example, processing the target region in the fused image may include color, brightness adjustment.

According to the first aspect, or any implementation manner of the first aspect, the processing the target area in the fused image to obtain the training image includes: acquiring a foreground image mask corresponding to a target foreground image; according to the depth information of the pixel points of the target area in the fusion image, carrying out shielding treatment on the foreground image mask; and processing the target area in the fused image based on the blocked foreground image mask to obtain a training image. Thus, after the fusion image is obtained, the target area in the fusion image can be determined according to the blocked foreground image mask.

For example, the blocked foreground image mask and the fused image may be input to the image harmony network, and the image harmony network processes the target area in the fused image based on the blocked foreground image mask, so as to obtain a training image output by the image harmony network. Wherein the image harmony network is a deep neural network of an encoder-decoder structure; the image harmony network enables the foreground image to be fused with the background image more naturally through intelligent adjustment of the color of the foreground image.

According to the first aspect, or any implementation manner of the first aspect, the foreground requirement information includes a target foreground parameter, and the background requirement information includes a target background parameter; the method further comprises the steps of: training a preset algorithm based on the training image; testing the trained preset algorithm to obtain an evaluation index; and carrying out iterative updating on the target foreground parameter and the target background parameter based on the evaluation index. Therefore, the algorithm index of the preset algorithm can be improved, and the generalization and the robustness of the preset algorithm are further enhanced.

For example, the target foreground parameter and the target background parameter are iteratively updated until a preset number of loops is reached or an index of a preset algorithm reaches a preset value. The preset cycle number and the preset value can be set according to requirements, and are not described herein.

According to the first aspect, or any implementation manner of the first aspect, the iteratively updating the target foreground parameter and the target background parameter based on the evaluation index includes: and carrying out iterative updating on the target foreground parameter and the target background parameter based on the evaluation index and the Bayesian optimization algorithm. Therefore, the target foreground parameter and the target background parameter can be updated rapidly by adopting the Bayesian optimization algorithm, and the optimal target foreground parameter and the optimal target background parameter can be searched each time.

According to a first aspect, or any implementation manner of the first aspect above, the target foreground parameter includes at least one of: the foreground sample number, the foreground object motion speed, the foreground object motion distance, the foreground object motion track noise, the foreground object size coefficient and the maximum foreground shielding proportion; the target background parameters include at least one of: the number of background obstructions, the number of background samples, background weather conditions, and background lighting conditions.

For example, the data augmented demand information may include foreground demand information and background demand information; the foreground requirement information can be used for determining a target foreground image, and the background requirement information can be used for determining a target background image.

For example, the foreground requirement information may include target foreground parameters and other foreground parameters. For example, the target foreground parameter may be a parameter that the device may automatically update during data augmentation, and the other foreground parameters may be parameters that the device may not automatically update during data augmentation.

The number of foreground samples is used for screening a three-dimensional model or a preset foreground image from a preset foreground asset library; the number of foreground objects, i.e. the number of target foreground objects included in a training image, is the maximum foreground occlusion ratio, i.e. the ratio of foreground objects to occlusion background.

By way of example, other foreground parameters may include, but are not limited to: foreground object type (e.g., person, rabbit, wheel, etc.), foreground object behavior category (e.g., stationary, uniform motion, accelerated motion, etc.), total number of samples (total number of training images), etc.

By way of example, the context demand information may include target context parameters and other context parameters. The target background parameter may be, for example, a parameter that the device may automatically update during data augmentation, and the other background parameter may be a parameter that the device may not automatically update during data augmentation.

Illustratively, the number of background obstacles, i.e. the number of obstacles contained in the background; and the background sample number is used for screening a preset background model group or a preset background image from a preset foreground asset library.

By way of example, other background parameters may include, but are not limited to: background category (e.g., expressway, urban road, mountain road, etc.), and so forth.

It should be appreciated that the above is merely an example of demand information, including more or less demand information than shown above in an autopilot scenario; in addition, in other application scenarios, including different requirement information than that shown above, the present application is not limited in this regard.

In a second aspect, embodiments of the present application provide a data augmentation apparatus, the apparatus comprising:

the first image acquisition module is used for acquiring a target background image and a target semantic segmentation map corresponding to the target background image;

the pose acquisition module is used for determining the pose corresponding to the target foreground object based on the target semantic segmentation map, wherein the pose is the pose of the target foreground object in the target background image;

the second image acquisition module is used for acquiring a target foreground image based on the pose corresponding to the target foreground object, wherein the target foreground image comprises the target foreground object;

and the synthesis module is used for synthesizing the target foreground image and the target background image to obtain a training image.

Any implementation manner of the second aspect and the second aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. The technical effects corresponding to the second aspect and any implementation manner of the second aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory coupled to the processor; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the data augmentation method of the first aspect or any possible implementation of the first aspect.

Any implementation manner of the third aspect and any implementation manner of the third aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. The technical effects corresponding to the third aspect and any implementation manner of the third aspect may be referred to the technical effects corresponding to the first aspect and any implementation manner of the first aspect, which are not described herein.

In a fourth aspect, embodiments of the present application provide a chip comprising one or more interface circuits and one or more processors; the interface circuit is used for receiving signals from the memory of the electronic device and sending signals to the processor, wherein the signals comprise computer instructions stored in the memory; the computer instructions, when executed by a processor, cause an electronic device to perform the data augmentation method of the first aspect or any possible implementation of the first aspect.

Any implementation manner of the fourth aspect and any implementation manner of the fourth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fourth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, which when run on a computer or processor, causes the computer or processor to perform the data augmentation method of the first aspect or any possible implementation of the first aspect.

Any implementation manner of the fifth aspect and any implementation manner of the fifth aspect corresponds to any implementation manner of the first aspect and any implementation manner of the first aspect, respectively. Technical effects corresponding to any implementation manner of the fifth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.

In a sixth aspect, embodiments of the present application provide a computer program product comprising a software program which, when executed by a computer or processor, causes the computer or processor to perform the data augmentation method of the first aspect or any possible implementation of the first aspect.

Any implementation manner of the sixth aspect corresponds to any implementation manner of the first aspect and the first aspect, respectively. Technical effects corresponding to any implementation manner of the sixth aspect may be referred to the technical effects corresponding to any implementation manner of the first aspect, and are not described herein.

Drawings

FIG. 1a is a schematic diagram of an exemplary application scenario;

FIG. 1b is a schematic diagram of an exemplary application scenario;

FIG. 2a is a schematic diagram of an exemplary data augmentation process;

FIG. 2b is a preset semantic segmentation diagram shown by way of example;

FIG. 3a is a schematic diagram of an exemplary data augmentation process;

FIG. 3b is a schematic diagram of an associated image of an exemplary illustrated location generation network;

FIG. 3c is a schematic diagram of an exemplary illustrated location generation network;

FIG. 3d is a schematic diagram illustrating a process for determining the orientation angle of a target foreground object;

FIG. 4a is a schematic diagram of an exemplary data augmentation process;

FIG. 4b is a schematic diagram of an exemplary illustrated data augmentation framework;

fig. 5 is a schematic diagram of an exemplary illustrated data augmentation device;

fig. 6 is a schematic diagram of the structure of the device shown in an exemplary manner.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone.

The terms first and second and the like in the description and in the claims of embodiments of the present application are used for distinguishing between different objects and not necessarily for describing a particular sequential order of objects. For example, the first target object and the second target object, etc., are used to distinguish between different target objects, and are not used to describe a particular order of target objects.

In the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations, or descriptions. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, unless otherwise indicated, the meaning of "a plurality" means two or more. For example, the plurality of processing units refers to two or more processing units; the plurality of systems means two or more systems.

Fig. 1a is a schematic diagram of an exemplary application scenario.

Referring to fig. 1a, the data augmentation method provided in the present application may be applied to a vehicle autopilot scenario, for example. In particular, it may be used to augment training data for training an autopilot algorithm to cover various difficult scenarios during autopilot. The difficult situations in the automatic driving process may include various situations, such as pedestrians, animals, etc. suddenly appearing on a highway, for example, a sudden lane change of a front vehicle, a sudden overtaking of a rear vehicle, etc., which is not limited in this application. It should be understood that the training data augmented by the data augmentation method provided in the present application may also cover non-difficult situations in the automatic driving process, which is not limited in this application.

It should be noted that the training data of the autopilot algorithm may include training images. The autopilot algorithm may include algorithms for processing images during autopilot, such as classification algorithms, detection algorithms, segmentation algorithms, and the like, as the present application is not limited in this regard.

Fig. 1b is a schematic view of an exemplary application scenario.

Referring to fig. 1b, the data augmentation method provided in the present application may be applied to a sweeping (or mopping) scene of a sweeping robot. Specifically, the method can be used for amplifying training data for training an image sensing algorithm in the sweeping robot to cover various difficult cases in the sweeping (or mopping) process of the sweeping robot. The difficult scene in the sweeping (or mopping) process of the sweeping robot can comprise various important and small objects such as rings, necklaces and the like on the ground. It should be understood that the training data amplified by the data amplification method provided by the application can also cover a non-difficult scene in the sweeping (or mopping) process of the sweeping robot, which is not limited in this application.

It should be appreciated that the training data of the image perception algorithm may comprise training images.

The present application is exemplified by augmenting training data for training an autopilot algorithm.

Fig. 2a is a schematic diagram of an exemplary data augmentation process.

S201, acquiring a target background image and a target semantic segmentation map corresponding to the target background image.

For example, a preset background asset library may be pre-established; the preset background asset library comprises assets for generating target background images and assets for determining corresponding poses of target foreground objects.

In one possible approach, the pre-set background asset library may be built using CG (Computer Graphics, a graphic drawn by computer software) modeling. For example, CG modeling can be performed on multiple preset backgrounds through CG software to obtain multiple preset background model sets corresponding to the multiple preset backgrounds; wherein, a preset background model group corresponds to a preset background. For example, a preset background model group may include a three-dimensional model of a plurality of preset background objects included in a preset background. For example, if the preset background is an outdoor parking lot, the preset background object may include a parking space line, a road, a tree, and so on, and one preset background model group corresponding to the outdoor parking lot may include a three-dimensional model of the parking space line, a three-dimensional model of the road, a three-dimensional model of the tree, and so on.

Then, CG software may generate a corresponding preset semantic segmentation map based on the preset background model set. For example, the preset semantic segmentation map may include semantics of multiple types of preset background objects, where semantics of one type of preset background object corresponds to one semantic tag value, and semantics of different types of preset background object correspond to different semantic tag values.

Fig. 2b is a preset semantic segmentation diagram, shown schematically. It should be noted that, in the actual preset semantic segmentation map, colors corresponding to different types of preset background objects are different, and in fig. 2b, gray values corresponding to different types of preset background objects are different.

Referring to fig. 2b, semantics of the plurality of types of preset background objects in the preset semantic segmentation map of fig. 2b include: the semantics of the parking space lines, the semantics of the lane lines, the semantics of the trees and the semantics of the drivable area. The semantic tag value corresponding to the parking space line can be 1, the semantic tag value corresponding to the lane line can be 2, the semantic tag value corresponding to the tree can be 3, and the semantic tag value corresponding to the drivable area can be 4.

In this way, for a plurality of preset background model groups, a plurality of preset semantic segmentation graphs can be generated, and the preset semantic segmentation graphs correspond to the preset background model groups one by one.

And then, generating a preset background asset library by adopting a plurality of preset background model groups and a plurality of preset semantic segmentation graphs. The preset background model group is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

In one possible manner, a default background asset library may be established by means of implicit three-dimensional reconstruction. By way of example, a plurality of preset neural networks corresponding to a plurality of preset backgrounds can be obtained by scanning a real preset background and performing implicit three-dimensional reconstruction based on an image obtained by scanning; the preset background corresponds to a preset neural network.

Then, for a preset neural network, the overlooking view angle can be input into the preset neural network, the operation is performed by the preset neural network group, and the overlooking view of the preset neural network corresponding to the preset background is output. Then, the top view can be subjected to semantic segmentation to obtain a preset semantic segmentation map. In this way, for a plurality of preset neural networks, a plurality of preset semantic segmentation graphs can be generated, and the plurality of preset semantic segmentation graphs are in one-to-one correspondence with the plurality of preset neural networks.

Then, a plurality of preset neural networks and a plurality of preset semantic segmentation graphs are adopted to generate a preset background asset library; the preset neural network is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

In one possible mode, a real vehicle collection mode can be adopted to establish a preset background asset library. For example, for a preset background, a plurality of preset background images can be acquired through real vehicle acquisition (that is, image acquisition is performed by adopting any image acquisition device facing to the vehicle in the moving process of the vehicle).

Then, for a preset background image, a corresponding preset semantic segmentation map may be generated based on the image acquired by the image acquisition device with the preset background image and other orientations (e.g., an autopilot algorithm of a vehicle may generate a preset semantic segmentation map based on the image acquired by the image acquisition device with the preset background image and other orientations). Further, for a plurality of preset background images, a plurality of preset semantic segmentation graphs can be generated; the preset semantic segmentation graphs are in one-to-one correspondence with the preset background images.

Then, a plurality of preset semantic segmentation graphs and a plurality of preset background images can be adopted to generate a preset background asset library; the preset background image is used for generating a target background image, and the preset semantic segmentation map is used for determining the pose corresponding to the target foreground object.

Furthermore, a target background image and a target semantic segmentation map matched with the user requirement can be obtained based on a preset background asset library. In a possible manner, the acquired target background image may be one, and the corresponding target semantic segmentation map is also one. In a possible mode, a plurality of target background images can be acquired, and a plurality of corresponding target semantic segmentation graphs are also acquired; the plurality of target background images and the plurality of target semantic segmentation maps are in one-to-one correspondence, that is to say, one target semantic segmentation map corresponds to one target background image.

For example, if the background of the user demand is a daytime expressway, the daytime expressway image can be obtained as a target background image based on a preset background asset library; and acquiring a semantic segmentation map corresponding to the daytime expressway image as a target semantic segmentation map. For another example, if the background of the user demand is a night expressway, the night expressway image can be obtained as the target background image based on the preset background asset library; and acquiring a semantic segmentation map corresponding to the expressway image at night as a target semantic segmentation map. For example, if the background of the user demand is a rainy city road, the rainy city road image can be obtained as a target background image based on a preset background asset library; and acquiring a semantic segmentation map corresponding to the urban road image in the rainy day as a target semantic segmentation map.

Illustratively, the target background image is used to synthesize a training image as a background for the training image. The target semantic segmentation map is used for determining the pose corresponding to the target foreground object.

S202, determining the pose corresponding to the target foreground object based on the target semantic segmentation map, wherein the pose is the pose of the target foreground object in the target background image.

For example, one or more target foreground objects may be first determined according to user requirements. By way of example, the foreground object may include a variety of objects such as humans, animals, objects (e.g., front vehicles, rear vehicles, adjacent lane vehicles), and the like, which are not limited in this application.

It should be noted that, when the target foreground object includes a plurality of target foreground objects, in one possible manner, the types of the plurality of target foreground objects are all different. In one possible approach, the multiple target foreground objects are all of the same type. In one possible approach, the plurality of target foreground objects includes foreground objects of a partially identical type and foreground objects of a partially different type. The present application does not limit the type to which the target foreground object corresponds.

Next, based on the one target semantic segmentation map, a pose of the one target foreground object in the target background image corresponding to the one target semantic segmentation map may be determined. For example, the pose of the target foreground object in the target background image corresponding to the target semantic segmentation map may be determined by analyzing based on the type of the target foreground object and semantic tag values of various semantics in the target semantic segmentation map. The pose may include a position and an orientation angle, among other things.

When the target foreground object is determined to be a static object according to the user requirement, a pose of the target foreground object in the target background image corresponding to the target semantic segmentation map can be determined based on the target semantic segmentation map. When the target foreground object is determined to be a moving object according to the user requirement, based on one target semantic segmentation map, a plurality of poses of the target foreground object in the target background image corresponding to the target semantic segmentation map can be determined.

Further, when the target background images are plural and the target semantic segmentation map is plural, the pose of one target foreground object in the plural target background images can be determined based on the plural target semantic segmentation maps. Thus, when there are a plurality of target foreground objects, each target foreground object can be based on a plurality of target semantic segmentation graphs, and the pose of each target foreground object in a plurality of target background images can be obtained. Wherein the pose of each target foreground object in one target background image is one or more.

It should be understood that when the target semantic segmentation map is multiple and the target background image is multiple, the pose of one target foreground object in a part of the target background image can be determined based on only a part of the target semantic segmentation map; the pose of the target foreground object in all target background images can be determined based on all target semantic segmentation graphs; and in particular, may be determined according to user requirements, which is not limited in this application. That is, for a target foreground object, one or more poses of the target foreground object in a target background image may be determined, and one or more poses of the target foreground object in respective target background images of a plurality of target background images may also be determined.

S203, acquiring a target foreground image based on the pose corresponding to the target foreground object, wherein the target foreground image comprises the target foreground object.

For example, a library of foreground assets may be pre-established, wherein the library of preset foreground assets includes assets for generating a target foreground image.

In one possible approach, a CG modeling approach may be employed to build a library of pre-set foreground assets. By way of example, modeling can be performed on a plurality of preset foreground objects through CG to obtain three-dimensional models corresponding to the plurality of preset foreground objects; for example, a three-dimensional model of a character, a three-dimensional model of an animal, a three-dimensional model of an object, and so forth. Wherein each of the preset foreground objects may correspond to one or more three-dimensional models (e.g., a stationary preset foreground object corresponds to one three-dimensional model, an accelerating preset foreground object corresponds to one or more three-dimensional models, and a decelerating preset foreground object corresponds to one or more three-dimensional models).

Then, a three-dimensional model of a plurality of preset foreground objects is adopted to establish a preset foreground asset library; the three-dimensional model of the foreground object is preset and used for generating a target foreground image.

In one possible manner, a preset foreground asset library may be established by means of explicit three-dimensional reconstruction. By way of example, the three-dimensional model corresponding to the various preset foreground objects can be obtained by scanning the real preset foreground object and performing explicit three-dimensional reconstruction based on the scanned image. Wherein each preset foreground object may correspond to one or more three-dimensional models.

In one possible manner, a default library of foreground assets may be established in an implicit three-dimensional reconstruction manner. By way of example, the neural network corresponding to the various preset foreground objects can be obtained by scanning the real preset foreground object and performing implicit three-dimensional reconstruction based on the scanned image. Wherein each preset foreground object may correspond to one or more neural networks.

Then, a plurality of neural networks are adopted to build a preset foreground asset library. The neural network is used for generating a target foreground image.

In one possible mode, a real vehicle acquisition mode can be adopted to establish a preset foreground asset library. For one preset foreground object, a plurality of preset foreground images can be acquired through real vehicle acquisition (namely, image acquisition is carried out by adopting image acquisition equipment in a vehicle in the moving process of the vehicle (the orientation of the image acquisition equipment is not limited and the preset foreground object can be shot); and labeling the pose (hereafter called preset pose) of the preset foreground object in each preset foreground image.

Then, generating a foreground asset library by adopting a plurality of preset foreground images and preset poses of preset foreground objects in the preset foreground images; wherein the preset foreground image and the preset pose may be used to generate a target foreground image.

For example, for a target foreground object, a target foreground image corresponding to the target foreground object may be determined according to a pose corresponding to the target foreground object, a user requirement, and a preset foreground asset library, and the target foreground image may include the target foreground object.

Further, when the target foreground object corresponds to a plurality of poses, a plurality of target foreground objects may be determined, each of which includes the target foreground object. Correspondingly, when the target foreground object includes a plurality of target foreground objects, one or more target foreground images corresponding to each target foreground object may be determined.

Based on the above description, the pose of a target foreground object includes the pose of the target foreground object in a target background image; further, one target foreground image may correspond to one target background image. And, the pose of one target foreground object may include the pose of the target foreground object in a plurality of target background images, and further, one target background image may correspond to a plurality of target foreground images.

The target foreground image can be used for synthesizing a training image and can be used as the foreground of the training image.

S204, synthesizing the target foreground image and the target background image to obtain a training image.

For example, for a target foreground image, the target foreground image and a target background image corresponding to the target foreground image may be synthesized to obtain a training image. When the target background images corresponding to the target foreground images are multiple, the target foreground images can be respectively combined with the corresponding target background images, and multiple training images can be obtained.

It should be understood that the same training image may include a plurality of target foreground objects, and further, a plurality of target foreground objects and a target background image may be synthesized, so that a training image may be obtained.

The target foreground image is generated according to the pose of the target foreground object in the target background image, that is, the pose of the target foreground object in the target foreground image, that is, the pose of the target foreground object in the target background image. Therefore, after the target foreground image and the corresponding target background image are synthesized, the target foreground object in the target foreground image can be placed in the target background image according to the determined pose. Furthermore, after the training images are synthesized, the target foreground objects can appear at reasonable positions of the background, so that the augmented training images are more similar to the images acquired by the actual scenes, and images covering various difficult-to-be-detected scenes are obtained; thus, images covering various difficult scenes can be automatically generated in large quantities. After training the preset algorithm (such as an automatic driving algorithm) by using the training image, the index, generalization and robustness of the trained preset algorithm can be improved.

The following describes a process of determining the pose corresponding to the target foreground object based on the target semantic segmentation map, and a process of synthesizing the target foreground image and the target background image.

Fig. 3a is a schematic diagram of an exemplary data augmentation process.

S301, acquiring requirement information of data augmentation, wherein the requirement information comprises foreground requirement information and background requirement information.

For example, the foreground requirement information may include target foreground parameters and other foreground parameters. The target foreground parameter may be a parameter that may be automatically updated by the device during the data augmentation process, and the other foreground parameters may be parameters that may not be automatically updated by the device during the data augmentation process, and the update process of the target foreground parameter will be described later.

In one possible approach, the target foreground parameters may include, but are not limited to: the number of foreground samples (for screening a three-dimensional model or a preset foreground image from a preset foreground asset library), the number of foreground objects (the number of target foreground objects included in one training image), the foreground object size coefficient, the foreground object motion speed, the foreground object motion distance, the foreground object motion trajectory noise, the maximum foreground occlusion proportion (i.e., the proportion of foreground objects occluding the background), and the like. Other foreground parameters may include, but are not limited to: foreground object type (e.g., person, rabbit, wheel, etc.), foreground object behavior category (e.g., stationary, uniform motion, accelerated motion, etc.), total number of samples (total number of training images), etc.

By way of example, the context demand information may include target context parameters and other context parameters. The target background parameter may be a parameter that the device may automatically update during the data augmentation process, and the other background parameters may be parameters that the device may not automatically update during the data augmentation process, and the update process of the target background parameter will be described later.

In one possible approach, the target background parameters may include, but are not limited to: the number of background obstacles (i.e., the number of obstacles contained in the background), the number of background samples (for screening a preset background model set or a preset background image from a preset foreground asset library), background weather conditions, background lighting conditions, and the like. Other background parameters may include, but are not limited to: background category (e.g., expressway, urban road, mountain road, etc.), and so forth.

S302, acquiring a target background image and a target semantic segmentation map corresponding to the target background image based on a preset background asset library and background demand information.

In a possible manner, when the preset background asset library includes a plurality of preset semantic segmentation graphs and a plurality of preset background images which are collected in advance, a target background image matched with the background requirement information can be selected from the plurality of preset background images; and then, selecting a preset semantic segmentation map corresponding to the target background image from the plurality of preset semantic segmentation maps as a target semantic segmentation map. Thus, the target background image and the target semantic segmentation map can be selected rapidly.

For example, a plurality of candidate background images may be selected from a plurality of preset background images according to the background category in the background demand information; and then screening target background images with the same number as the number of the background samples from the candidate background images according to the number of the background obstacles, the background weather condition, the background illumination condition and the like.

In a possible manner, when the background asset library includes a plurality of preset background model groups and a plurality of preset semantic segmentation graphs, a target background model group matched with the background requirement information can be selected from the preset background model groups; and then, rendering the target background model group based on a preset view angle to obtain a target background image. And then, selecting a preset semantic segmentation map corresponding to the target background model group from a plurality of preset semantic segmentation maps as a target semantic segmentation map. The preset viewing angle can be any photographing viewing angle of the vehicle-mounted camera in the running process of the vehicle.

For example, a plurality of candidate background model sets may be selected from a plurality of preset background model sets according to the background category in the background demand information; and then, according to the number of the background barriers, the background weather conditions and the like, selecting a target background model group with the same numerical value as the number of the background samples from the plurality of candidate background model groups.

For example, a plurality of candidate neural networks may be selected from a plurality of preset neural networks according to the background category in the background demand information; and then, according to the number of the background obstacles, the background weather condition, the background illumination condition and the like, screening target neural networks with the same number as the number of the background samples from the candidate neural networks.

For example, since the clear imaging range of the image collected by the real vehicle is limited, the image shot by the camera is converted into the top view and is only available in a certain distance range, the top view with a longer distance can be very blurred, the view angle of the top view generated based on the preset background image collected by the real vehicle is limited, and the semantic range summarized by the preset semantic segmentation map in the preset background library established based on the mode of collecting the real vehicle is limited. Therefore, the pose of the obstacle in the preset background can be determined based on images acquired for the same preset background by adopting other camera poses (other than the camera pose when the preset background image is acquired; for example, the camera pose when the preset background image is acquired is the camera pose of the vehicle-mounted front view camera, and the other camera poses can be the camera pose of the vehicle-mounted side view camera and the camera pose of the vehicle-mounted rear view camera) (for example, the pose of the obstacle can be stored in a 3D form, namely, the pose of the obstacle is represented by adopting the coordinates of 8 vertexes of a cuboid and is subsequently called as a third pose); and then the obstacle pose group corresponding to the preset background. Then, the obstacle pose group and a preset background image and a preset semantic segmentation map corresponding to the preset background can be associated. Thus, after the target semantic segmentation map is obtained, a third pose of one or more obstacles in the obstacle pose group associated with the target semantic segmentation map may be mapped into the target semantic segmentation map. In this way, the semantics of the semantic segmentation map can be enriched.

Similarly, because the clear imaging range of the image aiming at the real preset background scanning is limited, after the camera image is converted into the top view, the top view with a longer distance can be very fuzzy only in a certain distance range, so that the semantic range outlined by the preset semantic segmentation map in the preset background asset library established based on the implicit three-dimensional reconstruction is limited; furthermore, the pose of the obstacle in the preset background can be determined based on images acquired for the same preset background by adopting other camera poses (other than the camera pose when the preset background image is acquired; for example, the camera pose when the preset background image is acquired is the camera pose of the vehicle-mounted front view camera, and the other camera poses can be the camera pose of the vehicle-mounted side view camera and the camera pose of the vehicle-mounted rear view camera) (for example, the pose of the obstacle can be stored in a 3D form, namely, the pose of the obstacle is represented by adopting the coordinates of 8 vertexes of a cuboid and is subsequently called as a third pose); and then the obstacle pose group corresponding to the preset background. Then, the obstacle pose group and the neural network corresponding to the preset background can be associated with a preset semantic segmentation map. Thus, after the target semantic segmentation map is obtained, the third pose of one or more obstacles in the obstacle pose group associated with the target semantic segmentation map can be mapped into the target semantic segmentation map. In this way, the semantics of the semantic segmentation map can be enriched.

By way of example, since various information in the CG scene is completely known, semantics at any position in a top view can be known, and further, CG software can directly generate a preset semantic segmentation map according to a preset background model group, without converting the CG software into a top view, and then performing semantic segmentation on the top view; the generation of the preset semantic segmentation map is not limited by the clear usable range of the top view, and the preset semantic segmentation map in the preset background library established based on the CG modeling mode can summarize a large range of semantics, so that the semantics are rich. In this case, there is no need to determine the obstacle pose group corresponding to the preset background model group, and there is no need to map the third pose of the obstacle in the obstacle pose group to the preset semantic segmentation map.

In S202, determining the pose corresponding to the target foreground object based on the target semantic segmentation map may refer to S303 to S308 as follows. In the following, S303 to S308, a target semantic segmentation map and a target foreground object are exemplified.

S303, determining probability information based on the target semantic segmentation map, wherein the probability information comprises the probability of each pixel point in the target semantic segmentation map to generate a target foreground object.

For example, the location generation network may be trained in advance, and then the trained location generation network and the target semantic segmentation map are used to determine probability information, that is, the probability that each pixel point in the target semantic segmentation map appears in the target foreground object.

By way of example, the process of training the location generation network may be as follows:

s11, acquiring a semantic segmentation map, wherein the semantic segmentation map comprises the semantics of a foreground object and the semantics of a background object.

For example, a plurality of semantic segmentation graphs may be acquired as training images of a training site generation network.

It should be noted that, the semantic segmentation map for training the position generation network is obtained by performing semantic segmentation on a top view of a scene (including a foreground object and a background object); that is, the semantic segmentation map for training the location generation network includes the semantics of the foreground object and the semantics of the background object. The target semantic segmentation map and the preset semantic segmentation map only contain the semantics of the background object, that is, the semantic segmentation map used for training the position generation network is different from the target semantic segmentation map and the preset semantic segmentation map.

Fig. 3b is a schematic diagram of an associated image of an exemplary illustrated location generation network. Wherein fig. 3b (1) is an acquired semantic segmentation map. The semantics of the background object in the semantic segmentation map of fig. 3b (1) include: the semantics of the parking space lines, the semantics of the trees, the semantics of the lane lines and the semantics of the drivable area; the semantic tag value corresponding to the parking space line can be 1, the semantic tag value corresponding to the lane line can be 2, the semantic tag value corresponding to the tree can be 3, and the semantic tag value corresponding to the drivable area can be 4. The semantics of the foreground object in the semantic segmentation map of fig. 3b (1) include: the semantics of the wheel file may have a semantic tag value of 5.

S12, shielding semantics of a foreground object in the semantic segmentation map to obtain a first image.

For each semantic segmentation map, the semantics of the foreground object in the semantic segmentation map can be blocked, so that a first image can be obtained. For example, the semantic tag value corresponding to the foreground object in the semantic segmentation map is covered by the semantic tag value corresponding to the background object in the surrounding environment (i.e., within a preset range centered on the foreground object) of the foreground object in the semantic segmentation map, i.e., the semantic tag value corresponding to the foreground object in the semantic segmentation map is adjusted to the semantic tag value corresponding to any background object in the semantic segmentation map.

Referring to fig. 3b (1), the semantic tag value corresponding to the wheel gear may be adjusted to have a semantic tag value 5 and a semantic tag value 4 corresponding to the driving area, and the obtained first image is shown in fig. 3b (2).

In this way, the processing of S12 is performed on the plurality of semantic division diagrams, and a plurality of first images can be obtained.

S13, binarizing the semantic segmentation map to obtain a second image, wherein the pixel value of the pixel points contained in the foreground object in the second image is different from the pixel value of the pixel points contained in the background object.

For each semantic segmentation graph, the semantic tag value of the background object can be set to be 1 by setting the semantic tag value corresponding to the foreground object in the semantic segmentation graph to be 0; or the semantic tag value corresponding to the foreground object in the semantic segmentation map is set to be 1, and the semantic tag value of the background object is set to be 0, so that the semantic segmentation map is binarized, and a second image can be obtained. For example, reference may be made to fig. 3b (3), where black portions correspond to background objects and white portions correspond to foreground objects in fig. 3b (3).

In this way, the processing of S13 is performed on the plurality of semantic division diagrams, and a plurality of second images can be obtained.

S14, training the position generation network based on the category identification of the foreground object, the first image and the second image.

For example, the category identifier of the foreground object, the first image and the second image may be input to a location generating network, and probability information (i.e., probability that the foreground object appears at each pixel point in the first image) may be output after location prediction is performed by the location generating network based on the category identifier of the foreground object and the first image. Next, a probability heat map may be generated based on the probability that the foreground object appears at each pixel point in the first image, as may be shown in fig. 3b (4). In fig. 3b (4), the gray scale map is used to represent that the pixel point with smaller gray scale value in the probability heat map represents that the probability of the foreground object appears is larger, for example, the white pixel point represents that the probability of the foreground object appears is largest.

The probabilistic heat map is then compared to the second image to adjust the weight parameters of the location generating network. In this way, the position generation network can learn the reasonable position of the foreground object in the background by taking the image which shields the foreground object and reserves the background object as a training image and taking the image which reserves the foreground object and shields the background object as a label; in the use process of the subsequent position generation network, the position prediction can be more accurately performed.

Fig. 3c is a schematic diagram of an exemplary illustrated location generation network. Referring to fig. 3c, an exemplary location generation network may include a gaussian kernel function and a sigmoid function (which is a nonlinear activation function).

Illustratively, in the training process, the semantic label value corresponding to each pixel in the first image is input to the position generating network, where X (0, 0) represents the pixel value of the pixel at (0, 0), X (0, 1) represents the pixel value of the pixel at (0, 1), and so on, X (m, n) represents the pixel value of the pixel at (m, n), and m and n are positive integers. The output of the position generation network is the probability of the foreground object of each pixel point in the first image, and Y (i, j) represents the probability of the foreground object of the pixel point at (i, j), wherein i is an integer between 0 and m, and j is an integer between 0 and n.

Wherein, the Gaussian kernel function can be expressed as follows:

l∈(0,m),k∈(0,m)

For example, the location generation network may correspond to multiple sets of weight coefficients, that is, multiple sets of weight coefficients for the gaussian kernel function; each set of weight coefficients includes an omega vector and a theta vector. Wherein the omega vector comprises weights of various background object semantics on a type of preceding object, and the theta vector comprises influence ranges of various background object semantics on a type of preceding object. Further, in training the location generation network, the location generation network configured with a set of weight parameters may be trained for one type of foreground object (hereinafter referred to as a first foreground object) based on the class identification of the first foreground object, the first image, and the second image. That is, for one type of foreground object, a corresponding set of weight parameters may be trained. In this way, different weight coefficients can be trained for different types of foreground objects. Since the different types of foreground objects are present in different reasonable positions in the background, the position generation network can accurately predict the positions of the different types of foreground objects.

For example, a category identifier corresponding to the target foreground object can be determined according to the type of the foreground object in the foreground requirement information; after the training of the position generation network is completed, the category identification corresponding to the target semantic segmentation graph and the target foreground object can be input into the position generation network, and the position generation network selects a group of weight parameters corresponding to the category identification of the target foreground object to be configured and then carries out position prediction so as to obtain the output probability information of the position generation network. In this way, the position generation network can use the weight parameter corresponding to the foreground object of the type of the target foreground object to conduct position prediction, and further can accurately predict the reasonable position of the target foreground object in the target background image.

The method comprises the steps that a target semantic segmentation graph and a category identifier corresponding to a target foreground object can be input to a position generation network each time; then, the position generation network may select a set of weight parameters corresponding to the category identifier corresponding to the target foreground object from the multiple sets of weight parameters to configure, and then perform position prediction based on the target semantic segmentation map, and output corresponding probability information.

Subsequently, a first pose of the target foreground object in the target background image may be determined based on the probability information; reference may be made to S304 to S306:

s304, determining the position of the target foreground object in the target background image based on the probability information.

For example, a probability heat map may be generated based on the probability information; wherein, the pixel values of the pixel points with different probabilities in the probability heat map are different.

Then, the probability heat map may be sampled according to a preset constraint condition to determine a position of the target foreground object in the target background image. For example, the probability heat map may be sampled first to determine candidate locations; and for each pixel point in the probability heat map, putting a corresponding number of preset elements into a preset set according to the probability of the pixel point. For example, when the probability of the pixel is equal to 0.1, 1 preset element is put into a preset set (the 1 preset element corresponds to the pixel), and when the probability of the pixel is equal to 0.2, 2 preset elements are put into a preset set (the 2 preset elements correspond to the pixel); and so on. Then, randomly selecting a preset element from the preset set, wherein the coordinates of the pixel points corresponding to the preset element are candidate positions.

Then, whether the candidate position meets the preset constraint condition can be judged, and when the candidate position meets the preset constraint condition, the candidate position can be used for determining the position of the target foreground object in the target background image. When the candidate position does not meet the preset constraint condition, the candidate position can be reselected from the preset set; and then continuously judging whether the candidate position meets the preset constraint condition.

In one possible manner, the preset constraint may be determined based on the foreground requirement information, for example, the preset constraint includes: the foreground occlusion ratio is less than or equal to the maximum foreground occlusion ratio in the foreground demand information.

In a possible manner, the preset constraint condition may be preset according to the requirement, for example, the preset constraint condition includes: the distance between the two foreground objects is larger than a preset distance; for example, the preset constraints include: the foreground object is not able to obscure the reference background object, etc.

Therefore, sampling is performed according to preset constraint conditions, and the target foreground object can be enabled to appear at a reasonable position in the target background image.

S305, determining the orientation angle of the target foreground object in the target background image based on the position of the reference semantic in the target semantic segmentation map and the position of the target foreground object in the target background image.

For example, reference semantics may be preset, such as a parking space line, a wall, a pillar, and other obstacles. Next, the location of the reference semantics in the target semantic segmentation map is determined. And then, taking the position corresponding to the target foreground object in the target semantic segmentation map as a center, and outwards diverging a detection ray at each interval by a preset angle, so as to gradually increase the ray length until the ray collides with the reference semantic. In this way, a plurality of collision points can be obtained by collision of a plurality of detection rays with the reference semantics; wherein, a collision point can be obtained by collision of a detection ray and a reference semantic. The preset angle may be set as 10 ° according to the requirement, which is not limited in this application.

Fig. 3d is a schematic diagram illustrating a process of determining the orientation angle of a target foreground object.

By way of example, fig. 3c (1) is a schematic diagram of collision points of detected rays with reference semantics. In fig. 3c (1), the target foreground object is a gear, the reference semantic may be a parking space line, and the collision point between the parking space line and the detected ray may include a plurality of parking space lines.

For example, if the preset angle is 10 °, 36 collision points may be obtained, and coordinates corresponding to the 36 collision points are respectively: (x 0, y 0), (x 1, y 1), (x 2, y 2), and..the term "is used for the term" x36, y36 ". Subsequently, a difference in coordinates between two collision points that are adjacent in position may be calculated to obtain a difference point, where the coordinates of the difference point may be expressed as (Δxi, Δyi), Δxi=x (i+1) -x (i), Δyi=y (i+1) -y (i) (i is a positive integer). For example, the coordinates of each difference point may be normalized, and the coordinates of the obtained difference point may be expressed as (Δ CXi, Δ CYi); wherein, the modulus of Δ CXi and Δ CYi may be 1. In this way, the origin of the target semantic segmentation map is connected to each difference point, respectively, and an orientation can be represented.

Referring to fig. 3d (2), 36 normalized difference points are shown in fig. 3d (2). It should be noted that fig. 3c (2) and fig. 3c (1) are not in the same coordinate system; and the difference points above (near the upper border of the image) in fig. 3c (2) are not the difference points corresponding to the adjacent collision points above (near the upper border of the image) in fig. 3c (1).

For example, an aggregation algorithm (e.g., K-means clustering algorithm) may be used to divide the difference points into multiple clusters; and a center point of each of the clusters is determined. As shown in fig. 3d (3), dividing the difference point into 4 clusters, four center points can be determined: center point 1, center point 2, center point 3, and center point 4. Next, taking two landing center points of the most element (namely difference points) from the plurality of landing; for example, fig. 3d (3) selects center point 1 and center point 4. And determining the connection lines of the origin of the target semantic segmentation graph to the two center points respectively to obtain a vector 1 (corresponding to the center point 1) and a vector 2 (corresponding to the center point 4). Wherein vector 1 represents a reference direction and vector 2 represents a reference direction. Then, calculating a difference coordinate between the position of the target foreground object and the center of the target semantic segmentation graph to obtain a difference point; then, normalizing a difference point between the position of the target foreground object and the center of the target semantic segmentation map, and determining a connecting line from the origin of the target semantic segmentation map to the difference point to obtain a vector 3; from the vector 1 and the vector 2, the reference direction corresponding to the vector with the smallest angle difference of the vector 3 is selected as the final reference direction, as shown by referring to the white arrow in fig. 3d (4).

Illustratively, the orientation angle of the target foreground object in the target semantic segmentation map (assuming line segment 1 is selected, the angle between line segment 1 and line segment 3) is denoted as α. Assume that the class of the target foreground object is identified as s, its final orientation angle γ=α+t(s) +n(s), T(s) is a fixed deviation value corresponding to the class to which the target foreground object belongs, N(s) is random noise subject to a gaussian distribution, μ value of the gaussian distribution is 0, and σ value is a standard deviation of deviation values of the reference direction and the actual direction of the class to which the target foreground object belongs, which are determined in advance. The fixed deviation value can be set in a self-defined manner, for example, the deviation value corresponding to the wheel guard is 0 degree, the deviation value corresponding to the hydrant box is 90 degrees, and the average value of the deviation values of the reference direction and the actual direction of the category of the target foreground object determined in advance can be taken.

S306, determining a first pose of the target foreground object in the target background image based on the position and the orientation angle of the target foreground object in the target background image.

For example, the position and orientation angle of the target foreground object in the target background image may be taken as the first pose of the target foreground object in the target background image.

It should be noted that, when the foreground object behavior class of the target foreground object is stationary (i.e., the target foreground object is a stationary object), the following S307 to S308 may not be executed; in this case, the first pose of the target foreground object in the target background image may be determined as the final pose corresponding to the target foreground object. At this time, according to the number of foreground objects, the number of foreground samples and the total number of samples in the foreground requirement information, it is determined how many target semantic segmentation graphs are adopted for one target foreground object, and a corresponding number of first poses are generated.

It should be noted that, when the foreground object behavior category of the target foreground object is motion (that is, the target foreground object is a motion object), the first pose is an initial pose of the target foreground object in the target background image; s307 to S308 may be performed subsequently to determine the pose of the target foreground object during motion (hereinafter referred to as the second pose), that is, the pose of the target foreground object during motion in the corresponding background of the target background image. Correspondingly, the second pose is also the pose of the target foreground object in the target background image. At this time, according to the number of foreground targets, the number of foreground samples, the behavior category of the foreground objects, the moving speed of the foreground objects, the moving distance of the foreground objects and the total number of samples in the foreground requirement information, determining how many target semantic segmentation graphs are adopted for one target foreground object, and generating a first pose with a corresponding number; and how many second poses are generated.

That is, S307 to S308 are optional steps.

S307, when the target foreground object is determined to be a dynamic object, a track generation model corresponding to the motion type of the target foreground object is selected from a preset behavior asset library.

S308, generating a motion track corresponding to the target foreground object based on the first pose corresponding to the target foreground object and the track generation model, wherein the motion track comprises a plurality of second poses in the motion process of the target foreground object, and the second poses are poses of the target foreground object in the target background image.

For example, a preset behavior asset library may be pre-established; the preset behavior asset library may include assets for generating motion trajectories, and may include a plurality of preset trajectory generation models, where each preset trajectory generation model corresponds to a foreground object behavior category. For example, the preset behavior asset library may include a preset track generation model corresponding to uniform motion, a preset track generation model corresponding to acceleration motion, a preset track generation model corresponding to deceleration motion, and the like, which is not limited in this application.

For example, a track generation model corresponding to the target foreground object can be selected from a preset behavior asset library according to the behavior category of the foreground object in the foreground requirement information. Then, for a first pose corresponding to the target foreground object, inputting the first pose corresponding to the target foreground object as an initial pose, the moving speed of the foreground object and the moving distance of the foreground object into a track generation model, calculating by the track generation model, and outputting a moving track corresponding to the target foreground object; the motion trail comprises a plurality of second poses in the motion process of the target foreground object. Wherein the plurality of second poses and the first pose as an initial pose correspond to the same target background image.

Thus, the second pose of the target foreground object in the motion process can be accurately determined through the track generation model.

Then, the first pose corresponding to the target foreground object and the second pose determined based on the first pose can be used as the final pose corresponding to the target foreground object.

S309, determining a target foreground image based on the pose corresponding to the target foreground object, the preset foreground asset library and the foreground requirement information.

In a possible manner, when the preset foreground asset library includes three-dimensional models of a plurality of preset foreground objects, the three-dimensional models of a plurality of target foreground objects matched with the foreground requirement information can be selected from the preset foreground asset library; rendering the three-dimensional model of the plurality of target foreground objects based on the plurality of poses to obtain a plurality of foreground Jing Tuxiang, wherein one pose corresponds to the three-dimensional model of one target foreground object to determine one foreground image. In this way, the pose of the target foreground object in the obtained target foreground image can be completely consistent with the pose determined based on the target semantic segmentation map.

For example, candidate three-dimensional models with the same number as the number of foreground samples can be selected from a preset foreground asset library according to foreground requirement information such as foreground object types, foreground object behavior categories and foreground object size coefficients. Then, for one target foreground object, a target three-dimensional model corresponding to the type of the target foreground object can be selected from the candidate three-dimensional models. Then, for a pose corresponding to the target foreground object, the target three-dimensional model can be rendered according to the pose, so as to obtain a target foreground image. When the target three-dimensional models are 1, a target foreground image can be obtained; and when the target three-dimensional model is multiple, obtaining multiple target foreground images.

In a possible manner, when the preset foreground asset library includes a plurality of preset foreground images and preset poses of preset foreground objects in each preset foreground image, a first candidate foreground image with the same number as the number of the foreground samples can be selected from the preset foreground asset library according to the foreground requirement information such as the type of the foreground objects, the behavior type of the foreground objects and the size coefficient of the foreground objects. Then, for a target foreground object, a second candidate foreground image containing the target foreground object can be selected from the first candidate foreground images. Then, for one pose corresponding to the target foreground object, a difference value between the preset pose of each second candidate foreground image and the pose can be determined, and then the preset pose with the minimum pose difference value is determined. And then, determining the second candidate foreground image with the minimum difference between the preset pose and the pose as the target foreground image. In this case, for a target foreground image, the target foreground image may be translated and scaled based on a preset pose corresponding to the target foreground image and the pose, by using a camera projection parameter and a perspective principle of near-large and far-small, to obtain a target foreground image closer to a shooting effect corresponding to the pose.

For example, candidate neural networks with the same number as the number of the foreground samples can be selected from a preset foreground asset library according to foreground requirement information such as the foreground object type, the foreground object behavior category and the foreground object size coefficient. Then, for one target foreground object, a target neural network corresponding to the type of the target foreground object can be selected from the candidate neural networks. Then, for a pose corresponding to the target foreground object, the target neural network can calculate based on the pose to determine the target foreground image. When the target neural networks are 1, a target foreground image can be obtained; when the target neural network is multiple, multiple target foreground images are obtained.

Illustratively, the synthesis of the target foreground image and the target background image in S204 to obtain the training image may be referred to as descriptions of S310 to S312 below. In S310 to S312, the combination of one target foreground image and a corresponding target background image is taken as an example.

S310, acquiring first depth information corresponding to a target foreground image and second depth information corresponding to a target background image.

When the preset foreground asset library is built by adopting a real vehicle acquisition mode, 3D point clouds of all preset foreground objects in the preset foreground images are acquired in the real vehicle acquisition process; furthermore, depth information of the preset foreground object can be generated based on the 3D point cloud of the preset foreground object acquired in the real vehicle acquisition process; and storing the depth information of the preset foreground object in a preset foreground asset library. Each preset foreground image corresponds to a group of depth information. Thus, the first depth information corresponding to the target foreground image can be obtained from the preset foreground asset library.

For example, when CG modeling, explicit three-dimensional modeling, or implicit three-dimensional modeling is adopted to build a preset foreground asset library, when rendering to obtain a target foreground image, first depth information corresponding to the target foreground image may be generated.

When the real vehicle collection mode is adopted to establish the preset background asset library, 3D point clouds of all preset background objects in the preset background images are collected in the real vehicle collection process; furthermore, depth information of the preset background object can be generated based on the 3D point cloud of the preset background object acquired in the real vehicle acquisition process; and storing the depth information of the preset background object in a preset background asset library. Wherein each preset background image corresponds to a set of depth information. In this way, the second depth information corresponding to the target background image can be obtained from the preset background asset library.

For example, when CG modeling or implicit three-dimensional modeling is adopted to build a preset background asset library, when rendering to obtain a target background image, second depth information corresponding to the target background image may be generated.

S311, fusing the target foreground image and the target background image based on the first depth information and the second depth information to obtain a fused image, wherein the pixel value of the pixel point in the target area of the fused image is the pixel value of the pixel point with the minimum depth information in the target area of the target foreground image and the target area of the target background image, and the target area is the area where the target foreground object in the target foreground image is located.

For example, the target foreground image and the target background image may be fused according to the first depth information and the second depth information to obtain a fused image. Illustratively, in the fusion process, the depth information of the target pixel point of the target area in the target foreground image and the depth information of the target pixel point of the target area in the target background image are compared, and a pixel value with small depth information is used as a fused pixel value of the target pixel point (the depth information of the target pixel point is smaller in the depth information of the target pixel point of the target area in the target foreground image and the depth information of the target pixel point of the target area in the target background image), that is, the pixel value of the target pixel point of the target area in the fusion image. The target area in the target foreground image is an area where the target foreground object in the target foreground image is located, and the target area in the target background image is an area corresponding to the area where the target foreground object is located. Thus, based on the first depth information and the second depth information, the target foreground image and the target background image are fused, and the object far from the user can be blocked by the object near to the user.

For example, in the fusion process, for other areas except the target area, the pixel values of the pixels in the other areas in the target background image may be used as the fused pixel values of the pixels in the other areas, that is, the fused pixel values of the pixels in the other areas in the fused image.

S312, processing the target area in the fusion image to obtain a training image.

For example, edge detection, semantic segmentation, or instance segmentation may also be performed on the target foreground image, to determine a foreground image mask (which may also be referred to as a foreground mask map) corresponding to the target foreground image. Then, the foreground image mask can be subjected to shielding processing according to the depth information of the pixel points of the target area in the fused image. For example, if the depth information of the pixel point of the target area in the fused image is the depth information of the pixel point corresponding to the target background image, the target pixel point of the target area (i.e., the area corresponding to the area where the target foreground object is located) in the foreground image mask corresponding to the target foreground image may be blocked, that is, the pixel value of the target pixel point in the foreground image mask corresponding to the target foreground image is set to 0. If the depth information of the pixel point of the target area in the fused image is the depth information of the pixel point corresponding to the target foreground image, the target pixel point in the foreground image mask corresponding to the target foreground image is not required to be shielded.

For example, a target area corresponding to a target foreground object in the fused image may be determined based on the blocked foreground image mask, and then the foreground object in the fused image may be processed (e.g., adjusted in color, brightness, etc.) according to the target area corresponding to the target foreground object in the fused image, so as to obtain a training image.

In S310 to S312, when the number of foreground objects is 1, one target foreground image and one target background image are combined. When the number of foreground objects is h (h is an integer greater than 1), the h target foreground images correspond to the same target background image; in one possible way, h target foreground images may be combined first to obtain one target foreground image; and combining the target foreground image with a corresponding target background image to obtain a fusion image. The process of merging the h target foreground images may refer to the descriptions of S310 to S312, and will not be described herein. In one possible manner, the 1 st target foreground image and the corresponding target background image may be combined to obtain a first intermediate image; then, merging the 2 nd target foreground image with the first intermediate image to obtain a second intermediate image; and analogizing until the h target foreground image and the (h-1) intermediate image are combined, and obtaining a final fusion image.

The following is an example of the data augmentation process of S301 to S312:

assuming that the number of background samples is 30, the background category is 5, the foreground object type is 3, the number of foreground samples is 20, and the number of foreground objects includes: 1. 2 and 3 (i.e. 1 target foreground object in part of the training image, 2 foreground target objects in part of the training image, and 3 target foreground objects in part of the training image), the total number of samples is 600, and the motion behaviors of the foreground objects are all static. And assuming that the part of the training image comprises 1 st target foreground object, the part of the training image comprises 2 nd target foreground objects, and the part of the training image comprises 3 rd target foreground objects.

Firstly, based on a preset background asset library, 30 target background images and 30 semantic segmentation graphs can be obtained, wherein the 30 target background images correspond to 5 categories of background.

Then, 20 three-dimensional models corresponding to 3 target foreground objects can be selected from a preset foreground asset library; the 1 st kind of target foreground object corresponds to 10 three-dimensional models, the 2 nd kind of target foreground object corresponds to 5 three-dimensional models, and the 3 rd kind of target foreground object corresponds to 5 three-dimensional models.

For 1 class 1 target foreground object, 30 first poses may be generated based on 30 target semantic segmentation graphs. As the class 1 target foreground object corresponds to 10 three-dimensional models, 300 target foreground images can be determined corresponding to the 30 first poses; each 10 of the 300 target foreground images corresponds to 1 of the 30 target background images. The training image based on the assumption only comprises 1 st target foreground object, and further, 300 training images can be obtained by combining one target foreground object and one target background image at a time.

For the 1 st class 2 target foreground object, 30 first poses can be generated based on 30 target semantic segmentation graphs; since the category 2 target foreground object corresponds to 5 three-dimensional models, 150 target foreground images a can be determined corresponding to the 30 first poses. Each 5 target foreground images a of the 150 target foreground images corresponds to 1 target background image of the 30 target background images. For category 2 target foreground objects, 30 first poses may be generated based on the 30 target semantic segmentation maps, and 150 target foreground images B may be determined corresponding to the 30 first poses. Each 5 target foreground images B of the 150 target foreground images corresponds to 1 target background image of the 30 target background images. The training image based on the assumption contains 2 nd target foreground objects, and further, one target foreground object A, one target foreground object B and one target background image are combined at a time, so that 150 training images can be obtained.

For class 1, class 3 target foreground objects, 30 first poses may be generated based on 30 target semantic segmentation graphs; since the 3 rd target foreground object corresponds to 5 three-dimensional models, 150 target foreground images a can be determined corresponding to the 30 first poses. Each 5 target foreground images a of the 150 target foreground images corresponds to 1 target background image of the 30 target background images. For the 2 nd and 3 rd class target foreground objects, 30 first poses can be generated based on 30 target semantic segmentation graphs, and 150 target foreground images B can be determined corresponding to the 30 first poses. Each 5 target foreground images B of the 150 target foreground images corresponds to 1 target background image of the 30 target background images. For the 3 rd and 3 rd class target foreground objects, 30 first poses can be generated based on 30 target semantic segmentation graphs, and 150 target foreground images C can be determined corresponding to the 30 first poses. Each 5 target foreground images C of the 150 target foreground images corresponds to 1 target background image of the 30 target background images. The training image based on the assumption contains 3 rd kind of target foreground objects, and further, one target foreground object A, one target foreground object B, one target foreground image C and one target background image are combined at a time, so that 150 training images can be obtained.

Further, 600 training images can be obtained.

Assuming that the number of background samples is 30, the background category is 5, the foreground object type is 3, the number of foreground samples is 20, and the number of foreground objects includes: 1. 2 and 3 (i.e. part of the training image comprises 1 target foreground object, part of the training image comprises 2 foreground target objects, part of the training image comprises 3 target foreground objects), the total number of samples is 1000, the motion behaviors of the type 2 foreground object and the type 3 foreground object are static, and the motion behavior of the type 1 foreground object is accelerated motion. And assuming that 1 st target foreground object is included in the partial training image, 2 nd target foreground objects are included in the partial training image, and 3 rd target foreground objects are included in the partial training image.

Illustratively, 300 training images may be obtained for a class 2 foreground object and a class 3 foreground object in the manner described above.

For 1 class 1 target foreground object, 1 first pose may be generated based on 1 target semantic segmentation map. Then generating a model based on the first pose and the corresponding track of the acceleration motion, and generating a motion track, wherein the motion track can comprise 4 second poses; thus, 1 first pose and 4 second poses can obtain 5 poses. Further, based on the 30 target semantic segmentation maps, 30 first poses can be generated; each first pose can determine 4 second poses; thus, 150 poses can be determined for 1 class 1 target foreground object.

Since class 1 target foreground objects correspond to 10 three-dimensional models, the total number of samples is 1000; in a further possible manner, based on each of the 100 poses of the 150 poses, 5 three-dimensional models of the 10 three-dimensional models are respectively rendered, so that 500 target foreground images can be obtained. Then, based on each of the other 50 poses of the 150 poses, 4 three-dimensional models of the other 5 three-dimensional models of the 10 three-dimensional models are respectively rendered, and 200 target foreground images can be obtained. Thus, 700 target foreground images can be obtained in total, and then the 700 target foreground images are respectively combined with the corresponding target background images, so that 700 training images can be obtained.

Thus, 1000 training images can be obtained.

For example, after the training image is obtained according to the embodiments of fig. 2a and fig. 3a, the training image may be used to train a preset algorithm (such as an autopilot algorithm), and then the target parameters (including the target foreground parameters and the target background parameters) are iteratively updated according to the evaluation index of the trained preset algorithm, so that the algorithm index of the preset algorithm can be improved, and the generalization and the robustness of the preset algorithm can be further enhanced.

Fig. 4a is a schematic diagram of an exemplary data augmentation process.

S401, acquiring data-augmented demand information, wherein the demand information comprises target parameters and other parameters, the target parameters comprise target foreground parameters contained in the foreground demand information and target background parameters contained in the background demand information, and the other parameters can comprise other foreground parameters contained in the foreground demand information and other background parameters contained in the background demand information.

The description of the requirement information may refer to S301, which is not described herein.

For example, when the target background parameter and the target foreground parameter are set by the user in a user-defined manner, the target background parameter and the target foreground parameter set by the user in a user-defined manner are obtained. And if the target background parameter and the target foreground parameter are not set by the user, acquiring the 1 st group of target background parameters and the 1 st group of target foreground parameters preconfigured by the equipment.

S402, acquiring a target background image and a target semantic segmentation map corresponding to the target background image based on a preset background asset library and background demand information.

S403, determining the pose corresponding to the target foreground object based on the semantic segmentation map, wherein the plurality of poses comprise the pose of the target foreground object in the target background image.

S404, determining a plurality of foreground images based on the pose corresponding to the target foreground object, a preset foreground asset library and foreground requirement information.

S405, synthesizing the target foreground image and the target background image to obtain a training image.

For example, S402 to S405 may refer to the description of the above embodiments, and are not described herein.

S406, training a preset algorithm by adopting training images.

S407, testing the trained preset algorithm to obtain an evaluation index.

S408, based on the evaluation index, iteratively updating the target foreground parameter and the target background parameter.

Illustratively, after S402-S405 are performed based on the 1 st set of target background parameters and the 1 st set of target foreground parameters, and other parameters, a 1 st set of training images may be obtained. The preset algorithm may then be trained using the training images of group 1. After training the preset algorithm by adopting the 1 st group training image, testing the trained preset algorithm to obtain an evaluation index; thus, the 1 st group evaluation index can be obtained.

For example, when the preset algorithm is a detection algorithm, the evaluation index is AP (Average Precision, average accuracy). When the preset algorithm is a classification algorithm, the evaluation index may be an accuracy rate.

Next, the device preconfigured group 2 target background parameters and group 2 target foreground parameters are obtained. Then, after S402 to S405 are performed based on the other parameters, and the 2 nd group target foreground parameter and the 2 nd group target background parameter, a 2 nd group training image may be obtained. And then training a preset algorithm by adopting the training images of the group 2. After training the preset algorithm by adopting the group 2 training images, testing the trained preset algorithm to obtain an evaluation index; thus, the evaluation index of group 2 can be obtained. And by analogy, n (n is a positive integer) groups of evaluation indexes can be obtained.

And then determining new target foreground parameters and new target background parameters according to the n groups of evaluation indexes, the n groups of target foreground parameters and the n groups of target background parameters.

Illustratively, fitting a proxy function using n sets of target foreground parameters, n sets of target background parameters, and n sets of evaluation indicators, typically a gaussian process; the probability P (y|x, D) can be found after fitting. Wherein, D is n data pairs, and each data pair comprises 1 group of target foreground parameters, 1 group of target background parameters and 1 group of evaluation indexes, y represents the evaluation indexes, and x represents the target foreground parameters and the target background parameters. Next, an acquisition function S, such as UCB (upper confidence bound, confidence interval upper bound algorithm), may be selected, through which new target foreground parameters and new target background parameters are searched. For example, the target parameters (target foreground parameters and target background parameters) that maximize the S value of the acquisition function may be selected, and the following formula may be referred to:

/>

The target foreground parameter and the target background parameter with the largest acquisition function S value, that is, the latest target foreground parameter and the latest target background parameter searched by the acquisition function (not belonging to the selected n groups of target foreground parameters and n groups of target background parameters). X is a random variable (one random variable) of the target foreground parameter and the target background parameter.

Then, taking the new target foreground parameter as an n+1th group target foreground parameter and taking the new target background parameter as an n+1th group target background parameter; and then executing S402-S405 based on other parameters, the n+1th group target foreground parameters and the n+1th group target background parameters, so as to obtain the n+1th group training image. Then, fitting a proxy function by adopting n+1 groups of target foreground parameters, n+1 groups of target background parameters and n+1 groups of evaluation indexes, and obtaining and searching new target foreground parameters and new target background parameters through an acquisition function to obtain n+2 groups of target foreground parameters and n+2 groups of target background parameters. And by analogy, continuously cycling to realize iterative updating of the target foreground parameter and the target background parameter until the preset cycle times are reached or the index of the preset algorithm reaches a preset value. The preset cycle number and the preset value can be set according to requirements, and are not described herein.

It should be noted that, the data augmentation method of the present application may be implemented in the cloud, and specifically may be executed by a cloud service deployed in the cloud.

Fig. 4b is a schematic diagram of an exemplary illustrated data augmentation framework. Wherein fig. 4b is the data augmentation framework of fig. 4 a.

Referring to fig. 4b, the cloud end deploys cloud services, preset algorithm image files, asset libraries and demand information libraries.

Illustratively, the cloud service is used to perform the steps described above in the embodiments of fig. 2a, 3a and 4 a.

For example, a preset algorithm (such as an automatic driving algorithm and an image sensing algorithm) in the first terminal device (such as a vehicle and a sweeping robot) and an operation environment of the preset algorithm may be packaged into an image file in advance, and the image file of the preset algorithm may be obtained; and uploading the preset algorithm image file to the cloud.

By way of example, the asset library may include a preset background asset library, a preset foreground asset library, and a preset behavior asset library.

For example, the demand information base may include target foreground parameters and target background parameters.

In connection with the embodiment of fig. 4a, after the user inputs the requirement information (which may include the other foreground parameters and the other background parameters described above) in the second terminal device (such as a mobile phone, a notebook computer, and a tablet computer), the cloud service may acquire the requirement information (including the other foreground parameters and the other background parameters) from the second terminal device, and may acquire the requirement information (including the target foreground parameters and the target background parameters) from the requirement information base; then executing the steps S401-S408, generating a training image based on the requirement information, training a preset algorithm by adopting the training image, and updating the target foreground parameter and the target background parameter based on the index of the trained preset algorithm; the method is circulated, data augmentation and training of the preset algorithm are achieved, and therefore indexes, generalization and robustness of the preset algorithm are improved.

The present application further provides a data augmentation device, which may be used to perform the method of the foregoing embodiment, so that the advantages achieved by the data augmentation device may refer to the advantages of the corresponding method provided above, and will not be described herein.

Fig. 5 is a schematic diagram of a data augmentation device shown in an exemplary manner. The data augmentation device comprises:

a first image obtaining module 501, configured to obtain a target background image and a target semantic segmentation map corresponding to the target background image;

the pose acquisition module 502 is configured to determine, based on the target semantic segmentation map, a pose corresponding to a target foreground object, where the pose is a pose of the target foreground object in a target background image;

a second image obtaining module 503, configured to obtain a target foreground image based on a pose corresponding to the target foreground object, where the target foreground image includes the target foreground object;

a synthesis module 504, configured to synthesize the target foreground image and the target background image to obtain a training image.

The pose includes a first pose, and the pose acquisition module 502 is specifically configured to determine probability information based on the target semantic segmentation map, where the probability information includes a probability that each pixel point in the target semantic segmentation map appears in a target foreground object; based on the probability information, a first pose is determined.

The pose obtaining module 502 is specifically configured to input the target semantic segmentation map and the category identifier of the target foreground object to the location generating network, and perform location prediction by the location generating network to obtain probability information.

The pose obtaining module 502 is specifically configured to determine, based on the probability information, a position of the target foreground object in the target background image; determining an orientation angle of the target foreground object in the target background image based on the position of a preset reference semantic in the target semantic segmentation map and the position of the target foreground object in the target background image; the first pose is determined based on the position and orientation angle of the target foreground object in the target background image.

The pose obtaining module 502 is specifically configured to generate a probability heat map based on probability information; and sampling the probability heat map according to a preset constraint condition to determine the position of the target foreground object in the target background image.

The device also comprises a training module, wherein the training module is used for acquiring a semantic segmentation graph, and the semantic segmentation graph comprises the semantics of the foreground object and the semantics of the background object; shielding semantics of a foreground object in the semantic segmentation map to obtain a first image; binarizing the semantic segmentation map to obtain a second image, wherein the pixel value of the pixel point corresponding to the foreground object in the second image is different from the pixel value of the pixel point corresponding to the background object; the location generation network is trained based on the category identification of the foreground object, the first image, and the second image.

The foreground object is identified as a plurality of categories, and the position generation network corresponds to a plurality of groups of weight parameters; the training module is specifically configured to, for a first foreground object of the plurality of foreground objects: the location generation network configured with a set of weight parameters is trained based on the class identification of the first foreground object, the first image, and the second image.

Illustratively, the location generation network includes a gaussian kernel function and a sigmoid function; the gaussian kernel functions correspond to sets of weight parameters.

The pose obtaining module 502 is specifically configured to input the target semantic segmentation map and the category identifier of the target foreground object to the location generating network, and select, by the location generating network, a set of weight parameters corresponding to the category identifier of the target foreground object to perform location prediction after configuration, so as to obtain probability information.

The pose further includes a second pose, and the pose obtaining module 502 is further configured to select, when the target foreground object is a dynamic object, a trajectory generation model corresponding to a motion type of the target foreground object from a preset behavior asset library; and generating a motion track corresponding to the target foreground object based on the first pose and the track generation model, wherein the motion track comprises a plurality of second poses in the motion process of the target foreground object, the second poses are the poses of the target foreground object in the target background image, and the first pose is one.

Illustratively, the apparatus further comprises: the demand acquisition module is used for acquiring demand information of data augmentation, wherein the demand information comprises foreground demand information and background demand information;

the first image obtaining module 501 is specifically configured to obtain a target background image and a target semantic segmentation map based on a preset background asset library and background requirement information;

the second image obtaining module 503 is specifically configured to determine a target foreground image based on the pose corresponding to the target foreground object, the preset foreground asset library and the foreground requirement information.

The background asset library comprises a plurality of preset semantic segmentation graphs and a plurality of preset background images which are collected in advance, wherein the preset semantic segmentation graphs correspond to the preset background images one by one;

the first image obtaining module 501 is specifically configured to select, from a plurality of preset background images, a target background image that matches the background requirement information; and selecting a preset semantic segmentation map corresponding to the target background image from the plurality of preset semantic segmentation maps as a target semantic segmentation map.

The second image obtaining module 503 is specifically configured to select, from the preset foreground asset library, a three-dimensional model of a target foreground object that matches the foreground requirement information; and rendering the three-dimensional model of the target foreground object based on the pose corresponding to the target foreground object to obtain a target foreground image.

The synthesis module 504 is specifically configured to obtain first depth information corresponding to the target foreground image and second depth information corresponding to the target background image; based on the first depth information and the second depth information, fusing the target foreground image and the target background image to obtain a fused image, wherein the pixel value of a pixel point in a target area of the fused image is the pixel value of a pixel point with minimum depth information in the target area of the target foreground image and the target area of the target background image, and the target area is an area where a target foreground object in the target foreground image is located; and processing the target area in the fusion image to obtain a training image.

Illustratively, the synthesizing module 504 is specifically configured to obtain a foreground image mask corresponding to the target foreground image; according to the depth information of the pixel points of the target area in the fusion image, carrying out shielding treatment on the foreground image mask; and processing the target area in the fused image based on the blocked foreground image mask to obtain a training image.

Illustratively, the foreground demand information includes a target foreground parameter and the background demand information includes a target background parameter; the device method further comprises the following steps: the updating module is used for training a preset algorithm based on the training image; testing the trained preset algorithm to obtain an evaluation index; and carrying out iterative updating on the target foreground parameter and the target background parameter based on the evaluation index.

The updating module is specifically configured to iteratively update the target foreground parameter and the target background parameter based on the evaluation index and the bayesian optimization algorithm.

Illustratively, the target foreground parameters include at least one of: the foreground sample number, the foreground object motion speed, the foreground object motion distance, the foreground object motion track noise, the foreground object size coefficient and the maximum foreground shielding proportion; the target background parameters include at least one of: the number of background obstructions, the number of background samples, background weather conditions, and background lighting conditions.

In one example, fig. 6 shows a schematic block diagram apparatus 600 of an embodiment of the present application may include: processor 601 and transceiver/transceiving pin 602, optionally, also include memory 603.

The various components of device 600 are coupled together by bus 604, where bus 604 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are referred to in the figures as bus 604.

Optionally, the memory 603 may be used for storing instructions in the foregoing method embodiments. The processor 601 is operable to execute instructions in the memory 603 and control the receive pin to receive signals and the transmit pin to transmit signals.

The apparatus 600 may be an electronic device or a chip of an electronic device in the above-described method embodiments.

All relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

The present embodiment also provides a computer-readable storage medium having stored therein computer instructions which, when executed on an electronic device, cause the electronic device to perform the above-described related method steps to implement the data augmentation method in the above-described embodiments.

The present embodiment also provides a computer program product which, when run on a computer, causes the computer to perform the above-described related steps to implement the data augmentation method of the above-described embodiments.

In addition, embodiments of the present application also provide an apparatus, which may be specifically a chip, a component, or a module, and may include a processor and a memory connected to each other; the memory is configured to store computer-executable instructions, and when the device is running, the processor may execute the computer-executable instructions stored in the memory, so that the chip performs the data augmentation method in the above-described method embodiments.

The electronic device, the computer readable storage medium, the computer program product or the chip provided in this embodiment are used to execute the corresponding method provided above, so that the beneficial effects thereof can be referred to the beneficial effects in the corresponding method provided above, and will not be described herein.

It will be appreciated by those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts shown as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

Any of the various embodiments of the application, as well as any of the same embodiments, may be freely combined. Any combination of the above is within the scope of the present application.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

The steps of a method or algorithm described in connection with the disclosure of the embodiments disclosed herein may be embodied in hardware, or may be embodied in software instructions executed by a processor. The software instructions may be comprised of corresponding software modules that may be stored in random access Memory (Random Access Memory, RAM), flash Memory, read Only Memory (ROM), erasable programmable Read Only Memory (Erasable Programmable ROM), electrically Erasable Programmable Read Only Memory (EEPROM), registers, hard disk, a removable disk, a compact disc Read Only Memory (CD-ROM), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC.

Those skilled in the art will appreciate that in one or more of the examples described above, the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer-readable storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Claims

1. A method of data augmentation, the method comprising:

Acquiring a target background image and a target semantic segmentation map corresponding to the target background image;

determining a pose corresponding to a target foreground object based on the target semantic segmentation map, wherein the pose is the pose of the target foreground object in the target background image;

acquiring a target foreground image based on the pose corresponding to the target foreground object, wherein the target foreground image comprises the target foreground object;

and synthesizing the target foreground image and the target background image to obtain a training image.

2. The method of claim 1, wherein the pose comprises a first pose, wherein the determining the pose corresponding to the target foreground object based on the target semantic segmentation map comprises:

determining probability information based on the target semantic segmentation map, wherein the probability information comprises the probability of each pixel point in the target semantic segmentation map to appear the target foreground object;

and determining the first pose based on the probability information.

3. The method of claim 2, wherein the determining probability information based on the target semantic segmentation map comprises:

and inputting the target semantic segmentation graph and the category identification of the target foreground object to a position generation network, and carrying out position prediction by the position generation network to obtain the probability information.

4. A method according to claim 2 or 3, wherein said determining said first pose based on said probability information comprises:

determining the position of the target foreground object in the target background image based on the probability information;

determining an orientation angle of the target foreground object in the target background image based on the position of a preset reference semantic in the target semantic segmentation map and the position of the target foreground object in the target background image;

the first pose is determined based on a position and an orientation angle of the target foreground object in the target background image.

5. The method of claim 4, wherein determining the location of the target foreground object in the target background image based on the probability information comprises:

generating a probability heat map based on the probability information;

and sampling the probability heat map according to a preset constraint condition to determine the position of the target foreground object in the target background image.

6. A method according to claim 3, further comprising training the location generation network:

Acquiring a semantic segmentation map, wherein the semantic segmentation map comprises the semantics of a foreground object and the semantics of a background object;

shielding the semantics of the foreground object in the semantic segmentation map to obtain a first image;

binarizing the semantic segmentation map to obtain a second image, wherein the pixel value of the pixel point corresponding to the foreground object in the second image is different from the pixel value of the pixel point corresponding to the background object;

training the location generation network based on the category identification of the foreground object, the first image, and the second image.

7. The method of claim 6, wherein the foreground object is identified in a plurality of categories, and the location generation network corresponds to a plurality of sets of weight parameters;

the training the location generation network based on the category identification of the foreground object, the first image, and the second image includes:

for a first foreground object of the plurality of foreground objects:

training the location generation network configured with a set of weight parameters based on the class identification of the first foreground object, the first image, and the second image.

8. The method according to claim 6 or 7, wherein,

The position generation network comprises a Gaussian kernel function and a sigmoid function;

the Gaussian kernel function corresponds to the plurality of groups of weight parameters.

9. The method according to claim 7 or 8, wherein said inputting the category identification of the target semantic segmentation map and the target foreground object to a location generation network, performing location prediction by the location generation network to obtain the probability information, comprises:

inputting the target semantic segmentation graph and the category identification of the target foreground object to a position generation network, selecting a group of weight parameter configuration corresponding to the category identification of the target foreground object by the position generation network, and then carrying out position prediction to obtain the probability information.

10. The method according to any one of claims 2 to 9, wherein the pose further comprises a second pose, the determining a pose corresponding to a target foreground object based on the target semantic segmentation map further comprising:

when the target foreground object is a dynamic object, selecting a track generation model corresponding to the motion type of the target foreground object from a preset behavior asset library;

and generating a motion track corresponding to the target foreground object based on the first pose and the track generation model, wherein the motion track comprises a plurality of second poses in the motion process of the target foreground object, the second poses are poses of the target foreground object in the target background image, and the first pose is one.

11. The method according to any one of claims 1 to 10, further comprising:

acquiring requirement information of data augmentation, wherein the requirement information comprises foreground requirement information and background requirement information;

the obtaining the target background image and the target semantic segmentation map corresponding to the target background image comprises the following steps:

acquiring the target background image and the target semantic segmentation map based on a preset background asset library and the background demand information;

the determining the target foreground image based on the pose corresponding to the target foreground object comprises the following steps:

and determining the target foreground image based on the pose corresponding to the target foreground object, a preset foreground asset library and the foreground requirement information.

12. The method of claim 11, wherein the background asset library comprises a plurality of preset semantic segmentation maps and a plurality of preset background images pre-acquired, the plurality of preset semantic segmentation maps being in one-to-one correspondence with the plurality of preset background images;

the obtaining the target background image and the target semantic segmentation map based on a preset background asset library and the background demand information comprises the following steps:

Selecting a target background image matched with the background demand information from the preset background images;

and selecting a preset semantic segmentation map corresponding to the target background image from the plurality of preset semantic segmentation maps as the target semantic segmentation map.

13. The method of claim 11 or 12, wherein the library of preset foreground assets comprises three-dimensional models of a plurality of preset foreground objects,

the determining the target foreground image based on the pose corresponding to the target foreground object, a preset foreground asset library and the foreground requirement information comprises the following steps:

selecting a three-dimensional model of a target foreground object matched with the foreground demand information from the preset foreground asset library;

and rendering the three-dimensional model of the target foreground object based on the pose corresponding to the target foreground object to obtain the target foreground image.

14. The method according to any one of claims 1 to 13, wherein the synthesizing the target foreground image and the target background image to obtain a training image comprises:

acquiring first depth information corresponding to the target foreground image and second depth information corresponding to the target background image;

Based on the first depth information and the second depth information, fusing the target foreground image and the target background image to obtain a fused image, wherein the pixel value of a pixel point in a target area of the fused image is the pixel value of a pixel point with minimum depth information in the target area of the target foreground image and the target area of the target background image, and the target area is an area where the target foreground object in the target foreground image is located;

and processing the target area in the fusion image to obtain the training image.

15. The method of claim 14, wherein processing the target region in the fused image to obtain the training image comprises:

acquiring a foreground image mask corresponding to the target foreground image;

according to the depth information of the pixel points of the target area in the fusion image, carrying out shielding treatment on the foreground image mask;

and processing a target area in the fused image based on the blocked foreground image mask to obtain the training image.

16. The method according to any one of claims 11 to 13, wherein the foreground demand information comprises a target foreground parameter and the background demand information comprises a target background parameter; the method further comprises the steps of:

Training a preset algorithm based on the training image;

testing the trained preset algorithm to obtain an evaluation index;

and iteratively updating the target foreground parameter and the target background parameter based on the evaluation index.

17. The method of claim 16, wherein iteratively updating the target foreground parameter and the target background parameter based on the evaluation index comprises:

and carrying out iterative updating on the target foreground parameter and the target background parameter based on the evaluation index and a Bayesian optimization algorithm.

18. The method according to claim 16 or 17, wherein,

the target foreground parameters include at least one of: the foreground sample number, the foreground object motion speed, the foreground object motion distance, the foreground object motion track noise, the foreground object size coefficient and the maximum foreground shielding proportion;

the target background parameters include at least one of: the number of background obstructions, the number of background samples, background weather conditions, and background lighting conditions.

19. A data augmentation apparatus, the apparatus comprising:

20. An electronic device, comprising:

a memory and a processor, the memory coupled with the processor;

the memory includes program instructions that, when executed by the processor, cause the electronic device to perform the data augmentation method of any of claims 1 to 18.

21. A chip comprising one or more interface circuits and one or more processors; the interface circuit is configured to receive a signal from a memory of an electronic device and to send the signal to the processor, the signal including computer instructions stored in the memory; the computer instructions, when executed by the processor, cause the electronic device to perform the data augmentation method of any one of claims 1 to 18.

22. A computer readable storage medium, characterized in that the computer readable storage medium comprises a computer program which, when run on a computer or a processor, causes the computer or the processor to perform the data augmentation method of any one of claims 1 to 18.

23. A computer program product comprising a software program which, when executed by a computer or processor, causes the steps of the method of any one of claims 1 to 18 to be performed.