CN114708437A

CN114708437A - Training method of target detection model, target detection method, device and medium

Info

Publication number: CN114708437A
Application number: CN202210618552.3A
Authority: CN
Inventors: 陈志轩; 杨敏; 杨作兴; 艾国
Original assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Current assignee: Shenzhen MicroBT Electronics Technology Co Ltd
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-07-05
Anticipated expiration: 2042-06-02
Also published as: CN114708437B

Abstract

The embodiment of the application provides a training method, a target detection method, a device and a medium of a target detection model, wherein the training method specifically comprises the following steps: extracting a target region from the image; determining a training image according to the target area; the training image includes: an original image and at least one enhancement image; the enhancement map is an image obtained by enhancing the original image; the enhancement treatment comprises the following steps: position processing and size processing; performing first feature extraction on an original image to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first characteristic and the second characteristic, and updating the first parameter of the first characteristic extraction unit according to the error information. The embodiment of the application can save the labeling cost of the training image, save the operation cost, improve the operation speed and improve the generalization capability of the first feature extraction unit.

Description

Training method of target detection model, target detection method, device and medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a training method for a target detection model, a target detection method, an apparatus, and a medium.

Background

Pedestrian detection is an important research direction in the field of intelligent video monitoring, is a computer vision technology based on machine learning, and is used for completing tasks such as people counting, pedestrian tracking and the like by analyzing and detecting pedestrians, vehicles and other moving objects in a scene.

In the existing pedestrian detection method, the feature representation of an image to be detected is usually extracted by a pedestrian detection model, whether the image to be detected contains a pedestrian or not is detected according to the feature representation, and if yes, the position information of the pedestrian can be given.

In practical applications, the pedestrian detection model is usually trained from tagged image data, and tagging of the tagged image data usually consumes a lot of labor cost and time cost. In particular, in the case where a change in the detection scene occurs, new tagged image data needs to be prepared, which further increases the labor cost and time cost.

Disclosure of Invention

The embodiment of the application provides a training method of a target detection model, which can save the labeling cost of a training image, save the operation cost, improve the operation speed and improve the generalization capability of a first feature extraction unit.

Correspondingly, the embodiment of the application also provides a target detection method, a training device of a target detection model, a target detection device, electronic equipment and a machine readable medium, so as to ensure the realization and application of the method.

In order to solve the above problem, an embodiment of the present application discloses a training method for a target detection model, where the target detection model includes: a first feature extraction unit, the method comprising:

extracting a target region from the image;

determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing;

performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature;

performing second feature extraction on the enhancement map to obtain second features;

and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

Optionally, the performing of the second feature extraction on the enhancement map includes:

performing feature extraction on the enhancement map by using a second feature extraction unit; the second feature extraction unit and the first feature extraction unit have the same neural network structure;

and performing full connection operation on the feature representation output by the second feature extraction unit by using a seventh multilayer perceptron and an eighth multilayer perceptron to obtain a second feature.

Optionally, the method further comprises:

updating the second parameter of the second feature extraction unit according to the updated first parameter;

and updating parameters of the seventh multilayer perceptron and the eighth multilayer perceptron according to the error information.

Optionally, the enhancement map comprises: a first enhancement map and a second enhancement map;

the determining error information according to the matching degree between the first feature and the second feature includes: and determining error information according to a first matching degree between the original image and the first enhanced image and a second matching degree between the original image and the second enhanced image.

Optionally, the determining a training image according to the target region includes:

cutting an original image containing a target area from the image; the first size of the original image is larger than that of the target area;

presetting an original image to obtain a middle image;

randomly cropping a first image having a second size from the intermediate image and enlarging the first image into a first enhancement image having a first size;

the second image having the second size is cropped from the middle map in accordance with the center of the middle map, and the second image is enlarged to a second enhancement map having the first size.

Optionally, when the error information meets a preset condition, the value of the first parameter is a first target parameter value;

the method further comprises the following steps:

and carrying out migration training on the target detection model according to the first target parameter value and the labeled image data.

In order to solve the above problem, an embodiment of the present application discloses a target detection method, including:

receiving an image to be detected;

carrying out target detection on the image to be detected by using a target detection model to obtain a corresponding detection result;

wherein the target detection model comprises: a first feature extraction unit; the training process of the target detection model comprises the following steps: extracting a target region from the image; determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing; performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

In order to solve the above problem, an embodiment of the present application discloses a training apparatus for a target detection model, where the target detection model includes: a first feature extraction unit, the apparatus comprising:

the region extraction module is used for extracting a target region from the image;

the training image determining module is used for determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing;

the first feature extraction module is used for performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature;

the second feature extraction module is used for performing second feature extraction on the enhancement map to obtain second features;

the error determining module is used for determining error information according to the matching degree between the first characteristic and the second characteristic;

and the first parameter updating module is used for updating the first parameter of the first feature extraction unit according to the error information.

In order to solve the above problem, an embodiment of the present application discloses an object detection apparatus, including:

the receiving module is used for receiving an image to be detected;

the target detection module is used for carrying out target detection on the image to be detected by utilizing a target detection model so as to obtain a corresponding detection result;

wherein the object detection model comprises: a first feature extraction unit; the training process of the target detection model comprises the following steps: extracting a target area from the image; determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing; performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

The embodiment of the application also discloses an electronic device, which comprises: a processor; and a memory having executable code stored thereon that, when executed, causes the processor to perform a method as described in embodiments of the present application.

The embodiment of the application also discloses a machine-readable medium, wherein executable codes are stored on the machine-readable medium, and when the executable codes are executed, a processor is caused to execute the method according to the embodiment of the application.

The embodiment of the application has the following advantages:

in the technical scheme of the embodiment of the application, a target area is extracted from an image, and a training image is automatically constructed aiming at the target area. Therefore, the embodiment of the application can save the labeling cost of the training image.

In addition, the error information adopted by the back propagation of the first feature extraction unit in the target detection model in the embodiment of the present application is obtained according to the matching degree between the feature representations of different training images corresponding to the same target region. In this way, the back propagation of the first feature extraction unit in the embodiment of the present application may involve the operation of the positive samples represented by the same target region, so that the operation of the negative samples represented by different target regions may be saved compared with the conventional self-supervision learning method or contrast learning method; therefore, the calculation cost can be saved and the calculation speed can be improved.

Furthermore, the position and size of the target in the image may also change due to factors such as the distance between the target, such as a pedestrian, and the camera. The enhancement processing related to the enhancement map of the embodiment of the application may include: position processing and size processing, so the enhancement map can be characterized: and performing position conversion and size conversion on the original image. The training of the first feature extraction unit in the embodiment of the application can improve the matching degree between the original image and the feature representation of the enhanced image; in this way, the embodiment of the present application enables the first feature extraction unit to have consistent feature representation capabilities before and after position conversion and before and after size conversion, and thus can improve the generalization capability of the first feature extraction unit. In the case where the generalization ability of the first feature extraction means is improved, the first feature extraction means may be applied to a plurality of detection scenes before and after a change in the detection scene.

Drawings

FIG. 1 is a schematic diagram of a target detection model according to an embodiment of the present application;

FIG. 2 is a flow chart illustrating steps of a method for training a target detection model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart diagram illustrating a method for determining a training image according to one embodiment of the present application;

FIG. 4 is a schematic structural diagram of a first feature extraction unit according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an enhancement map corresponding feature extraction module according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating steps of a method for training a target detection model according to an embodiment of the present application;

FIG. 7 is a flow chart illustrating steps of a target detection method according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a training apparatus for an object detection model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of the structure of an object detection device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an apparatus provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

The embodiment of the application can be applied to a target detection scene. In a target detection scene, the characteristic representation of the image to be detected can be extracted by the target detection model, whether the image to be detected contains a target to be detected, such as a pedestrian, is detected according to the characteristic representation, and if yes, the position information of the target, such as the pedestrian, can be given. The targets to be detected may include: moving objects such as pedestrians and vehicles can be understood, and the embodiment of the application is not limited to the specific target to be detected.

The target detection model of the embodiment of the application can be used for outputting a corresponding detection result according to the input image to be detected. The embodiment of the application can train the mathematical model to obtain the target detection model. The mathematical model is a scientific or engineering model constructed by using a mathematical logic method and a mathematical language, and is a mathematical structure which is generally or approximately expressed by adopting the mathematical language aiming at the characteristic or quantity dependency relationship of a certain object system, and the mathematical structure is a relational structure which is described by means of mathematical symbols. The mathematical model may be one or a set of algebraic, differential, integral or statistical equations, and combinations thereof, by which the interrelationships or causal relationships between the variables of the system are described quantitatively or qualitatively. In addition to mathematical models described by equations, there are also models described by other mathematical tools, such as algebra, geometry, topology, mathematical logic, etc. Where the mathematical model describes the behavior and characteristics of the system rather than the actual structure of the system. The method can adopt methods such as machine learning and deep learning methods to train the mathematical model, and the machine learning method can comprise the following steps: linear regression, decision trees, random forests, etc., and the deep learning method may include: CNN (Convolutional Neural Networks), LSTM (Long Short-Term Memory), GRU (Gated cyclic Unit), and the like.

Referring to fig. 1, a schematic structural diagram of a target detection model according to an embodiment of the present application is shown, where the target detection model specifically includes: a first feature extraction unit 101, a feature fusion unit 102, and a detection head unit 103.

The first feature extraction unit 101 may be configured to perform feature extraction on an image to be detected. The first feature extraction unit 101 may be configured to receive an image to be detected, and extract a first feature of the image from the image to be detected, where the first feature may refer to a deep image feature. The first feature extraction unit 101 may be a backbone (backbone) network, and may include: VGG (Visual Geometry Group Network), ResNet (Residual Network), lightweight Network, and the like. It is understood that, in the embodiment of the present application, a specific network corresponding to the first feature extraction unit 101 is not limited.

Wherein, the residual error network may be a convolution network. The convolution network can be a deep feedforward artificial neural network and has better performance in image recognition. The convolutional network may specifically include a convolutional layer (convolutional layer) and a pooling layer (pooling layer). The convolutional layer is used to automatically extract features from an input image to obtain a feature map (feature map). The pooling layer is used for pooling the feature map to reduce the number of features in the feature map. The pooling treatment of the pooling layer may include: maximum pooling, average pooling, random pooling and the like, and can be selected according to actual requirements.

The feature fusion unit 102 is a unit which is started from the top in the target detection model, and can fuse the first features extracted by the first feature extraction unit 101 to obtain fusion features, which can improve the diversity of the features and the performance of the target detection model.

The detection head unit 103 is configured to detect whether the image to be detected includes the target to be detected according to the fusion feature output by the feature fusion unit 102, and if so, may provide position information of the target to be detected.

In the conventional technology, the target detection model is usually trained from tagged image data, and tagging of the tagged image data usually consumes a great deal of labor cost and time cost. In particular, in the case where a change in the detection scene occurs, new tagged image data needs to be prepared, which further increases the labor cost and time cost.

In view of the technical problem of high labeling cost of labeled image data, an embodiment of the present application provides a training method for a target detection model, where a first feature extraction unit in the target detection model, in addition to extracting a first feature, also performs an operation related to extracting a second feature, and the method may specifically include:

extracting a target region from the image; determining a training image according to the target area; the training image includes: the original image and at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing may include: position processing and size processing;

performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features;

and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information. The first feature and the second feature may both refer to deep-level image features.

According to the method and the device, the target area is extracted from the image, and the training image is automatically constructed according to the target area. Therefore, the embodiment of the application can save the labeling cost of the training image.

In addition, the error information adopted by the back propagation of the first feature extraction unit in the target detection model of the embodiment of the present application is obtained according to the matching degree between the feature representations of different training images corresponding to the same target region. Therefore, in the embodiment of the present application, the back propagation of the first feature extraction unit may involve the operation of the positive samples represented by the same target region, so that compared with a conventional self-supervision learning method or a comparative learning method, the operation of the negative samples represented by different target regions may be saved, thereby saving the operation cost and increasing the operation speed.

Furthermore, the distance between the target and the camera changes, and the like, so that the position and the size of the target in the image also change. The enhancement processing related to the enhancement map of the embodiment of the application may include: position processing and size processing, so the enhancement map can be characterized: and performing position conversion and size conversion on the original image. The training of the first feature extraction unit in the embodiment of the application can improve the matching degree between the original image and the feature representation of the enhanced image; in this way, the embodiment of the present application enables the first feature extraction unit to have consistent feature representation capabilities before and after position conversion and before and after size conversion, and thus can improve the generalization capability of the first feature extraction unit. In the case where the generalization ability of the first feature extraction means is improved, the first feature extraction means may be applied to a plurality of detection scenes before and after a change in the detection scene.

Method embodiment 1

The present embodiment describes a training process of a target detection model, and particularly, a training process of a first feature extraction unit in a target detection model.

The training process of the first feature extraction unit may include: a pre-training process and a migration training (or fine-tuning training) process. The pre-training can train a first feature extraction unit on a universal image to learn universal image knowledge and image rules; the migration training may perform migration training on the first feature extraction unit according to the tagged image data of the detection scene. One difference between the images used for pre-training and migration training is: the images used for pre-training have general applicability, while the images used for migration training have specificity, e.g., to match the detection scenario.

In the conventional technology, pre-training of the first feature extraction unit generally adopts labeled image data; labeling of tagged image data typically costs a significant amount of labor and time. The pre-training process of the first feature extraction unit can be improved, and a training image is automatically constructed; therefore, the embodiment of the application can save the labeling cost of the training image.

Referring to fig. 2, a schematic flow chart illustrating steps of a training method of a target detection model according to an embodiment of the present application is shown, where the method may specifically include the following steps:

step 201, extracting a target area from an image;

step 202, determining a training image according to the target area; the training image may specifically include: the original image and at least one enhancement image corresponding to the target area; the enhancement map may be an image obtained by enhancing the original image; the enhancement processing may include: position processing and size processing;

step 203, performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature;

step 204, performing second feature extraction on the enhancement map to obtain a second feature;

step 205, determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

The embodiment of the application can be used for updating the first parameter of the first feature extraction unit in the training process of the target detection model, so that the marking cost of a training image can be saved, the operation cost can be saved, the operation speed can be increased, and the generalization capability of the first feature extraction unit can be improved.

The training process of the first feature extraction unit may include: forward propagation and backward propagation.

The Forward Propagation (Forward Propagation) may sequentially calculate, according to the first parameter of the first feature extraction unit, to obtain the output information in an order from the input layer to the output layer. Wherein the output information may be used to determine error information.

Back Propagation (Backward Propagation) may sequentially calculate and update the first parameter of the first feature extraction unit according to the error information in an order from the output layer to the input layer. The first feature extraction unit generally adopts a neural network structure, and the first parameters may include: weights of the neural network, etc. In the back propagation process, gradient information of the first parameter of the first feature extraction unit may be determined, and the first parameter of the first feature extraction unit is updated by using the gradient information. For example, the backward propagation may sequentially calculate and store gradient information of the first parameter of the processing layers (including the input layer, the intermediate layer, and the output layer) of the first feature extraction unit in the order from the output layer to the input layer according to a chain rule in the calculus.

In step 201, the image may originate from a camera, video camera, or other acquisition device. In other words, the embodiments of the present application may acquire an image from a video or an image acquired by at least one acquisition device. It is understood that the embodiment of the present application does not limit the specific manner of acquiring the image in step 201.

The target area may characterize the area corresponding to the target, which may contain the independent complete target. A target area may contain a target. For example, the target region corresponding to a pedestrian includes a pedestrian, but does not include a car. As another example, the target area corresponding to the car includes the car, but does not include the pedestrian.

In the embodiment of the application, a selective search (selectsearch) method or a search method of a neural network may be used to extract a target region corresponding to a target. The selective search method is a region extraction method for target detection, and is used for obviously dividing possible target regions on an image, so that targets sent to downstream training are as complete as possible, and the gradient at the edge is large. It has the advantages of fast operation speed and high recall rate. In practical application, the embodiment of the application can extract the target region from the image according to the image characteristics such as color, texture, size and shape. The searching method of the neural network may include: lift (boost) methods, and the like.

In order to extract a target region capable of representing a target from an image, an embodiment of the present application may acquire the region from the image by using a sliding window with a preset size. The preset size may be matched with an image area corresponding to an object such as a pedestrian. The width-to-length ratio characterized by the predetermined dimension may be matched to the dimensional characteristics of the target. For example, in the case of a pedestrian being the target, the sliding window is rectangular, and the width-to-length ratio represented by the preset dimension may include: 1: 2. or 1: 3. or 1: 4, etc.

According to the embodiment of the application, whether the adjacent regions belong to the same category or not can be judged according to the image feature similarity between the adjacent regions corresponding to different sliding windows, and the specific processing process can include:

step 1, determining similarity between adjacent regions according to image characteristics of the adjacent regions;

step 2, merging two adjacent regions with similarity exceeding a similarity threshold value to obtain a merged region;

step 3, determining the similarity between the merging region and the adjacent region, and merging the merging region with the similarity exceeding a similarity threshold value with the adjacent region to obtain a merging region;

and (5) repeating the step (3) until a plurality of areas are extracted from the image and serve as target areas. It is understood that the number of target areas may be one or more, and the specific number of target areas is not limited in the embodiments of the present application.

The basis for determining the similarity between adjacent regions according to the embodiment of the present application may include: color, texture, size, and shape.

Taking color as an example, the embodiment of the present application may determine a color histogram of a neighboring region. The determination process of the color histogram may include: 25 bins (containers, representing color intervals) are respectively divided on three channels RGB (Red, Green, Blue), and a color histogram of an adjacent region is obtained by counting color distribution on a single bin. Thus, assuming that x and y represent the distribution on the bin corresponding to the color histograms of the two neighboring regions, i.e. the number of pixels falling into the bin corresponding color interval, and c represents the channel, the similarity between the neighboring regions can be determined by using the vector distance method between the color vectors corresponding to the two neighboring regions, and the like.

The vector distance method may be specifically expressed as determining similarity s between neighboring regions according to formula (1):

（1）

where i may represent numbers corresponding to 25 bins, respectively. Xi may represent the number of pixels falling within the color interval corresponding to the ith bin of the first neighboring region, and Yi may represent the number of pixels falling within the color interval corresponding to the ith bin of the second neighboring region. The obtained s is between 0 and 1, and the closer s is to 1, the higher the similarity of two adjacent regions can be shown. In practical applications, two adjacent regions may be merged when s exceeds a similarity threshold, which may be determined by those skilled in the art according to practical application requirements, for example, the similarity threshold may be a value of 0.6, etc.

In step 202, the embodiment of the present application may automatically construct a training image according to the target region, so as to save the labeling cost of the training image.

The embodiment of the application can extract a local image of the target area in the image from the image as an original image.

The embodiment of the application can also perform enhancement processing on the original image to obtain at least one enhancement image. The enhancement processing may include: position processing and size processing. The position processing can realize position conversion corresponding to the target in the original image. The position transformation may refer to a change in the position of the object in the enhancement map relative to the position of the object in the original map. The size processing may implement a size transformation corresponding to the object in the original image. The size transformation may refer to a change in the size of the object in the enhancement map relative to the size of the object in the original map.

In the embodiment of the application, in the pre-training process of the first feature extraction unit, the following may be set in the training image: the enhancement map subjected to position processing and size processing can solve the problems of position transformation and size transformation in the image to be detected in the detection scene, so that the generalization capability of the first feature extraction unit can be improved, and the matching degree between the first feature extraction unit and the detection scene can be enhanced.

The process of determining the training image in step 202 may specifically include: cutting an original image containing a target area from the image; the first size of the original image is larger than that of the target area; presetting an original image to obtain a middle image; randomly cropping a first image having a second size from the intermediate image and enlarging the first image into a first enhancement image having a first size; and cropping a second image having a second size from the intermediate map and enlarging the second image into a second enhancement map having the first size, in accordance with the center of the intermediate map.

The preset processing may be used to change the picture quality of the original image, and the preset processing may include but is not limited to: brightness processing, noise processing, random flipping, random erasing, color dithering, etc. The preset processing may increase the difference in image quality between the enhanced image and the original image.

Randomly cropping a first image having a second size, which may include a partial object or a complete object, from the intermediate image. In this way, the position of the object in the first enhancement map is changed with respect to the position of the object in the original, that is, the position of the first enhancement map with respect to the original is processed.

A second image having a second size is cropped from the middle map according to the center of the middle map, and the second image is enlarged to a second enhancement map of the first size. In this way, the size of the object in the second enhancement map can be changed with respect to the size of the object in the original, that is, the size processing of the first enhancement map with respect to the original can be realized.

Referring to fig. 3, a flowchart illustrating a method of determining a training image according to an embodiment of the present application is shown. Fig. 3 is used to illustrate the process of determining the training image in step 202.

In step 301, an enlarged area corresponding to the target area, for example, 150% of the area, is cropped to obtain an original image of, for example, 120 × 240. The target area may correspond to 100% of the area in the image, and 150% of the area may be the area including the target area and having a larger area. In practical applications, an area of a first size may be cropped from the image as the original according to the center of the target area. Specifically, a 150% area can be obtained by increasing 1/4 in both the lateral and longitudinal directions, respectively, centered on the center of the target area.

In step 302, the artwork may be subjected to a preset process to obtain an intermediate map. The preset process may include: brightness processing, noise processing, random inversion, random erasure, color dithering and other image quality related processing.

In step 303, a first image having a width and a height each set to a first ratio for the original, for example, a first image having a width and a height of one half of the original, that is, 60 × 120, is randomly cropped from the intermediate map, and the first image is enlarged to a first enhancement map having the same size as the original, for example, 120 × 240. The second size is half of the original, and may be 2/3, 3/4, etc. of the original, for example only.

In step 304, a second image having a width and a height each set to a second ratio for the original, for example, a 60 × 120 second image having a width and a height half of the original, is cropped from the intermediate map according to the center of the intermediate map, and the second image is enlarged to a second enhanced map having the same size as the original, for example, 120 × 240. The first ratio and the second ratio are positive numbers, and specific values can be set according to actual conditions.

In step 203, a first feature extraction unit may be used to perform a first feature extraction on the artwork to obtain a first feature.

In practical applications, the first feature extraction unit may include: n convolutional layers of N levels, where N may be a positive integer, for example, N may be a value such as 5. In a particular implementation, outputs of at least one level of convolutional layers may be fused to obtain a first feature.

Referring to fig. 4, a schematic structural diagram of a first feature extraction unit according to an embodiment of the present application is shown, where the first feature extraction unit may include: a first convolutional layer 401, a second convolutional layer 402, a third convolutional layer 403, a fourth convolutional layer 404, and a fifth convolutional layer 405. The convolutional layers here are large stages, each having a certain number of blocks inside, each block comprising, for example, 3 convolutions. Of course, the number of convolutional layers described herein is for illustration only and should not be construed as a limitation on the present application. In the present application, a first Multilayer sensor (MLP) 406 may be provided after the third convolutional layer 403, a second Multilayer sensor 407 may be provided after the fourth convolutional layer 404, and a third Multilayer sensor 408 may be provided after the fifth convolutional layer 405. Furthermore, a fusion processing unit 409 may be further provided, and the fusion processing unit 409 performs fusion processing on the outputs of the first multilayer sensor 406, the second multilayer sensor 407, and the third multilayer sensor 408 to obtain the first feature.

In the embodiment of the application, the first multilayer perceptron 406, the second multilayer perceptron 407 and the third multilayer perceptron 408 are utilized to perform full connection operation on feature codes output by the convolutional layers of multiple levels, and corresponding full connection results are fused (for example, spliced) together, so that a first feature with an M (M may be 128-dimensional numerical value) dimension can be finally obtained.

In step 204, a second feature extraction is performed on the enhancement map to obtain a second feature.

In an embodiment of the present application, the first feature and the second feature may be extracted by a first feature extraction unit. In another implementation manner of the present application, a first feature extraction unit may be used to extract a first feature, and a second feature extraction unit may be used to perform feature extraction on an enhancement map (such as a first enhancement map and a second enhancement map) to obtain a second feature, where the second feature extraction unit and the first feature extraction unit may have the same neural network structure. For example, the second feature extraction unit may include: convolutional layers of N levels.

In another implementation, the performing the second feature extraction on the enhancement map specifically may include: performing feature extraction on the enhancement map by using a second feature extraction unit; and performing full connection operation on the feature representation output by the second feature extraction unit by using the seventh multilayer perceptron and the eighth multilayer perceptron to obtain a second feature.

Referring to fig. 5, a schematic structural diagram of an enhancement map corresponding feature extraction module according to an embodiment of the present application is shown, where the enhancement map corresponding feature extraction module may include: the second feature extraction unit 501, the second feature extraction unit 501 may include: a first convolutional layer 511, a second convolutional layer 512, a third convolutional layer 513, a fourth convolutional layer 514, a fifth convolutional layer 515, a fourth multi-layer sensor 516, a fifth multi-layer sensor 517, a sixth multi-layer sensor 518, and a fusion processing unit 519.

The enhancement map corresponding feature extraction module may further include: a seventh multilayer perceptron 502, a principal component analysis unit 503 and an eighth multilayer perceptron 504.

The seventh multilayer sensor 502 may perform a full join operation on the feature codes output by the fusion processing unit 519, the principal component analysis unit 503 may extract key feature codes from the feature codes output by the seventh multilayer sensor 502, and the eighth multilayer sensor 503 may perform a further full join operation on the key feature codes. The number of neurons included in the seventh multi-layer sensor 502 and the eighth multi-layer sensor 504 is not limited in the embodiments of the present application, for example, the number of neurons included in the seventh multi-layer sensor 502 and the eighth multi-layer sensor 504 is 1024, 2048, 4096, or the like.

The principal component analysis unit 503 may convert a series of possible linearly related variables into a set of linearly unrelated new variables, also called principal components, using orthogonal transformation (orthogonal transformation), thereby characterizing the data in a smaller dimension using the new variables. In the process of extracting the key feature codes, the principal component analysis unit 503 is equivalent to screening the feature codes, that is, screening out the feature codes with lower stability, and retaining the feature codes with higher stability, so that the embodiment of the present application can improve the stability of the second feature, and further can improve the accuracy of updating the parameters of the first feature extraction unit. Specifically, the enhanced map is obtained by performing position conversion and size conversion on the original, and the enhanced map and the original may correspond to the same target in practice, and the principal component analysis unit 503 according to the embodiment of the present application can improve the stability of the enhanced map, and therefore can improve the accuracy of the error information between the original and the enhanced map, and can improve the accuracy of the parameter update by the first feature extraction unit.

It is understood that the principal component analysis unit 503 is disposed between the seventh multilayer sensor 502 and the eighth multilayer sensor 504, which is only an alternative embodiment of the present application, and is not to be construed as an application limitation of the embodiment of the present application. In fact, the principal component analysis unit 503 of the embodiment of the present application may be omitted, that is, the seventh multilayer sensor 502 and the eighth multilayer sensor 504 may be disposed in this order after the second feature extraction unit 501.

In practical applications, the dimensions of the feature vectors corresponding to the first feature and the second feature may be the same, for example, the first feature and the second feature may both be 128-dimensional feature vectors.

In step 205, a method may be used, based on the degree of match between the first feature and the second feature. The measurement method can comprise the following steps: euclidean distance, or cosine of included angle, or information entropy, etc., it can be understood that the specific measurement method is not limited in the embodiments of the present application.

In the case that the enhancement map is one, the embodiment of the present application may determine the error information according to the matching degree between the first feature and the second feature. In this case, the error information may be the first preset value as the update target. The first preset value can represent a numerical value corresponding to the matching degree under the condition that the first characteristic and the second characteristic are the same. For example, in the case of determining the matching degree by using the cosine of the included angle, the first preset value may be 1; for another example, in the case of determining the matching degree by using the euclidean distance, the first preset value may be 0.

In the case that there are a plurality of enhancement maps, the embodiment of the present application may determine the error information according to the matching degree between the first feature and the plurality of second features.

In particular implementations, the enhancement map may include: a first enhancement map and a second enhancement map; the process of determining the error information may include: the error information is determined based on a first matching degree between the original image and the first enhanced image and a second matching degree between the original image and the second enhanced image.

For example, the error information may be determined based on the result of adding the first matching degree and the second matching degree. In this case, the error information may be the second preset value as the update target. For example, in the case of determining the matching degree by the angle cosine method, the error information may be a difference between 1 and half of the addition result, and in this case, the error information may be 0 as the update target.

The cosine method of the included angle may be specifically expressed as determining error information loss according to formula (2):

（2）

wherein v represents a first feature corresponding to the original image, and p and q represent second features corresponding to the two enhancement images, respectively.

The method for updating the first parameter of the first feature extraction unit may include: a gradient descent method, a newton method, a quasi-newton method, a conjugate gradient method, or the like, and it is understood that the embodiment of the present application is not limited to a specific update method.

The embodiment of the application can represent the mapping relation between the error information and the matching degree through the error function. In practical applications, a partial derivative of a parameter of the error function (e.g., a parameter corresponding to the first feature in the first feature extraction unit) may be obtained, and the obtained partial derivative of the parameter may be written in a form of a vector, where the vector corresponding to the partial derivative may be referred to as gradient information corresponding to the parameter. The updating amount corresponding to the parameter can be obtained according to the gradient information and the step length information.

When the gradient descent method is used, a batch gradient descent method, a random gradient descent method, a small batch gradient descent method, or the like may be used. In a specific implementation, iteration may be performed according to a training image corresponding to one image; alternatively, iteration may be performed based on training images corresponding to multiple images. The convergence condition of the iteration may be: the error information meets the preset condition. The preset conditions may be: and the absolute value of the difference between the error information and the first preset value or the second preset value is smaller than a difference threshold, or the iteration times exceed a time threshold, and the like. In other words, the iteration may be ended in case the error information meets the preset condition; in this case, the first target parameter value of the first feature extraction unit can be obtained.

In addition to updating the first parameter of the first feature extraction unit, the embodiment of the present application may also update a parameter of the feature extraction module corresponding to the enhancement map, and accordingly, the method may further include: updating the second parameter of the second feature extraction unit according to the updated first parameter; the parameters of the seventh and eighth multi-layer sensors are updated based on the error information. The parameters of the seventh multi-layer perceptron and the eighth multi-layer perceptron may participate in the corresponding back-propagation of the error information.

According to the embodiment of the application, the last second parameter can be updated according to the current first parameter and the last second parameter. The current first parameter may refer to the ith first parameter, the last second parameter may refer to the (i-1) th second parameter, i may refer to the number of iterations, and i may be a positive integer. Specifically, a first weight and a second weight corresponding to the current first parameter and the last second parameter may be set, respectively, and the current first parameter and the last second parameter may be weighted according to the first weight and the second weight. Wherein the first weight and the second weight may be between [0,1], the sum of the first weight and the second weight may be 1, and the second weight may be a value close to 1, such as 0.95.

The updating process of the second parameter is shown in formula (3):

（3）

wherein, the first and the second end of the pipe are connected with each other,

the second parameter of the last time is represented,

denotes a first parameter of the current time, and m denotes a second weight.

The multi-layered perceptron of the embodiments of the present application is a feedforward artificial neural network model that can map multiple data sets of an input onto a single data set of an output. According to the embodiment of the application, parameters of the seventh multilayer sensor and the eighth multilayer sensor can be updated according to error information by using a back propagation algorithm of a neural network.

According to the embodiment of the application, the second parameter of the second feature extraction unit is updated according to the updated first parameter, and the parameters of the seventh multilayer sensor and the eighth multilayer sensor are updated according to the error information, so that the problem that the updating of the parameters of the feature extraction modules corresponding to the original image and the enhanced image is not synchronous can be avoided to a certain extent, and the synchronism between the parameters of the feature extraction modules corresponding to the original image and the enhanced image is improved; and, the correlation between the second parameters of two adjacent times can also be improved.

In summary, in the training method of the target detection model according to the embodiment of the present application, the target region is extracted from the image, and the training image is automatically constructed for the target region. Therefore, the embodiment of the application can save the labeling cost of the training image.

In addition, in the embodiment of the present application, the error information used for the back propagation of the first feature extraction unit is obtained according to the matching degree between the feature representations of different training images corresponding to the same target region. Therefore, in the embodiment of the present application, the back propagation of the first feature extraction unit may involve the operation of the positive samples represented by the same target region, so that compared with a conventional self-supervision method or a contrast learning method, the operation of the negative samples represented by different target regions may be saved, thereby saving the operation cost and increasing the operation speed.

Moreover, due to factors such as the change of the distance between the pedestrian and the camera, the position and the size of the pedestrian in the image can also be changed. The enhancement processing related to the enhancement map of the embodiment of the application may include: position processing and size processing, so the enhancement map can be characterized: and performing position conversion and size conversion on the original image. The training of the first feature extraction unit in the embodiment of the application can improve the matching degree between the original image and the feature representation of the enhanced image; in this way, the embodiment of the present application enables the first feature extraction unit to have consistent feature representation capabilities before and after position conversion and before and after size conversion, and thus can improve the generalization capability of the first feature extraction unit. In the case where the generalization ability of the first feature extraction means is improved, the first feature extraction means may be applied to a plurality of detection scenes before and after a change in the detection scene.

Method embodiment two

The present embodiment describes a training process of an object detection model, and particularly, describes a migration training process of an object detection model.

Referring to fig. 6, a schematic flow chart illustrating steps of a training method of a target detection model according to an embodiment of the present application is shown, where the method may specifically include the following steps:

step 601, extracting a target area from an image;

step 602, determining a training image according to the target area; the training image may specifically include: the original image and at least one enhancement image corresponding to the target area; the enhancement map may be an image obtained by enhancing the original image; the enhancement processing may include: position processing and size processing;

603, performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature;

step 604, performing second feature extraction on the enhancement map to obtain a second feature;

step 605, determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information;

compared with the first method embodiment shown in fig. 2, the method of this embodiment may further include;

step 606, performing migration training on the target detection model according to the first target parameter value and the labeled image data; wherein, the first target parameter value may be: and under the condition that the error information meets the preset condition, the numerical value corresponding to the first parameter.

The image data with the label in the embodiment of the application can be an image sample in a detection scene, and the detection scene can correspond to places such as a supermarket, a market, a park and the like.

As shown in fig. 1, the target detection model of the embodiment of the present application may include: the first feature extraction unit 101, the feature fusion unit 102, and the detection head unit 103 may use the first target parameter value as an initial parameter of the first feature extraction unit 101 during the migration training process, and may update the first parameter based on the first target parameter value during the migration training process.

The feature fusion unit 102 may correspond to a third parameter, the detection head unit 103 may correspond to a fourth parameter, and initial values of the third parameter and the fourth parameter may be determined by those skilled in the art according to practical application requirements, and the embodiment of the present application does not limit the initial values of the third parameter and the fourth parameter. The third parameter and the fourth parameter may be updated based on the initial values of the third parameter and the fourth parameter during the migration training.

The tagged image data may be provided with a positive tag that may characterize the tagged image data as including an object or a negative tag that may characterize the tagged image data as not including an object. In the migration training process of the target detection model, the image data with the label can be detected by the target detection model, loss information can be determined according to the detection result obtained by the target detection model and the label corresponding to the image data with the label, the first feature extraction unit 101, the feature fusion unit 102 and the detection head unit 103 in the target detection model are updated according to the loss information, and the migration training can be considered to be completed and ended under the condition that the loss information meets the convergence condition. The convergence condition may include: and matching the loss information with a loss threshold, etc., it is understood that the embodiment of the present application is not limited to a specific convergence condition.

In the training process of the target detection model, the first parameter, the second parameter, the third parameter, and the fourth parameter may be obtained by learning a training sample (including a training image or labeled image data). Such parameters may include: weight parameters of the neural network, etc.

In the training process of the target detection model, some parameters cannot be obtained by learning the training samples, and the parameters are called hyper-parameters. Examples of hyper-parameters may include: the number of layers of the neural network, the number of neurons in each layer, the number of training samples processed by one training, the learning rate and the like.

In order to save the calculation cost for determining the hyper-parameters, the embodiment of the application may determine a value range corresponding to the hyper-parameters, and search for a target value corresponding to the hyper-parameters in the value range based on training of the labeled image data. Specifically, in the embodiments of the present application, for a plurality of candidate numerical values in a numerical range, corresponding loss information may be determined, and a target numerical value may be selected from the plurality of candidate numerical values according to the loss information. For example, the candidate values may be sorted in the order from superior to inferior of the loss information, and the candidate value ranked in the front may be selected as the target value.

In summary, according to the training method for the target detection model in the embodiment of the present application, migration training is performed on the target detection model according to the first target parameter value and the labeled image data. The first target parameter value is obtained from a common training image, and the first feature extraction means can have a feature representation capability that is consistent between before and after the position conversion and before and after the size conversion, so that the generalization capability of the first feature extraction means can be improved. Under the condition of improving the generalization ability of the first feature extraction unit, the matching degree between the first target parameter value of the first feature extraction unit and the detection scene can be improved.

In addition, according to the image data with the label, the target detection model is subjected to the migration training, so that the target detection model after the migration training can be suitable for the detection scene corresponding to the image data with the label, that is, the detection capability of the target detection model after the migration training can be improved, and the accuracy of the detection result of the target detection model after the migration training can be improved.

Method embodiment three

In this embodiment, a detection process of the target detection model is described, and the target detection model may perform target detection on an image to be detected to obtain a corresponding detection result.

Referring to fig. 7, a schematic flow chart illustrating steps of a target detection method according to an embodiment of the present application is shown, where the method may specifically include the following steps:

step 701, receiving an image to be detected;

step 702, performing target detection on the image to be detected by using a target detection model to obtain a corresponding detection result;

wherein, the target detection model may include: a first feature extraction unit; the training process of the target detection model may include: extracting a target region from the image; determining a training image according to the target area; the training image may include: the original image and at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing may include: position processing and size processing; performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

The image to be detected in step 701 may originate from the acquisition device. The acquisition device can acquire videos, and video frames can be extracted from the videos under the condition to serve as images to be detected. Or the acquisition device can acquire images, and under the condition, the images acquired by the acquisition device can be used as images to be detected.

In step 702, the target detection model may perform target detection on the image to be detected according to the process shown in fig. 1. Specifically, the first feature extraction unit in the target detection model may extract a feature representation of the image to be detected. The feature fusion unit in the target detection model can fuse the feature representation output by the first feature extraction unit so as to improve the diversity of fusion features and the performance of the target detection model.

The detection head unit in the target detection model can detect whether the image to be detected contains targets such as pedestrians according to the fusion characteristics output by the characteristic fusion unit, and if yes, position information of the targets such as the pedestrians can be given. Therefore, the detection result of the embodiment of the present application may include: no target is included. Alternatively, the detection result of the embodiment of the present application may include: the image detection method comprises the steps of obtaining a target and position information of the target, wherein the position information can be coordinate information, or the position information can be marked in an image to be detected.

In summary, the target detection method according to the embodiment of the present application extracts a target region from an image, and automatically constructs a training image for the target region. Therefore, the embodiment of the application can save the labeling cost of the training image.

In addition, in the embodiment of the present application, the error information used for the back propagation of the first feature extraction unit is obtained according to the matching degree between the feature representations of different training images corresponding to the same target region. Therefore, in the embodiment of the present application, the back propagation of the first feature extraction unit may involve the operation of the positive samples represented by the same target region, so that compared with a conventional self-supervision learning method or a comparative learning method, the operation of the negative samples represented by different target regions may be saved, thereby saving the operation cost and increasing the operation speed.

Moreover, the position and the size of the pedestrian in the image can be changed due to factors such as the change of the distance between the pedestrian and the camera. The enhancement processing related to the enhancement map of the embodiment of the application may include: position processing and size processing, so the enhancement map can be characterized: and performing position conversion and size conversion on the original image. The training of the first feature extraction unit in the embodiment of the application can improve the matching degree between the original image and the feature representation of the enhanced image; in this way, the embodiment of the present application enables the first feature extraction unit to have consistent feature representation capabilities before and after position conversion and before and after size conversion, and thus can improve the generalization capability of the first feature extraction unit. Under the condition of improving the generalization capability of the first feature extraction unit, the target detection accuracy can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the embodiments. Further, those of skill in the art will recognize that the embodiments described in this specification are presently preferred embodiments and that no particular act is required to implement the embodiments of the disclosure.

On the basis of the foregoing embodiment, this embodiment further provides a training apparatus for a target detection model, and with reference to fig. 8, the training apparatus may specifically include: a region extraction module 801, a training image determination module 802, a first feature extraction module 803, a second feature extraction module 804, an error determination module 805, and a first parameter update module 806.

The region extraction module 801 is configured to extract a target region from an image;

a training image determining module 802, configured to determine a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing;

a first feature extraction module 803, configured to perform first feature extraction on the original image by using a first feature extraction unit to obtain a first feature;

a second feature extraction module 804, configured to perform second feature extraction on the enhancement map to obtain a second feature;

an error determining module 805, configured to determine error information according to a matching degree between the first feature and the second feature;

a first parameter updating module 806, configured to update the first parameter of the first feature extraction unit according to the error information.

Optionally, the second feature extraction module 804 may include:

the second feature extraction unit is used for extracting features of the enhancement graph; the second feature extraction unit and the first feature extraction unit have the same neural network structure;

and the seventh multilayer perceptron and the eighth multilayer perceptron are used for carrying out full connection operation on the feature representation output by the second feature extraction unit to obtain a second feature.

Optionally, the apparatus may further include:

the second parameter updating module is used for updating the second parameter of the second feature extraction unit according to the updated first parameter;

and the sensor parameter updating module is used for updating the parameters of the seventh multilayer sensor and the eighth multilayer sensor according to the error information.

Optionally, the enhancement map may include: a first enhancement map and a second enhancement map;

the error determining module 805 is specifically configured to determine error information according to a first matching degree between the original image and the first enhanced image and a second matching degree between the original image and the second enhanced image.

Optionally, the training image determining module 802 may specifically include:

the original image acquisition module is used for cutting an original image containing a target area from the image; the first size of the original image is larger than that of the target area;

the device comprises a presetting processing module, a processing module and a processing module, wherein the presetting processing module is used for carrying out presetting processing on an original image to obtain an intermediate image;

and a first enhancement map acquisition module for randomly cropping the first image having the second size from the intermediate map and enlarging the first image into a first enhancement map having the first size.

And a second enhancement map acquisition module for cropping a second image having a second size from the intermediate map according to the center of the intermediate map and enlarging the second image into a second enhancement map having the first size.

the apparatus may further include:

and the migration training module is used for performing migration training on the target detection model according to the first target parameter value and the image data with the labels.

On the basis of the foregoing embodiment, this embodiment further provides an object detection apparatus, and with reference to fig. 9, the object detection apparatus may specifically include: a receiving module 901 and an object detecting module 902.

The receiving module 901 is configured to receive an image to be detected;

a target detection module 902, configured to perform target detection on the image to be detected by using a target detection model to obtain a corresponding detection result;

the target detection model may specifically include: a first feature extraction unit; the training process of the target detection model may specifically include: extracting a target area from the image; determining a training image according to the target area; the training image includes: the original image and at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement treatment may specifically include: position processing and size processing; performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

The present application further provides a non-transitory, readable storage medium, where one or more modules (programs) are stored, and when the one or more modules are applied to a device, the device may execute instructions (instructions) of method steps in this application.

Embodiments of the present application provide one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an electronic device to perform the methods as described in one or more of the above embodiments. In the embodiment of the present application, the electronic device includes various types of devices such as a terminal device and a server (cluster).

Embodiments of the disclosure may be implemented as an apparatus for performing desired configurations using any suitable hardware, firmware, software, or any combination thereof, which may include: electronic devices such as terminal devices and servers (clusters). Fig. 10 schematically illustrates an example apparatus 1100 that may be used to implement various embodiments described herein.

For one embodiment, fig. 10 illustrates an example apparatus 1100 having one or more processors 1102, a control module (chipset) 1104 coupled to at least one of the processor(s) 1102, a memory 1106 coupled to the control module 1104, a non-volatile memory (NVM)/storage 1108 coupled to the control module 1104, one or more input/output devices 1110 coupled to the control module 1104, and a network interface 1112 coupled to the control module 1104.

The processor 1102 may include one or more single-core or multi-core processors, and the processor 1102 may include any combination of general-purpose or special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In some embodiments, the apparatus 1100 can be implemented as a terminal device, a server (cluster), or the like in the embodiments of the present application.

In some embodiments, the apparatus 1100 may include one or more computer-readable media (e.g., the memory 1106 or the NVM/storage 1108) having instructions 1114 and one or more processors 1102 in combination with the one or more computer-readable media configured to execute the instructions 1114 to implement modules to perform the actions described in this disclosure.

For one embodiment, control module 1104 may include any suitable interface controllers to provide any suitable interface to at least one of the processor(s) 1102 and/or to any suitable device or component in communication with control module 1104.

The control module 1104 may include a memory controller module to provide an interface to the memory 1106. The memory controller module may be a hardware module, a software module, and/or a firmware module.

The memory 1106 may be used, for example, to load and store data and/or instructions 1114 for the device 1100. For one embodiment, memory 1106 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the memory 1106 may comprise a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, control module 1104 may include one or more input/output controllers to provide an interface to NVM/storage 1108 and input/output device(s) 1110.

For example, NVM/storage 1108 may be used to store data and/or instructions 1114. NVM/storage 1108 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 1108 may include storage resources that are physically part of the device on which apparatus 1100 is installed, or it may be accessible by the device and need not be part of the device. For example, NVM/storage 1108 may be accessed over a network via input/output device(s) 1110.

Input/output device(s) 1110 may provide an interface for apparatus 1100 to communicate with any other suitable device, input/output devices 1110 may include communication components, audio components, sensor components, and so forth. Network interface 1112 may provide an interface for device 1100 to communicate over one or more networks, and device 1100 may communicate wirelessly with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols, such as access to a communication standard-based wireless network, e.g., WiFi, 2G, 3G, 4G, 5G, etc., or a combination thereof.

For one embodiment, at least one of the processor(s) 1102 may be packaged together with logic for one or more controller(s) (e.g., memory controller module) of the control module 1104. For one embodiment, at least one of the processor(s) 1102 may be packaged together with logic for one or more controller(s) of control module 1104 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 1102 may be integrated on the same die with logic for one or more controller(s) of control module 1104. For one embodiment, at least one of the processor(s) 1102 may be integrated on the same die with logic for one or more controller(s) of control module 1104 to form a system on chip (SoC).

In various embodiments, the apparatus 1100 may be, but is not limited to: a server, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.), among other terminal devices. In various embodiments, the apparatus 1100 may have more or fewer components and/or different architectures. For example, in some embodiments, device 1100 includes one or more cameras, keyboards, Liquid Crystal Display (LCD) screens (including touch screen displays), non-volatile memory ports, multiple antennas, graphics chips, Application Specific Integrated Circuits (ASICs), and speakers.

The detection device can adopt a main control chip as a processor or a control module, sensor data, position information and the like are stored in a memory or an NVM/storage device, a sensor group can be used as an input/output device, and a communication interface can comprise a network interface.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The above detailed description is provided for a training method and apparatus of a target detection model, a target detection method and apparatus, an electronic device, and a machine-readable medium, and specific examples are applied herein to explain the principles and embodiments of the present application, and the descriptions of the above embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for training an object detection model, wherein the object detection model comprises: a first feature extraction unit, the method comprising:

extracting a target region from the image;

2. The method of claim 1, wherein the second feature extraction of the enhancement map comprises:

and performing full connection operation on the feature representation output by the second feature extraction unit by using a seventh multilayer sensor and an eighth multilayer sensor to obtain a second feature.

3. The method of claim 2, further comprising:

updating parameters of the seventh multilayer perceptron and the eighth multilayer perceptron according to the error information.

4. The method of claim 1, wherein the enhancement map comprises: a first enhancement map and a second enhancement map;

5. The method of claim 1, wherein determining a training image based on the target region comprises:

presetting an original image to obtain a middle image;

6. The method according to any one of claims 1 to 5, wherein the value of the first parameter is a first target parameter value if the error information meets a preset condition;

the method further comprises the following steps:

7. A method of object detection, the method comprising:

receiving an image to be detected;

wherein the target detection model comprises: a first feature extraction unit; the training process of the target detection model comprises the following steps: extracting a target area from the image; determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original map; the enhancement processing includes: position processing and size processing; performing first feature extraction on the original image by using a first feature extraction unit to obtain a first feature; performing second feature extraction on the enhancement map to obtain second features; and determining error information according to the matching degree between the first feature and the second feature, and updating the first parameter of the first feature extraction unit according to the error information.

8. An apparatus for training an object detection model, the object detection model comprising: a first feature extraction unit, the apparatus comprising:

the training image determining module is used for determining a training image according to the target area; the training image includes: the original image and the at least one enhancement image corresponding to the target area; the enhancement map is an image obtained by enhancing the original image; the enhancement processing includes: position processing and size processing;

9. An object detection apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving an image to be detected;

10. An electronic device, comprising: a processor; and

memory having stored thereon executable code which, when executed, causes the processor to perform the method of any of claims 1-7.

11. A machine readable medium having executable code stored thereon, which when executed, causes a processor to perform the method of any of claims 1-7.