CN114359572A

CN114359572A - Training method and device of multi-task detection model and terminal equipment

Info

Publication number: CN114359572A
Application number: CN202111416716.6A
Authority: CN
Inventors: 李奕润; 程骏
Original assignee: Ubtech Robotics Corp
Current assignee: Ubtech Robotics Corp
Priority date: 2021-11-25
Filing date: 2021-11-25
Publication date: 2022-04-15

Abstract

The application is applicable to the technical field of image processing, and provides a training method, a device, terminal equipment and a computer readable storage medium for a multi-task detection model, which are applied to a preset multi-task detection model, wherein the multi-task detection model comprises a feature extraction network and a multi-task detection network, and the method comprises the following steps: acquiring global characteristic information and local characteristic information of the training image through the characteristic extraction network; inputting the global characteristic information into the multitask detection network to obtain a multitask detection result; obtaining a semantic segmentation result of the training image according to the global feature information and the local feature information; and training the multi-task detection model according to the multi-task detection result and the semantic segmentation result. By the method, the calculation amount of the multi-task detection can be reduced, and the accuracy of the multi-task detection is improved.

Description

Training method and device of multi-task detection model and terminal equipment

Technical Field

The application belongs to the technical field of image processing, and particularly relates to a training method and device of a multi-task detection model, a terminal device and a computer readable storage medium.

Background

With the continuous development of artificial intelligence in recent years, deep neural networks are widely applied in the field of computer vision. For example, in the field of automatic driving applications, a target detection task based on a deep neural network may be used to detect objects such as obstacles and signboards during driving, and a lane line detection task based on a deep neural network may be used to detect lane lines on a road surface.

As described in the above example, multiple detection tasks need to be performed in the same application scenario. In the prior art, a plurality of detection tasks can be executed through a multi-task detection network, and a to-be-detected image is input into the multi-task detection network, so that respective detection results of the plurality of detection tasks can be output. However, each detection task in the existing multi-task detection network is executed independently, the calculation amount is large, and the correlation characteristics between different detection targets cannot be utilized, so that the detection accuracy is low.

Disclosure of Invention

The embodiment of the application provides a training method and device of a multi-task detection model, a terminal device and a computer readable storage medium, which can reduce the calculation amount of multi-task detection and improve the accuracy of the multi-task detection.

In a first aspect, an embodiment of the present application provides a method for training a multi-task detection model, which is applied to a preset multi-task detection model, where the multi-task detection model includes a feature extraction network and a multi-task detection network, and the method includes:

acquiring global characteristic information and local characteristic information of the training image through the characteristic extraction network;

inputting the global characteristic information into the multitask detection network to obtain a multitask detection result;

obtaining a semantic segmentation result of the training image according to the global feature information and the local feature information;

and training the multi-task detection model according to the multi-task detection result and the semantic segmentation result.

In the embodiment of the application, the global feature information is utilized for multi-task detection, which is equivalent to the fact that a plurality of detection tasks share the global feature information. Local feature information is added during semantic segmentation, which is equivalent to consideration of detail features in the image, and then the semantic segmentation result is combined with the multi-task detection result to jointly train the multi-task detection model, so that the recognition capability of the trained multi-task detection model to different target features and the correlation capability to the different target features can be effectively improved. By the method, the detection precision of the multi-task detection model is effectively improved while the calculation amount of the multi-task detection model is reduced.

In a possible implementation manner of the first aspect, the obtaining a semantic segmentation result of the training image according to the global feature information and the local feature information includes:

performing information integration processing on the global characteristic information and the local characteristic information to obtain integrated characteristic information;

and obtaining the semantic segmentation result of the training image according to the integrated characteristic information.

In a possible implementation manner of the first aspect, the performing information integration processing on the global feature information and the local feature information to obtain integrated feature information includes:

carrying out up-sampling processing on the global feature information to obtain first processing information;

performing convolution processing on the first processing information to obtain second processing information;

and performing information splicing processing on the second processing information and the local characteristic information to obtain the integrated characteristic information.

In a possible implementation manner of the first aspect, the integrated feature information includes a probability value that each pixel point on the training image belongs to each semantic category; the semantic segmentation result comprises the semantic category to which each pixel point in the training image belongs;

the obtaining of the semantic segmentation result of the training image according to the integrated feature information includes:

for each pixel point in the training image, acquiring a first target value corresponding to the pixel point, wherein the first target value is the maximum value in probability values of the pixel point belonging to each semantic category;

and determining the semantic category corresponding to the first target value as the semantic category to which the pixel point belongs.

In one possible implementation manner of the first aspect, the multitask detection network includes a vehicle detection sub-network, and the multitask detection result includes a vehicle detection frame;

the inputting the global feature information into the multitask detection network to obtain a multitask detection result includes:

inputting the global feature information into the vehicle detection sub-network to obtain a plurality of groups of detection frame information, wherein each group of detection frame information comprises a center position, a probability value corresponding to the center position, a long value, a wide value and a center offset;

and generating vehicle detection frames corresponding to each group of target frame information, wherein the target frame information is detection frame information corresponding to a second target value, and the second target value is a probability value corresponding to the central position meeting a first preset threshold value in the multiple groups of detection frame information.

In a possible implementation manner of the first aspect, the multitask detection network includes a lane line detection sub-network, and the multitask detection result includes a lane line point set;

inputting the global feature information into the lane line detection sub-network to obtain a category matrix corresponding to each group of target pixels, wherein one group of target pixels are a row of pixel points in the training image, the number of the group of target pixels is smaller than the number of the row of the pixel points in the training image, and the category matrix comprises a probability value of each pixel point in the target pixels belonging to each lane line category;

determining a category vector corresponding to each group of target pixels according to the category matrix corresponding to each group of target pixels, wherein the category vector comprises lane line categories corresponding to the maximum probability values of each pixel point in the target pixels in the category matrix;

and generating the lane line point set by the pixel points corresponding to the preset categories in each category vector.

In a possible implementation manner of the first aspect, the multitask detection network includes a first detection sub-network and a second detection sub-network, and the multitask detection result includes a first detection result output by the first detection sub-network and a second detection result output by the second detection sub-network;

the training of the multi-task detection model according to the multi-task detection result and the semantic segmentation result comprises:

calculating a first loss value of the first detection result according to a first loss function;

calculating a second loss value of the second detection result according to a second loss function;

calculating a third loss value of the semantic segmentation result according to a third loss function;

training the multi-tasking detection model according to the first loss value, the second loss value, and the third loss value.

In a second aspect, an embodiment of the present application provides a training apparatus for a multi-task detection model, which is applied to a preset multi-task detection model, where the multi-task detection model includes a feature extraction network and a multi-task detection network, and the apparatus includes:

the characteristic extraction unit is used for acquiring global characteristic information and local characteristic information of the training image through the characteristic extraction network;

the target detection unit is used for inputting the global characteristic information into the multitask detection network to obtain a multitask detection result;

the semantic segmentation unit is used for obtaining a semantic segmentation result of the training image according to the global characteristic information and the local characteristic information;

and the model training unit is used for training the multi-task detection model according to the multi-task detection result and the semantic segmentation result.

In a third aspect, an embodiment of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for training a multitask detection model according to any one of the above first aspects.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, and the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, where the computer program, when executed by a processor, implements the method for training a multi-task detection model according to any one of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the method for training a multitask detection model according to any one of the above first aspects.

It is understood that the beneficial effects of the second aspect to the fifth aspect can be referred to the related description of the first aspect, and are not described herein again.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic diagram of a multitasking detection model provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of a training method of a multi-task detection model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a model training process provided by an embodiment of the present application;

FIG. 4 is a block diagram of a training apparatus for a multi-task detection model according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise.

Referring to fig. 1, a diagram of a multitask detection model provided in the embodiment of the present application is shown. As shown in fig. 1, the multitask detection model includes a feature extraction network 11 and a multitask detection network 12. The multitask detection network is used for executing a plurality of detection tasks. The output end of the feature extraction network is connected with the input end of the multitask detection network.

For example, in an autonomous automatic cruise application scenario, the detection tasks may include vehicle detection and lane line detection, i.e., ensuring that the vehicle is traveling along the lane line. In this case, the multitask detection network in the multitask detection model may include a vehicle detection sub-network and a lane line detection sub-network. Inputting an image to be detected into a feature extraction network of a multi-task detection model, outputting global feature information of the image to be detected by the feature extraction network, and respectively inputting the global feature information into a vehicle detection sub-network and a lane line detection sub-network in the multi-task detection network; the vehicle detection sub-network detects a target vehicle in the image to be detected according to the global characteristic information and outputs a detection frame of the target vehicle; and the lane line detection sub-network detects the lane lines in the image to be detected according to the global characteristic information and outputs a lane line point set. And the vehicle controller controls the target vehicle to run along the lane line according to the detection frame of the target vehicle and the lane line point set output by the multitask detection model.

In an automatic obstacle avoidance application scenario of automatic driving, the detection task may include vehicle detection and obstacle detection, that is, it is ensured that the vehicle runs while avoiding the obstacle. In this case, the multitask detection network in the multitask detection model may include a vehicle detection sub-network and an obstacle detection sub-network. Inputting an image to be detected into a feature extraction network of a multi-task detection model, outputting global feature information of the image to be detected by the feature extraction network, and respectively inputting the global feature information into a vehicle detection sub-network and an obstacle detection sub-network in the multi-task detection network; the vehicle detection sub-network detects a target vehicle in the image to be detected according to the global characteristic information and outputs a detection frame of the target vehicle; and the obstacle detection sub-network detects the obstacles in the image to be detected according to the global characteristic information and outputs a detection frame of the obstacles. And the vehicle controller controls the target vehicle to avoid the obstacle to run according to the detection frame of the target vehicle and the detection frame of the obstacle, which are output by the multitask detection model.

It should be noted that the above is only an example of the multitask detection model, and the structure of the multitask detection model is not specifically limited. In practical applications, the multi-task detection model can be used for processing a single detection task or multiple detection tasks. When processing multiple detection tasks, the multi-task detection network may include a detection sub-network corresponding to each of the multiple detection tasks.

In the embodiment of the application, the global feature information is utilized for multi-task detection, which is equivalent to the fact that a plurality of detection tasks share the global feature information.

In the process of training the multi-task detection model, the output result of each detection task can be comprehensively considered, so that the recognition capability of the multi-task detection model to different detection targets and the correlation capability to different target characteristics can be enhanced.

In general, in an application scenario, besides the feature of the detection target itself, other image features, such as background features, color features, etc., may also provide useful information for the detection of the target. However, in the above training method, only the output result of each detection task is considered, that is, only the feature information of the detection target itself is considered, and useful information other than the target feature is ignored.

The embodiment of the application provides a training method of a multi-task detection model, in the method, other characteristics except detection target characteristic information in an image are used for carrying out auxiliary training on the model, so that the detection precision of the model is improved. Referring to fig. 2, which is a schematic flowchart of a training method of a multi-task detection model provided in an embodiment of the present application, by way of example and not limitation, the method may include the following steps:

s201, global feature information and local feature information of the training image are obtained through a feature extraction network.

In the embodiment of the present application, the feature extraction network may include a plurality of feature extraction layers. The output of the feature extraction network is global feature information which can include color features, texture features, shape features and the like of the image, the overall attributes of the image are emphasized, the correlation degree between the features is high, and the detailed features of the image are lost. And acquiring the outputs of certain intermediate layers in the feature extraction network, and taking the outputs as local feature information. The local feature information focuses on the local details of the image, and the image details lost by the global feature information can be effectively made up.

Existing backbone networks such as VGG, ResNet, MobileNet and the like can be used as the feature extraction network. In the selection of the feature extraction network, the selection can be carried out by the balance between the accuracy and the calculation speed. If the accuracy of network calculation is more emphasized, a backbone network taking ResNet as an example can be selected, and more image features are extracted through a deeper network; if the computing speed of the network is more emphasized, a backbone network such as MobileNet can be selected, and the light backbone network is more suitable for the embedded device, so that the computing speed is improved.

S202, inputting the global feature information into a multi-task detection network to obtain a multi-task detection result.

When processing a single detection task, the multi-tasking detection network may include a sub-network of detectors for the single detection task. When processing multiple detection tasks, the multi-task detection network may include a detection sub-network corresponding to each of the multiple detection tasks. Of course, a multi-tasking detection network may include multiple sub-detection networks, but in application, the output of some of the required sub-detection networks may be obtained.

For example, the multi-tasking detection network may include a vehicle detection sub-network, an obstacle detection sub-network, a lane line detection sub-network, and a pedestrian detection sub-network. Under the automatic line patrol application scene, the output results of the vehicle detection sub-network and the lane line detection sub-network can be obtained. Under the application scene of automatic obstacle avoidance, the output results of the vehicle detection sub-network and the obstacle detection sub-network can be obtained. Under the automatic avoidance application scene, the output results of the vehicle detection sub-network and the pedestrian detection sub-network can be obtained.

In the scenario of the automatic line patrol application as shown in the embodiment of fig. 1, the multitask detection network includes a vehicle detection sub-network and a lane line detection sub-network. Correspondingly, the output result of the multitask detection network comprises a vehicle detection frame and a lane line point set.

Optionally, the process of performing the vehicle detection task may include: presetting anchor points and anchor frames; acquiring a plurality of candidate detection frames of the target vehicle in the training image by taking the anchor points and the anchor frames as references; and filtering out the target detection frame from the candidate detection frames by a non-maximum suppression method.

The detection precision of the detection method depends on the preset anchor point and anchor frame, and the reliability is poor. In addition, the detection frame needs to be filtered through a non-maximum suppression algorithm, so that the calculation amount is large, and the algorithm speed is low.

In order to improve the calculation speed while ensuring the detection accuracy, in one embodiment, the process of performing the vehicle detection task may include:

inputting the global characteristic information into a vehicle detection sub-network to obtain a plurality of groups of detection frame information, wherein each group of detection frame information comprises a central position, a probability value corresponding to the central position, a long value, a wide value and a central offset; and generating vehicle detection frames corresponding to each group of target frame information, wherein the target frame information is detection frame information corresponding to a second target value, and the second target value is a probability value corresponding to a center position meeting a first preset threshold value in the plurality of groups of detection frame information.

Fig. 3 is a schematic diagram of a model training process provided in the embodiment of the present application. As shown in fig. 3, three branches of heat map detection, length and width detection, and regression detection may be included in a vehicle detection sub-network. The heat map detection branch is used for acquiring a heat map of the training image, wherein the heat map comprises a plurality of center positions possibly belonging to the center of the detection frame on the training image and probability values corresponding to the center positions. The length and width detection is used for acquiring the length and width values of the detection frame of the target vehicle in the training image. The regression detection is used to obtain the center offset of the detection frame of the target vehicle in the training image. The output results of the three branches are correlated, that is, a central position of the output of the thermal diagram detection branch corresponds to a group of length and width values of the output of the length and width detection branch, and corresponds to a central offset of the output of the regression detection branch.

The method includes the steps that N central positions with probability values of the central positions larger than a first preset threshold value and a group of detection box information corresponding to the N central positions respectively can be selected. And generating the detection frames corresponding to the N groups of detection frame information respectively.

And for any kind of target frame information, determining the actual central position of the detection frame according to the central position and the central offset in the target frame information, and determining the position and the shape of the detection frame according to the actual central position and the length and width values. Correspondingly, the generated detection score of the detection box is the probability value of the center position in the target box information.

Generally, the detection process of the lane line is as follows: and classifying each pixel point in the training image to determine whether each pixel point belongs to the lane line or not and which lane line. This method is computationally intensive. To reduce the amount of computation, in one embodiment, the process of performing lane line detection tasks may include:

inputting the global feature information into a lane line detection sub-network, obtaining a category matrix corresponding to each group of target pixels, determining a category vector corresponding to each group of target pixels according to the category matrix corresponding to each group of target pixels, and generating a lane line point set by using pixel points corresponding to preset categories in each category vector.

The classification matrix comprises a probability value of each pixel point in the target pixel belonging to each lane line classification; the category vector comprises the lane line category corresponding to the maximum probability value of each pixel point in the target pixel in the category matrix.

The above method is equivalent to classifying the image in the direction of the lines in the image. Suppose that the training image is an H × W image, H is the number of rows of pixel points in the training image, and W is the number of columns of pixel points in the training image. H rows of pixel points can be sampled from the training image, and H is smaller than H. Therefore, only the classification problem on H lines needs to be processed, the original H multiplied by W classification problems are simplified into H classification problems, and the calculation amount is greatly reduced.

Optionally, in the h classification problems, each classification problem is W-dimensional. And W pixel points can be sampled from the W dimension again, so that each classification problem is simplified into the W dimension, and the calculation amount is further reduced.

For example, assume that the training image is a 10 × 10(H ═ W ═ 10) image, the lane line categories are I, II and class III, class I indicates that the left lane line belongs to the left side lane line, class II indicates that the right lane line belongs to the right side lane line, and class III indicates that the lane line does not belong to the lane line. Sampling 3 lines of pixel points (h is 3), and sampling 1 pixel point in every other pixel point in each line of pixel points (w is 5).

Obtaining a category matrix corresponding to each row of pixel points through a lane line detection sub-network, wherein the category matrix corresponding to the 1 st row of pixel points is

The first row of numerical values in the matrix is the probability value that each pixel belongs to the left lane line, the second row of numerical values is the probability value that each pixel belongs to the right lane line, the third row of numerical values is the probability value that each pixel does not belong to the lane line, and each column from left to right respectively represents a pixel h1_w1Corresponding probability value and pixel point h1_w2Corresponding probability value and pixel point h1_w3Corresponding probability value h1_w4Corresponding probability value and pixel point h1_w5Corresponding probability value, hi_wjAnd indicating the pixel point of the ith row and the jth column in the sampled pixel points. And observing the maximum probability value of each pixel point in the category matrix, wherein the maximum probability value can be in the direction of columns in the matrix. Pixel h1_w1The maximum probability value in the category matrix is 0.8 (the corresponding category is III), and the pixel point h1_w2The maximum probability value in the category matrix is 0.8 (the corresponding category is I), and the pixel point h1_w3The maximum probability value in the category matrix is 0.8 (corresponding category)Is III), pixel point h1_w4The maximum probability value in the category matrix is 0.75 (the corresponding category is II), and the pixel point h1_w5The maximum probability value in the class matrix is 0.8 (corresponding class is III), and the class vector obtained from the class matrix is [ III I III II III ]]。

Similarly, assume that the category matrix corresponding to the 2 nd row pixel point is

The class vector obtained from the class matrix is [ III I III II III ]]. Assume that the category matrix corresponding to the row 2 pixel point is

The class vector obtained from the class matrix is [ III I III II III ]]. Generating a point set of the left lane line by using the pixel point with the category I in the three category vectors, namely the point set of the left lane line comprises a pixel point h1_w2Pixel h2_w2And pixel h3_w2(ii) a Generating a point set of the right lane line by using the pixel points with the category II in the three category vectors, namely the point set of the right lane line comprises a pixel point h1_w4Pixel h2_w4And pixel h3_w4. The other pixels represent the background.

It should be noted that the above is only an example of lane line detection, and the number of rows h and columns w of sampling is not specifically limited. Of course, the larger h and w are, the denser the sampling is, the larger the required data processing amount is, and the higher the detection precision is; the smaller h and w are, the more sparse the sampling is, the smaller the required data processing amount is, and the lower the detection accuracy is. And determining the values of h and w according to actual needs.

The method adopts the global characteristic information of the image when the lane line is detected, namely when the position of the lane line in a certain row is detected, the receptive field is the size of the whole image, so that a good effect can be realized without a complex information transmission mechanism, and the problem of inaccurate detection of the lane line caused by the small receptive field is effectively solved.

And S203, obtaining a semantic segmentation result of the training image according to the global characteristic information and the local characteristic information.

In one embodiment, S203 may include the steps of:

performing information integration processing on the global characteristic information and the local characteristic information to obtain integrated characteristic information; and obtaining a semantic segmentation result of the training image according to the integrated characteristic information.

The semantic segmentation process simultaneously adopts global feature information and local feature information, which is equivalent to not only considering the overall attributes of the image such as color, texture and shape, but also considering the detail attributes of the image, and is beneficial to improving the accuracy of the semantic segmentation result.

Optionally, as shown in fig. 3, the process of integrating processing may include:

carrying out up-sampling processing on the global characteristic information to obtain first processing information; performing convolution processing on the first processing information to obtain second processing information; and performing information splicing processing on the second processing information and the local characteristic information to obtain integrated characteristic information.

Further, after the information stitching process, a convolution and upsampling operation may be performed to obtain a set of multi-dimensional images. The number of channels of the multi-dimensional image is the number of semantic categories. Correspondingly, the group of multi-dimensional images is integrated feature information, wherein the integrated feature information comprises probability values of each pixel point on the training image belonging to each semantic category.

Optionally, the semantic segmentation process may include:

for each pixel point in the training image, acquiring a first target value corresponding to the pixel point, wherein the first target value is the maximum value in probability values of the pixel point belonging to each semantic category; and determining the semantic category corresponding to the first target value as the semantic category to which the pixel point belongs.

For example, assume that there are A, B, C semantic categories and D semantic categories, where A denotes belonging to the background, B denotes belonging to the vehicle, C denotes belonging to the lane line, and D denotes belonging to the road surface but not belonging to the lane line. For pixel point C in training image_mn(pixel point of mth row and nth column), the probability values of 4 categories are 0.1, 0.2, 0.65 and 0.05 respectively. Wherein the maximum probability valueThe semantic category corresponding to 0.65 is C, which represents the pixel point C_mnThe semantic category of (2) is C, and belongs to a lane line.

According to the method, the semantic category to which each pixel point in the training image belongs can be obtained, and correspondingly, the semantic segmentation result comprises the semantic category to which each pixel point in the training image belongs.

And S204, training a multi-task detection model according to the multi-task detection result and the semantic segmentation result.

Taking the example of a multitasking detection model performing two detection tasks, in one embodiment, the multitasking detection network includes a first detection subnetwork and a second detection subnetwork, and accordingly, the multitasking detection result includes a first detection result output by the first detection subnetwork and a second detection result output by the second detection subnetwork. Optionally, the training process in S204 includes:

calculating a first loss value of the first detection result according to a first loss function; calculating a second loss value of the second detection result according to a second loss function; calculating a third loss value of the semantic segmentation result according to a third loss function; and training the multi-task detection model according to the first loss value, the second loss value and the third loss value.

In the scenario of the automatic line patrol application as shown in the embodiment of fig. 1, the multitask detection network includes a vehicle detection sub-network and a lane line detection sub-network. Correspondingly, the output result of the multitask detection network comprises a vehicle detection frame and a lane line point set. The training process of the model is as follows: and acquiring a training image set, wherein the training image set comprises a plurality of training images, and inputting each training image into the multi-task detection model respectively to obtain a vehicle detection frame and a lane line point set of each training image. For each training image, calculating a first loss value corresponding to the vehicle detection frame according to a first loss function; calculating a second loss value corresponding to the lane line point set according to a second loss function; calculating a third loss value of the semantic segmentation result according to a third loss function; weighting and summing the first loss value, the second loss value and the third loss value to obtain a total loss value; and feeding back the total loss value to the feature extraction network of the multitask detection model so as to update the network parameters of the feature extraction network. And then inputting the next training image into the multi-task detection model, and extracting the features by using the updated feature extraction network.

The first loss function, the second loss function and the third loss function may be the same loss function, or different loss functions may be selected according to the characteristics of each sub-network. Commonly used loss functions are cross entropy loss functions, logarithmic loss functions, and the like.

In another embodiment, a gradient descent method can be adopted to train the multi-task detection model according to the multi-task detection result and the semantic segmentation result.

In the embodiment of the application, the local feature information is added during semantic segmentation, which is equivalent to considering the detail features in the image, and then the semantic segmentation result is combined with the multi-task detection result to jointly train the multi-task detection model, so that the recognition capability of the trained multi-task detection model to different target features and the correlation capability to different target features can be effectively improved. By the method, the detection precision of the multi-task detection model is effectively improved while the calculation amount of the multi-task detection model is reduced.

It should be noted that the method of the embodiment of the present application only uses semantic segmentation for assistance in the training phase, and the semantic segmentation task is deleted in the detection phase. Therefore, the detection precision of the trained multi-task detection model can be ensured, and the running speed in the detection stage is not influenced.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 4 is a block diagram of a training apparatus for a multi-task detection model provided in an embodiment of the present application, and only shows portions related to the embodiment of the present application for convenience of description.

Referring to fig. 4, the apparatus includes:

and a feature extraction unit 41, configured to obtain global feature information and local feature information of the training image through the feature extraction network.

And the target detection unit 42 is configured to input the global feature information into the multitask detection network to obtain a multitask detection result.

And a semantic segmentation unit 43, configured to obtain a semantic segmentation result of the training image according to the global feature information and the local feature information.

And the model training unit 44 is used for training the multi-task detection model according to the multi-task detection result and the semantic segmentation result.

Optionally, the semantic segmentation unit 43 is further configured to:

performing information integration processing on the global characteristic information and the local characteristic information to obtain integrated characteristic information; and obtaining the semantic segmentation result of the training image according to the integrated characteristic information.

Optionally, the semantic segmentation unit 43 is further configured to:

carrying out up-sampling processing on the global feature information to obtain first processing information; performing convolution processing on the first processing information to obtain second processing information; and performing information splicing processing on the second processing information and the local characteristic information to obtain the integrated characteristic information.

Optionally, the integrated feature information includes a probability value that each pixel point on the training image belongs to each semantic category; and the semantic segmentation result comprises the semantic category to which each pixel point in the training image belongs.

Correspondingly, the semantic segmentation unit 43 is further configured to:

Optionally, the multitask detection network includes a vehicle detection sub-network, and the multitask detection result includes a vehicle detection frame.

Accordingly, the object detection unit 42 is further configured to:

Optionally, the multitask detecting network includes a lane line detecting sub-network, and the multitask detecting result includes a lane line point set.

Accordingly, the object detection unit 42 is further configured to:

Optionally, the multitask detection network includes a first sub-network and a second sub-network, and the multitask detection result includes a first detection result output by the first sub-network and a second detection result output by the second sub-network.

Accordingly, the model training unit 44 is further configured to:

calculating a first loss value of the first detection result according to a first loss function; calculating a second loss value of the second detection result according to a second loss function; calculating a third loss value of the semantic segmentation result according to a third loss function; training the multi-tasking detection model according to the first loss value, the second loss value, and the third loss value.

It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.

In addition, the training apparatus of the multitask detection model shown in fig. 4 may be a software unit, a hardware unit, or a combination of software and hardware unit that is built in the existing terminal device, may be integrated into the terminal device as an independent pendant, or may exist as an independent terminal device.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Fig. 5 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 5, the terminal device 5 of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program 52 stored in the memory 51 and executable on the at least one processor 50, wherein the processor 50 executes the computer program 52 to implement the steps in any of the above-described embodiments of the method for training a multi-tasking detection model.

The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. Those skilled in the art will appreciate that fig. 5 is only an example of the terminal device 5, and does not constitute a limitation to the terminal device 5, and may include more or less components than those shown, or combine some components, or different components, such as an input-output device, a network access device, and the like.

The Processor 50 may be a Central Processing Unit (CPU), and the Processor 50 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may in some embodiments be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing an operating system, an application program, a Boot Loader (Boot Loader), data, and other programs, such as program codes of the computer programs. The memory 51 may also be used to temporarily store data that has been output or is to be output.

The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.

The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to an apparatus/terminal device, recording medium, computer Memory, Read-Only Memory (ROM), Random-Access Memory (RAM), electrical carrier wave signals, telecommunications signals, and software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A training method of a multi-task detection model is applied to a preset multi-task detection model, wherein the multi-task detection model comprises a feature extraction network and a multi-task detection network, and the method comprises the following steps:

2. The method for training the multi-task detection model according to claim 1, wherein the obtaining the semantic segmentation result of the training image according to the global feature information and the local feature information comprises:

3. The method for training a multitask detection model according to claim 2, wherein the step of performing information integration processing on the global feature information and the local feature information to obtain integrated feature information includes:

4. The method of claim 2, wherein the integrated feature information comprises a probability value of each pixel point on the training image belonging to each semantic category; the semantic segmentation result comprises the semantic category to which each pixel point in the training image belongs;

5. A method for training a multi-tasking detection model as recited in claim 1, wherein said multi-tasking detection network comprises a vehicle detection sub-network, and said multi-tasking detection result comprises a vehicle detection box;

6. A method for training a multi-tasking detection model as recited in claim 1, wherein the multi-tasking detection network comprises a lane line detection subnetwork, and the multi-tasking detection result comprises a set of lane line points;

7. The method of claim 1, wherein the multi-tasking detection network comprises a first detection sub-network and a second detection sub-network, and wherein the multi-tasking detection results comprise a first detection result output by the first detection sub-network and a second detection result output by the second detection sub-network;

8. The utility model provides a training device of multitask detection model which characterized in that, is applied to preset multitask detection model, multitask detection model includes feature extraction network and multitask detection network, the device includes:

9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.