CN114842449A

CN114842449A - Target detection method, electronic device, medium, and vehicle

Info

Publication number: CN114842449A
Application number: CN202210509045.6A
Authority: CN
Inventors: 陈进
Original assignee: Anhui Weilai Zhijia Technology Co Ltd
Current assignee: Anhui Weilai Zhijia Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-02

Abstract

The invention relates to the technical field of artificial intelligence, in particular provides a target detection method, electronic equipment, a medium and a vehicle, and aims to solve the technical problem that the target detection method in the field of automatic vehicle driving is low in detection accuracy. To this end, the object detection method of the present invention includes: acquiring an image to be detected containing a target detection object; inputting an image to be detected into a student network trained by a knowledge distillation method to obtain a target detection object, wherein the training of the student network by the knowledge distillation method comprises the following steps: inputting the traffic scene training image into a trained teacher network to obtain a first characteristic diagram and a dynamic area; inputting the traffic scene training image into a student network to be trained to obtain a second characteristic diagram and a loss detection part; determining a distillation loss fraction based on the first profile, the second profile, and the dynamic region; the student network is trained based on the distillation loss component and the detection loss component. Thus, the target detection precision is improved.

Description

Target detection method, electronic device, medium, and vehicle

Technical Field

The invention relates to the technical field of artificial intelligence, and particularly provides a target detection method, electronic equipment, a medium and a vehicle.

Background

Currently, in the field of vehicle automatic driving, detection of objects such as pedestrians, traffic lights, obstacles, and lane lines is implemented based on a network model. An existing network model usually adopts an efficient lightweight model, when the lightweight model is trained in a knowledge distillation mode, the constraint force between the lightweight model and a teacher network is small, and in the face of the fact that a large amount of training data are in an under-fitting state, the expression capacity of the lightweight model is difficult to learn effective characteristics in the teacher network, the existing distillation method is poor in effect, and further the target detection precision in automatic driving detection is low, and the actual requirements are difficult to meet.

Accordingly, there is a need in the art for a new target detection scheme to address the above-mentioned problems.

Disclosure of Invention

In order to overcome the above drawbacks, the present invention is proposed to solve or at least partially solve the technical problem of low detection accuracy corresponding to the existing target detection method in the field of vehicle automatic driving. The invention provides a target detection method, an electronic device, a medium and a vehicle.

In a first aspect, the present invention provides a method of object detection comprising the steps of: acquiring an image to be detected containing a target detection object; inputting the image to be detected into a student network trained by a knowledge distillation method to obtain a target detection object; wherein training the student network using a knowledge distillation method comprises: inputting a traffic scene training image into a trained teacher network to obtain a first feature map and a dynamic region, wherein the teacher network comprises a first feature pyramid network and a region candidate network, the first feature pyramid network outputs the first feature map, and the region candidate network inputs the first feature map and outputs the dynamic region; inputting a traffic scene training image into a student network to be trained to obtain a second feature map and a detection loss part, wherein the student network comprises a second feature pyramid network, the second feature pyramid network outputs the second feature map, and the student network obtains the detection loss part based on a real label; determining a distillation loss fraction based on the first profile, the second profile, and the dynamic region; training the student network based on the distillation loss component and the detection loss component.

In one embodiment, the inputting the first feature map and outputting the dynamic region by the region candidate network includes: normalizing the output V (F) of the area candidate network to obtain a normalized feature map:

Map＝σ(V(F))；

carrying out threshold processing on the normalized feature map to generate the dynamic region:

wherein σ (-) denotes a normalization function, WeiightedMap _x,y Represents the weight of the dynamic region at coordinates (x, y) and threshold represents the region threshold.

In one embodiment, said determining a distillation loss fraction based on said first profile, second profile and dynamic region comprises: weighting a distillation loss function using the dynamic region, wherein the distillation loss function is used to constrain the first profile and the second profile to remain consistent.

In one embodiment, the weighting of the distillation loss function using the dynamic region comprises:

wherein LOSS represents a weighted square LOSS function, F _t,x,y Features representing the first map at coordinate (x, y) positions, F _s,x,y The feature representing the second feature map at the coordinate (x, y) position, WeightedMap _x,y Representing the weight of the dynamic region at coordinates (x, y), N being the second feature mapThe product of length and width of (a).

In one embodiment, training the student network based on the distillation loss fraction and the detection loss fraction comprises: taking the sum of the distillation loss part and the detection loss part as the global loss of the student network; and adjusting network parameters of the student network based on the global loss until the global loss meets a preset condition.

In one embodiment, prior to determining the distillation loss fraction based on the first profile, the second profile, and the dynamic region, the method further comprises: inputting the second feature map into a FitNet network such that the second feature map is aligned with the first feature map.

In one embodiment, the regional candidate network is any one of RPN, R-CNN, and master-R-CNN.

In a second aspect, an electronic device is provided, comprising a processor and a storage adapted to store a plurality of program codes adapted to be loaded and run by the processor to perform the object detection method of any of the preceding claims.

In a third aspect, there is provided a computer readable storage medium having stored therein a plurality of program codes adapted to be loaded and run by a processor to perform the object detection method of any of the preceding claims.

In a fourth aspect, a vehicle is provided, comprising the electronic device described above.

One or more technical schemes of the invention at least have one or more of the following beneficial effects:

the invention provides a target detection method, which mainly inputs an acquired image to be detected into a trained student network so as to obtain a target detection object, in the course of training the student network, the distillation loss part is determined by obtaining the dynamic region from the teacher network, the first characteristic diagram and the second characteristic diagram from the student network, further, the distillation loss part and the detection loss part corresponding to the student network are utilized to distill the student network, the dynamic area with confidence coefficient guides the weight in the distillation, the fitting of the student network to the background noise is effectively avoided, meanwhile, the study on the characteristics of the detected object is enhanced, the distillation effect is improved, the trained student network has higher precision, when the method is used for target detection in the field of automatic driving, the method is beneficial to improving the detection precision of the target object obtained by detection, thereby realizing the effect of safe driving.

Normalization processing is carried out on the output of the regional candidate network to obtain a normalized characteristic diagram, threshold processing is carried out on the normalized characteristic diagram to generate a dynamic region, the foreground is distinguished from the background, technical support is provided for determining a distillation loss function in the later period, in addition, the distillation effect can be further improved through normalization processing, and the training precision of a student network is improved.

Drawings

The disclosure of the present invention will become more readily understood with reference to the accompanying drawings. As is readily understood by those skilled in the art: these drawings are for illustrative purposes only and are not intended to constitute a limitation on the scope of the present invention. Moreover, in the drawings, like numerals are used to indicate like parts, and in which:

FIG. 1 is a schematic flow chart of the main steps of a target detection method according to one embodiment of the present invention;

FIG. 2 is a schematic overall flow diagram of a knowledge distillation according to one embodiment of the present invention;

fig. 3 is a schematic structural diagram of an RPN network according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the invention.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random-access memory, and the like. The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well.

Currently, the detection of objects such as pedestrians, traffic lights, obstacles, and lane lines is implemented based on a network model. An existing network model usually adopts an efficient lightweight model, when the lightweight model is trained by adopting a knowledge distillation mode, the constraint force between the lightweight model and a teacher network is small, and in the face of the condition that a large amount of training data is in an under-fitting state, the expression capacity of the lightweight model is difficult to learn effective characteristics in the teacher network, so that the model distillation effect is poor, and further, the target detection precision in automatic driving detection is low. Therefore, the application provides a target detection method and a storage medium, and the method mainly comprises the steps of inputting an acquired image to be detected into a trained student network to obtain a target detection object, specifically determining a distillation loss part by obtaining a dynamic region and a first characteristic diagram from a teacher network and a second characteristic diagram obtained from the student network in the process of training the student network, and further training the student network by using the distillation loss part and a detection loss part corresponding to the student network.

Referring to FIGS. 1 and 2, there is illustrated a schematic flow diagram of the main steps of a known distillation method according to one embodiment of the present invention. As shown in fig. 1, the knowledge distillation method in the embodiment of the present invention mainly includes the following steps S11 to S12.

Step S11: and acquiring an image to be detected containing a target detection object.

In one embodiment, the target detection object may be an obstacle, a pedestrian, a traffic light, a lane line, and the like, but is not limited thereto.

The hardware used for acquiring the image to be detected can comprise a vehicle-mounted camera, Lidar, Radar and the like.

Step S12: inputting the image to be detected into a student network trained by a knowledge distillation method to obtain a target detection object, wherein the training of the student network by the knowledge distillation method can be realized by the following steps S101 to S104.

Step S101: inputting the traffic scene training image into a trained teacher network to obtain a first feature map and a dynamic region, wherein the teacher network comprises a first feature pyramid network and a region candidate network, the first feature pyramid network outputs the first feature map, and the region candidate network inputs the first feature map and outputs the dynamic region.

Generally, a traffic scene training image is traffic scene image data used for training a network model, and these data include labels such as object types and positions of artificial markers.

In the automatic driving scene, the target detection model serving as the student network can be used for detecting target objects such as obstacles, pedestrians, traffic lights, lane lines and the like, so the traffic scene training image in the application can be an acquired training image containing the target objects such as the obstacles, the pedestrians, the traffic lights, the lane lines and the like.

Specifically, as shown in fig. 2, the teacher network in the present application includes a backbone network, a first feature pyramid network and a regional candidate network, where a first feature graph output by the first feature pyramid network passes through the regional candidate network to obtain two classification results, that is, a background and a foreground, corresponding to each feature graph.

In one embodiment, the regional candidate network in the present application may be any one of RPN, R-CNN, and master-R-CNN.

Taking the RPN as an example, as shown in fig. 3, the RPN includes a convolutional layer, a classifier, and a regressor, a feature map output by the feature pyramid network is convolved by the convolutional layer first, and after the convolved features pass through two branches of the classifier and the regressor, the position of the foreground can be obtained finally.

In a specific embodiment, in the process of inputting the first feature map into the area candidate network and outputting the dynamic area, firstly, the output v (f) of the area candidate network is normalized to obtain a normalized feature map:

Map＝σ(V(F))；

then, threshold processing is carried out on the normalized feature map to generate a dynamic region:

wherein σ (-) denotes a normalization function, WeiightedMap _x,y Represents the weight of the dynamic region at coordinate (x, y) and threshold represents the region threshold. Specifically, WeiightedMap _x,y The position with the value greater than zero in (1) belongs to the dynamic region of distillation, and the larger the value is, the higher the confidence degree of representing the dynamic region is, so that in the process of model distillation, the student network can better learn the characteristics in the teacher network. WeightedMap _x,y The position where the value is equal to zero is usually noise, and does not belong to a distillation dynamic region, so that the learning of a student network on the noise can be effectively avoided.

As will be readily understood by those skilled in the art, in the example where the regional candidate network is RPN, the output v (f) of the regional candidate network is denoted as RPN (f).

This application is through carrying out normalization processing to the output of regional candidate network, obtains the normalization characteristic map to carry out threshold value to the normalization characteristic map and handle, generate dynamic region, thereby distinguish the prospect from the background, confirm to distill the loss function for the later stage and provide technical support, in addition, the processing of normalization can further improve the distillation effect, promotes student's network training precision.

Step S102: and inputting the traffic scene training image into a student network to be trained to obtain a second feature map and a detection loss part, wherein the student network comprises a second feature pyramid network, the second feature pyramid network outputs the second feature map, and the student network obtains the detection loss part based on the real label.

Specifically, as shown in fig. 2, the student network includes a backbone network, a second feature pyramid network and a prediction network, wherein a second feature map output by the second feature pyramid network passes through the prediction network and then outputs a prediction result, and the prediction result and a real label (ground route) may be used to determine a detection loss part of the student network.

In addition, the second feature map output by the second feature pyramid network is also used for calculating the distillation loss part.

Step S103: the distillation loss part is determined based on the first characteristic map, the second characteristic map, and the dynamic region.

Specifically, the distillation loss fraction can be calculated using a square loss function (squesed L2 loss) and a maximum mean loss function (mmd loss).

In one embodiment of the present application, the steps include: and weighting the distillation loss function by using the dynamic region, and using the dynamic region as a weight to constrain the first characteristic diagram to be consistent with the second characteristic diagram, thereby determining the distillation loss part.

In one embodiment, weighting the distillation loss function using the dynamic region, as exemplified by using a square loss function, comprises:

wherein LOSS represents a weighted square LOSS function, F _t,x,y Features representing the first map at coordinate (x, y) positions, F _s,x,y The feature representing the second feature map at the coordinate (x, y) position, WeightedMap _x,y Representing the weight of the dynamic region at coordinates (x, y), N being the product of the length and width of the second feature map.

The distillation loss function is weighted by the dynamic region, the optimization strength of the position network with large weight is larger, and the position with zero weight is not optimized, so that the learning ability of the student network is improved.

Compared with the prior art, the loss function is determined directly through the characteristic diagrams output by the teacher network and the student network, the dynamic area is used as the weight for guiding the student network to distill, the fitting capability of the student network to the teacher network is improved, the fitting of the student network to background noise is avoided, and the high-precision student network can be trained.

Step S104: the student network is trained based on the distillation loss component and the detection loss component.

In one embodiment, in the course of training the student network based on the distillation loss part and the detection loss part, the sum of the distillation loss part and the detection loss part may be first used as the global loss of the student network, and then the network parameters of the student network are adjusted based on the global loss (for example, by using a back propagation method), until the global loss meets the preset condition, that is, the global loss converges, and then the training of the student network is completed.

Based on the steps S11-S12, the acquired image to be detected is input into a trained student network, so that a target detection object is obtained, in the process of training the student network, a dynamic area, a first feature map and a second feature map obtained from the teacher network are obtained to determine a distillation loss part, and the distillation loss part and the detection loss part corresponding to the student network are used for distilling the student network, so that the learning capacity of the student network on the teacher network is improved, the distillation effect is improved, the precision of the trained student network is higher, and when the method is used for target detection in the field of automatic driving, the detection precision of the detected target object is improved, and the effect of safe driving is achieved.

In one embodiment, prior to determining the distillation loss fraction based on the first profile, the second profile, and the dynamic region, the method of the present application further comprises: the second signature is input into the FitNet network such that the second signature is aligned with the first signature, as shown in figure 2.

Generally speaking, because the teacher network and the student network have different network structures, the height and width of the feature pyramid network output feature maps in the teacher network and the student network, the number of channels, and the like may be different, so that the FitNet network is required to align the two feature maps, thereby further improving the distillation effect of the model.

Specifically, the FitNet network can be formed by sequentially connecting a plurality of convolution layers or anti-rolling layers, and the size of the second characteristic diagram output by the student network is adjusted by the FitNet network by setting parameters such as the number of channels and the step length of the convolution layers, so that the distillation effect of the model is further improved.

And aligning the second characteristic diagram with the first characteristic diagram by using the FitNet network, calculating a distillation loss part based on the aligned second characteristic diagram, the aligned first characteristic diagram and the dynamic region, and finally training the student network by combining the distillation loss part and the detection loss part, so that the distillation effect of the student model is improved.

It should be noted that, although the foregoing embodiments describe each step in a specific sequence, those skilled in the art will understand that, in order to achieve the effect of the present invention, different steps do not necessarily need to be executed in such a sequence, and they may be executed simultaneously (in parallel) or in other sequences, and these changes are all within the protection scope of the present invention.

In addition, it should be noted that the above is an example of an automatic driving scenario, however, the present application is not limited thereto, and the target detection method provided by the present application is also applicable to other scenarios in which target detection is performed by artificial intelligence.

It can be understood by those skilled in the art that all or part of the flow of the method of implementing the above embodiments of the present invention may also be implemented by a computer program instructing related hardware. The computer program may be stored in a computer readable storage medium, which when executed by a processor, may implement the steps of the various method embodiments described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

Further, the present invention also provides an electronic device, as shown in fig. 4, in an embodiment of the electronic device according to the present invention, the electronic device includes a processor 70 and a storage device 71, the storage device may be configured to store a program for executing the knowledge distillation method or the object detection method of the above-mentioned method embodiment, and the processor may be configured to execute the program in the storage device, the program including but not limited to the program for executing the knowledge distillation method or the object detection method of the above-mentioned method embodiment. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed.

Further, the invention also provides a computer readable storage medium. In one computer-readable storage medium embodiment according to the present invention, the computer-readable storage medium may be configured to store a program that executes the knowledge distillation method or the object detection method of the above-described method embodiment, and the program may be loaded and executed by a processor to implement the above-described knowledge distillation method or the object detection method. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The computer readable storage medium may be a storage device formed by including various electronic devices, and optionally, the computer readable storage medium is a non-transitory computer readable storage medium in the embodiment of the present invention.

Further, the invention also provides a vehicle comprising the electronic equipment. Through the vehicle that this application contains electronic equipment, can the accurate position at target places such as lane line, traffic lights, barrier and pedestrian.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method of target detection, comprising the steps of:

acquiring an image to be detected containing a target detection object;

inputting the image to be detected into a student network trained by a knowledge distillation method to obtain a target detection object; wherein training the student network using a knowledge distillation method comprises:

inputting a traffic scene training image into a trained teacher network to obtain a first feature map and a dynamic region, wherein the teacher network comprises a first feature pyramid network and a region candidate network, the first feature pyramid network outputs the first feature map, and the region candidate network inputs the first feature map and outputs the dynamic region;

inputting a traffic scene training image into a student network to be trained to obtain a second feature map and a detection loss part, wherein the student network comprises a second feature pyramid network, the second feature pyramid network outputs the second feature map, and the student network obtains the detection loss part based on a real label;

determining a distillation loss fraction based on the first profile, the second profile, and the dynamic region;

training the student network based on the distillation loss component and the detection loss component.

2. The object detection method of claim 1, wherein the area candidate network inputs the first feature map and outputs the dynamic area, and comprises:

normalizing the output V (F) of the area candidate network to obtain a normalized feature map:

Map＝σ(V(F))；

wherein σ (-) denotes a normalization function, WeiightedMap _x,y Represents the weight of the dynamic region at coordinate (x, y) and threshold represents the region threshold.

3. The method of claim 1, wherein determining the distillation loss fraction based on the first signature, the second signature, and the dynamic region comprises:

weighting a distillation loss function using the dynamic region, wherein the distillation loss function is used to constrain the first profile and the second profile to remain consistent.

4. The method of claim 3, wherein weighting the distillation loss function using the dynamic region comprises:

wherein LOSS represents a weighted square LOSS function, F _t,x,y Features representing the first map at coordinate (x, y) positions, F _s,x,y The feature representing the second feature map at the coordinate (x, y) position, WeightedMap _x,y Represents the weight of the dynamic region at coordinates (x, y), and N is the product of the length and width of the second feature map.

5. The method of claim 1, wherein training the student network based on the distillation loss component and the detection loss component comprises:

taking the sum of the distillation loss part and the detection loss part as the global loss of the student network;

and adjusting network parameters of the student network based on the global loss until the global loss meets a preset condition.

6. The method of object detection of claim 1, wherein prior to determining a distillation loss fraction based on the first signature, the second signature, and the dynamic region, the method further comprises:

inputting the second feature map into a FitNet network such that the second feature map is aligned with the first feature map.

7. The object detection method according to any one of claims 1 to 6, wherein the regional candidate network is any one of RPN, R-CNN, and master-R-CNN.

8. An electronic device comprising a processor and a storage means adapted to store a plurality of program codes, characterized in that said program codes are adapted to be loaded and run by said processor to perform the object detection method according to any of claims 1-7.

9. A computer readable storage medium having stored therein a plurality of program codes, characterized in that the program codes are adapted to be loaded and run by a processor to perform the object detection method according to any one of claims 1-7.

10. A vehicle characterized in that the vehicle comprises the electronic device of claim 8.