CN109934147B

CN109934147B - Target detection method, system and device based on deep neural network

Info

Publication number: CN109934147B
Application number: CN201910167067.7A
Authority: CN
Inventors: 龙浩
Original assignee: Beijing Union University
Current assignee: Beijing Union University
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2020-11-06
Anticipated expiration: 2039-03-05
Also published as: CN109934147A

Abstract

The invention discloses a target detection method, a system and a device based on a deep neural network, comprising the following steps: extracting deep features of different scales of video frames in a video to be detected based on a feature learning network; performing superpixel segmentation on a video frame to obtain a superpixel structure diagram; carrying out feature fusion on the deep features and the super-pixel structure chart to obtain fusion features; performing target semantic classification based on the conditional random field network according to the fusion characteristics to obtain a target semantic label; and performing frame regression according to the target semantic label to obtain a target detection result. The method can accurately detect the targets with complex backgrounds, high density and small targets in the video, and is particularly suitable for the target identification task of the aerial video.

Description

Target detection method, system and device based on deep neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method, a system and a device based on a deep neural network.

Background

In recent years, the target detection technology has attracted great attention and has been widely applied in many fields, but the target detection task based on aerial images still faces many challenges. First, most aerial images are taken at high altitudes vertically or obliquely, so the natural landscape images taken with the image background from the ground are more confusing. For example, when detecting vehicles in aerial imagery, some similar objects, such as rooftop equipment and substation boxes, may result in false positive detections. Secondly, when the image is shot in a wide field of view, the object in the aerial image is very small and the density is higher than that of the natural scene image. Finally, the lack of a large scale and well annotated dataset limits the detection performance of the training network.

At present, most of target detection methods for aerial images are based on characteristics of sliding window search and shallow learning, but the method cannot acquire comprehensive information of a detection object from the aerial images, so that the application of the method is very limited, and monitoring results of the method are inconsistent in different tasks. For example, although convolutional neural networks have the function of learning a strong hierarchy structure, they are used for object detection tasks in aerial images, and the network performs multiple maximization and downsampling operations, which may cause signal downsampling problems and relatively weak spatial description. On the other hand, due to the highly variable and multi-angle rotation of the aerial device, the objects in the aerial image typically have small size and multi-scale and shape distortions, which essentially limit the spatial description capability of the convolutional neural network.

Disclosure of Invention

The invention aims to provide a target detection method, a system and a device based on a deep neural network, which can detect targets with complex backgrounds, high density and small targets in a video and improve the target detection precision.

In order to achieve the above object, in a first aspect of the present invention, there is provided a deep neural network-based target detection method, including:

extracting deep features of different scales of video frames in a video to be detected based on a preset feature learning network;

performing superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame;

performing feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;

carrying out target semantic classification based on a preset conditional random field network and according to the fusion characteristics to obtain target semantic labels;

and performing frame regression according to the target semantic label to obtain a target detection result.

Further, the step of "performing superpixel segmentation on the video frame to obtain a superpixel model map corresponding to the video frame" includes:

performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm;

calculating the pixel average value of each superpixel block obtained after superpixel segmentation;

and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.

Further, before the step of performing target semantic classification based on the preset conditional random field network and according to the fusion features to obtain the target semantic label, the method further includes:

and carrying out network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics.

Further, the step of performing network training on the conditional random field network by using a maximum conditional likelihood method based on preset fusion features includes:

optimizing the network weight of the conditional random field network according to a method shown as the following formula:

wherein V represents a superpixel block set in the superpixel structure diagram, E represents a connection relation set of adjacent superpixel blocks, and E_ijRepresenting the connection between the ith and jth superpixel blocks, x⁽ⁿ⁾The n-th fusion feature is represented,

representing a weight corresponding to an ith superpixel block in the nth fused feature,

represents the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3 …, M and M represent the number of fusion features,

represents the preset x⁽ⁿ⁾And

the corresponding function of the unit term is,

represents the preset x⁽ⁿ⁾、

And

corresponding function of a binary term, Z (x)⁽ⁿ⁾W) represents preset based on said x⁽ⁿ⁾W conditional inference function, c_i，c_jRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel block_i，l_jEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ w_N，w_E]，w^*Represents the optimized value of w, w_NWeights, w, representing the function of said unit terms preset_ERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, P_k(y_k，a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset category_k，aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset category_KA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,

representing the square of the 2 norm.

In a second aspect of the present invention, there is also provided a deep neural network-based target detection system, including:

the characteristic extraction module is configured to extract deep characteristics of different scales of video frames in the video to be detected based on a preset characteristic learning network;

the super-pixel segmentation module is configured to perform super-pixel segmentation on the video frame to obtain a super-pixel structure diagram corresponding to the video frame;

the feature fusion module is configured to perform feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;

the semantic classification module is configured to classify the target semantic to obtain a target semantic label based on a preset conditional random field network and according to the fusion characteristics;

and the target detection module is configured to perform frame regression according to the target semantic tag to obtain a target detection result.

Further, the super-pixel segmentation module is further configured to perform the following operations:

Further, the system also includes a network training module configured to perform the following operations:

and carrying out network training on the conditional random field network by adopting a maximum conditional likelihood method based on a preset first fusion characteristic.

Further, the network training module is further configured to optimize the network weights of the conditional random field network according to a method shown in the following formula:

representing the ith of the n-th fusion featuresThe weight corresponding to the super-pixel block,

represents the preset x⁽ⁿ⁾And

the corresponding function of the unit term is,

represents the preset x⁽ⁿ⁾、

And

representing the square of the 2 norm.

In a third aspect of the present invention, there is also provided a storage device, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned deep neural network-based object detection method.

In a fourth aspect of the present invention, there is also provided a processing apparatus, comprising a processor adapted to execute various programs; and a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned deep neural network-based object detection method.

The invention has the advantages that:

the target detection method based on the deep neural network can detect the targets with complex backgrounds, high density and small targets in the video and improve the target detection precision.

Drawings

Fig. 1 is a schematic diagram illustrating main steps of a deep neural network-based target detection method according to an embodiment of the present invention.

Fig. 2 is a schematic main flow chart of a target detection method based on a deep neural network in an embodiment of the present invention.

Fig. 3 is a schematic diagram of a target detection result on a UAV23 data set in an embodiment of the invention.

Fig. 4 is a schematic structural diagram of a deep neural network-based target detection system in an embodiment of the present invention.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

Referring to fig. 1, fig. 1 illustrates the main steps of a deep neural network-based target detection method, and as shown in fig. 1, the deep neural network-based target detection method of the present invention may include the following steps:

step S1: and extracting deep features of different scales of video frames in the video to be detected based on a preset feature learning network.

Specifically, the video to be detected is a video sequence to be subjected to a target detection task, and comprises a plurality of video frames. The feature learning network may be a deep convolutional network constructed using a machine learning algorithm. Because the position, rotation, scale and the like of the detected target in each video frame have variability, and the characteristic expression extracted by the convolution operation has invariance to inclination, translation, scaling and the like, the deep features can express small targets and background information in a aerial photography video in a layering manner, the target detection precision is improved, and the target detection is more accurate and convenient by using the deep features of different scales than a method based on manual extraction of shallow features.

In this embodiment, in the network training stage, the feature learning network extracts deep features from the VLFEAT49 toolbox in the MatConvNet pre-trained neural network, and the selected feature learning network is 21 layers of "ImageNet-vgg-f", and the feature learning network is trained by using the 5 th, 13 th and 16 th layers thereof.

Step S2: and carrying out superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame.

Specifically, performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm; calculating the pixel average value of each superpixel block obtained after superpixel segmentation; and acquiring the super-pixel structure diagram according to the probability dependence relationship between the pixel average value of each super-pixel block and the pixel average values of other super-pixel blocks. The superpixel structure diagram is a probability model describing conditional independent relations among multiple random variables, and is composed of a group of nodes and edges among the nodes, wherein each node represents a random variable (or a group of random variables), and the edges represent probability dependent relations among the random variables. Based on the method, a small amount of scattered abnormal pixel points in the video frame can be eliminated, so that the target detection precision is further improved. In addition, the number of super pixels in the video frame is far smaller than that of pixels, so that the operation speed of the network can be obviously improved. The boundary between the super pixel blocks can be definitely reserved in the super pixel structure diagram, so that adjacent objects can be more accurately distinguished, and the monitoring precision of the small target is further improved. In this embodiment, in the super-pixel segmentation process, the size of the super-pixel neighborhood is set to 15, and the normalization factor is set to 0.1.

Step S3: and carrying out feature fusion on the deep features and the superpixel structure chart to obtain fusion features. Specifically, the super-pixel structure diagram is used as a feature representation of the video frame, and the deep features and the super-pixel structure diagram are subjected to feature fusion to obtain fusion features. The fused feature is a depth multi-scale feature.

Step S4: and carrying out target semantic classification based on a preset conditional random field network and according to the fusion characteristics to obtain a target semantic label.

Specifically, the conditional random field network is a neural network constructed based on a conditional random field, and the conditional random field has a strong capability of learning a spatial relationship in display.

In the implementation, the method further comprises the step of network training of the conditional random field network. Specifically, network training is carried out on the conditional random field network based on preset fusion features and by adopting a maximum conditional likelihood method, and the network weight of the conditional random field network is optimized according to a method shown in a formula (1):

wherein V represents a superpixel block set in the superpixel structure diagram, and E representsSet of connection relationships of adjacent superpixel blocks, e_ijRepresenting the connection between the ith and jth superpixel blocks, x⁽ⁿ⁾The n-th fusion feature is represented,

representing the weight corresponding to the ith superpixel block in the nth fusion feature,

represents the weight corresponding to the jth super-pixel block in the nth fusion feature, wherein n is 1,2,3 …, M represents the number of fusion features, the number of fusion features can be equal to the number of video frames in the video to be tested,

denotes a preset x⁽ⁿ⁾And

the corresponding function of the unit term is,

denotes a preset x⁽ⁿ⁾、

And

corresponding function of a binary term, Z (x)⁽ⁿ⁾W) represents a preset x-based⁽ⁿ⁾W conditional inference function, c_i，c_jRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel block_i，l_jRespectively represents classification categories corresponding to the ith superpixel block and the jth superpixel block, w represents the weight of the conditional random field network, and w ═ w_N，w_E]，w^*Denotes the optimized value of w, w_NWeight, w, representing a predetermined unit term function_ERepresenting the weight of a predetermined bivariate function, T representing the transpose of a vector or matrix, P_k(y_k，a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset category_k，aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset category_KA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,

representing the square of the 2 norm.

Step S4: and performing frame regression according to the target semantic label to obtain a target detection result. Specifically, the identified target is obtained according to the target semantic tag, and frame regression is performed on the identified target to obtain position information and size information of each target in the video frame.

Referring to fig. 2, fig. 2 illustrates a main flow of a target detection method based on a deep neural network, and the target detection method based on the deep neural network shown in fig. 2 may include: extracting deep features of different scales of video frames in a video to be detected based on a feature learning network; performing superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame; carrying out feature fusion on the deep features and the super-pixel structure chart to obtain fusion features; performing target semantic classification based on the conditional random field network according to the fusion characteristics to obtain a target semantic label; and performing frame regression according to the target semantic label to obtain a target detection result.

Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverted order, and these simple changes are all within the scope of the present invention.

To illustrate the effectiveness of the method of the present invention, the effectiveness of the method of the present invention was evaluated using the UAV123 database. The database is a database which is newly created in 2016 and is used for target tracking and detection of the aerial images of the unmanned aerial vehicles, the database comprises 123 videos, 33 videos are selected to generate 48770 pictures which comprise scenes in all the databases, and 13871 pictures are manually marked with detection box true values. The training set and test set were randomly divided into a 1: 1 ratio. The target detection objects are mainly focused on bicycles, ships, buildings, people, cars, and the like. Internationally recognized 3 indexes were used: precision, Recall, F1-score, compared to the current state-of-the-art method of target detection, the results are shown in Table 1. Wherein "ACF 2015" represents a method proposed in documents "K.Liu and G.Mattyus," Fast Multiclass vessel Detection on Images, "IEEE Geoscience and Remote Sensing Letters, vol.12, No.9, pp.1938-1942, 2015," AVPN "represents a method proposed in documents" Z.Deng, H.Sun, S.ZHou, J.ZHao, and H.Zou, "firmware and" hardware vessel Detection in aqueous Images coupling, vol.10, No.8, pp.52, WO 64 "and" IEEE Journal of selected characteristics in Applied liquids and updates and Remote controls, "volume.10, 368, pp.52, and" TS 2017, S.J.Z ", and" noise measurement on S.7, Z.7, Z.1, 2. J.D.D, H.S.S.Z ", and N.D.D.D.S.S.S.Z", and S.D.D.D.S.D.D.S.S.D.D.D.D.S.S.S.S.S.S.A.

TABLE 1 comparison of Properties results

As can be seen from Table 1, the method of the invention can well obtain the hierarchical structure characteristics and the spatial relationship characteristics of the target, and the method of the invention obtains three indexes of high Precision, Recall and F1-score.

Referring to fig. 3, fig. 3 exemplarily shows a target detection result on a UAV23 data set by the method of the present invention, as shown in fig. 3, each row is a video sequence consisting of 4 consecutive video frames, the first row is a target detection task for detecting a ship, the second row is a target detection task for detecting a car, and the third row is a target detection task for detecting a pedestrian on a car, and it can be seen from the figure that the method of the present invention can accurately complete detection tasks for objects with different viewing angles and categories, especially for objects with complicated background, high density, and small target, and can be better applied to complicated detection of aerial image targets of an unmanned aerial vehicle.

Based on the same inventive concept as the method embodiment, the embodiment of the invention also provides a target detection system based on the deep neural network. The following describes a target detection system based on a deep neural network according to the present invention with reference to the accompanying drawings.

Referring to fig. 4, fig. 4 exemplarily shows a main structure of a deep neural network-based target detection system, as shown in fig. 4, the deep neural network-based target detection system may include a feature extraction module 1 configured to extract deep features of different scales of video frames in a video to be detected based on a preset feature learning network; the super-pixel segmentation module 2 is configured to perform super-pixel segmentation on the video frame to obtain a super-pixel structure diagram corresponding to the video frame; the feature fusion module 3 is configured to perform feature fusion on the deep features and the superpixel structure chart to obtain fusion features; the semantic classification module 4 is configured to classify the target semantic to obtain a target semantic label based on a preset conditional random field network and according to the fusion characteristics; and the target detection module 5 is configured to perform frame regression according to the target semantic tags to obtain target detection results.

Further, the super-pixel segmentation module 2 is further configured to perform the following operations: performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm; calculating the pixel average value of each superpixel block obtained after superpixel segmentation; and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.

Further, the system also includes a network training module configured to perform the following operations: and performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on a preset first fusion characteristic.

Further, the network training module is further configured to optimize the network weights of the conditional random field network according to the method shown in formula (1).

Further, based on the above method embodiment, the embodiment of the present invention further provides a storage device, which stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to implement the above deep neural network-based target detection method.

Further, based on the foregoing method embodiment, an embodiment of the present invention further provides a processing apparatus, which includes a processor and a storage device. Wherein the processor may be adapted to execute the respective program, and the storage device may be adapted to store the plurality of programs, which are adapted to be loaded and executed by the processor to implement the above-mentioned deep neural network based object detection method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process and the related descriptions of the apparatus according to the embodiment of the present invention may refer to the corresponding process in the method according to the foregoing embodiment, and have the same beneficial effects as the method described above, and will not be described again here.

Those of skill in the art will appreciate that the various illustrative method steps and systems described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The above description is of the preferred embodiment of the present invention and the technical principles applied thereto, and it will be apparent to those skilled in the art that any changes and modifications based on the equivalent changes and simple substitutions of the technical solution of the present invention are within the protection scope of the present invention without departing from the spirit and scope of the present invention.

Claims

1. A target detection method based on a deep neural network is characterized by comprising the following steps:

performing frame regression according to the target semantic tag to obtain a target detection result;

before the step of performing target semantic classification based on the preset conditional random field network and according to the fusion features to obtain the target semantic label, the method further includes:

performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics;

the step of performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics comprises the following steps:

representing the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3, M represents the number of fusion features,

represents the preset x⁽ⁿ⁾And

the corresponding function of the unit term is,

represents the preset x⁽ⁿ⁾、

And

corresponding function of a binary term, Z (x)⁽ⁿ⁾W) represents preset based on said x⁽ⁿ⁾W conditional inference function, c_i，c_jRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel block_i，l_jEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ w_N,w_E]，w^*Represents the optimized value of w, w_NWeights, w, representing the function of said unit terms preset_ERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, P_k(y_k,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset category_k,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset category_KA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,

representing the square of the 2 norm.

2. The method for detecting an object based on a deep neural network as claimed in claim 1, wherein the step of performing superpixel segmentation on the video frame to obtain a superpixel structure map corresponding to the video frame comprises:

3. A deep neural network-based object detection system, the system comprising:

the target detection module is configured to perform frame regression according to the target semantic tag to obtain a target detection result;

the system also includes a network training module configured to perform the following operations:

the network training module is further configured to optimize the network weights of the conditional random field network according to a method shown in the following formula:

represents the preset x⁽ⁿ⁾And

the corresponding function of the unit term is,

represents the preset x⁽ⁿ⁾、

And

representing the square of the 2 norm.

4. The deep neural network-based object detection system of claim 3, wherein the superpixel segmentation module is further configured to perform operations comprising:

5. A storage device in which a plurality of programs are stored, wherein the programs are adapted to be loaded and executed by a processor to implement the deep neural network based object detection method of any one of claims 1 to 2.

6. A processing apparatus, comprising:

a processor adapted to execute various programs; and

a storage device adapted to store a plurality of programs;

characterized in that the program is adapted to be loaded and executed by a processor to implement the deep neural network based object detection method of any one of claims 1 to 2.