CN109934147B - Target detection method, system and device based on deep neural network - Google Patents

Target detection method, system and device based on deep neural network Download PDF

Info

Publication number
CN109934147B
CN109934147B CN201910167067.7A CN201910167067A CN109934147B CN 109934147 B CN109934147 B CN 109934147B CN 201910167067 A CN201910167067 A CN 201910167067A CN 109934147 B CN109934147 B CN 109934147B
Authority
CN
China
Prior art keywords
superpixel
preset
representing
fusion
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910167067.7A
Other languages
Chinese (zh)
Other versions
CN109934147A (en
Inventor
龙浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Union University
Original Assignee
Beijing Union University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Union University filed Critical Beijing Union University
Priority to CN201910167067.7A priority Critical patent/CN109934147B/en
Publication of CN109934147A publication Critical patent/CN109934147A/en
Application granted granted Critical
Publication of CN109934147B publication Critical patent/CN109934147B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a target detection method, a system and a device based on a deep neural network, comprising the following steps: extracting deep features of different scales of video frames in a video to be detected based on a feature learning network; performing superpixel segmentation on a video frame to obtain a superpixel structure diagram; carrying out feature fusion on the deep features and the super-pixel structure chart to obtain fusion features; performing target semantic classification based on the conditional random field network according to the fusion characteristics to obtain a target semantic label; and performing frame regression according to the target semantic label to obtain a target detection result. The method can accurately detect the targets with complex backgrounds, high density and small targets in the video, and is particularly suitable for the target identification task of the aerial video.

Description

Target detection method, system and device based on deep neural network
Technical Field
The invention relates to the technical field of computer vision, in particular to a target detection method, a system and a device based on a deep neural network.
Background
In recent years, the target detection technology has attracted great attention and has been widely applied in many fields, but the target detection task based on aerial images still faces many challenges. First, most aerial images are taken at high altitudes vertically or obliquely, so the natural landscape images taken with the image background from the ground are more confusing. For example, when detecting vehicles in aerial imagery, some similar objects, such as rooftop equipment and substation boxes, may result in false positive detections. Secondly, when the image is shot in a wide field of view, the object in the aerial image is very small and the density is higher than that of the natural scene image. Finally, the lack of a large scale and well annotated dataset limits the detection performance of the training network.
At present, most of target detection methods for aerial images are based on characteristics of sliding window search and shallow learning, but the method cannot acquire comprehensive information of a detection object from the aerial images, so that the application of the method is very limited, and monitoring results of the method are inconsistent in different tasks. For example, although convolutional neural networks have the function of learning a strong hierarchy structure, they are used for object detection tasks in aerial images, and the network performs multiple maximization and downsampling operations, which may cause signal downsampling problems and relatively weak spatial description. On the other hand, due to the highly variable and multi-angle rotation of the aerial device, the objects in the aerial image typically have small size and multi-scale and shape distortions, which essentially limit the spatial description capability of the convolutional neural network.
Disclosure of Invention
The invention aims to provide a target detection method, a system and a device based on a deep neural network, which can detect targets with complex backgrounds, high density and small targets in a video and improve the target detection precision.
In order to achieve the above object, in a first aspect of the present invention, there is provided a deep neural network-based target detection method, including:
extracting deep features of different scales of video frames in a video to be detected based on a preset feature learning network;
performing superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame;
performing feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;
carrying out target semantic classification based on a preset conditional random field network and according to the fusion characteristics to obtain target semantic labels;
and performing frame regression according to the target semantic label to obtain a target detection result.
Further, the step of "performing superpixel segmentation on the video frame to obtain a superpixel model map corresponding to the video frame" includes:
performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm;
calculating the pixel average value of each superpixel block obtained after superpixel segmentation;
and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.
Further, before the step of performing target semantic classification based on the preset conditional random field network and according to the fusion features to obtain the target semantic label, the method further includes:
and carrying out network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics.
Further, the step of performing network training on the conditional random field network by using a maximum conditional likelihood method based on preset fusion features includes:
optimizing the network weight of the conditional random field network according to a method shown as the following formula:
Figure BDA0001985799080000021
Figure BDA0001985799080000022
Figure BDA0001985799080000023
wherein V represents a superpixel block set in the superpixel structure diagram, E represents a connection relation set of adjacent superpixel blocks, and EijRepresenting the connection between the ith and jth superpixel blocks, x(n)The n-th fusion feature is represented,
Figure BDA0001985799080000024
representing a weight corresponding to an ith superpixel block in the nth fused feature,
Figure BDA0001985799080000025
represents the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3 …, M and M represent the number of fusion features,
Figure BDA0001985799080000026
represents the preset x(n)And
Figure BDA0001985799080000027
the corresponding function of the unit term is,
Figure BDA0001985799080000028
represents the preset x(n)
Figure BDA0001985799080000029
And
Figure BDA00019857990800000210
corresponding function of a binary term, Z (x)(n)W) represents preset based on said x(n)W conditional inference function, ci,cjRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel blocki,ljEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ wN,wE],w*Represents the optimized value of w, wNWeights, w, representing the function of said unit terms presetERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, Pk(yk,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset categoryk,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset categoryKA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,
Figure BDA0001985799080000031
representing the square of the 2 norm.
In a second aspect of the present invention, there is also provided a deep neural network-based target detection system, including:
the characteristic extraction module is configured to extract deep characteristics of different scales of video frames in the video to be detected based on a preset characteristic learning network;
the super-pixel segmentation module is configured to perform super-pixel segmentation on the video frame to obtain a super-pixel structure diagram corresponding to the video frame;
the feature fusion module is configured to perform feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;
the semantic classification module is configured to classify the target semantic to obtain a target semantic label based on a preset conditional random field network and according to the fusion characteristics;
and the target detection module is configured to perform frame regression according to the target semantic tag to obtain a target detection result.
Further, the super-pixel segmentation module is further configured to perform the following operations:
performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm;
calculating the pixel average value of each superpixel block obtained after superpixel segmentation;
and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.
Further, the system also includes a network training module configured to perform the following operations:
and carrying out network training on the conditional random field network by adopting a maximum conditional likelihood method based on a preset first fusion characteristic.
Further, the network training module is further configured to optimize the network weights of the conditional random field network according to a method shown in the following formula:
Figure BDA0001985799080000032
Figure BDA0001985799080000033
Figure BDA0001985799080000041
wherein V represents a superpixel block set in the superpixel structure diagram, E represents a connection relation set of adjacent superpixel blocks, and EijRepresenting the connection between the ith and jth superpixel blocks, x(n)The n-th fusion feature is represented,
Figure BDA0001985799080000042
representing the ith of the n-th fusion featuresThe weight corresponding to the super-pixel block,
Figure BDA0001985799080000043
represents the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3 …, M and M represent the number of fusion features,
Figure BDA0001985799080000044
represents the preset x(n)And
Figure BDA0001985799080000045
the corresponding function of the unit term is,
Figure BDA0001985799080000046
represents the preset x(n)
Figure BDA0001985799080000047
And
Figure BDA0001985799080000048
corresponding function of a binary term, Z (x)(n)W) represents preset based on said x(n)W conditional inference function, ci,cjRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel blocki,ljEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ wN,wE],w*Represents the optimized value of w, wNWeights, w, representing the function of said unit terms presetERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, Pk(yk,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset categoryk,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset categoryKA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,
Figure BDA0001985799080000049
representing the square of the 2 norm.
In a third aspect of the present invention, there is also provided a storage device, in which a plurality of programs are stored, the programs being adapted to be loaded and executed by a processor to implement the above-mentioned deep neural network-based object detection method.
In a fourth aspect of the present invention, there is also provided a processing apparatus, comprising a processor adapted to execute various programs; and a storage device adapted to store a plurality of programs; the program is suitable to be loaded and executed by a processor to implement the above-mentioned deep neural network-based object detection method.
The invention has the advantages that:
the target detection method based on the deep neural network can detect the targets with complex backgrounds, high density and small targets in the video and improve the target detection precision.
Drawings
Fig. 1 is a schematic diagram illustrating main steps of a deep neural network-based target detection method according to an embodiment of the present invention.
Fig. 2 is a schematic main flow chart of a target detection method based on a deep neural network in an embodiment of the present invention.
Fig. 3 is a schematic diagram of a target detection result on a UAV23 data set in an embodiment of the invention.
Fig. 4 is a schematic structural diagram of a deep neural network-based target detection system in an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
Referring to fig. 1, fig. 1 illustrates the main steps of a deep neural network-based target detection method, and as shown in fig. 1, the deep neural network-based target detection method of the present invention may include the following steps:
step S1: and extracting deep features of different scales of video frames in the video to be detected based on a preset feature learning network.
Specifically, the video to be detected is a video sequence to be subjected to a target detection task, and comprises a plurality of video frames. The feature learning network may be a deep convolutional network constructed using a machine learning algorithm. Because the position, rotation, scale and the like of the detected target in each video frame have variability, and the characteristic expression extracted by the convolution operation has invariance to inclination, translation, scaling and the like, the deep features can express small targets and background information in a aerial photography video in a layering manner, the target detection precision is improved, and the target detection is more accurate and convenient by using the deep features of different scales than a method based on manual extraction of shallow features.
In this embodiment, in the network training stage, the feature learning network extracts deep features from the VLFEAT49 toolbox in the MatConvNet pre-trained neural network, and the selected feature learning network is 21 layers of "ImageNet-vgg-f", and the feature learning network is trained by using the 5 th, 13 th and 16 th layers thereof.
Step S2: and carrying out superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame.
Specifically, performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm; calculating the pixel average value of each superpixel block obtained after superpixel segmentation; and acquiring the super-pixel structure diagram according to the probability dependence relationship between the pixel average value of each super-pixel block and the pixel average values of other super-pixel blocks. The superpixel structure diagram is a probability model describing conditional independent relations among multiple random variables, and is composed of a group of nodes and edges among the nodes, wherein each node represents a random variable (or a group of random variables), and the edges represent probability dependent relations among the random variables. Based on the method, a small amount of scattered abnormal pixel points in the video frame can be eliminated, so that the target detection precision is further improved. In addition, the number of super pixels in the video frame is far smaller than that of pixels, so that the operation speed of the network can be obviously improved. The boundary between the super pixel blocks can be definitely reserved in the super pixel structure diagram, so that adjacent objects can be more accurately distinguished, and the monitoring precision of the small target is further improved. In this embodiment, in the super-pixel segmentation process, the size of the super-pixel neighborhood is set to 15, and the normalization factor is set to 0.1.
Step S3: and carrying out feature fusion on the deep features and the superpixel structure chart to obtain fusion features. Specifically, the super-pixel structure diagram is used as a feature representation of the video frame, and the deep features and the super-pixel structure diagram are subjected to feature fusion to obtain fusion features. The fused feature is a depth multi-scale feature.
Step S4: and carrying out target semantic classification based on a preset conditional random field network and according to the fusion characteristics to obtain a target semantic label.
Specifically, the conditional random field network is a neural network constructed based on a conditional random field, and the conditional random field has a strong capability of learning a spatial relationship in display.
In the implementation, the method further comprises the step of network training of the conditional random field network. Specifically, network training is carried out on the conditional random field network based on preset fusion features and by adopting a maximum conditional likelihood method, and the network weight of the conditional random field network is optimized according to a method shown in a formula (1):
Figure BDA0001985799080000061
Figure BDA0001985799080000062
Figure BDA0001985799080000063
wherein V represents a superpixel block set in the superpixel structure diagram, and E representsSet of connection relationships of adjacent superpixel blocks, eijRepresenting the connection between the ith and jth superpixel blocks, x(n)The n-th fusion feature is represented,
Figure BDA0001985799080000064
representing the weight corresponding to the ith superpixel block in the nth fusion feature,
Figure BDA0001985799080000065
represents the weight corresponding to the jth super-pixel block in the nth fusion feature, wherein n is 1,2,3 …, M represents the number of fusion features, the number of fusion features can be equal to the number of video frames in the video to be tested,
Figure BDA0001985799080000066
denotes a preset x(n)And
Figure BDA0001985799080000067
the corresponding function of the unit term is,
Figure BDA0001985799080000068
denotes a preset x(n)
Figure BDA0001985799080000069
And
Figure BDA00019857990800000610
corresponding function of a binary term, Z (x)(n)W) represents a preset x-based(n)W conditional inference function, ci,cjRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel blocki,ljRespectively represents classification categories corresponding to the ith superpixel block and the jth superpixel block, w represents the weight of the conditional random field network, and w ═ wN,wE],w*Denotes the optimized value of w, wNWeight, w, representing a predetermined unit term functionERepresenting the weight of a predetermined bivariate function, T representing the transpose of a vector or matrix, Pk(yk,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset categoryk,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset categoryKA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,
Figure BDA0001985799080000071
representing the square of the 2 norm.
Step S4: and performing frame regression according to the target semantic label to obtain a target detection result. Specifically, the identified target is obtained according to the target semantic tag, and frame regression is performed on the identified target to obtain position information and size information of each target in the video frame.
Referring to fig. 2, fig. 2 illustrates a main flow of a target detection method based on a deep neural network, and the target detection method based on the deep neural network shown in fig. 2 may include: extracting deep features of different scales of video frames in a video to be detected based on a feature learning network; performing superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame; carrying out feature fusion on the deep features and the super-pixel structure chart to obtain fusion features; performing target semantic classification based on the conditional random field network according to the fusion characteristics to obtain a target semantic label; and performing frame regression according to the target semantic label to obtain a target detection result.
Although the foregoing embodiments describe the steps in the above sequential order, those skilled in the art will understand that, in order to achieve the effect of the present embodiments, the steps may not be executed in such an order, and may be executed simultaneously (in parallel) or in an inverted order, and these simple changes are all within the scope of the present invention.
To illustrate the effectiveness of the method of the present invention, the effectiveness of the method of the present invention was evaluated using the UAV123 database. The database is a database which is newly created in 2016 and is used for target tracking and detection of the aerial images of the unmanned aerial vehicles, the database comprises 123 videos, 33 videos are selected to generate 48770 pictures which comprise scenes in all the databases, and 13871 pictures are manually marked with detection box true values. The training set and test set were randomly divided into a 1: 1 ratio. The target detection objects are mainly focused on bicycles, ships, buildings, people, cars, and the like. Internationally recognized 3 indexes were used: precision, Recall, F1-score, compared to the current state-of-the-art method of target detection, the results are shown in Table 1. Wherein "ACF 2015" represents a method proposed in documents "K.Liu and G.Mattyus," Fast Multiclass vessel Detection on Images, "IEEE Geoscience and Remote Sensing Letters, vol.12, No.9, pp.1938-1942, 2015," AVPN "represents a method proposed in documents" Z.Deng, H.Sun, S.ZHou, J.ZHao, and H.Zou, "firmware and" hardware vessel Detection in aqueous Images coupling, vol.10, No.8, pp.52, WO 64 "and" IEEE Journal of selected characteristics in Applied liquids and updates and Remote controls, "volume.10, 368, pp.52, and" TS 2017, S.J.Z ", and" noise measurement on S.7, Z.7, Z.1, 2. J.D.D, H.S.S.Z ", and N.D.D.D.S.S.S.Z", and S.D.D.D.S.D.D.S.S.D.D.D.D.S.S.S.S.S.S.A.
TABLE 1 comparison of Properties results
Figure BDA0001985799080000081
As can be seen from Table 1, the method of the invention can well obtain the hierarchical structure characteristics and the spatial relationship characteristics of the target, and the method of the invention obtains three indexes of high Precision, Recall and F1-score.
Referring to fig. 3, fig. 3 exemplarily shows a target detection result on a UAV23 data set by the method of the present invention, as shown in fig. 3, each row is a video sequence consisting of 4 consecutive video frames, the first row is a target detection task for detecting a ship, the second row is a target detection task for detecting a car, and the third row is a target detection task for detecting a pedestrian on a car, and it can be seen from the figure that the method of the present invention can accurately complete detection tasks for objects with different viewing angles and categories, especially for objects with complicated background, high density, and small target, and can be better applied to complicated detection of aerial image targets of an unmanned aerial vehicle.
Based on the same inventive concept as the method embodiment, the embodiment of the invention also provides a target detection system based on the deep neural network. The following describes a target detection system based on a deep neural network according to the present invention with reference to the accompanying drawings.
Referring to fig. 4, fig. 4 exemplarily shows a main structure of a deep neural network-based target detection system, as shown in fig. 4, the deep neural network-based target detection system may include a feature extraction module 1 configured to extract deep features of different scales of video frames in a video to be detected based on a preset feature learning network; the super-pixel segmentation module 2 is configured to perform super-pixel segmentation on the video frame to obtain a super-pixel structure diagram corresponding to the video frame; the feature fusion module 3 is configured to perform feature fusion on the deep features and the superpixel structure chart to obtain fusion features; the semantic classification module 4 is configured to classify the target semantic to obtain a target semantic label based on a preset conditional random field network and according to the fusion characteristics; and the target detection module 5 is configured to perform frame regression according to the target semantic tags to obtain target detection results.
Further, the super-pixel segmentation module 2 is further configured to perform the following operations: performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm; calculating the pixel average value of each superpixel block obtained after superpixel segmentation; and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.
Further, the system also includes a network training module configured to perform the following operations: and performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on a preset first fusion characteristic.
Further, the network training module is further configured to optimize the network weights of the conditional random field network according to the method shown in formula (1).
Further, based on the above method embodiment, the embodiment of the present invention further provides a storage device, which stores a plurality of programs, and the programs are suitable for being loaded and executed by a processor to implement the above deep neural network-based target detection method.
Further, based on the foregoing method embodiment, an embodiment of the present invention further provides a processing apparatus, which includes a processor and a storage device. Wherein the processor may be adapted to execute the respective program, and the storage device may be adapted to store the plurality of programs, which are adapted to be loaded and executed by the processor to implement the above-mentioned deep neural network based object detection method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process and the related descriptions of the apparatus according to the embodiment of the present invention may refer to the corresponding process in the method according to the foregoing embodiment, and have the same beneficial effects as the method described above, and will not be described again here.
Those of skill in the art will appreciate that the various illustrative method steps and systems described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of electronic hardware and software. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The above description is of the preferred embodiment of the present invention and the technical principles applied thereto, and it will be apparent to those skilled in the art that any changes and modifications based on the equivalent changes and simple substitutions of the technical solution of the present invention are within the protection scope of the present invention without departing from the spirit and scope of the present invention.

Claims (6)

1. A target detection method based on a deep neural network is characterized by comprising the following steps:
extracting deep features of different scales of video frames in a video to be detected based on a preset feature learning network;
performing superpixel segmentation on the video frame to obtain a superpixel structure diagram corresponding to the video frame;
performing feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;
carrying out target semantic classification based on a preset conditional random field network and according to the fusion characteristics to obtain target semantic labels;
performing frame regression according to the target semantic tag to obtain a target detection result;
before the step of performing target semantic classification based on the preset conditional random field network and according to the fusion features to obtain the target semantic label, the method further includes:
performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics;
the step of performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics comprises the following steps:
optimizing the network weight of the conditional random field network according to a method shown as the following formula:
Figure FDA0002666894420000011
Figure FDA0002666894420000012
Figure FDA0002666894420000013
wherein V represents a superpixel block set in the superpixel structure diagram, E represents a connection relation set of adjacent superpixel blocks, and EijRepresenting the connection between the ith and jth superpixel blocks, x(n)The n-th fusion feature is represented,
Figure FDA0002666894420000014
representing a weight corresponding to an ith superpixel block in the nth fused feature,
Figure FDA0002666894420000015
representing the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3, M represents the number of fusion features,
Figure FDA0002666894420000016
represents the preset x(n)And
Figure FDA0002666894420000017
the corresponding function of the unit term is,
Figure FDA0002666894420000021
represents the preset x(n)
Figure FDA0002666894420000022
And
Figure FDA0002666894420000023
corresponding function of a binary term, Z (x)(n)W) represents preset based on said x(n)W conditional inference function, ci,cjRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel blocki,ljEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ wN,wE],w*Represents the optimized value of w, wNWeights, w, representing the function of said unit terms presetERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, Pk(yk,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset categoryk,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset categoryKA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,
Figure FDA0002666894420000024
representing the square of the 2 norm.
2. The method for detecting an object based on a deep neural network as claimed in claim 1, wherein the step of performing superpixel segmentation on the video frame to obtain a superpixel structure map corresponding to the video frame comprises:
performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm;
calculating the pixel average value of each superpixel block obtained after superpixel segmentation;
and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.
3. A deep neural network-based object detection system, the system comprising:
the characteristic extraction module is configured to extract deep characteristics of different scales of video frames in the video to be detected based on a preset characteristic learning network;
the super-pixel segmentation module is configured to perform super-pixel segmentation on the video frame to obtain a super-pixel structure diagram corresponding to the video frame;
the feature fusion module is configured to perform feature fusion on the deep features and the super-pixel structure diagram to obtain fusion features;
the semantic classification module is configured to classify the target semantic to obtain a target semantic label based on a preset conditional random field network and according to the fusion characteristics;
the target detection module is configured to perform frame regression according to the target semantic tag to obtain a target detection result;
the system also includes a network training module configured to perform the following operations:
performing network training on the conditional random field network by adopting a maximum conditional likelihood method based on preset fusion characteristics;
the network training module is further configured to optimize the network weights of the conditional random field network according to a method shown in the following formula:
Figure FDA0002666894420000031
Figure FDA0002666894420000032
Figure FDA0002666894420000033
wherein V represents a superpixel block set in the superpixel structure diagram, E represents a connection relation set of adjacent superpixel blocks, and EijRepresenting the connection between the ith and jth superpixel blocks, x(n)The n-th fusion feature is represented,
Figure FDA0002666894420000034
representing a weight corresponding to an ith superpixel block in the nth fused feature,
Figure FDA0002666894420000035
representing the weight corresponding to the jth superpixel block in the nth fusion feature, wherein n is 1,2,3, M represents the number of fusion features,
Figure FDA0002666894420000036
represents the preset x(n)And
Figure FDA0002666894420000037
the corresponding function of the unit term is,
Figure FDA0002666894420000038
represents the preset x(n)
Figure FDA0002666894420000039
And
Figure FDA00026668944200000310
corresponding function of a binary term, Z (x)(n)W) represents preset based on said x(n)W conditional inference function, ci,cjRespectively representing the initial classification probability values l corresponding to the ith superpixel block and the jth superpixel blocki,ljEach classification category corresponding to an ith superpixel block and a jth superpixel block, w represents a weight of the conditional random field network, and w ═ wN,wE],w*Represents the optimized value of w, wNWeights, w, representing the function of said unit terms presetERepresenting the weight of a predetermined said bivariate function, T representing the transpose of a vector or matrix, Pk(yk,a) Probability distribution function, y, representing that the kth super-pixel block belongs to the a-th preset categoryk,aRepresenting the probability, γ, that the kth super-pixel block belongs to the a-th preset categoryKA weight corresponding to the color information representing the kth super-pixel block, λ represents a preset non-negative L2 regularization parameter,
Figure FDA0002666894420000041
representing the square of the 2 norm.
4. The deep neural network-based object detection system of claim 3, wherein the superpixel segmentation module is further configured to perform operations comprising:
performing superpixel segmentation on the video frame based on a simple linear iterative clustering algorithm;
calculating the pixel average value of each superpixel block obtained after superpixel segmentation;
and acquiring a superpixel structure chart according to the probability dependence relationship between each superpixel block and other superpixel blocks based on the pixel average value.
5. A storage device in which a plurality of programs are stored, wherein the programs are adapted to be loaded and executed by a processor to implement the deep neural network based object detection method of any one of claims 1 to 2.
6. A processing apparatus, comprising:
a processor adapted to execute various programs; and
a storage device adapted to store a plurality of programs;
characterized in that the program is adapted to be loaded and executed by a processor to implement the deep neural network based object detection method of any one of claims 1 to 2.
CN201910167067.7A 2019-03-05 2019-03-05 Target detection method, system and device based on deep neural network Expired - Fee Related CN109934147B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910167067.7A CN109934147B (en) 2019-03-05 2019-03-05 Target detection method, system and device based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910167067.7A CN109934147B (en) 2019-03-05 2019-03-05 Target detection method, system and device based on deep neural network

Publications (2)

Publication Number Publication Date
CN109934147A CN109934147A (en) 2019-06-25
CN109934147B true CN109934147B (en) 2020-11-06

Family

ID=66986583

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910167067.7A Expired - Fee Related CN109934147B (en) 2019-03-05 2019-03-05 Target detection method, system and device based on deep neural network

Country Status (1)

Country Link
CN (1) CN109934147B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633327B (en) * 2020-12-02 2023-06-30 西安电子科技大学 Staged metal surface defect detection method, system, medium, equipment and application
CN113627298A (en) * 2021-07-30 2021-11-09 北京百度网讯科技有限公司 Training method of target detection model and method and device for detecting target object

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103984953B (en) * 2014-04-23 2017-06-06 浙江工商大学 Semantic segmentation method based on multiple features fusion Yu the street view image of Boosting decision forests
US9239384B1 (en) * 2014-10-21 2016-01-19 Sandia Corporation Terrain detection and classification using single polarization SAR
CN107423278B (en) * 2016-05-23 2020-07-14 株式会社理光 Evaluation element identification method, device and system
CN106709924B (en) * 2016-11-18 2019-11-22 中国人民解放军信息工程大学 Image, semantic dividing method based on depth convolutional neural networks and super-pixel
CN107742133A (en) * 2017-11-08 2018-02-27 电子科技大学 A kind of sorting technique for Polarimetric SAR Image
CN108510521A (en) * 2018-02-27 2018-09-07 南京邮电大学 A kind of dimension self-adaption method for tracking target of multiple features fusion
CN108629286B (en) * 2018-04-03 2021-09-28 北京航空航天大学 Remote sensing airport target detection method based on subjective perception significance model
CN108898145A (en) * 2018-06-15 2018-11-27 西南交通大学 A kind of image well-marked target detection method of combination deep learning

Also Published As

Publication number Publication date
CN109934147A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
JP7058669B2 (en) Vehicle appearance feature identification and vehicle search methods, devices, storage media, electronic devices
CN108510467B (en) SAR image target identification method based on depth deformable convolution neural network
Chen et al. Vehicle detection in high-resolution aerial images via sparse representation and superpixels
CN110310264A (en) A kind of large scale object detection method, device based on DCNN
Workman et al. A unified model for near and remote sensing
CN109919223B (en) Target detection method and device based on deep neural network
CN105989336B (en) Scene recognition method based on deconvolution deep network learning with weight
Han et al. Aerial image change detection using dual regions of interest networks
CN110555420B (en) Fusion model network and method based on pedestrian regional feature extraction and re-identification
CN110008900B (en) Method for extracting candidate target from visible light remote sensing image from region to target
Vishal et al. Accurate localization by fusing images and GPS signals
Zelener et al. Cnn-based object segmentation in urban lidar with missing points
CN108734200A (en) Human body target visible detection method and device based on BING features
CN109934147B (en) Target detection method, system and device based on deep neural network
Zhang et al. Finding nonrigid tiny person with densely cropped and local attention object detector networks in low-altitude aerial images
Padmanabula et al. Object Detection Using Stacked YOLOv3.
CN112465854A (en) Unmanned aerial vehicle tracking method based on anchor-free detection algorithm
Malav et al. DHSGAN: An end to end dehazing network for fog and smoke
Yang et al. Toward country scale building detection with convolutional neural network using aerial images
Makantasis et al. Semi-supervised vision-based maritime surveillance system using fused visual attention maps
Tu et al. Detection of damaged rooftop areas from high-resolution aerial images based on visual bag-of-words model
CN117557780A (en) Target detection algorithm for airborne multi-mode learning
Samanta et al. Spatial-resolution independent object detection framework for aerial imagery
Wang et al. Oil tank detection via target-driven learning saliency model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201106