CN112200062A - Target detection method and device based on neural network, machine readable medium and equipment - Google Patents

Target detection method and device based on neural network, machine readable medium and equipment Download PDF

Info

Publication number
CN112200062A
CN112200062A CN202011069007.0A CN202011069007A CN112200062A CN 112200062 A CN112200062 A CN 112200062A CN 202011069007 A CN202011069007 A CN 202011069007A CN 112200062 A CN112200062 A CN 112200062A
Authority
CN
China
Prior art keywords
loss
network
target
distillation
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011069007.0A
Other languages
Chinese (zh)
Other versions
CN112200062B (en
Inventor
姚志强
周曦
夏伯谦
钟南昌
於景瞵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Cloudwalk Artificial Intelligence Technology Co ltd
Original Assignee
Guangzhou Cloudwalk Artificial Intelligence Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Cloudwalk Artificial Intelligence Technology Co ltd filed Critical Guangzhou Cloudwalk Artificial Intelligence Technology Co ltd
Priority to CN202011069007.0A priority Critical patent/CN112200062B/en
Publication of CN112200062A publication Critical patent/CN112200062A/en
Application granted granted Critical
Publication of CN112200062B publication Critical patent/CN112200062B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/20Education
    • G06Q50/205Education administration or guidance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Educational Technology (AREA)
  • Tourism & Hospitality (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Strategic Management (AREA)
  • Biophysics (AREA)
  • Educational Administration (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • General Business, Economics & Management (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a target detection method based on a neural network, which comprises the following steps: constructing a teacher network; training the teacher network through a sample image set; constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network; in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network, training the student network through a sample image set; and performing target detection on the input image through the trained student network. The method automatically extracts the characteristics required by the target through the deep neural network, and avoids the problem of manually designing and extracting low-level abstract characteristics. Through knowledge distillation, the detection precision of the small network is close to that of the large network, and meanwhile, the high detection speed is guaranteed, so that the double requirements for accuracy and instantaneity in target detection are met.

Description

Target detection method and device based on neural network, machine readable medium and equipment
Technical Field
The invention relates to the field of image processing, in particular to a target detection method, a target detection device, a machine readable medium and machine readable equipment based on a neural network.
Background
The head and shoulder detection technology has a plurality of application scenes such as crowd counting, density estimation and passenger flow statistics, but in the practical application scene, a client has higher requirements on the head and shoulder detection precision and also has very high requirements on the real-time performance of the head and shoulder detection.
The traditional head and shoulder detection method usually adopts Hog feature extraction and SVM classifier or haar feature extraction plus adaboost cascade technology, but the traditional methods usually need manual design to extract features, are easily interfered and influenced by factors such as illumination, shielding and the like, cause the reduction of the head and shoulder detection performance, and are difficult to meet the requirements of the current practical application scenes on the head and shoulder detection performance.
On the other hand, although there are some methods and techniques for performing head-shoulder detection using a deep neural network, it is difficult for the current head-shoulder detection techniques and methods to achieve a balance between accuracy and real-time performance: not only obtains higher precision but also has good accuracy. The existing head and shoulder detection may be better in detection accuracy if a deeper deep neural network is adopted, but the real-time performance is difficult to guarantee. If a small deep neural network is adopted, the real-time performance can be better, but the detection accuracy is difficult to guarantee.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, the present invention provides a method, an apparatus, a machine-readable medium and a device for object detection based on a neural network, which are used to solve the problems of the prior art.
To achieve the above and other related objects, the present invention provides a target detection method based on a neural network, including:
constructing a teacher network;
training the teacher network through a sample image set;
constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network;
in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network, training the student network through a sample image set;
and performing target detection on the input image through the trained student network.
Optionally, in the knowledge distillation process, the feature map output by the teacher network is subjected to knowledge distillation, and the distilled knowledge is migrated to the student network.
Optionally, the teacher network includes a convolution unit, a batch normalization unit, a function activation unit, and a pooling unit, which are connected in sequence; the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner.
Optionally, the knowledge distillation of the feature map output by the teacher network comprises: determining a target distillation area, and performing knowledge distillation on a characteristic diagram of the target distillation area, wherein the determination method of the target distillation area comprises the following steps:
respectively mapping the target frames marked in the sample image to the feature maps output by the corresponding convolution subunits according to different scales;
constructing a matrix with the same size as the characteristic diagram;
judging whether a target frame exists in the feature map or not;
if the target frame exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
Optionally, the convergence result of the student network is judged according to a loss function, where the loss function is:
loss=lossA+λ·lossB
therein, lossALoss for student network under labeled dataBThe distillation loss of the student network when the supervision information is extracted from the teacher network is defined, and lambda is a weight coefficient between the detection loss and the distillation loss of the target; the lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame has the confidence loss of the targetconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
Optionally, the lossBIs composed of
Figure BDA0002711876130000021
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure BDA0002711876130000022
To achieve the above and other related objects, the present invention provides an object detecting apparatus based on a neural network, including:
the first network construction module is used for constructing a teacher network;
a first training module for training the teacher network through a sample image set;
the second network construction module is used for constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network;
the second training module is used for training the student network through a sample image set in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network;
and the target detection module is used for carrying out target detection on the input image through the trained student network.
Optionally, in the knowledge distillation process, the feature map output by the teacher network is subjected to knowledge distillation, and the distilled knowledge is migrated to the student network.
Optionally, the teacher network includes a convolution unit, a batch normalization unit, a function activation unit, and a pooling unit, which are connected in sequence; the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner.
Optionally, the second training module comprises:
a distillation zone determination submodule for determining a target distillation zone;
a knowledge distillation submodule for performing knowledge distillation on the feature map of the target distillation region, wherein the target region determination submodule includes:
the mapping unit is used for mapping the target frames marked in the sample image to the corresponding feature maps in different scales respectively;
the matrix construction unit is used for constructing a matrix with the same size as the characteristic diagram;
the target frame unit is used for judging whether a target frame exists in the feature map; if the target exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
Optionally, the convergence result of the student network is judged according to a loss function, where the loss function is:
loss=lossA+λ·lossB
therein, lossADetection of student networks under annotated dataLoss, lossBThe distillation loss of the student network when the supervision information is extracted from the teacher network is defined, and lambda is a weight coefficient between the detection loss and the distillation loss of the target; the lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame has the confidence loss of the targetconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
Optionally, the lossBIs composed of
Figure BDA0002711876130000031
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure BDA0002711876130000041
To achieve the above and other related objects, the present invention also provides an apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the methods described previously.
To achieve the above objects and other related objects, the present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the methods described above.
As described above, the method, the apparatus, the machine-readable medium and the device for detecting the target based on the neural network, the apparatus, the machine-readable medium and the device provided by the present invention have the following advantages:
the invention relates to a target detection method based on a neural network, which comprises the following steps: constructing a teacher network; training the teacher network through a sample image set; constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network; in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network, training the student network through a sample image set; and performing target detection on the input image through the trained student network. The method automatically extracts the characteristics required by the target through the deep neural network, and avoids the problem of manually designing and extracting low-level abstract characteristics. Through knowledge distillation, the detection precision of the small network is close to that of the large network, and meanwhile, the high detection speed is guaranteed, so that the double requirements for accuracy and instantaneity in target detection are met.
Drawings
FIG. 1 is a flow chart of a target detection method based on a neural network according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a target detection method based on a neural network according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a distillation process according to the present invention;
FIG. 4 is a flow chart of a method of determining a distillation zone according to one embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a target detection apparatus based on a neural network according to an embodiment of the present invention;
fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
As shown in fig. 1, a method for constructing a target detection model includes:
s11 constructing a teacher network;
s12 training the teacher network through a sample image set;
s13, constructing a student network, wherein the parameter quantity of the student network is smaller than that of the teacher network;
s14, in the process of extracting knowledge obtained by teacher network training by knowledge distillation and transferring the knowledge to the student network, training the student network through a sample image set;
and S15, performing target detection on the input image through the trained student network.
The method automatically extracts the characteristics required by the target through the deep neural network, and avoids the problem of extracting low-level abstract characteristics through manual design in the traditional measuring method. Through knowledge distillation, the detection precision of a small network (student network) is close to that of a large network (teacher network), and meanwhile, higher detection speed is guaranteed, so that double requirements on accuracy and instantaneity in target detection are met.
As shown in fig. 2, one of the neural networks may include a backbone network (backbone), a cervical net (nic), and a head net (head). Each network is composed of operations such as convolution, batch normalization, function activation, pooling and the like. In this embodiment, the teacher network may be trained by using the backbone network shown in fig. 2, and the teacher network is trained by using the sample image set. Of course, in another embodiment, the teacher's network may be a complete neural network that is trained for direct use in target detection.
In step S14, the knowledge distillation is a knowledge extraction, and the knowledge learned by one neural network can be transferred to another neural network. The knowledge distillation process can be as shown in fig. 3, wherein the middle rectangular box represents knowledge, Data represents Data, an Input layer is an Input layer, hind layers are hidden layers, out layers are output layers, and a Teacher Model is a Teacher Model. In knowledge distillation, knowledge has different definitions in different mission scenarios, and does not have a strict definition. Knowledge includes relationship-Based Knowledge, Feature-Based Knowledge, Response-Based Knowledge. In this embodiment, feature-based knowledge is referred to.
In this embodiment, the one neural network may be a teacher network, and the other neural network is a student network. In general, teacher networks have powerful capabilities and performance, while student networks are more compact. Then by knowledge distillation it is desirable that the student network be as close to or as far as possible to the teacher network, so that similar predictive results are achieved with less complexity. The teacher network is often a network with a large model size, a complex structure, a high computation amount and a good performance, for example, the teacher network may adopt a network structure including, but not limited to, ResNet-152, DenseNet-264, Darknet-53, etc. The student network is a network with small size, simple structure, low computation amount and poor performance, for example, the student network can adopt a network including but not limited to ResNet-18, so that knowledge migration can be performed between the teacher network and the student network through knowledge distillation, and the student network can learn the supervision information of the teacher network. And training the student network by using the sample image set while learning the supervision information of the teacher network by the student network.
In one embodiment, the teacher network comprises a convolution unit, a batch normalization unit, a function activation unit and a pooling unit which are connected in sequence. In the process of training the teacher network, a large and deep learning network is selected as the teacher network to extract the characteristic information in the image, and training is carried out on a training set consisting of sample images so as to obtain the teacher network with high detection accuracy. In the teacher network, the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner. For example, if the convolution unit includes 20 convolution layers, one may use 1 to 5 convolution layers as a convolution unit, 6 to 15 convolution layers as a convolution unit, 16 to 20 convolution layers as a convolution unit, and each convolution unit outputs a feature map.
In one embodiment, the student network comprises a convolution unit, a batch normalization unit, a function activation unit and a pooling unit which are connected in sequence. In the student network, the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner. For example, if the convolution unit includes 10 convolution layers, one may use 1 to 3 convolution layers as a convolution unit, 4 to 7 convolution layers as a convolution unit, 8 to 10 convolution layers as a convolution unit, and each convolution unit outputs a feature map. It should be noted that the size of the feature map output by each convolution unit of the student network is the same as the size of the feature map output by each convolution unit of the teacher network, and the feature maps have a one-to-one correspondence relationship, that is, the size of the feature map output by the first convolution unit of the student network is the same as the size of the feature map output by the first convolution unit of the teacher network, the size of the feature map output by the second convolution unit of the student network is the same as the size of the feature map output by the second convolution unit of the teacher network, and the size of the feature map output by the third convolution unit of the student network is the same as the size of the feature map output by the third convolution unit of the teacher network.
In the knowledge distillation process, the knowledge distillation is carried out on the characteristic diagram output by the teacher network, and the distilled knowledge is migrated to the student network.
In the knowledge distilling process of the feature map output by the teacher network, since the image does not have only the target region, and also has regions other than the target region, and can be understood as the interference region, in order to reduce the amount of calculation, the feature of only the target region can be learned in the knowledge distilling process. The target distillation area is called as a target distillation area, and by distilling the target distillation area, on one hand, noise information caused by distilling the full characteristic diagram can be reduced, the noise information can cover the extraction of teacher network supervision information by a student network, and on the other hand, the distillation of all areas can be avoided so as to reduce the calculation amount.
Specifically, as shown in fig. 4, in the knowledge distillation process, the method for determining the target distillation region includes:
s41, mapping the target frames marked in the sample image to the feature maps output by the corresponding convolution subunits respectively according to different scales;
the target frame is in different scales, which is to be understood that labeling the target in the sample image forms a target frame, and the target frame is transformed in different scales, for example, 0.5x,1.0x,1.5x, and so on. Wherein each scale corresponds to a feature map output by each convolution unit in the teacher network. For example, a 1.5x target frame corresponds to a feature map output by a convolution unit composed of 1 to 5 convolution layers, a 1.0x target frame corresponds to a feature map output by a convolution unit composed of 6 to 15 convolution layers, and a 0.5x target frame corresponds to a feature map output by a convolution unit composed of 17 to 20 convolution layers.
S42, constructing a matrix with the same size as the characteristic diagram;
s43, judging whether a target frame exists in the feature map;
if the target frame exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
In one embodiment, the convergence result of the student network is determined according to a loss function, where the loss function is:
loss=lossA+λ·lossB
therein, lossALoss for student network under labeled dataBFor the distillation loss of the student network in extracting the supervision information from the teacher network, λ is a weight coefficient between the detection loss and the distillation loss of the target.
The lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame has the confidence loss of the targetconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
The lossBIs composed of
Figure BDA0002711876130000071
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure BDA0002711876130000072
After the student network is obtained through the foregoing steps, the student network also needs to be trained, specifically, the student network is trained in a training set containing the sample image, and under the condition that the loss function loss reaches a minimum value without fitting, the trained student network is obtained. And performing target detection on the input image through the trained student network.
In one embodiment, the image to be detected is input to a trained student network for detection, and a prediction frame of a target and a confidence score corresponding to each frame are obtained. Setting a specified threshold according to the application scene, discarding the frames with the confidence scores smaller than the specified threshold in the prediction frames, and keeping the prediction frames with the confidence scores larger than the threshold, namely obtaining the target.
In another embodiment, the teacher network in the neural network shown in fig. 2 is replaced with a trained student network to form the target detection model. And carrying out target detection on the input image through the neural network model.
In one embodiment, the sample images need to be preprocessed before the teacher network is trained with the sample images; the sample image can be a multi-frame image of one or more identification objects, such as a head and shoulder image, a human face image, a plant image, a building image, an automobile image and the like; the pretreatment method comprises the following steps:
the image is subjected to dynamic brightness adjustment of different degrees, dynamic contrast adjustment, dynamic brightness adjustment, dynamic saturation adjustment, random rotation, fuzzy processing and the like so as to realize the combination of an online enhancement mode and an offline enhancement mode. Interference influence caused by factors such as light, blur and shielding is avoided to the maximum extent.
As shown in fig. 5, an object detection model construction apparatus includes:
a first network construction module 51 for constructing a teacher network;
a first training module 52 for training the teacher network through a sample image set;
a second network construction module 53, configured to construct a student network, where a parameter amount of the student network is smaller than a parameter amount of the teacher network;
a second training module 54, configured to train the student network through a sample image set in a process of extracting knowledge obtained by the teacher network training by using knowledge distillation and migrating the knowledge to the student network;
and the target detection module 55 is used for performing target detection on the input image through the trained student network.
In one embodiment, in the knowledge distillation process, the feature map output by the teacher network is subjected to knowledge distillation, and the distilled knowledge is migrated to the student network.
In one embodiment, the teacher network comprises a convolution unit, a batch normalization unit, a function activation unit and a pooling unit which are connected in sequence; the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner.
In one embodiment, the second training module comprises:
a distillation zone determination submodule for determining a target distillation zone;
a knowledge distillation submodule for performing knowledge distillation on the feature map of the target distillation region, wherein the target region determination submodule includes:
the mapping unit is used for mapping the target frames marked in the sample image to the corresponding feature maps in different scales respectively;
the matrix construction unit is used for constructing a matrix with the same size as the characteristic diagram;
the target frame unit is used for judging whether a target frame exists in the feature map; if the target exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
In one embodiment, the convergence result of the student network is determined according to a loss function, where the loss function is:
loss=lossA+λ·lossB
therein, lossALoss for student network under labeled dataBThe distillation loss of the student network when the supervision information is extracted from the teacher network is defined, and lambda is a weight coefficient between the detection loss and the distillation loss of the target; the lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame existsLoss of confidence of target lossconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
In one embodiment, the lossBIs composed of
Figure BDA0002711876130000091
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure BDA0002711876130000092
In this embodiment, the embodiment of the apparatus corresponds to the embodiment of the method, and specific functions and technical effects are only referred to the embodiment, which is not described herein again.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present application further provides a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may be caused to execute instructions (instructions) of steps included in the method in fig. 1 according to the present application.
Fig. 6 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a module for executing functions of each module in each device, and specific functions and technical effects may refer to the foregoing embodiments, which are not described herein again.
Fig. 7 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 7 is a specific embodiment of the implementation of FIG. 6. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the data processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 involved in the embodiment of fig. 7 can be implemented as the input device in the embodiment of fig. 6.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (14)

1. A target detection method based on a neural network is characterized by comprising the following steps:
constructing a teacher network;
training the teacher network through a sample image set;
constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network;
in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network, training the student network through a sample image set;
and performing target detection on the input image through the trained student network.
2. The neural network-based object detection method of claim 1, wherein in the knowledge distillation process, the feature map output by the teacher network is subjected to knowledge distillation, and the distilled knowledge is migrated to the student network.
3. The neural network-based target detection method according to claim 1, wherein the teacher network comprises a convolution unit, a batch normalization unit, a function activation unit and a pooling unit which are connected in sequence; the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner.
4. The neural network-based object detection method of claim 2, wherein the knowledge distillation of the feature map output by the teacher network comprises: determining a target distillation area, and performing knowledge distillation on a characteristic diagram of the target distillation area, wherein the determination method of the target distillation area comprises the following steps:
respectively mapping the target frames marked in the sample image to the feature maps output by the corresponding convolution subunits according to different scales;
constructing a matrix with the same size as the characteristic diagram;
judging whether a target frame exists in the feature map or not;
if the target frame exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
5. The neural network-based object detection method according to claim 1, wherein the convergence result of the student network is judged according to a loss function, the loss function being:
loss=lossA+λ·lossB
therein, lossALoss for student network under labeled dataBIs networked to students inDistilling loss when extracting supervision information from a teacher network, wherein lambda is a weight coefficient between the detection loss and the distilling loss of a target; the lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame has the confidence loss of the targetconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
6. The neural network-based object detection method of claim 5, wherein loss isBIs composed of
Figure FDA0002711876120000021
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure FDA0002711876120000022
7. An object detection apparatus based on a neural network, comprising:
the first network construction module is used for constructing a teacher network;
a first training module for training the teacher network through a sample image set;
the second network construction module is used for constructing a student network, wherein the parameter quantity of the student network is smaller than the parameter quantity of the teacher network;
the second training module is used for training the student network through a sample image set in the process of extracting knowledge obtained by teacher network training by adopting knowledge distillation and transferring the knowledge to the student network;
and the target detection module is used for carrying out target detection on the input image through the trained student network.
8. The neural network-based object detection device of claim 7, wherein in the knowledge distillation process, the feature map output by the teacher network is subjected to knowledge distillation, and the distilled knowledge is migrated to the student network.
9. The neural network-based object detection device of claim 7, wherein the teacher network comprises a convolution unit, a batch normalization unit, a function activation unit and a pooling unit which are connected in sequence; the convolution unit comprises a plurality of convolution subunits which are connected in sequence, and each convolution subunit outputs a characteristic diagram; each convolution subunit includes a plurality of convolution layers arranged in a stacked manner.
10. The neural network-based object detection device of claim 8, wherein the second training module comprises:
a distillation zone determination submodule for determining a target distillation zone;
a knowledge distillation submodule for performing knowledge distillation on the feature map of the target distillation region, wherein the target region determination submodule includes:
the mapping unit is used for mapping the target frames marked in the sample image to the corresponding feature maps in different scales respectively;
the matrix construction unit is used for constructing a matrix with the same size as the characteristic diagram;
the target frame unit is used for judging whether a target frame exists in the feature map; if the target exists in the characteristic diagram, setting the value of the corresponding area of the constructed matrix as 1, otherwise, setting the value of the corresponding area of the constructed matrix as 0, and forming a 0-1 distillation mask; and the characteristic region corresponding to the distillation mask median value of 1 is the target distillation region.
11. The neural network-based object detecting apparatus of claim 8, wherein the convergence result of the student network is judged according to a loss function, the loss function being:
loss=lossA+λ·lossB
therein, lossALoss for student network under labeled dataBThe distillation loss of the student network when the supervision information is extracted from the teacher network is defined, and lambda is a weight coefficient between the detection loss and the distillation loss of the target; the lossAIncluding target frame center point lossxyLoss of target frame sizewhWhether the target frame has the confidence loss of the targetconfAnd loss of classification lossclsThe method specifically comprises the following steps: lossA=lossxy+losswh+lossconf+losscls
12. The neural network-based object detection device of claim 11, wherein the loss isBIs composed of
Figure FDA0002711876120000031
Wherein M isijFor distilling mask, W, H, C shows width, height and number of channels corresponding to the characteristic diagram, respectively, FsFeature maps exported for student networks, FtA feature map output for the teacher network; n is the number of distillation masks with a median value of 1, i.e.
Figure FDA0002711876120000032
13. An apparatus, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-6.
14. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-6.
CN202011069007.0A 2020-09-30 2020-09-30 Target detection method and device based on neural network, machine readable medium and equipment Active CN112200062B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011069007.0A CN112200062B (en) 2020-09-30 2020-09-30 Target detection method and device based on neural network, machine readable medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011069007.0A CN112200062B (en) 2020-09-30 2020-09-30 Target detection method and device based on neural network, machine readable medium and equipment

Publications (2)

Publication Number Publication Date
CN112200062A true CN112200062A (en) 2021-01-08
CN112200062B CN112200062B (en) 2021-09-28

Family

ID=74014062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011069007.0A Active CN112200062B (en) 2020-09-30 2020-09-30 Target detection method and device based on neural network, machine readable medium and equipment

Country Status (1)

Country Link
CN (1) CN112200062B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712052A (en) * 2021-01-13 2021-04-27 安徽水天信息科技有限公司 Method for detecting and identifying weak target in airport panoramic video
CN112801236A (en) * 2021-04-14 2021-05-14 腾讯科技(深圳)有限公司 Image recognition model migration method, device, equipment and storage medium
CN112949766A (en) * 2021-04-07 2021-06-11 成都数之联科技有限公司 Target area detection model training method, system, device and medium
CN113033767A (en) * 2021-02-19 2021-06-25 北京大学 Knowledge distillation-based data compression recovery method and system for neural network
CN113095251A (en) * 2021-04-20 2021-07-09 清华大学深圳国际研究生院 Human body posture estimation method and system
CN113159073A (en) * 2021-04-23 2021-07-23 上海芯翌智能科技有限公司 Knowledge distillation method and device, storage medium and terminal
CN113221709A (en) * 2021-04-30 2021-08-06 芜湖美的厨卫电器制造有限公司 Method and device for recognizing user movement and water heater
CN113657483A (en) * 2021-08-14 2021-11-16 北京百度网讯科技有限公司 Model training method, target detection method, device, equipment and storage medium
CN114842449A (en) * 2022-05-10 2022-08-02 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN115019060A (en) * 2022-07-12 2022-09-06 北京百度网讯科技有限公司 Target recognition method, and training method and device of target recognition model
CN115082880A (en) * 2022-05-25 2022-09-20 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN115527083A (en) * 2022-09-27 2022-12-27 中电金信软件有限公司 Image annotation method and device and electronic equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN108830813A (en) * 2018-06-12 2018-11-16 福建帝视信息科技有限公司 A kind of image super-resolution Enhancement Method of knowledge based distillation
CN108898168A (en) * 2018-06-19 2018-11-27 清华大学 The compression method and system of convolutional neural networks model for target detection
CN108921294A (en) * 2018-07-11 2018-11-30 浙江大学 A kind of gradual piece of knowledge distillating method accelerated for neural network
CN110059740A (en) * 2019-04-12 2019-07-26 杭州电子科技大学 A kind of deep learning semantic segmentation model compression method for embedded mobile end
CN110097084A (en) * 2019-04-03 2019-08-06 浙江大学 Pass through the knowledge fusion method of projection feature training multitask student network
CN110163344A (en) * 2019-04-26 2019-08-23 北京迈格威科技有限公司 Neural network training method, device, equipment and storage medium
CN110443784A (en) * 2019-07-11 2019-11-12 中国科学院大学 A kind of effective conspicuousness prediction model method

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107247989A (en) * 2017-06-15 2017-10-13 北京图森未来科技有限公司 A kind of neural network training method and device
CN108764462A (en) * 2018-05-29 2018-11-06 成都视观天下科技有限公司 A kind of convolutional neural networks optimization method of knowledge based distillation
CN108830813A (en) * 2018-06-12 2018-11-16 福建帝视信息科技有限公司 A kind of image super-resolution Enhancement Method of knowledge based distillation
CN108898168A (en) * 2018-06-19 2018-11-27 清华大学 The compression method and system of convolutional neural networks model for target detection
CN108921294A (en) * 2018-07-11 2018-11-30 浙江大学 A kind of gradual piece of knowledge distillating method accelerated for neural network
CN110097084A (en) * 2019-04-03 2019-08-06 浙江大学 Pass through the knowledge fusion method of projection feature training multitask student network
CN110059740A (en) * 2019-04-12 2019-07-26 杭州电子科技大学 A kind of deep learning semantic segmentation model compression method for embedded mobile end
CN110163344A (en) * 2019-04-26 2019-08-23 北京迈格威科技有限公司 Neural network training method, device, equipment and storage medium
CN110443784A (en) * 2019-07-11 2019-11-12 中国科学院大学 A kind of effective conspicuousness prediction model method

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112712052A (en) * 2021-01-13 2021-04-27 安徽水天信息科技有限公司 Method for detecting and identifying weak target in airport panoramic video
CN113033767A (en) * 2021-02-19 2021-06-25 北京大学 Knowledge distillation-based data compression recovery method and system for neural network
CN112949766A (en) * 2021-04-07 2021-06-11 成都数之联科技有限公司 Target area detection model training method, system, device and medium
CN112801236A (en) * 2021-04-14 2021-05-14 腾讯科技(深圳)有限公司 Image recognition model migration method, device, equipment and storage medium
CN113095251B (en) * 2021-04-20 2022-05-27 清华大学深圳国际研究生院 Human body posture estimation method and system
CN113095251A (en) * 2021-04-20 2021-07-09 清华大学深圳国际研究生院 Human body posture estimation method and system
CN113159073A (en) * 2021-04-23 2021-07-23 上海芯翌智能科技有限公司 Knowledge distillation method and device, storage medium and terminal
CN113221709A (en) * 2021-04-30 2021-08-06 芜湖美的厨卫电器制造有限公司 Method and device for recognizing user movement and water heater
CN113221709B (en) * 2021-04-30 2022-11-25 芜湖美的厨卫电器制造有限公司 Method and device for identifying user motion and water heater
CN113657483A (en) * 2021-08-14 2021-11-16 北京百度网讯科技有限公司 Model training method, target detection method, device, equipment and storage medium
CN114842449A (en) * 2022-05-10 2022-08-02 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN115082880A (en) * 2022-05-25 2022-09-20 安徽蔚来智驾科技有限公司 Target detection method, electronic device, medium, and vehicle
CN115019060A (en) * 2022-07-12 2022-09-06 北京百度网讯科技有限公司 Target recognition method, and training method and device of target recognition model
CN115527083A (en) * 2022-09-27 2022-12-27 中电金信软件有限公司 Image annotation method and device and electronic equipment

Also Published As

Publication number Publication date
CN112200062B (en) 2021-09-28

Similar Documents

Publication Publication Date Title
CN112200062B (en) Target detection method and device based on neural network, machine readable medium and equipment
US10936919B2 (en) Method and apparatus for detecting human face
CN109543714B (en) Data feature acquisition method and device, electronic equipment and storage medium
CN111444826B (en) Video detection method, device, storage medium and computer equipment
CN112200318B (en) Target detection method, device, machine readable medium and equipment
CN111539412B (en) Image analysis method, system, device and medium based on OCR
CN112052186A (en) Target detection method, device, equipment and storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
WO2020244151A1 (en) Image processing method and apparatus, terminal, and storage medium
CN111950570B (en) Target image extraction method, neural network training method and device
CN111739027A (en) Image processing method, device and equipment and readable storage medium
CN113050860B (en) Control identification method and related device
WO2022161302A1 (en) Action recognition method and apparatus, device, storage medium, and computer program product
US20230035366A1 (en) Image classification model training method and apparatus, computer device, and storage medium
WO2023197648A1 (en) Screenshot processing method and apparatus, electronic device, and computer readable medium
CN111310725A (en) Object identification method, system, machine readable medium and device
CN113642359B (en) Face image generation method and device, electronic equipment and storage medium
CN113515994A (en) Video feature extraction method, device, equipment and storage medium
CN111914850B (en) Picture feature extraction method, device, server and medium
CN112036307A (en) Image processing method and device, electronic equipment and storage medium
CN111818364B (en) Video fusion method, system, device and medium
CN116434253A (en) Image processing method, device, equipment, storage medium and product
CN111079472A (en) Image comparison method and device
CN111144510B (en) Image semantic recognition method, system, device and medium based on multiple models
CN113196279A (en) Face attribute identification method and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A neural network-based object detection method, device, machine readable medium, and device

Effective date of registration: 20230918

Granted publication date: 20210928

Pledgee: Bank of China Co.,Ltd. Nansha Branch of Guangdong Free Trade Pilot Area

Pledgor: GUANGZHOU CLOUDWALK ARTIFICIAL INTELLIGENCE TECHNOLOGY Co.,Ltd.

Registration number: Y2023980057268