CN110084313A

CN110084313A - A method of generating object detection model

Info

Publication number: CN110084313A
Application number: CN201910369470.8A
Authority: CN
Inventors: 齐子铭; 李启东; 陈裕潮; 张伟; 李志阳
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2019-05-05
Filing date: 2019-05-05
Publication date: 2019-08-02

Abstract

The invention discloses a kind of methods for generating object detection model, comprising: obtains the training image comprising labeled data, labeled data is the position of target object and classification in training image；It will be handled in the object detection model of training image input pre-training, the object detection model includes the characteristic extracting module and prediction module being mutually coupled, wherein, characteristic extracting module includes depth residual error network unit and convolution processing unit, suitable for carrying out process of convolution to training image, to generate at least one characteristic pattern；Prediction module is suitable for predicting classification and the position of target object from least one characteristic pattern；Object category and position based on labeled data and prediction are trained the object detection model of pre-training, using the object detection model after being trained as object detection model generated.

Description

A method of generating object detection model

Technical field

The present invention relates to technical field of computer vision more particularly to a kind of methods for generating object detection model, object Detection method calculates equipment and storage medium.

Background technique

Object detection is the basis of many Computer Vision Tasks, one known in input picture suitable for positioning and identifying Or multiple targets, be generally applied to scene content understanding, video monitoring, content-based image retrieval, robot navigation and The fields such as augmented reality.

Traditional object detecting method generally divides three phases: firstly, candidate frame region is extracted, using sliding window to whole Width image traversal obtains the position that object is likely to occur；Then, to the candidate frame extracted region feature that these are extracted, common side Method has SIFT (Scale invariant features transform), HOG (histograms of oriented gradients) etc.；Finally, feature input classifier is divided Class, common classifier have SVM (support vector machines), Adaboost (iterative algorithm) etc..Traditional object detecting method time Complexity is high, and window redundancy needs manual designs feature, and variation robustness multifarious to object is low.

The object detection method based on deep learning achieves important progress in recent years.Main stream approach is broadly divided into two Type: one kind is test problems to be divided into two stages, firstly, passing through inspiration based on the two-part algorithm of region nomination Formula method generates a series of sparse candidate frames, and then these candidate frames are classified and returned.Typically there is R-CNN (base Convolutional neural networks in region), SPPNet (spatial pyramid pond network) and various improved R-CNN serial algorithms Deng.This mode accuracy in detection is higher, but calculating speed is slower.One is unistage type algorithms end to end, that is, do not need The extracted region stage directly generates the class probability and position coordinates of object.By equably being carried out in the different location of picture Intensive sampling can use different scale and length-width ratio when sampling, then be extracted after feature directly using convolutional neural networks Classified and is returned.Typically there are YOLO, SSD etc..It is fast that this mode detects speed, but accuracy rate is lower.

Therefore, it is necessary to a kind of object detecting methods, and the calculating speed of model can be improved while reducing model size And accuracy rate.

Summary of the invention

For this purpose, the present invention provides a kind of method for generating object detection model, to try hard to solve or at least in alleviation At least one problem existing for face.

According to an aspect of the invention, there is provided a kind of method for generating object detection model, this method is suitable for counting It calculates and is executed in equipment, comprising: firstly, obtaining the training image comprising labeled data, labeled data is object in training image The position of body and classification.Then, it will be handled in the object detection model of training image input pre-training, object detection model Including the characteristic extracting module and prediction module being mutually coupled, wherein characteristic extracting module include depth residual error network unit and Convolution processing unit is suitable for carrying out process of convolution to training image, to generate at least one characteristic pattern；Prediction module be suitable for to Classification and the position of target object are predicted in a few characteristic pattern.Finally, object category based on labeled data and prediction and Position is trained the object detection model of pre-training, using the object detection model after being trained as object generated Body detection model.

Optionally, in the above-mentioned methods, it is 3*3 that depth residual error network unit, which includes multiple convolution kernel sizes being mutually coupled, Process of convolution layer and jump articulamentum, jump articulamentum be suitable for will be mutually coupled two process of convolution layers output characteristic pattern The output of phase adduction.

Optionally, in the above-mentioned methods, process of convolution layer includes convolutional layer, batch normalized layer and active coating, In, batch normalizes layer and is merged into convolutional layer.

Optionally, in the above-mentioned methods, prediction module includes class prediction unit and position prediction unit, class prediction list Member is suitable for exporting the classification confidence level of each object in image, and position prediction unit, which is suitable for exporting, predicts target object in image Position.

Optionally, in the above-mentioned methods, the position of the target object of mark is the characteristic point coordinate or true of target object Object frame.

Optionally, in the above-mentioned methods, prediction module further includes candidate frame generation unit and candidate frame matching unit.It is candidate Each characteristic pattern that frame generation unit is suitable for exporting characteristic extracting module generates corresponding according to different sizes and length-width ratio Multiple candidate frames, candidate frame matching unit is suitable for choosing and the matched candidate frame of real-world object frame, to be based on matched candidate Frame is predicted.

Optionally, in the above-mentioned methods, determine between the real-world object frame position based on mark and prediction object frame position Classification confidence level penalty values between bit-loss value and the classification and prediction classification confidence level of mark, update object detection model Parameter, when the weighted sum of the positioning penalty values and classification confidence level penalty values meets predetermined condition, training terminates.

Optionally, in the above-mentioned methods, based on following formula calculate positioning penalty values and classification confidence level penalty values plus Quan He:

Wherein, L_locTo position penalty values, L_confFor classification confidence penalty values, N is the quantity of matched candidate frame, and α is Weight coefficient, g are the positions of real-world object frame, and l is the position for predicting object frame, and x is the classification of mark, and C is classification confidence level.

Optionally, in the above-mentioned methods, positioning penalty values are calculated based on following formula:

Wherein, i is the serial number for predicting object frame, and j is the serial number of real-world object frame, and cx, cy are the center of candidate frame, w, h For the width and height of candidate frame, m indicates the size of candidate frame,For i-th of prediction object frame and j-th of real-world object frame Between position deviation, Pos indicates the quantity of positive sample candidate frame in training image, and N indicates the quantity of matched candidate frame,Indicate whether i-th of prediction object frame matches with j-th of real-world object frame about classification k.

Optionally, in the above-mentioned methods, classification confidence level penalty values are calculated based on following formula:

Wherein, i is the serial number for predicting object frame, and j is the serial number of real-world object frame, and N indicates the quantity of matched candidate frame, Pos indicates the quantity of the positive sample candidate frame in training image, and Neg indicates the quantity of the negative sample candidate frame in training image,Indicate the ratio that prediction classification is p,Indicate that i-th of prediction object frame corresponds to the classification confidence of classification p,It indicates Whether i-th of prediction object frame matches with j-th of real-world object frame about classification p.

Optionally, in the above-mentioned methods, the object detection model of pre-training is generated based on image data set, wherein image Include at least the image of each object category in training image in data set, the object category in training image include cat face, Dog face, face and background.

Optionally, in the above-mentioned methods, data enhancing processing and normalized are carried out to training image.

Optionally, in the above-mentioned methods, data enhancing processing include overturning, rotation, color jitter, random cropping, at random Brightness adjustment, random comparison are to any one of adjustment, Fuzzy Processing or multinomial.

According to a further aspect of the present invention, a kind of object detecting method is provided, image to be detected can be inputted into object In detection model, to obtain the position of each object frame and classification in image, wherein object detection model is using as described above Method generates.

According to another aspect of the invention, a kind of calculating equipment is provided, comprising: one or more processors；And storage Device；One or more programs, wherein one or more programs store in memory and are configured as being handled by one or more Device executes, and one or more programs include the instruction for either executing in method as described above method.

In accordance with a further aspect of the present invention, a kind of computer-readable storage medium for storing one or more programs is provided Matter, one or more programs include instruction, and instruction is when calculating equipment execution, so that calculating equipment executes method as described above In either method.

Scheme according to the present invention, object detection model include the characteristic extracting module and prediction module being mutually coupled, each mould The convolutional layer of block uses less port number, reduces the size of model.Further, object detection model uses depth residual error Low-level image feature can be fused to upper one layer, improve the accuracy and speed of model inspection by network unit.Therefore, this programme institute The object detection model of offer can either match the computational efficiency and memory of mobile terminal, and can satisfy wanting for object detection precision It asks.

Detailed description of the invention

To the accomplishment of the foregoing and related purposes, certain illustrative sides are described herein in conjunction with following description and drawings Face, these aspects indicate the various modes that can practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical appended drawing reference generally refers to identical Component or element.

Fig. 1 shows the organigram according to an embodiment of the invention for calculating equipment 100；

Fig. 2 shows the structural schematic diagrams of object detection model 200 according to an embodiment of the invention；

Fig. 3 shows the schematic network structure of depth residual error network unit 300 according to an embodiment of the invention；

Fig. 4 shows the schematic stream of the method 400 according to an embodiment of the invention for generating object detection model Cheng Tu；

Fig. 5 shows the schematic diagram of the training image according to an embodiment of the invention comprising labeled data；

Fig. 6 shows the schematic diagram of image data enhancing processing according to an embodiment of the invention.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Object detection is position and the classification in order to go out objects in images with collimation mark.Based on SSD object detection model not With being identified under the characteristic pattern of level, more ranges can be covered, generally, SSD object detection model includes the basis VGG Network and pyramid network have 16 layers or 19 layers, keep the parameter amount of model larger, nothing since VGG has deeper network structure Method meets the requirement of mobile terminal.In order to realize real-time object detection, model is made to meet the requirement of mobile end memory and calculating speed, This programme improves the network structure of SSD object detection model, to reduce the size of model, improve and detection accuracy and improve Calculating speed can satisfy the real-time object detection in mobile terminal.

Fig. 1 is the block diagram of Example Computing Device 100.In basic configuration 102, calculating equipment 100, which typically comprises, is System memory 106 and one or more processor 104.Memory bus 108 can be used for storing in processor 104 and system Communication between device 106.

Depending on desired configuration, processor 104 can be any kind of processor, including but not limited to: micro process Device (μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 104 may include all Cache, processor core such as one or more rank of on-chip cache 110 and second level cache 112 etc 114 and register 116.Exemplary processor core 114 may include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.Exemplary Memory Controller 118 can be with processor 104 are used together, or in some implementations, and Memory Controller 118 can be an interior section of processor 104.

Depending on desired configuration, system storage 106 can be any type of memory, including but not limited to: easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System storage Device 106 may include operating system 120, one or more is using 122 and program data 124.In some embodiments, It may be arranged to be operated using program data 124 on an operating system using 122.In some embodiments, equipment is calculated 100 are configured as executing the method 400 for generating object detection model, just contain in program data 124 for executing method 400 Instruction.

Calculating equipment 100 can also include facilitating from various interface equipments (for example, output equipment 142, Peripheral Interface 144 and communication equipment 146) to basic configuration 102 via the communication of bus/interface controller 130 interface bus 140.Example Output equipment 142 include graphics processing unit 148 and audio treatment unit 150.They can be configured as facilitate via One or more port A/V 152 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example If interface 144 may include serial interface controller 154 and parallel interface controller 156, they, which can be configured as, facilitates Via one or more port I/0 158 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, image Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.Exemplary communication is set Standby 146 may include network controller 160, can be arranged to convenient for via one or more communication port 164 and one A or multiple other calculate communication of the equipment 162 by network communication link.

Network communication link can be an example of communication media.Communication media can be usually presented as in such as carrier wave Or computer readable instructions, data structure, program module in the modulated data signal of other transmission mechanisms etc, and can To include any information delivery media." modulated data signal " can be such signal, one in its data set or Multiple or its change can be carried out in a manner of encoded information in the signal.As unrestricted example, communication media It may include the wired medium of such as cable network or private line network etc, and such as sound, radio frequency (RF), microwave, red Various wireless mediums including (IR) or other wireless mediums outside.Term computer-readable medium used herein may include Both storage medium and communication media.In some embodiments, one or more programs are stored in computer-readable medium, this It include the instruction for executing certain methods in a or multiple programs.

Calculating equipment 100 can be implemented as a part of portable (or mobile) electronic equipment of small size, these electronics are set It is standby to can be such as cellular phone, digital camera, personal digital assistant (PDA), personal media player device, wireless network Browsing apparatus, personal helmet, application specific equipment or may include any of the above function mixing apparatus.Certainly, it counts Calculate equipment 100 and also can be implemented as include desktop computer and notebook computer configuration personal computer, or have The server of above-mentioned configuration.Embodiments of the present invention to this with no restriction.

First before model training, need to be configured the network structure and parameter of model.Fig. 2 shows according to this The structural schematic diagram of the object detection model 200 of one embodiment of invention.As shown in Fig. 2, object detection model 200 includes phase The characteristic extracting module 210 and prediction module 220 mutually coupled.Wherein, characteristic extracting module 210 includes depth residual error network unit And convolution processing unit, it is suitable for carrying out process of convolution to input picture, to generate at least one characteristic pattern.Prediction module 220 is wrapped Include candidate frame generation unit 221 and candidate frame matching unit 222, class prediction unit 223 and position prediction unit 224.It is candidate Each characteristic pattern that frame generation unit 221 is suitable for exporting characteristic extracting module 210 is generated according to different sizes and length-width ratio Corresponding multiple candidate frames.Candidate frame matching unit 222 be suitable for choose with the matched candidate frame of real-world object frame, so as to based on The candidate frame matched is predicted.Class prediction unit 223 is suitable for exporting the classification confidence level of each object in image, position prediction Unit 224 is suitable for exporting the position that object frame is predicted in image.

For object detection model, if simply increasing depth, it may appear that the phenomenon that accuracy rate of network declines.Error Raised the phenomenon that gradient disappears, is just more obvious the reason is that network is deeper, can not effective handle so when back-propagating To the network layer of front, forward network layer parameter can not update gradient updating, and trained and test effect is caused to be deteriorated.If Network design is H (x)=F (x)+x, so that it may as long as being converted to study one residual error function F (x)=- x. F (x)=0 H (x), Just constitute identical mapping H (x)=x.Depth residual error network can increase an identical mapping, and current output is directly passed It is defeated by next layer network, a shortcut has been equivalent to away, has skipped this layer of operation, this is directly connected to be named as " the company of jumping Connect ", while it being directly passed to a upper layer network during back-propagating, and by the gradient of next layer network, thus solve The gradient disappearance problem of deep layer of having determined network.

According to one embodiment of present invention, characteristic extracting module can use ResNet depth residual error network unit.Table 1 Show the subnetwork parameter of characteristic extracting module 210 according to an embodiment of the invention.Wherein, number Conv_1, Layer_19_2_2, layer_19_2_3, layer_19_2_4, layer_19_2_5 are convolution processing units, number conv1, Conv2, Conv2_sum constitute a depth residual error network unit, and conv_3, conv_4, Conv_4_sum constitute a depth Residual error network unit, conv_5, conv_6, Conv_6_sum one depth residual error network unit of composition, conv_7, conv_8, Conv_8_sum constitutes a depth residual error network unit, and conv_9, conv_10, Conv_10_sum constitute a depth residual error Network unit.

In table 1, Conv is convolutional layer, and BN is batch normalization layer, and it is active coating that ReLU, which indicates activation primitive,.Sum is to jump Turn articulamentum.Kh, kw respectively indicate the height and width of convolution kernel, and padding is Filling power, and stride is convolution step-length, num_ Output indicates the quantity of the matched candidate frame of output, and group indicates grouping convolution, and group=1 expression is not grouped.

The subnetwork parameter of 1 characteristic extracting module of table

As shown in Table 1, characteristic extracting module includes multiple ResNet units and convolution processing unit.Each ResNet is mono- Member includes the process of convolution layer that two convolution kernel sizes being mutually coupled are 3*3 and jumps articulamentum, and process of convolution layer includes volume Lamination, batch normalization layer and active coating.In neural metwork training network model, batch, which normalizes layer, can speed up network receipts It holds back, and the generation of over-fitting can be controlled, can effectively solve the problem that gradient disappears and gradient explosion issues are generally placed upon volume After lamination, before active coating.Although BN layers training when play positive effect, however, before network to infer when more than Some layers of operation influences the performance of model, and occupies more memory or video memory space.It therefore, can be by batch normalizing Change layer and be merged into convolutional layer, can be improved the calculating speed of model in this way, to be suitable for the real-time object detection of mobile terminal.Activation Layer uses ReLU activation primitive, can also use any type of activation primitives such as leakyReLU, tanh, sigmoid, This is without limitation.Feature is extracted using various sizes of convolution kernel in lightweight convolution unit, while two outputs being connected It is connected to together, it can be with lifting feature dimension.For example, conv1 and convolution kernel size that convolution kernel size is 3*3 by Conv2_sum It is attached for the conv2 of the 3*3 characteristic pattern exported.

As described above, each process layer can export corresponding characteristic pattern in characteristic extracting module 210, according to this hair Bright embodiment, from wherein extracting at least one characteristic pattern for being processed into output, for prediction module 220 carry out position and Class prediction.In one embodiment, as shown in table 1, extracting its middle layer number is conv_8_sum, Conv_1, layer_ The characteristic pattern that 6 process layers of 19_2_2, layer_19_2_3, layer_19_2_4, layer_19_2_5 are exported.

Depth residual error network unit realizes that short circuit connection can't increase additional to network by way of jumping connection Parameter and calculation amount, while can but greatly increase the training speed of model, improve training effect, and when the number of plies of model adds When deep, residual error structure can be good at solving degenerate problem.Fig. 3 shows depth residual error according to an embodiment of the invention The schematic diagram of network unit 300.As shown in figure 3, depth residual error network unit includes multiple convolution kernels being mutually coupled Size is the process of convolution layer of 3*3 and jumps articulamentum, jumps two process of convolution layers that articulamentum is suitable for be mutually coupled Characteristic pattern is added.For residual error network, the matched short circuit of dimension is connected as solid line connection, otherwise connects for dotted line.Dimension is not Timing, there are two types of optinal plans for same mapping: directly by increasing dimension with 0 filling.

Prediction module 220 may include class prediction unit 223 and position prediction unit 224.Table 2 and table 3 are shown respectively The network parameter of position prediction unit according to an embodiment of the invention and class prediction unit.According to the present invention one A embodiment, prediction module 220 further include candidate frame generation unit 231 and candidate frame matching unit 222, and wherein candidate frame generates Each characteristic pattern that unit is suitable for exporting characteristic extracting module 210 generates corresponding multiple according to different sizes and length-width ratio Candidate frame.Candidate frame matching unit be suitable for choose with the matched candidate frame of real-world object frame, so as to based on matched candidate frame into Row prediction.

The network parameter of 2 position prediction unit of table

The subnetwork parameter of 3 class prediction unit of table

Wherein, mbox block is candidate frame and real-world object frame from the characteristic pattern extracted in characteristic extracting module The candidate frame matched.Concat layers of effect is exactly to splice two or more characteristic patterns on channel dimension, will be with big Small characteristic pattern is stitched together.Table 4 shows the network parameter of candidate frame generation unit according to an embodiment of the invention. Wherein PriorBox indicates that the candidate frame generated, aspect_ratio indicate to generate the length-width ratio of candidate frame, and min_size makes a living At the smallest dimension of candidate frame, max_size is the out to out for generating candidate frame.

The network parameter of 4 candidate frame generation unit of table

In the training process, it first has to determine that the real-world object frame in training picture is matched with which candidate frame, Matching candidate frame is responsible for predicting true frame.Table 5 shows the network parameter of candidate frame matching unit.Wherein, Permute Layer can reset the dimension of input according to mould-fixed.Flatten layers can be by input " pressing ", i.e., the defeated of multidimensional Enter one-dimensional.Prediction module finally integrates the prediction output of 6 characteristic patterns.The row of order expression matching candidate frame Sequence, axis:1 are indicated using 1 value along each row or column label mould to executing corresponding method.

The network parameter of 5 candidate frame matching unit of table

After completing the setting of network structure and parameter of model, the generation object detection model of this programme can be executed Method.Fig. 4 shows the schematic flow of the method 400 according to an embodiment of the invention for generating object detection model Figure.Wherein object detection model may include that (structure about model can join for characteristic extracting module, Fusion Module and prediction module It examines and is described above, details are not described herein again).This method can execute in calculating equipment 100, as shown in figure 4, this method 400 begins In step S410.

It according to some embodiments of the present invention, can be first to constructed object detection mould before executing step S410 Type carries out pre-training.According to one embodiment of present invention, image data the set pair analysis model can be primarily based on and carry out pre-training, with Just the parameter for initializing object detection model, that is, generate the object detection model of pre-training.For example, image data set can be VOC data set includes 20 catalogues: the mankind in data set；Animal (bird, cat, ox, dog, horse, sheep)；The vehicles (aircraft, from Driving, ship, bus, car, motorcycle, train)；Indoor (bottle, chair, dining table, potted plant, sofa, TV). It also needs to consider background when using VOC data set training pattern, it is therefore desirable to the model of 21 classifications of training.For different Layer can initialize 4 classifications (cat faces, dog face, people of the invention with the biggish weighted value of weighted value in the corresponding layer of modulus type Face, background) object detection model.By the method for this pre-training, model convergence rate can be accelerated, while improving model Detection accuracy.The COCO data set that Microsoft can also be used to provide carries out the pre-training of model, and wherein COCO data set has 3 kinds of marks Infuse type: object instance, target critical point and iamge description can be advantageously applied to object detection.This programme is to picture number According to collection using without limitation.

In step S410, the training image comprising labeled data is obtained, labeled data is target object in training image Position and classification.The position of real-world object frame can be gone out with Direct Mark, object frame can also be calculated by the characteristic point of mark Position.This programme to the mask method of labeled data without limitation.

Fig. 5 shows the schematic diagram of the training image according to an embodiment of the invention comprising labeled data.Such as Fig. 5 institute Show, in order to detect the cat in picture, dog, face, the frame of each examined object first in mark picture, then in frame Object marks out classification (also needing in model training plus background classification).For the ease of display, in each object in Fig. 5 The classification of target object: cat, dog, face has been marked out beside frame.Cat face classification can also be labeled as to 1, dog face classification mark Note is 2, and face classification is labeled as 3, and background classification is labeled as 0.Another implementation according to the present invention, simultaneously for one Comprising cat face, dog face, face image, cat face characteristic point, dog face characteristic point and human face characteristic point can be marked first, in total 30 A characteristic point (quantity of characteristic point mark can be adjusted as the case may be) and the class label for marking each object.Example Such as, cat face is labeled as 1, and dog face is labeled as 2, and face is labeled as 3, and background is labeled as 0.It can be based on the characteristic point coordinate of mark Calculate the position of real-world object frame.For example, obtaining the maximum value and minimum value of all characteristic point coordinates, respectively x_min, x_max, y_min, y_max.So the coordinate of object frame is (x_min, y_min, w, h), w=x_max-x_min, h=y_max-y_min。

According to one embodiment of present invention, in the input layer of model, training image can also be pre-processed, it can be with Enhance processing and normalized including data.In order to detect the object under various natural scenes, guarantee the effective of model Training can carry out data extending or enhancing to training image.By to picture Random-Rotation, random brightness, setting contrast And Fuzzy Processing etc., to simulate the image data under various natural scenes.Fig. 6 is shown according to one embodiment of present invention Image data enhancing processing schematic diagram.As shown in fig. 6, be from left to right followed successively by rotation, dim, lighten, enhance contrast, Fuzzy Processing.In addition, it can include overturning (horizontally or vertically), change of scale (adjustment image resolution ratio), take at random ( Image block is taken in original image at random), color jitter (slight noise is added to original pixel Distribution value) etc., complicated data expand Filling method, there are also GAN generation confrontation network generation, principal component analysis, supervised to take and (only take the figure of obvious semantic information As block) etc..

It should be noted that not all data enhancement methods can be used at will, such as facial image Flip vertical is carried out with regard to improper.In data enhancing, it is also necessary to which image data and flag data are synchronized expansion, example Such as Image Reversal or rotation, corresponding mark coordinate accordingly will overturn or rotate.Due to the size of real image be it is unfixed, If changing the size of image, the markup information of image is with regard to incorrect, so simultaneously to the size modification of image, Corresponding variation is done to markup information.The mark of image can be cut according to the original size of image and the ratio of markup information Infuse the corresponding image of information.

Then in the step s 420, it will be handled in the object detection model of training image input pre-training, wherein object Body detection model includes the characteristic extracting module and prediction module being mutually coupled, wherein characteristic extracting module includes multiple depth Residual error network unit is suitable for carrying out process of convolution to training image, to generate at least one characteristic pattern.Prediction module be suitable for to Classification and the position of target object are predicted in a few characteristic pattern.

Finally in step S430, object category and position based on labeled data and prediction examine the object of pre-training It surveys model to be trained, using the object detection model after being trained as object detection model generated.

According to one embodiment of present invention, can real-world object frame position based on mark and prediction object frame position it Between positioning penalty values and mark classification and prediction classification confidence level between classification confidence level penalty values, update object detection The parameter of model, when the weighted sum until positioning penalty values and classification confidence level penalty values meets predetermined condition, training terminates.? In an implementation of the invention, for location error, it can be calculated using Smooth loss function, confidence level is missed Difference can be calculated using softmax loss function.

The weighted sum of positioning penalty values and classification confidence level penalty values can be calculated based on following formula:

Wherein, L_locTo position penalty values, L_confFor classification confidence penalty values, N is and the matched candidate of real-world object frame The quantity of frame, α are weight coefficient, and weight coefficient can be set to 1.G is the location parameter of real-world object frame, and l is prediction object The location parameter of frame, x are the classification of mark, and c is classification confidence level.

The positioning penalty values can be calculated based on following formula:

Wherein, i is the serial number for predicting object frame, and j is the serial number of real-world object frame, and cx, cy are the center of candidate frame, w, h For the width and height of candidate frame, m indicates the size of candidate frame,For i-th of prediction object frame and j-th of real-world object frame Between position deviation, Pos indicates the quantity of positive sample candidate frame in training image, and N indicates the quantity of matched candidate frame,Indicate whether i-th of prediction object frame matches with j-th of real-world object frame about classification k, matching is 1, and mismatching is 0.

Since the gradient of the error in deep-neural-network can add up to be multiplied in the updating, if the gradient between network layer Value is greater than 1, then repeating to be multiplied will lead to gradient and be exponentially increased, and causing network weight significantly to update makes network become shakiness It is fixed.Therefore it is lost using mean square deviation when predicted value is differed with true value less than 1 using smooth loss function and adds 0.5 Smoothing factor then reduces loss power, at this moment backpropagation derivation is not just deposited when predicted value and true value are differed by more than equal to 1 At this, so as to solve the problems, such as that gradient is exploded.

In the training process, it first has to determine that the real-world object frame in training picture is matched with which candidate frame, Matching candidate frame will be responsible for predicting it.Candidate frame and true frame matching principle mainly have two o'clock.First principle is: right Each true frame in picture finds and hands over it and than maximum candidate frame, then the candidate frame is matched.Second principle Be: for remaining not matched candidate frame, if hand over and compare be greater than some threshold value (usually 0.5), then the candidate frame also with This true frame matches.After candidate frame matching step, most of candidate frames are all negative samples, this leads to positive sample and bears Imbalance between sample.In order to guarantee that positive negative sample balances as far as possible, negative sample can be sampled, according to confidence when sampling It spends error (confidence level of projected background is smaller, and error is bigger) and carries out descending arrangement, choose the biggish a number of of error Negative sample of the sample as training, to guarantee positive and negative sample proportion close to 1: 3.Model can be made to obtain stable training in this way, Ensure that model can restrain.

Classification confidence level is lost, needs to consider the choosing of positive sample candidate frame and negative sample candidate frame in training image It selects, that is to say, that only hand over and the candidate frame than reaching threshold value is positive sample.Classification confidence can be calculated based on following formula Penalty values:

Based on the gradient value that above-mentioned penalty values obtain, the parameter value through multiple inverse iteration more new model, when penalty values Weighted sum meets predetermined condition, such as the difference of the front and back penalty values weighted sum of iteration twice is less than predetermined threshold, or reaches pre- When determining the number of iterations, training terminates.

After obtaining trained object detection model according to method 400, so that it may execute object inspection in the terminal Survey method.According to a kind of embodiment, image to be detected (in an embodiment according to the present invention, may be wrapped in image to be detected Contain the target objects such as cat face, dog face, face) it inputs in trained object detection model, to obtain each object frame in image Position and classification.Specifically, characteristic extracting module carries out process of convolution to image to be detected, generates at least one characteristic pattern； Prediction module predicts the classification of target object (that is, each object frame) from least one characteristic pattern that characteristic extracting module is extracted The position and.By in mobile terminal application test, compared with traditional SSD object detection model, the calculating speed of this programme is mentioned It is high by 20%, it can be realized the real-time detection of object.

According to the solution of the present invention, it is improved by the network structure to object detection model, in characteristic extracting module It is middle to use depth residual error network unit, wherein jumping connection structure using more, low-level image feature can be fused to upper one Layer, improves the accuracy and speed of model inspection.Object detection model provided by this programme can either match the meter of mobile terminal Efficiency and memory are calculated, and can satisfy the requirement of object detection precision.

A8, the method as described in A7, wherein calculate positioning penalty values and classification confidence level penalty values based on following formula Weighted sum:

A9, the method as described in A8, wherein calculate the positioning penalty values based on following formula:

A10, the method as described in A8, wherein calculate classification confidence penalty values based on following formula:

A11, method as described in a1, wherein the described method includes:

The object detection model of pre-training is generated based on image data set, and training figure is included at least in described image data set The image of each object category as in, the object category in the training image includes cat face, dog face, face and background.

A12, method as described in a1, wherein the method also includes:

Data enhancing processing and normalized are carried out to training image.

A13, the method as described in A12, wherein data enhancing processing includes overturning, rotation, color jitter, at random It cuts, random brightness adjustment, random comparison is to any one of adjustment, Fuzzy Processing or multinomial.

It should be appreciated that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, it is right above In the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure or In person's descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. claimed hair Bright requirement is than feature more features expressly recited in each claim.More precisely, as the following claims As book reflects, inventive aspect is all features less than single embodiment disclosed above.Therefore, it then follows specific real Thus the claims for applying mode are expressly incorporated in the specific embodiment, wherein each claim itself is used as this hair Bright separate embodiments.

Those skilled in the art should understand that the module of the equipment in example disclosed herein or unit or groups Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined into a module or furthermore be segmented into multiple Submodule.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

Various technologies described herein are realized together in combination with hardware or software or their combination.To the present invention Method and apparatus or the process and apparatus of the present invention some aspects or part can take insertion tangible media, such as it is soft The form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums, Wherein when program is loaded into the machine of such as computer etc, and is executed by the machine, the machine becomes to practice this hair Bright equipment.

In the case where program code executes on programmable computers, calculates equipment and generally comprise processor, processor Readable storage medium (including volatile and non-volatile memory and or memory element), at least one input unit, and extremely A few output device.Wherein, memory is configured for storage program code；Processor is configured for according to the memory Instruction in the said program code of middle storage executes method of the present invention.

By way of example and not limitation, computer-readable medium includes computer storage media and communication media.It calculates Machine readable medium includes computer storage media and communication media.Computer storage medium storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is generally modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer readable instructions, data structure, program module or other data including any information Pass medium.Above any combination is also included within the scope of computer-readable medium.

In addition, be described as herein can be by the processor of computer system or by executing by some in the embodiment The combination of method or method element that other devices of the function are implemented.Therefore, have for implementing the method or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, Installation practice Element described in this is the example of following device: the device be used for implement as in order to implement the purpose of the invention element performed by Function.

As used in this, unless specifically stated, come using ordinal number " first ", " second ", " third " etc. Description plain objects, which are merely representative of, is related to the different instances of similar object, and is not intended to imply that the object being described in this way must Must have the time it is upper, spatially, sequence aspect or given sequence in any other manner.

Although the embodiment according to limited quantity describes the present invention, above description, the art are benefited from It is interior it is clear for the skilled person that in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that Language used in this specification primarily to readable and introduction purpose and select, rather than in order to explain or limit Determine subject of the present invention and selects.Therefore, without departing from the scope and spirit of the appended claims, for this Many modifications and changes are obvious for the those of ordinary skill of technical field.For the scope of the present invention, to this Invent done disclosure be it is illustrative and not restrictive, the scope of the invention is defined by the appended claims.

Claims

1. a kind of method for generating object detection model, the method is suitable for executing in calculating equipment, comprising:

The training image comprising labeled data is obtained, the labeled data is the position of target object and classification in training image；

It will be handled in the object detection model of training image input pre-training, the object detection model includes being mutually coupled Characteristic extracting module and prediction module, wherein

The characteristic extracting module includes multiple depth residual error network units and convolution processing unit, is suitable for the training image Process of convolution is carried out, to generate at least one characteristic pattern；

The prediction module is suitable for predicting classification and the position of target object from least one characteristic pattern；

Object category and position based on labeled data and prediction are trained the object detection model of the pre-training, with Object detection model after being trained is as object detection model generated.

2. the method for claim 1, wherein the depth residual error network unit includes multiple convolution kernels being mutually coupled Size is the process of convolution layer of 3*3 and jumps articulamentum, two process of convolution for jumping articulamentum and being suitable for be mutually coupled The characteristic pattern phase adduction output of layer output.

3. method according to claim 2, the process of convolution layer includes convolutional layer, batch normalized layer and activation Layer, wherein the batch normalization layer is merged into convolutional layer.

4. the method for claim 1, wherein the prediction module includes class prediction unit and position prediction unit, The class prediction unit is suitable for exporting the classification confidence level of each object in image, and the position prediction unit is suitable for output figure The position of target object is predicted as in.

5. the method for claim 1, wherein the position of the target object of the mark is that the characteristic point of target object is sat Mark or real-world object frame.

6. method as claimed in claim 5, wherein the prediction module further includes candidate frame generation unit and candidate frame matching Unit, each characteristic pattern that the candidate frame generation unit is suitable for export the characteristic extracting module according to different sizes with Length-width ratio generates corresponding multiple candidate frames, and the candidate frame matching unit is suitable for choosing and the matched candidate of real-world object frame Frame, to be predicted based on matched candidate frame.

7. method as claimed in claim 6, wherein the object category and position based on labeled data and prediction, to institute It states the step of the object detection model of pre-training is trained and includes:

The classification of positioning penalty values and mark between real-world object frame position based on mark and prediction object frame position in advance The classification confidence level penalty values between classification confidence level are surveyed, the parameter of object detection model are updated, until the positioning penalty values When meeting predetermined condition with the weighted sum of classification confidence level penalty values, training terminates.

8. a kind of object detecting method, this method is suitable for executing in the terminal, comprising:

Image to be detected is inputted in object detection model, to obtain the position of each object frame and classification in image,

Wherein the object detection model is generated using the method as described in claim 1-7 any one.

9. a kind of calculating equipment, comprising:

Memory；

One or more processors；

One or more programs, wherein one or more of programs are stored in the memory and are configured as by described one A or multiple processors execute, and one or more of programs include for executing in -8 the methods according to claim 1 The instruction of either method.

10. a kind of computer readable storage medium for storing one or more programs, one or more of programs include instruction, Described instruction is when calculating equipment execution, so that the equipment that calculates executes appointing in method described in -8 according to claim 1 One method.