CN110503097A

CN110503097A - Training method, device and the storage medium of image processing model

Info

Publication number: CN110503097A
Application number: CN201910798468.2A
Authority: CN
Inventors: 王子愉; 姜文浩; 黄浩智; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2019-11-26

Abstract

The present invention provides a kind of training method of image processing model, device and storage mediums, image processing model includes: backbone network, region candidate network and detection network, method includes: to pass through backbone network, feature extraction is carried out to the sample image comprising target object, obtains the characteristic pattern of sample image；By region candidate network, region selection is carried out to characteristic pattern, determines candidate region；By detecting network, target object detection is carried out to candidate region, the classification parameter and location parameter of target object are obtained, the classification parameter includes the classification results of the corresponding target object, and location parameter includes: the encirclement frame, segmentation mask and key point mask of corresponding target object；Classification parameter and location parameter based on target object determine the value of the target loss function of image processing model；Based on the value of target loss function, model parameter is updated.By means of the invention it is possible to accurately position the target object in image, the precision of target detection is improved.

Description

Training method, device and the storage medium of image processing model

Technical field

The present invention relates to machine learning techniques field more particularly to a kind of training method of image processing model, device and Storage medium.

Background technique

Machine learning (ML, machine Learning) is a branch of artificial intelligence, generally includes artificial neural network The technologies such as network, confidence network, intensified learning, transfer learning and inductive learning, the purpose of machine learning are to allow machine according to priori Knowledge learnt, thus have classification and judgement logical capability.Using neural network as the machine learning model of representative not Disconnected development, the target detection being gradually applied in image procossing.

In the related technology, the training for the neural network model of target object in detection image is based only upon mesh in image The corresponding encirclement frame information of object is marked, so that accuracy when obtained image processing model being trained to carry out target detection is low.

Summary of the invention

The embodiment of the present invention provides training method, device and the storage medium of a kind of image processing model, can be accurate The target object in image is positioned, the precision of target detection is improved.

The technical solution of the embodiment of the present invention is achieved in that

The embodiment of the present invention provides a kind of training method of image processing model, and it includes: backbone that described image, which handles model, Network, region candidate network and detection network, which comprises

By the backbone network, feature extraction is carried out to the sample image comprising target object, obtains the sample graph The characteristic pattern of picture；

By the region candidate network, region selection is carried out to the characteristic pattern, determines candidate region；

By the detection network, target object detection is carried out to the candidate region, obtains the position of the target object Parameter and classification parameter are set, the classification parameter includes the classification results of the corresponding target object, and the location parameter includes: Encirclement frame, segmentation mask and the key point mask of the corresponding target object；

Encirclement frame, segmentation mask, key point mask and classification results based on the target object, determine at described image Manage the value of the target loss function of model；

Based on the value of the determining target loss function, the model parameter of described image processing model is updated.

The embodiment of the present invention provides a kind of training device of image processing model, comprising:

Characteristic extracting module, for carrying out feature to the sample image comprising target object and mentioning by the backbone network It takes, obtains the characteristic pattern of the sample image；

Module is chosen in region, for carrying out region selection to the characteristic pattern, determining and wait by the region candidate network Favored area；

Obj ect detection module, for carrying out target object detection to the candidate region, obtaining by the detection network The location parameter and classification parameter of the target object, the classification parameter include the classification results of the corresponding target object, The location parameter includes: the encirclement frame, segmentation mask and key point mask of the corresponding target object；

Determining module is lost, for encirclement frame, segmentation mask, key point mask and classification knot based on the target object Fruit determines the value of the target loss function of described image processing model；

Parameter updating module updates described image and handles model for the value based on the determining target loss function Model parameter.

In above scheme, module is chosen in the region, for generating the corresponding feature by the region candidate network Multiple initial encirclement frames of figure；

The multiple initial encirclement frame is scanned by sliding window, determines in the multiple initial encirclement frame and corresponds to the initial of prospect Surround frame；

The initial encirclement frame of the corresponding prospect is carried out to surround frame recurrence, to determine candidate region.

In above scheme, described device further include:

Divide module, for intercepting the corresponding spy in the candidate region from the characteristic pattern by the detection network Region is levied, candidate feature region is obtained；

The feature in the candidate feature region is adjusted to the characteristic dimension of fixed size.

In above scheme, the obj ect detection module is also used to the fully-connected network for including by the detection network, point It is other that target object detection is carried out to the candidate region, determine the candidate region comprising the target object；

Based on the candidate region comprising the target object, carries out surrounding frame recurrence, obtain corresponding to the target pair The encirclement frame of elephant.

In above scheme, the obj ect detection module is also used to the convolutional network for including by the detection network, respectively Correspond to the candidate region semantic segmentation of the target object, generates the segmentation mask of the corresponding target object.

In above scheme, the obj ect detection module is also used to the full convolutional network for including by the detection network, point The semantic segmentation of the other key point for correspond to the candidate region target object generates the corresponding target object The key point mask of key point.

In above scheme, the loss determining module, is also used to obtain the encirclement frame respectively and target is surrounded between frame The first difference, it is described segmentation mask and Target Segmentation mask between the second difference, the key point mask and target critical Third difference, the classification results between point mask and the 4th difference between target classification result；

Based on first difference, second difference, the third difference and the 4th difference, described image is determined Handle the value of the loss function of model.

In above scheme, the parameter updating module is also used to when the value of the loss function is beyond preset threshold, base Corresponding error signal is determined in the loss function of described image processing model；

By error signal backpropagation in described image processing model, and the figure is updated during propagation As the model parameter of processing model.

The embodiment of the invention also provides a kind of training devices of image processing model, comprising:

Memory, for storing executable instruction；

Processor when for executing the executable instruction stored in the memory, is realized provided in an embodiment of the present invention The training method of image processing model.

The embodiment of the present invention provides a kind of storage medium, is stored with executable instruction, real when for causing processor to execute The training method of existing image processing model provided in an embodiment of the present invention.

The embodiment of the present invention has the advantages that

The embodiment of the present invention combines the encirclement frame of obtained target object, segmentation mask, key point mask and classification results, It determines the value of the target loss function of image processing model, and then updates the model parameter of image processing model, realize to image Handle the training of model；Scheming since the segmentation mask and key point mask of target object can more accurately characterize target object Location information as in enables and trains obtained image processing model in conjunction with encirclement frame, segmentation mask and key point mask The target object in image is more accurately positioned, the detection accuracy of target object is improved.

Detailed description of the invention

Fig. 1 is the schematic illustration of R-CNN provided in an embodiment of the present invention；

Fig. 2 is the schematic illustration of Fast R-CNN provided in an embodiment of the present invention；

Fig. 3 is the schematic illustration of Faster R-CNN provided in an embodiment of the present invention；

Fig. 4 is the configuration diagram of Faster R-CNN provided in an embodiment of the present invention；

Fig. 5 is the operation principle schematic diagram of RPN provided in an embodiment of the present invention；

Fig. 6 is the configuration diagram of Mask R-CNN provided in an embodiment of the present invention；

Fig. 7 is the configuration diagram of the training system of image processing model provided in an embodiment of the present invention；

Fig. 8 is the structural schematic diagram of electronic equipment 600 provided in an embodiment of the present invention；

Fig. 9 is the flow diagram of the training method of image processing model provided in an embodiment of the present invention；

Figure 10 is the schematic diagram that frame is initially surrounded in characteristic pattern provided in an embodiment of the present invention；

Figure 11 is the schematic illustration that encirclement frame provided in an embodiment of the present invention returns；

Figure 12 is the flow diagram of the training method of image processing model provided in an embodiment of the present invention；

Figure 13 is the flow diagram of the training method of image processing model provided in an embodiment of the present invention；

Figure 14 is the schematic diagram of advertisement position provided in an embodiment of the present invention detection.

Specific embodiment

To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into It is described in detail to one step, described embodiment is not construed as limitation of the present invention, and those of ordinary skill in the art are not having All other embodiment obtained under the premise of creative work is made, shall fall within the protection scope of the present invention.

In the following description, it is related to " some embodiments ", which depict the subsets of all possible embodiments, but can To understand, " some embodiments " can be the same subsets or different subsets of all possible embodiments, and can not conflict In the case where be combined with each other.

In the following description, related term " first second " be only be the similar object of difference, do not represent needle To the particular sorted of object, it is possible to understand that specific sequence or successively can be interchanged in ground, " first second " in the case where permission Order, so that the embodiment of the present invention described herein can be implemented with the sequence other than illustrating or describing herein.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term used herein is intended merely to the purpose of the description embodiment of the present invention, It is not intended to limit the present invention.

Region convolutional neural networks (Regions with CNN features, R-CNN) are illustrated first.Fig. 1 is The schematic illustration of R-CNN provided in an embodiment of the present invention, referring to Fig. 1, mode input first is a picture, then in picture On propose the area to be tested of predetermined quantity (such as 2000), it is singly (serial mode) right by convolutional neural networks The area to be tested of this predetermined quantity carries out feature extraction, and the feature extracted is carried out by a support vector machines (SVM) Classification determines the classification of object, and surrounds frame by surrounding frame recurrence (Bounding box regression) adjustment target Size.Image processing model based on R-CNN takes a long time when performing image processing, and treatment effeciency is low, and different in model The module needs of function are respectively trained.

In the related technology, R-CNN is improved, proposes that Fast R-CNN, Fig. 2 are Fast provided in an embodiment of the present invention The schematic illustration of R-CNN determines the area to be tested of predetermined quantity (such as 2000) referring to fig. 2 on picture, then, leads to It crosses convolutional neural networks and feature extraction is carried out to the area to be tested of this predetermined quantity respectively, then pass through area-of-interest pond Layer (ROI Pooling Layer, Region Of Interest Pooling Layer), wins each in full figure feature The corresponding feature of ROI, then classification is carried out by full articulamentum (FC Layer, Fully Connected Layer) and surrounds frame Amendment.Image processing model based on FastR-CNN directlys adopt one instead of the feature extraction mode that R-CNN is serial Neural network extracts feature to full figure, but treatment effeciency is still lower.

In the related technology, propose that Faster R-CNN, Fig. 3 are provided in an embodiment of the present invention based on Fast R-CNN The schematic illustration of Faster R-CNN is first that full figure extracts feature using shared convolutional layer referring to Fig. 3, then will The characteristic pattern arrived is sent into region candidate network (RPN, Region Proposal Network), and it is (specified that RPN generates frame to be detected The position of ROI) and first time amendment is carried out to the encirclement frame of ROI, it is later exactly the framework of Fast R-CNN, ROI Pooling Layer chooses the corresponding feature of each ROI according to the output of RPN on characteristic pattern, and dimension is set to definite value, Finally, classifying using full articulamentum to frame is surrounded, and surround second of amendment of frame.

Next the framework of Faster R-CNN is illustrated.Fig. 4 is Faster R- provided in an embodiment of the present invention The configuration diagram of CNN, referring to fig. 4, the structure of Faster R-CNN are broadly divided into three parts, and first part is shared volume Lamination-backbone network (backbone), second part are region candidate network RPN, and Part III is divided candidate region The sorter network of class.

Here, the working principle of RPN is illustrated, the working principle that Fig. 5 is RPN provided in an embodiment of the present invention is illustrated Figure, RPN rely on a window slided on sharing feature figure, and the encirclement frame of preset quantity (such as 9 kinds) is generated for each position (anchor), for the encirclement frame of generation, there are two the things to be done of RPN, first be judgement surround frame correspond to prospect or Background, namely judge that this surrounds frame on earth either with or without coverage goal, second is to belong to the encirclement frame of prospect and carrying out coordinate Amendment.

In Faster R-CNN, feature is shared convolutional layer (backbone network) and disposably extracts, therefore, for each It for ROI, needs to win corresponding feature from shared convolutional layer, and is sent into full articulamentum and classifies；Therefore, the pond ROI Change layer and mainly done two pieces thing, first is to choose corresponding feature for each RoI, and second thing is to meet full articulamentum Input demand, the dimension of the corresponding feature of each ROI is converted to definite value.

Based on the above-mentioned explanation to Faster R-CNN, next Mask R-CNN is illustrated.Fig. 6 is that the present invention is real The configuration diagram of the Mask R-CNN of example offer is applied, referring to Fig. 6, Mask R-CNN is in Faster R-CNN to the pond ROI layer It improves, proposes ROI Align, and increase Mask branch；Wherein, the effect of ROI Align is mainly exactly to eliminate The floor operation of the pond ROI layer, and enable the feature obtained for each ROI that the ROI region in original image is better aligned； By Mask branch, semantic segmentation, the Mask R-CNN mould that output segmentation mask, so training obtain are carried out to candidate region Type improves the detection accuracy of image compared to Faster R-CNN model.

Next the training system of the image processing model of the embodiment of the present invention is illustrated, Fig. 7 is that the present invention is implemented The configuration diagram of the training system for the image processing model that example provides supports an exemplary application referring to Fig. 7 to realize, The training system 100 of image processing model includes terminal (illustratively showing terminal 400-1 and terminal 400-2), terminal 400 Server 200 is connected by network 300, network 300 can be wide area network or local area network, or be combination, makes Data transmission is realized with wirelessly or non-wirelessly link.

Terminal (such as terminal 400-1), the train request for image processing model are taken in train request to server 200 Sample image of the band for image processing model training, which includes target object.

Here, in practical applications, terminal can be various types of for smart phone, tablet computer, laptop etc. User terminal can also be broadcast for wearable computing devices, personal digital assistant (PDA), desktop computer, cellular phone, media Put any two in device, navigation equipment, game machine, television set or these data processing equipments or other data processing equipments Or multiple combination.

Server 200 obtains sample image for parsing to train request；

By backbone network, feature extraction is carried out to the sample image comprising target object, obtains the feature of sample image Figure；

By region candidate network, region selection is carried out to characteristic pattern, determines candidate region；

By detect network, to candidate region carry out target object detection, obtain the target object location parameter and Classification parameter, the classification parameter include the classification results of the corresponding target object, and the location parameter includes: described in correspondence Encirclement frame, segmentation mask and the key point mask of target object；

Encirclement frame, segmentation mask, key point mask and classification results based on target object, determine image processing model The value of target loss function；

Based on the value of determining target loss function, the model parameter of image processing model is updated.

In practical applications, server 200 both can be a server of the support various businesses being separately configured, and also may be used To be configured to a server cluster.

Terminal (such as terminal 400-1) is also used to send the image processing requests for carrying target image to server 200；

Server 200, the image processing model for being also used to obtain using training (the i.e. execution updated image of above-mentioned parameter Handle model), target object detection is carried out to target image, obtains and returns to the encirclement frame for including at least corresponding target object Location parameter gives terminal (such as terminal 400-1)；

Terminal (such as terminal 400-1), is also used in the user interface, knows target object by the encirclement collimation mark of return.

The electronic equipment for the training method for implementing real-time image processing of embodiment of the present invention model is illustrated below.In In some embodiments, it can also be server that electronic equipment, which can be terminal,.It is that the embodiment of the present invention provides referring to Fig. 8, Fig. 8 Electronic equipment 600 structural schematic diagram, electronic equipment 600 shown in Fig. 8 includes: that processor 610, memory 650, network connect Mouth 620 and user interface 630.Various components in electronic equipment 600 are coupled by bus system 640.It is understood that total Linear system system 640 is for realizing the connection communication between these components.Bus system 640 except include data/address bus in addition to, further include Power bus, control bus and status signal bus in addition.But for the sake of clear explanation, various buses are all designated as in fig. 8 Bus system 640.

Processor 610 can be a kind of IC chip, the processing capacity with signal, such as general processor, number Word signal processor (DSP, Digital Signal Processor) either other programmable logic device, discrete gate or Transistor logic, discrete hardware components etc., wherein general processor can be microprocessor or any conventional processing Device etc..

User interface 630 include make it possible to present one or more output devices 631 of media content, including one or Multiple loudspeakers and/or one or more visual display screens.User interface 630 further includes one or more input units 632, packet Include the user interface component for facilitating user's input, for example keyboard, mouse, microphone, touch screen display screen, camera, other are defeated Enter button and control.

Memory 650 can be it is removable, it is non-removable or combinations thereof.Illustrative hardware device includes that solid-state is deposited Reservoir, hard disk drive, CD drive etc..Memory 650 optionally includes one geographically far from processor 610 A or multiple storage equipment.

Memory 650 includes volatile memory or nonvolatile memory, may also comprise volatile and non-volatile and deposits Both reservoirs.Nonvolatile memory can be read-only memory (ROM, Read Only Me mory), and volatile memory can To be random access memory (RAM, Random Access Memor y).The memory 650 of description of the embodiment of the present invention is intended to Memory including any suitable type.

In some embodiments, memory 650 can storing data to support various operations, the example of these data includes Program, module and data structure or its subset or superset, below exemplary illustration.

Operating system 651, including for handle various basic system services and execute hardware dependent tasks system program, Such as ccf layer, core library layer, driving layer etc., for realizing various basic businesses and the hardware based task of processing；

Network communication module 652, for reaching other calculating via one or more (wired or wireless) network interfaces 620 Equipment, illustrative network interface 620 include: bluetooth, Wireless Fidelity (WiFi) and universal serial bus (USB, Universal Serial Bus) etc.；

Module 653 is presented, for via one or more associated with user interface 630 output device 631 (for example, Display screen, loudspeaker etc.) make it possible to present information (for example, for operating peripheral equipment and showing the user of content and information Interface)；

Input processing module 654, for one to one or more from one of one or more input units 632 or Multiple user's inputs or interaction detect and translate input or interaction detected.

In some embodiments, the training device of image processing model provided in an embodiment of the present invention can use software side Formula realizes that Fig. 8 shows the training device 655 for the image processing model being stored in memory 650, can be program and inserts The software of the forms such as part, including following software module: module 6552, object detection mould are chosen in characteristic extracting module 6551, region Block 6553, loss determining module 6554 and parameter updating module 6555, these modules are in logic, therefore according to being realized Function can be combined arbitrarily or further split, and the function of modules will be described hereinafter.

In further embodiments, the training device of image processing model provided in an embodiment of the present invention can use hardware Mode is realized, as an example, the training device of image processing model provided in an embodiment of the present invention can be and be decoded using hardware The processor of processor form is programmed to perform the training method of image processing model provided in an embodiment of the present invention, example Such as, the processor of hardware decoding processor form can using one or more application specific integrated circuit (ASIC, Application Specific Integrated Circuit), DS P, programmable logic device (PLD, Programmable Logic Device), Complex Programmable Logic Devices (CPLD, Complex Programmable Logic Device), scene Programmable gate array (FPG A, Field-Programmable Gate Array) or other electronic components.

Next the training method of image processing model provided in an embodiment of the present invention is illustrated.The embodiment of the present invention The image processing model of offer includes: backbone network, region candidate network and detection network, and Fig. 9 is that the embodiment of the present invention provides Image processing model training method flow diagram, in some embodiments, which can be by server or end End is implemented, or by server and terminal coordinated implementation, by taking server implementation as an example, referring to Fig. 9, and figure provided in an embodiment of the present invention As the training method of processing model includes:

Step 701: server carries out feature extraction by backbone network, to the sample image comprising target object, obtains The characteristic pattern of sample image.

Here, in actual implementation, backbone network can be convolutional neural networks, can extract to the full figure of sample image Feature obtains the characteristic pattern of corresponding sample image；

In some embodiments, pre-training can be carried out to backbone network, such as directly existed using image classification data collection Training obtains the backbone network for being only used for carrying out feature extraction above ImageNet.

Step 702: by region candidate network, region selection being carried out to characteristic pattern, determines candidate region.

In some embodiments, server can carry out region selection to characteristic pattern in the following way, determine candidate region:

By region candidate network, multiple initial encirclement frames of character pair figure are generated；It is scanned by sliding window multiple initial Frame is surrounded, determines the initial encirclement frame for corresponding to prospect in multiple initial encirclement frames；The initial encirclement frame of corresponding prospect is wrapped Peripheral frame returns, to determine candidate region.

Here, in actual implementation, the quantity of determining candidate region can be fixed quantity, illustratively, Tu10Wei The schematic diagram of frame is initially surrounded in characteristic pattern provided in an embodiment of the present invention, region candidate network is slided on characteristic pattern by one Dynamic window generates 9 kinds of initial encirclement frames for pre-setting length-width ratio and area referring to Figure 10 for each position (anchor), this 9 kinds initial frames that surround include three kinds of areas (128 × 128,256 × 256,512 × 512), and every kind of area wraps again Containing three kinds of length-width ratios (1:1,1:2,2:1), in this way, region candidate network generates in the case where characteristic pattern size is 40 × 60 The sum of initial encirclement frame be about 20000 (40 × 60 × 9).After generating multiple initial encirclement frames, region candidate net Network executes two operations, first is that the initial frame that surrounds of judgement corresponds to prospect or background, second is that the initial encirclement frame for belonging to prospect First time amendment is carried out, that is, carries out surrounding frame recurrence.

Here, the judgement for corresponding to prospect or background to initial encirclement frame is illustrated.In actual implementation, can set It hands over and than (IoU, Intersection over Union) threshold value, when initially surrounding frame and target surrounds frame (Ground Truth IoU) determines that the initial encirclement frame corresponds to prospect more than first threshold (such as 0.7), surrounds frame and target when initial The IoU of frame is surrounded in second threshold (such as 0.3) with upper and lower, determines that the initial encirclement frame corresponds to background.

It is illustrated to frame recurrence is surrounded.Figure 11 is the schematic illustration that encirclement frame provided in an embodiment of the present invention returns, It referring to Figure 11, is indicated for surrounding frame using four dimensional vectors (x, y, w, h), respectively indicates the center point coordinate and width for surrounding frame Height, for Figure 11, the encirclement frame P of number 111 represents initial encirclement frame, and the encirclement frame G of number 112 represents target and surrounds frame, warp Frame is surrounded to return so that input encirclement frame P, obtains one by mapping and target surrounds the closer recurrence window Z of frame G.

Step 703: by detecting network, target object detection being carried out to candidate region, obtains the position ginseng of target object Several and classification parameter, classification parameter include the classification results of corresponding target object, and location parameter includes: the packet of corresponding target object Peripheral frame, segmentation mask and key point mask.

In some embodiments, it after server determines candidate region, also by detection network, intercepts and waits from characteristic pattern The corresponding characteristic area of favored area, obtains candidate feature region；The feature in candidate feature region is adjusted to the feature of fixed size Dimension.

The framework of detection network is illustrated, in some embodiments, detection network includes candidate region alignment network (ROI Align), target detection head (Bbox head), segmentation head (Mask) and key point head (Keypoint head).

In actual implementation, detection network obtains the characteristic area of fixed size in characteristic pattern using ROI Align technology, For example, the characteristic area of fixed size (7X7), detection network do not use quantization operation in order to obtain, to avoid quantization is introduced Error, such as 665/32=20.78 detect Web vector graphic 20.78, and do not have to 20 to substitute, and the processing for floating number, inspection The mode for surveying Web vector graphic bilinear interpolation is handled.

It should be noted that the adjustment of the characteristic dimension for candidate feature region, it can be according to different input target tune Whole to different fixed size, for example, the input target in corresponding candidate feature region adjusted is adjustable for target detection head The characteristic dimension in candidate feature region is 7 × 7 × 256, is segmentation head for input target, adjusting candidate characteristic area Characteristic dimension is 14 × 14 × 256.

In some embodiments, server can obtain encirclement frame of the target object in candidate region in the following way:

The fully-connected network (i.e. target detection head) for including by detecting network, carries out target object to candidate region respectively Detection determines the candidate region comprising target object；Based on the candidate region comprising target object, carries out surrounding frame recurrence, obtain To the encirclement frame of corresponding target object.

In some embodiments, corresponding mesh is also exported by detecting the fully-connected network (i.e. target detection head) that network includes Whether two classification can be used to the classification of target object in embodiments of the present invention in the classification results of mark object, i.e., be target pair As (such as whether being advertisement position).

Illustratively, the feature for inputting the candidate region of target detection head is 7 × 7 × 256, by this feature according to pixel exhibition It opens as 12544 dimensional vectors, is 12544 by an input dimension, output dimension is the full articulamentums of 1024 dimensions, obtains 1024 dimensions Vector, by this vector by output dimension be 1 full articulamentum obtain classification results (such as whether being advertisement position), while by this to Amount by output dimension be 4 full articulamentum obtain surround frame regression result, i.e., prediction encirclement frame central point transverse and longitudinal coordinate and Four variables such as high, wide surround the offset of frame relative to target.

In some embodiments, server can obtain segmentation of the target object in candidate region in the following way and cover Code:

Server carries out corresponding target to candidate region respectively by detecting the convolutional network (segmentation head) that network includes The semantic segmentation of object generates the segmentation mask of corresponding target object.

Illustratively, the feature of the candidate region on input segmentation head is 14 × 14 × 256, this feature is defeated by 4 Dimension is 256 out, and convolution kernel size is 3 × 3, and the convolutional layer that step-length is 1 obtains one 14 × 14 × 256 feature, then pass through One output dimension is 256, and convolution kernel size is 3 × 3, the warp lamination that step-length is 2, and resolution ratio expansion is twice, obtains one A 28 × 28 × 256 feature is 1 finally by an output dimension, and the convolutional layer that convolution kernel size is 1 × 1 is divided Mask.

In some embodiments, server can obtain key point of the target object in candidate region in the following way and cover Code:

The full convolutional network (key point head) for including by detecting network carries out corresponding target pair to candidate region respectively The semantic segmentation of the key point of elephant generates the key point mask of the key point of corresponding target object.

Here, in actual implementation, for different target objects, different key points can be preset, for example, working as target When object is advertisement position, correspondingly, the key point of target object is four angle points of advertisement position.

Illustratively, the feature for inputting the candidate region on key point head is 14 × 14 × 256, and key point head is by 8 Convolution kernel size is 3 × 3, and output dimension is the convolutional layer composition of 512 dimensions, and being followed by convolution kernel size is 3 × 3, step-length 2, Convolutional layer and the 2 times of bilinearity sample levels that dimension is 1 are exported, 56 × 56 output resolution ratio is generated.

Step 704: encirclement frame, segmentation mask, key point mask and classification results based on target object determine at image Manage the value of the target loss function of model.

Here, the target loss function of image processing model is illustrated.The target loss function of image processing model It is shown below:

L=L_det+L_mask+βL_keypoint；(1)

Wherein,

L_det=L_rpn+L_rcnn；(2)

Here, L_detThe loss function of Faster R-CNN is represented, is made of respectively the part RPN and the part R-CNN.Wherein, The part RPN includes two loss (loss) items, is the prediction probability of classification and the recurrence loss of bounding box, the prediction of classification respectively Probability is cross entropy, and PRN is two classification, that is, has object type and without object type, the recurrence loss of bounding box is smooth_L1 letter Number, predict respectively object central point, length, it is high with the central point of mark, length, it is high compared with offset, four variables with (x, y, w, H) form is returned, and x indicates that central point abscissa, y indicate that central point ordinate, w indicate object width, and h indicates object Highly.The part R-CNN (i.e. detection head) also includes two loss, is the recurrence of the prediction probability and bounding box of classification respectively Loss, the prediction probability of classification are cross entropy, and R-CNN is also two classification in target object detection, and it is (such as wide to be divided into target object Accuse class) and non-targeted object (non-commercial paper), the recurrence loss of bounding box it is identical with RPN.

L_maskFor the average value of the cross entropy of the class probability and label (information marked) of each pixel, Mei Gedian It is also two classification in object detection task, i.e., the pixel, which belongs to target object or is not belonging to target object, (such as belongs to advertisement Be not belonging to advertisement), exported using softmax, for each object, classification calculated to all the points in candidate region and is intersected Entropy is averaged as L_mask。

L_keypointWith L_maskIt is similar, it is also the average value of the cross entropy of the class probability and label of each pixel, it is different It is L_keypointLabel in be only classified as target object at key point, it is all non-targeted object that other pixels, which punish class,；By In L_keypointWith L_maskIt is similar, therefore the weight of this loss is adjusted using beta coefficient, play the work for emphasizing key point With in actual implementation, β value can be set to 5.

In some embodiments, server can be determined as follows the target loss function of image processing model Value: obtain respectively it is described encirclement frame and target surround frame between the first difference, the segmentation mask and Target Segmentation mask it Between the second difference, third difference, the classification results and target between the key point mask and target critical point mask The 4th difference between classification results；Based on the first difference, the second difference, third difference and the 4th difference, image procossing is determined The value of the loss function of model.

Here, it is respectively pair marked in sample image that target, which surrounds frame, Target Segmentation mask and target critical point mask, Answer the encirclement frame, segmentation mask and key point mask of target object.

Step 705: based on the value of determining target loss function, updating the model parameter of image processing model.

In some embodiments, server can update the model parameter of image processing model in the following way:

When the value of loss function exceeds preset threshold, corresponding error is determined based on the loss function of image processing model Signal；The backpropagation in image processing model by error signal, and during propagation update image processing model mould Shape parameter.

Here backpropagation is illustrated, training sample data is input to the input layer of neural network model, passed through Hidden layer finally reaches output layer and exports as a result, this is the propagated forward process of neural network model, due to neural network mould The output result of type and actual result have error, then calculate the error between output result and actual value, and by the error from defeated Layer is to hidden layer backpropagation out, until input layer is traveled to, during backpropagation, according to error transfer factor model parameter Value；The continuous iteration above process, until convergence.

In some embodiments, by updating training of the model parameter realization of image processing model to image processing model Later, the image processing model that training obtains can be used, the detection of target object (such as advertisement position), tool are carried out to images to be recognized Body, images to be recognized is input to image processing model, by backbone network, feature extraction is carried out to images to be recognized, is obtained To the characteristic pattern of images to be recognized；By region candidate network, region selection is carried out to characteristic pattern, determines candidate region；Pass through Network is detected, target object detection is carried out to candidate region, obtains the location parameter and classification parameter of target object.

Next by taking target object is advertisement position as an example, to the training side of image processing model provided in an embodiment of the present invention Method is illustrated.Figure 12 and Figure 13 is the flow diagram of the training method of image processing model provided in an embodiment of the present invention, In some embodiments, which can be implemented by server or terminal, or by server and terminal coordinated implementation, with service For device is implemented, include: in conjunction with the training method referring to Figure 12 and Figure 13, image processing model provided in an embodiment of the present invention

Step 801: server carries out feature extraction by backbone network, to the sample image comprising advertisement position, obtains sample The characteristic pattern of this image.

Here, in actual implementation, sample image is labelled with following information:

Corresponding testing result/the classification results of advertisement position, that is, belong to advertisement position or be not belonging to advertisement position；

The encirclement frame of corresponding advertisement position；The segmentation mask of corresponding advertisement position；The key point mask of corresponding advertisement position.

Step 802: by region candidate network, region selection being carried out to characteristic pattern, determines the candidate region of fixed quantity.

In actual implementation, server generates multiple initial encirclement frames of character pair figure by region candidate network；It is logical It crosses sliding window and scans multiple initial encirclement frames, determine the initial encirclement frame for corresponding to prospect in multiple initial encirclement frames；To corresponding prospect Initial encirclement frame carry out surround frame return, to determine candidate region.

Step 803: by detecting network, the corresponding characteristic area in candidate region is intercepted from characteristic pattern, obtains candidate spy Region is levied, adjusts the feature in candidate feature region to the characteristic dimension of fixed size.

Step 804: by detecting network, advertisement position detection being carried out to candidate feature region adjusted, is obtained corresponding wide Accuse the classification results of position and the location parameter of advertisement position.

Here, whether the classification results of advertisement position are advertisement position for it is corresponding to characterize candidate feature region, i.e. classification knot Fruit include: be advertisement position, be not advertisement position.

Location parameter includes: the encirclement frame, segmentation mask and key point mask of corresponding advertisement position.Here, the pass of advertisement position Key point is four angle points of advertisement position.

Step 805: the location parameter of testing result and advertisement position based on advertisement position determines the target of image processing model The value of loss function.

Here, the target loss function of image processing model is referring to formula (1) and formula (2).

Step 806: based on the value of determining target loss function, updating the model parameter of image processing model.

In actual implementation, server can update the model parameter of image processing model in the following way:

Next by taking target object is advertisement position as an example, detection such as to advertisement position in video, advertisement can be poster, have Frame advertisement and rimless advertisement, referring to Figure 14, Figure 14 is the schematic diagram of advertisement position provided in an embodiment of the present invention detection, image procossing The purpose of model is to export the position of number 141 in Figure 14.Referring to Figure 12, image processing model base provided in an embodiment of the present invention It is improved in Mask R-CNN frame, Mask R-CNN frame is the frame divided for image instance, is based on target detection Frame Faster R-CNN, on the basis of Faster R-CNN, increase a branch for target Pixel-level divide. Mask R-CNN points are four parts: backbone network (backbone), region candidate network (region proposal n Etwork, RPN), detection head (bbox head) and divide head (mask head), this frame is by image instance segmentation problem Decoupling is target detection problems and semantic segmentation problem.

Wherein, backbone network is usually convolutional neural networks (convolution neural network, CNN), is used for Picture feature is extracted, in actual implementation, the CN N of backbone network training usually on ImageNet, such as pre-training are good Resnet, Inception V3, DenseNet etc.；Region candidate network (RPN), usually small-sized convolutional neural networks, are used for It proposes area-of-interest and judges whether the region has object, bounding box recurrence is carried out to the candidate region for being predicted as object； Head is detected, usually target detection frame R-CNN (regions with CNN features), is used for region candidate net Having the candidate region of object to carry out obtained in network, further bounding box returns and object category is predicted；Divide head, usually For full convolutional network (fully convolutional network, FCN), for having object to obtained in region candidate network The candidate region of body carries out the semantic segmentation of object.

The frame of image processing model provided in an embodiment of the present invention improves Mask R-CNN frame, increases pixel The critical point detection branch (key point head) of grade, key point head are a full convolutional network (f ully Convolutional network, FCN) export the key point mask that a resolution ratio is twice of mask of segmentation.Compared with segmentation Relatively high resolution ratio is needed to export the positioning accuracy of key point rank.

The target detection frame of image is usually two stages detector, and the first stage proposes area-of-interest and carries out to it Rough location returns, and second stage carries out object classification and further position to the area-of-interest tentatively returned excessively It puts back into and returns.Comprising the branch to object progress Pixel-level segmentation in image processing model frame provided in an embodiment of the present invention, lead to Cross full convolutional network to realize, introduce Pixel-level label, enhance feature semantic information and accurate location information, can be obvious The accuracy of ground raising target detection；Include simultaneously critical point detection head, is realized by full convolutional network, in conjunction with advertisement object Particularity, detect that four angle points of advertisement may further determine that location advertising, while critical point detection is required than segmentation Higher output resolution ratio improves the detection accuracy of image processing model.

In image procossing, key point is substantially a kind of feature, is to a fixed area or space physics relationship Abstractdesription, the combination or context relation in certain contiguous range are described, key point is not only information, Or a position is represented, more represent the syntagmatic of context and surrounding neighbors.Key point in advertisement is relatively easy to model, Advertisement is the more regular object of a kind of shape, and usually quadrangle, four angle points are considered four key points.It will close The position of key point is modeled as an individual on e-hot mask, the corresponding channel of a characteristic point, each channel corresponding one A characteristic pattern predicts 4 masks using Mask R-CNN.For 4 key points of an example (advertisement position), training objective is M × m binary mask of one one-hot coding, only one of them pixel are labeled as prospect.

It is 7 by the feature that RPN and ROI Align obtains each candidate region for target detection head referring to Figure 12 × 7 × 2 56, by this feature according to start pixel be 12544 dimensional vectors, by an input dimension be 12544, output dimension For the full articulamentum of 1024 dimensions, 1024 dimensional vectors are obtained, this vector is obtained into classification knot by the full articulamentum that output dimension is 1 Fruit (whether being advertisement, an only commercial paper), while this vector being surrounded by the full articulamentum that output dimension is 4 Box regression result, i.e. four variables such as the central point transverse and longitudinal coordinate of prediction bounding box and height and width are relative to the inclined of label bounding box Shifting amount.

It is 14 × 14 × 256 by the feature that RPN and ROI Align obtains each candidate region for dividing head, it will This feature is 256 by 4 output dimensions, and convolution kernel size is 3 × 3, step-length for 1 convolutional layer, obtain one 14 × 14 × 256 feature, then passing through an output dimension is 256, convolution kernel size is 3 × 3, the warp lamination that step-length is 2, by resolution ratio Expansion is twice, and obtains one 28 × 28 × 256 feature, is 1 finally by an output dimension, and convolution kernel size is 1 × 1 Convolutional layer obtains segmentation mask.

It is 14 × 14 × 256 by the feature that RPN and ROI Align obtains each candidate region for key point head, Key point head is 3 × 3 by 8 convolution kernel sizes, and output dimension is the convolutional layer composition of 512 dimensions, is followed by convolution kernel size It is 3 × 3, step-length 2, the convolutional layer and 2 times of bilinearity sample levels that output dimension is 1 generates 56 × 56 output resolution ratio, with Mask, which is compared, needs relatively high resolution ratio to export the positioning accuracy of key point rank.

Next target loss function used by model is handled to training image to be illustrated.The mesh of image processing model Mark loss function is shown below:

L=L_det+L_mask+βL_keypoint

Wherein,

L_det=L_rpn+L_rcnn

L_detThe loss function of Faster R-CNN is represented, is made of respectively the part RPN and the part R-CNN.RPN is wrapped part It is the prediction probability of classification and the recurrence loss of bounding box respectively containing two loss, the prediction probability of classification is cross entropy, PRN is two classification, that is, has object type and without object type, the recurrence loss of bounding box is smooth_L1 function, predict object center Point and the offset grown tall compared with label, four variables are returned in the form of (x, y, w, h), and x indicates the horizontal seat of central point Mark, y indicate that central point ordinate, w indicate object width, and h indicates object height.The part R-CNN (i.e. detection head) also includes Two loss, be the prediction probability of classification and the recurrence loss of bounding box respectively, and the prediction probability of classification is cross entropy, R- CNN is also two classification in location advertising prediction, is divided into commercial paper and non-commercial paper, phase in the recurrence loss and RPN of bounding box Together.

L_maskFor the average value of the cross entropy of the class probability and label of each pixel, each point is predicted in location advertising It is also two classification in task, i.e., the pixel belongs to advertisement and is not belonging to advertisement, is exported using softmax.For each Object calculates classification cross entropy to all the points in candidate region, is averaged as L_mask。

L_keypointWith L_maskIt is similar, it is also the average value of the cross entropy of the class probability and label of each pixel, it is different It is L_keypointLabel in be only classified as advertisement at key point, it is all non-advertisement that other pixels, which punish class,；Due to L_keypoint With L_maskIt is similar, therefore the weight of this loss is adjusted using beta coefficient, play the role of emphasizing key point, in reality β value can be set to 5 when implementation.

In the training process of image processing model, the above-mentioned target loss function based on image processing model, using with The mode implementation model training of machine gradient decline.

In practical applications, the image processing model obtained based on training is inputted the video pictures of advertisement position to be detected Image processing model carries out forward calculation, obtains testing result.

Using the above embodiment of the present invention, since the segmentation mask and key point mask of target object being capable of more accurate tables The location information of target object in the picture is levied, surrounds the figure that frame, segmentation mask and the training of key point mask obtain so that combining The target object in image can be more accurately positioned as handling model, improves the detection accuracy of target object.

Continue to be illustrated the software realization of the training device of image processing model provided in an embodiment of the present invention.Referring to The training device of Fig. 8, image processing model provided in an embodiment of the present invention includes:

Obj ect detection module, for carrying out target object detection to the candidate region, obtaining by the detection network Location parameter of the target object in the candidate region, the location parameter include: the packet of the corresponding target object Peripheral frame, segmentation mask and key point mask；

Determining module is lost, for encirclement frame, segmentation mask and key point mask based on the target object, determines institute State the value of the target loss function of image processing model；

In some embodiments, module is chosen in the region, for generating described in corresponding to by the region candidate network Multiple initial encirclement frames of characteristic pattern；

In some embodiments, described device further include:

In some embodiments, the obj ect detection module is also used to the fully connected network for including by the detection network Network carries out target object detection to the candidate region respectively, determines the candidate region comprising the target object；

Based on the candidate region comprising the target object, carry out surrounding frame recurrence, the corresponding target object Surround frame.

In some embodiments, the obj ect detection module is also used to the convolutional network for including by the detection network, Correspond to the candidate region semantic segmentation of the target object respectively, the segmentation for generating the corresponding target object is covered Code.

In some embodiments, the obj ect detection module is also used to the full convolution net for including by the detection network Network correspond to the candidate region semantic segmentation of the key point of the target object respectively, generates the corresponding target The key point mask of the key point of object.

In some embodiments, the loss determining module, is also used to obtain the encirclement frame respectively and target surrounds frame Between the first difference, it is described segmentation mask and Target Segmentation mask between the second difference, the key point mask and target Third difference between key point mask；

Based on first difference, second difference and the third difference, the damage of described image processing model is determined Lose the value of function.

In some embodiments, the parameter updating module is also used to the value when the loss function beyond preset threshold When, corresponding error signal is determined based on the loss function of described image processing model；

It need to be noted that: above is referred to the description of device, be with above method description it is similar, with having for method Beneficial effect description, does not repeat them here, for undisclosed technical detail in described device of the embodiment of the present invention, please refers to present invention side The description of method embodiment.

The embodiment of the invention also provides a kind of electronic equipment, the electronic equipment includes:

Memory, for storing executable program；

Processor when for executing the executable program stored in the memory, is realized provided in an embodiment of the present invention The training method of above-mentioned image processing model.

The embodiment of the present invention also provides a kind of storage medium for being stored with executable instruction, wherein being stored with executable finger It enables, when executable instruction is executed by processor, processor will be caused to execute image processing model provided in an embodiment of the present invention Training method.

This can be accomplished by hardware associated with program instructions for all or part of the steps of embodiment, and program above-mentioned can be with It is stored in a computer readable storage medium, which when being executed, executes step including the steps of the foregoing method embodiments；And Storage medium above-mentioned includes: movable storage device, random access memory (RAM, Random Access Memory), read-only The various media that can store program code such as memory (ROM, Read-Only Memory), magnetic or disk.

If alternatively, the above-mentioned integrated unit of the present invention is realized in the form of software function module and as independent product When selling or using, it also can store in a computer readable storage medium.Based on this understanding, the present invention is implemented The technical solution of example substantially in other words can be embodied in the form of software products the part that the relevant technologies contribute, The computer software product is stored in a storage medium, including some instructions are used so that computer equipment (can be with It is personal computer, server or network equipment etc.) execute all or part of each embodiment the method for the present invention. And storage medium above-mentioned includes: that movable storage device, RAM, ROM, magnetic or disk etc. are various can store program code Medium.

The above, only the embodiment of the present invention, are not intended to limit the scope of the present invention.It is all in this hair Made any modifications, equivalent replacements, and improvements etc. within bright spirit and scope, be all contained in protection scope of the present invention it It is interior.

Claims

1. a kind of training method of image processing model, which is characterized in that it includes: backbone network, area that described image, which handles model, Domain candidate network and detection network, which comprises

By the backbone network, feature extraction is carried out to the sample image comprising target object, obtains the sample image Characteristic pattern；

By the detection network, target object detection is carried out to the candidate region, obtains the position ginseng of the target object Several and classification parameter, the classification parameter include the classification results of the corresponding target object, and the location parameter includes: correspondence Encirclement frame, segmentation mask and the key point mask of the target object；

Encirclement frame, segmentation mask, key point mask and classification results based on the target object, determine that described image handles mould The value of the target loss function of type；

2. the method according to claim 1, wherein described by the region candidate network, to the feature Figure carries out region selection, determines candidate region, comprising:

By the region candidate network, multiple initial encirclement frames of the corresponding characteristic pattern are generated；

The multiple initial encirclement frame is scanned by sliding window, determines the initial encirclement for corresponding to prospect in the multiple initial encirclement frame Frame；

3. the method according to claim 1, wherein the method also includes:

By the detection network, the corresponding characteristic area in the candidate region is intercepted from the characteristic pattern, obtains candidate spy Levy region；

4. the method according to claim 1, wherein described by the detection network, to the candidate region Target object detection is carried out, the location parameter of the target object is obtained, comprising:

The fully-connected network for including by the detection network, carries out target object detection to the candidate region respectively, determines Candidate region comprising the target object；

Based on the candidate region comprising the target object, carries out surrounding frame recurrence, obtain corresponding to the target object Surround frame.

5. the method according to claim 1, wherein described by the detection network, to the candidate region Target object detection is carried out, the location parameter of the target object is obtained, comprising:

The convolutional network for including by the detection network, correspond to the candidate region language of the target object respectively Justice segmentation generates the segmentation mask of the corresponding target object.

6. the method according to claim 1, wherein described by the detection network, to the candidate region Target object detection is carried out, the location parameter of the target object is obtained, comprising:

The full convolutional network for including by the detection network, carries out the candidate region to correspond to the target object respectively The semantic segmentation of key point generates the key point mask of the key point of the corresponding target object.

7. the method according to claim 1, wherein the encirclement frame based on the target object, segmentation are covered Code, key point mask and classification results determine the value of the target loss function of described image processing model, comprising:

Obtain respectively it is described encirclement frame and target surround frame between the first difference, the segmentation mask and Target Segmentation mask it Between the second difference, third difference, the classification results and target between the key point mask and target critical point mask The 4th difference between classification results；

Based on first difference, second difference, the third difference and the 4th difference, determine that described image is handled The value of the loss function of model.

8. the method according to claim 1, wherein the value based on the determining target loss function, Update the model parameter of described image processing model, comprising:

When the value of the loss function exceeds preset threshold, determined accordingly based on the loss function of described image processing model Error signal；

By error signal backpropagation in described image processing model, and updated at described image during propagation Manage the model parameter of model.

9. a kind of training device of image processing model, which is characterized in that described device includes:

Characteristic extracting module, for carrying out feature extraction to the sample image comprising target object, obtaining by the backbone network To the characteristic pattern of the sample image；

Module is chosen in region, for carrying out region selection to the characteristic pattern, determining candidate regions by the region candidate network Domain；

Obj ect detection module obtains described for carrying out target object detection to the candidate region by the detection network The location parameter and classification parameter of target object, the classification parameter includes the classification results of the corresponding target object, described Location parameter includes: the encirclement frame, segmentation mask and key point mask of the corresponding target object；

Determining module is lost, for encirclement frame, segmentation mask, key point mask and classification results based on the target object, Determine the value of the target loss function of described image processing model；

Parameter updating module updates the mould of described image processing model for the value based on the determining target loss function Shape parameter.

10. a kind of storage medium, which is characterized in that the storage medium is stored with executable instruction, for causing processor to be held When row, the training method of image processing model described in any item of the claim 1 to 8 is realized.