CN108647641A

CN108647641A - Video behavior dividing method and device based on two-way Model Fusion

Info

Publication number: CN108647641A
Application number: CN201810443505.3A
Authority: CN
Inventors: 宋波
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2018-10-12
Anticipated expiration: 2038-05-10
Also published as: CN112836687A; CN112906649A; CN112966646B; CN112966646A; CN108647641B; CN112836687B; CN112906649B

Abstract

This application discloses a kind of video behavior dividing methods and device based on two-way Model Fusion.This method includes：Based on the related coefficient between video frame adjacent in video, by the Video segmentation at segment；For the video frame in the segment, the scene of the video frame is identified, obtain scene characteristic vector；For the video frame in the segment, the local behavioural characteristic of the video frame is identified, obtain local behavioural characteristic vector；Based on local behavioural characteristic vector described in the scene characteristic vector sum, the behavior classification of the video frame and confidence level corresponding with behavior classification are identified；The behavior classification and confidence level of video frame based on the segment determine the behavior classification of the segment；Merge with by the adjacent identical segment of behavior classification, obtains the segmentation result of the video.This method can simultaneously merge two-way model, and two dimensions of comprehensive utilization scene and local behavior extract global behavior information, to be rapidly split to video.

Description

Video behavior dividing method and device based on two-way Model Fusion

Technical field

This application involves image automatic business processing fields, more particularly to a kind of video behavior based on two-way Model Fusion Dividing method and device.

Background technology

The fast development of video compression algorithm and application brings the video data of magnanimity.Contain in video abundant Information, however, since video data is huge, unlike word directly indicates abstract concept, thus the extraction of video information and Structuring is relative complex.Currently, the extracting method of video information is mainly first split video, then give every after segmentation A segment classification is tagged, is a kind of thinking of video information extraction and structuring.Based on traditional computer vision to regarding Frequency is split, and generally requires engineer's characteristics of image, the feature designed in this way cannot flexibly adapt to the change of various scenes Change.Actually available Video segmentation most of at present only according to every frame colouring information, by various traditional computer visions Transformation, the variation of adjacent two frame is detected, so that it is determined that Video segmentation point, then proceedes to calculate using the cluster in machine learning Method polymerize the adjacent video clip divided, and the meeting of similar categorization is classified as one kind.However, these above-mentioned methods are only Superficial segmentation can be completed, and cannot recognize that the semanteme of each segment in screen.

Invention content

The application's aims to overcome that the above problem or solves or extenuate to solve the above problems at least partly.

According to the one side of the application, a kind of methods of video segmentation is provided, including：

Fragment segmentation step：It is based on the related coefficient between video frame adjacent in video, the Video segmentation is in blocks Section；

Scene Recognition step：For the video frame in the segment, identify the scene of the video frame, obtain scene characteristic to Amount；

Local behavioural characteristic identification step：For the video frame in the segment, identify that the local behavior of the video frame is special Sign obtains local behavioural characteristic vector；

Video frame behavior classification judgment step：Based on local behavioural characteristic vector described in the scene characteristic vector sum, know The behavior classification of the not described video frame and confidence level corresponding with behavior classification；

Segment behavior class determining step：The behavior classification and confidence level of video frame based on the segment, determine the piece The behavior classification of section；

Segment merges step：The adjacent identical segment of behavior classification is merged, the segmentation result of the video is obtained.

This method can simultaneously merge two-way model, two dimensions of comprehensive utilization scene and local behavior, to whole Body behavioural information extracts, to be rapidly split to video.

Optionally, the fragment segmentation step includes：

Histogram calculation step：Calculate the YCbCr histograms of each video frame of the video；

Related coefficient calculates step：Calculate the YCbCr histograms of the video frame and the YCbCr histograms of previous video frame Related coefficient；

Threshold value comparison step：When the related coefficient is less than scheduled first threshold, using the video frame as new piece The start frame of section.

Optionally, the scene Recognition step includes：

Resolution ratio step of converting：The RGB channel of the video frame is separately converted to fixed-size resolution ratio；With

Scene characteristic vector generation step：Video frame after resolution ratio converts is input in first network model, Obtain the scene characteristic vector of the video frame, wherein the first network model is：Remove the full articulamentum of last layer and The VGG16 network models of Softmax graders.

Optionally, the local behavioural characteristic identification step includes：

The long fixing step of most short side：The RGB channel of the video frame is separately converted to most short side and grows fixed resolution ratio； With

Local behavioural characteristic vector generation step：Most short side is grown fixed video frame to be input in first network model, The output result of the first network model is input in the convolutional neural networks based on region (FasterRCNN) model, profit Optimal detection category result is calculated with the output result of the convolutional neural networks based on region, by the optimal detection classification As a result local behavioural characteristic vector is obtained by area-of-interest pond layer.

Optionally, the video frame behavior classification judgment step includes：

Video frame feature vector merges step：Local behavioural characteristic vector described in the scene characteristic vector sum is merged into Video frame feature vector；With

Behavior classification and confidence calculations step：The video frame feature vector is input to third network, is obtained described The behavior classification of video frame and confidence level corresponding with behavior classification, wherein the third network by 4 full articulamentums with Softmax graders are in turn connected to form.

Optionally, the segment behavior classification judgment step includes：The identical video frame quantity of behavior classification with it is described In the case that the ratio of the video frame total quantity of segment is more than scheduled second threshold, using behavior classification as the row of the segment For classification.

According to further aspect of the application, a kind of Video segmentation device is additionally provided, including：

Fragment segmentation module is disposed for based on the related coefficient between video frame adjacent in video, will be described Video segmentation is at segment；

Scene Recognition module is disposed for identifying the video frame in the segment scene of the video frame, obtain To scene feature vector；

Local behavioural characteristic identification module is disposed for, for the video frame in the segment, identifying the video frame Local behavioural characteristic, obtain local behavioural characteristic vector；

Video frame behavior classification judgment module is disposed for based on local behavior described in the scene characteristic vector sum Feature vector identifies the behavior classification of the video frame and confidence level corresponding with behavior classification；

Segment behavior category determination module is disposed for the behavior classification and confidence of the video frame based on the segment Degree, determines the behavior classification of the segment；With

Segment merging module is disposed for merging the adjacent identical segment of behavior classification, obtains the video Segmentation result.

The device can simultaneously merge two-way model, two dimensions of comprehensive utilization scene and local behavior, to whole Body behavioural information extracts, to be rapidly split to video.

According to further aspect of the application, a kind of computer equipment, including memory, processor and storage are additionally provided In the memory and the computer program that can be run by the processor, wherein the processor executes the computer Method as described above is realized when program.

According to further aspect of the application, a kind of computer readable storage medium is additionally provided, it is preferably non-volatile Readable storage medium storing program for executing, is stored with computer program, and the computer program is realized as described above when executed by the processor Method.

According to further aspect of the application, a kind of computer program product, including computer-readable code are additionally provided, When the computer-readable code is executed by computer equipment, the computer equipment is caused to execute method as described above.

According to the accompanying drawings to the detailed description of the specific embodiment of the application, those skilled in the art will be more Above-mentioned and other purposes, the advantages and features of the application are illustrated.

Description of the drawings

Some specific embodiments of the application are described in detail by way of example rather than limitation with reference to the accompanying drawings hereinafter. Identical reference numeral denotes same or similar component or part in attached drawing.It should be appreciated by those skilled in the art that these What attached drawing was not necessarily drawn to scale.In attached drawing：

Fig. 1 is the schematic flow chart according to one embodiment of the methods of video segmentation of the application；

Fig. 2 is the schematic block diagram of the behavior prediction network of the application；

Fig. 3 is the schematic block diagram of the behavior prediction network of trained the application；

Fig. 4 is the schematic block diagram according to one embodiment of the Video segmentation device of the application；

Fig. 5 is the block diagram of one embodiment of the computing device of the application；

Fig. 6 is the block diagram of one embodiment of the computer readable storage medium of the application.

Specific implementation mode

The embodiment of the application provides a kind of methods of video segmentation, and Fig. 1 is the methods of video segmentation according to the application One embodiment schematic flow chart.This method may include：

S100 fragment segmentation steps：Based on the related coefficient between video frame adjacent in video, by the Video segmentation At segment；

S200 scene Recognition steps：For the video frame in the segment, the scene of the video frame is identified, obtain scene spy Sign vector；

The parts S300 behavioural characteristic identification step：For the video frame in the segment, the partial row of the video frame is identified It is characterized, obtains local behavioural characteristic vector；

S400 video frame behavior classification judgment steps：Based on local behavioural characteristic described in the scene characteristic vector sum to Amount, identifies the behavior classification of the video frame and confidence level corresponding with behavior classification；

S500 segment behavior class determining steps：The behavior classification and confidence level of video frame based on the segment determine The behavior classification of the segment；

S600 segments merge step：The adjacent identical segment of behavior classification is merged, the segmentation knot of the video is obtained Fruit.

Method provided by the present application can simultaneously merge two-way model, comprehensive utilization scene and local behavior two Dimension extracts global behavior information, to be rapidly split to video.The present invention utilizes depth learning technology, Video is split from the dimension of the behavior classification of people.On the one hand, it can be extracted using depth learning technology more abstract Generic features, on the other hand, multidate information and causal event in video are mainly defined by the behavior of people, therefore according to the row of people It is also the most rational to be split to video for classification.

Optionally, the S100 fragment segmentations step may include：

S101 histogram calculation steps：Calculate the YCbCr histograms of each video frame of the video；

S102 related coefficients calculate step：YCbCr histograms and the YCbCr of previous video frame for calculating the video frame are straight The related coefficient of square figure；With

S103 threshold value comparison steps：When the related coefficient is less than scheduled first threshold, using the video frame as new Segment start frame.

Color space can may include：(Hue, Saturation, Value, tone are satisfied by RGB, CMY (three primary colours), HSV With degree, brightness), HIS (Hue, Saturation, Intensity, tone, saturation degree, intensity), YCbCr.The wherein Y of YCbCr Refer to luminance component, Cb refers to chroma blue component, and Cr refers to red chrominance component.By taking YCbCr as an example, in an optional embodiment In, fragment segmentation is carried out to video：

Based on YCbCr color spaces, the YCbCr data of the frame are normalized, YCbCr after structure normalization The horizontal axis of histogram, the histogram indicates that normalized series, the longitudinal axis indicate the corresponding pixel quantity of the series.Normalization When processing, it is alternatively possible to which Y, Cb, Cr are not divided into 16 parts, 9 parts, 9 parts, i.e. 16-9-9 patterns, normalized series takes at this time Value is 16+9+9=34.It determines series and allows for visual resolving power and the calculating of the mankind the reason of being normalized The processing speed of machine, therefore perceive the normalized into between-line spacing not etc. according to the different range of color and subjective color, i.e., Quantification treatment.

The related coefficient between the frame and the former frame of the frame is calculated using following formula

Wherein, l indicates that normalized series, bins1 indicate normalized total series,WithRespectively the frame with L grades of corresponding pixel quantities of the former frame of the frame；WithThe pixel number of the frame and the former frame of the frame respectively Measure average value.It should be noted that bins1 is the number of the bin (box) of histogram, in YCbCr histograms, normalizing is indicated The total series changed.For each pixel, the channels Y value carries out 16 deciles, and the channels Cb and the channels Cr carry out 9 deciles respectively.At this point, Bins1 values are 16+9+9=34.Preferably, bins1 takes 34.Compared with colour difference information, human eye is more sensitive to luminance information, Therefore preferably luminance information and colour difference information can be respectively processed using YCbCr color space models.

First similarity is compared with first threshold, if first similarity is less than first threshold, is shown There is a strong possibility is the start frame of new segment (clip) for the frame, then using the frame as the start frame of new segment.First threshold can be with It is determined according to experiment and practical application.Optionally, first threshold takes 0.85.

For the every section of video clip (i) cut roughly in step S103, wherein i indicates the serial number of every section of video, per second A frame image is intercepted, is sent into behavior prediction network, network exports the identifier (id) of behavior, with clip (i) _ frame (j) _ id It indicates, and exports corresponding corresponding confidence level clip (i) _ frame (j) _ confidence.Behavior prediction network is special For the network of behavior prediction, each behavior is corresponded with an id.Behavior prediction network may include first network model, Second network model and third network model.Single-frame images is described below and finally obtains behavior classification by behavior prediction network Flow.

Optionally, the S200 scene Recognitions step may include：

S201 resolution ratio step of converting：The RGB channel of the video frame is separately converted to fixed-size resolution ratio；With

S202 scene characteristic vector generation steps：Video frame after resolution ratio converts is input to first network model In, obtain the scene characteristic vector of the video frame, wherein the first network model is：Remove the full articulamentum of last layer and The VGG16 network models of Softmax graders.

Fig. 2 is the schematic block diagram of the behavior prediction network of the application.It is fixed that image RGB channel is separately converted to size Video frame after conversion is inputted first network model, also referred to as scene is known by resolution ratio for example, being converted into the resolution ratio of 224x224 Small pin for the case network.First network model is the improved VGG16 nets for the pre-defined trained scene Recognition of several scenes Network, the improved VGG16 networks eliminate last full articulamentum and Softmax graders.The output of scene Recognition sub-network For the vector of 1x1x25088 dimensions, it is denoted as scene characteristic vector place_feature_vector.

It should be noted that visual geometric group (Visual Geometry Group, VGG) is Oxford University's engineering science One tissue, the tissue by expression data library carry out deep learning foundation model be VGG models, the spy of VGG models Sign is VGG features, and VGG features may include：FC6 layers of feature.VGG16Net deep neural network structures.

VGG16Net network structures include the convolutional neural networks (ConvNet) of 5 stacking-types, each ConvNet in total Be made of again multiple convolutional layers (Conv), Conv layer followed by Nonlinear Mapping layer (ReLU) later, is pond after each ConvNet Change layer (Pooling), is finally 3 full articulamentums and 1 soft-max (maximizing layer), wherein each full articulamentum has 4096 channels, soft-max layers have 1000 channels according to specific task, can select different output numbers).It should Network introduces smaller convolution kernel (3 × 3), increases ReLU layers, the input of convolutional layer and full articulamentum is all directly connected to ReLU Layer, while having used a kind of regularization method (Dropout), such network structure to greatly shorten in full articulamentum fc6 and fc7 Training time, the flexibility of network is increased, while preventing over-fitting.The present invention considers network model Factors, the features for choosing VGG16Net as the present invention such as study and characterization ability, the flexibility of structure and training time carry Take device.Adjustment of matrix function (Reshape functions) in the model is the line number that can readjust matrix, columns, dimension Function.

Optionally, the parts the S300 behavioural characteristic identification step may include：

The long fixing step of S301 most short sides：The RGB channel of the video frame is separately converted to most short side and grows fixed point Resolution；With

The parts S302 behavioural characteristic vector generation step：Most short side is grown into fixed video frame and is input to first network model In, the output result of the first network model is input to convolutional neural networks (FasterRCNN) model based on region In, optimal detection category result is calculated using the output result of the convolutional neural networks based on region, by the optimal inspection It surveys category result and obtains local behavioural characteristic vector by area-of-interest pond layer.

Referring to Fig. 2, the RGB channel of the video frame is separately converted to most short side is long, for example, 600 pixels resolution ratio, Video frame is inputted into the second network model, in also referred to as local behavior detection sub-network network.Second network model is for pre-defined Good several local trained local behavioral value networks of behavior.Second network model may include：First network model, FasterRCNN, optimal detection module and pond layer.The flow chart of data processing of second network model is, by the first network mould The output result of type is input in FasterRCNN models, and optimal detection module utilizes the convolutional neural networks based on region Output result calculate optimal detection category result, by the optimal detection category result pass through area-of-interest (region of Interest, ROI) pond layer (Pooling Layer) obtains local behavioural characteristic vector.Second network model is based on FasterRCNN, but only use optimal detection classification.

Optimal detection classification is determined based on the formula quantified as follows：For each FasterRCNN output detection target and Rectangle frame, for example, detection target takes the maximum probability value Softmax_max that softmxax is exported, the area of rectangle frame to be denoted as S calculates optimal detection category result opt_detection：

Opt_detection=SCALE*softmax_max+WEIGHT*S

Wherein, SCALE is coefficient, and softmax_max is flooded by the value range of S in order to prevent；WEIGHT is the face that is directed to Long-pending weighted value.Optionally, SCALE=1000, WEIGHT=0.7 indicate the weight of a little higher than area of weight of local behavior.

Optimal detection category result converts the output result that 7x7x512 is tieed up to by area-of-interest pond layer 1x1x25088 vectors, are denoted as local behavioural characteristic vector local_action_feature_vector.In fig. 2, it is obtaining After local behavioural characteristic vector, by FC1, FC2, FC M, Softmax M obtain as a result, and the result of FC2 inputted FC M*4, the identification that can be used in evaluating local behavioural characteristic vector using the obtained results of window regression function Bbox_Pred are imitated Fruit, wherein M are local behavior classification.

Optionally, the S400 video frame behavior classification judgment step may include：

S401 video frame feature vectors merge step：Local behavioural characteristic vector described in the scene characteristic vector sum is closed And it is video frame feature vector；With

S402 behaviors classification and confidence calculations step：The video frame feature vector is input to third network, is obtained The behavior classification of the video frame and confidence level corresponding with behavior classification, wherein the third network is by 4 full articulamentums It is in turn connected to form with Softmax graders.

In S401, by scene characteristic vector place_feature_vector and local behavioural characteristic vector local_ Action_feature_vector merges into a video frame feature vector, and the size of the vector is 1x1x (25088+25088) =50176 dimensional vectors, are denoted as feature_vector, referring to Fig. 2.

Optionally, the S500 segments behavior classification judgment step may include：In the identical video frame number of behavior classification In the case that the ratio of amount and the video frame total quantity of the segment is more than scheduled second threshold, using behavior classification as this The behavior classification of segment.

In S402, video frame feature vector feature_vector passes through 4 layers of full articulamentum FC, FC1 to FC4.Its In, FC1 exports 4096 channels, and FC2 exports 4096 channels, and FC3 exports 1000 channels, and FC4 exports the score of C classification, referring to Fig. 2.C can be according to actual needs the selection of behavior categorical measure, it is general to choose 15 to 30 preferably.The output of FC4 accesses Softmax graders, the forecast confidence of each behavior classification of final output.The highest behavior classification of confidence level is chosen, as The frame line exports for classification, is denoted as clip (i) _ frame (j) _ id, clip (i) _ frame (j) _ confidence.

In S500 segment behavior class determining steps, for the frame of interception per second in segment clip (i), step is all carried out The processing of S200 to S400 predicts the behavior classification of every frame.The identical frames of id account for the percentage of total prediction frame number in clip (i) It is denoted as same_id_percent.Simply by the presence of such id so that same_id_percent>same_id_percent_ Thres, wherein same_id_percent_thres indicate the threshold value of setting, and the confidence level of the frame of identical id is more than 65% Accounting be more than 80%.Just exported the id as the behavior classification of the segment clip (i).

In step S600 segments merge step, for each segment obtained roughly by step S100, all carry out The processing stated obtains the behavior classification of each segment.If the behavior classification of adjacent segment is identical, just the two segments are closed And it is a segment.Finally obtain the short-sighted frequency that the video is divided according to behavior classification.

It should be understood that the parts S300 behavioural characteristic identification step and S400 video frame behavior classification judgment steps are not It must execute, can also be performed simultaneously in sequence, or successively execute.

Fig. 3 is the schematic block diagram of the behavior prediction network of trained the application.Optionally, this method can also include behavior Predict the training step of network.

For first network model, that is, scene prediction network, the network model is using VGG16 to N number of predefined field Scape is classified.The scene type N of output chooses according to actual demand, general to choose 30 to 40.For example, scene type can be Dining room, basketball court, music hall etc..Training strategy is as follows：Weight w initialization is carried out using following formula：

W=np.random.randn (n) * sqrt (2.0/n)

Wherein, np.random.randn (n) is the function for generating random number, i.e., to each channel of each convolutional layer N weights initialisation of filter is Gaussian Profile, and the generation of numpy methods may be used.It is calculated using square root function Sqrt (2.0/n), the distribution variance to ensure the input of every layer of each neuron is consistent.Using dropout technologies come into Row regularization, it refers to, for neural network unit, being pressed in the training process of deep learning network to prevent over-fitting, dropout It is temporarily abandoned from network according to certain probability.The probability of each neuronal activation is hyper parameter p.Result warp behind pond It is input to cost function after crossing two FC4096, FC N, Softmax N.Cost function uses cross entropy loss function cross- Entropy loss (Softmax) are calculated.Wherein, weight more new strategy using SGD+Momentum (stochastic gradient descent+ Momentum) method realization.Learning rate (learning rate) is according to step decay (step decay) as the training time reduces.

For the second network model, that is, local behavior prediction network, Web vector graphic FasterRCNN, training method Using the standard exercise method of FasterRCNN.The local behavior classification M of output chooses according to actual demand, it is general choose 15 to 30.For example, local behavior can be have a meal, appointment etc. of playing basketball.After obtaining local behavioural characteristic vector, pass through two FC 4096, the prediction result that FC M, Softmax M are obtained, and the result of second FC 4096 is inputted into FC M*4, utilize window The result that mouth regression function Bbox_Pred is obtained can be used in evaluating the recognition effect of local behavioural characteristic vector, and wherein M is office Portion's behavior classification.The result of Softmax M and FC M*4 are input to the intersection entropy loss defined by FasterRCNN.

After the completion of first network model and the training of the second network model, training third network.Scene network removes Softmax Grader and last several layers of full articulamentums, remaining each layer parameter remain unchanged, last layer of pond layer is converted into 1x1x25088 is tieed up, and is denoted as video frame feature vector.For local Activity recognition network.When training third network model, Mei Getu As going out multiple local behaviors and its position rectangle frame by the part Activity recognition neural network forecast, according to optimal detection classification, choosing Optimal detection classification is taken, the vector output of the 7x7x512 dimensions of corresponding area-of-interest pond layer is obtained, is further converted into The local behavioural characteristic vector of 1x1x25088 dimensions.Scene characteristic vector sum part behavioural characteristic Vector Groups are combined into 1x1x (25088+ 25088) it=50176 ties up, is denoted as video frame feature vector.The video frame feature vector passes through 4 layers of full articulamentum FC1 to FC4. The output of FC4 is sequentially ingressed into Softmax C and intersects entropy loss cross-entropy loss.For third network model, He remains unchanged parameter, only the parameter of 4 layers of FC of training.Parameter training strategy takes the Training strategy of first network model.

For C behavior classification of third network model prediction, M local behavior classification of the second network model prediction, N number of scene type of first network model prediction, can choose as follows.First whole C are defined according to business demand Behavior classification, for example have a meal, play basketball, date.Then according to this C global behavior, to wherein possible local behavior classification It is defined, can generally keep consistent with global behavior, for example have a meal, play basketball, date.Finally according to global behavior point Class is defined N number of possible scene, for example, for having a meal, can define scenes such as dining room, coffee shop etc..

A kind of Video segmentation device is additionally provided according to another embodiment herein, Fig. 4 is according to the application The schematic block diagram of one embodiment of Video segmentation device.The device may include：

Fragment segmentation module 100 is disposed for based on the related coefficient between video frame adjacent in video, by institute Video segmentation is stated into segment；

Scene Recognition module 200 is disposed for identifying the video frame in the segment field of the video frame Scape obtains scene characteristic vector；

Local behavioural characteristic identification module 300 is disposed for, for the video frame in the segment, identifying the video The local behavioural characteristic of frame obtains local behavioural characteristic vector；

Video frame behavior classification judgment module 400 is disposed for based on part described in the scene characteristic vector sum Behavioural characteristic vector, identifies the behavior classification of the video frame and confidence level corresponding with behavior classification；

Segment behavior category determination module 500, be disposed for the video frame based on the segment behavior classification and Confidence level determines the behavior classification of the segment；With

Segment merging module 600 is disposed for merging the adjacent identical segment of behavior classification, obtains described regard The segmentation result of frequency.

Device provided by the present application can simultaneously merge two-way model, comprehensive utilization scene and local behavior two Dimension extracts global behavior information, to be rapidly split to video.

Optionally, the fragment segmentation module 100 may include：

Histogram calculation module 101 is disposed for calculating the YCbCr histograms of each video frame of the video Figure；

Related coefficient computing module 102 is disposed for calculating the YCbCr histograms of the video frame and previous video The related coefficient of the YCbCr histograms of frame；With

Threshold value comparison module 103 is disposed for, when the related coefficient is less than scheduled first threshold, this being regarded Start frame of the frequency frame as new segment.

Optionally, the scene Recognition module 200 may include：

Resolution ratio conversion module 201 is disposed for the RGB channel of the video frame being separately converted to fixed dimension Resolution ratio；With

Scene characteristic vector generation module 202 is disposed for the video frame after resolution ratio converts being input to In first network model, the scene characteristic vector of the video frame is obtained, wherein the first network model is：Remove last The VGG16 network models of layer full articulamentum and Softmax graders.

Optionally, the local behavioural characteristic identification module 300 may include：

The long fixed module 301 of most short side, is disposed for the RGB channel of the video frame being separately converted to most short side Long fixed resolution ratio；With

Local behavioural characteristic vector generation module 302 is disposed for the fixed video frame of most short side length being input to In first network model, the output result of the first network model is input to the convolutional neural networks based on region (FasterRCNN) in model, optimal detection classification knot is calculated using the output result of the convolutional neural networks based on region The optimal detection category result is obtained local behavioural characteristic vector by fruit by area-of-interest pond layer.

Optionally, the video frame behavior classification judgment module 400 may include：

Video frame feature vector merging module 401 is disposed for partial row described in the scene characteristic vector sum Video frame feature vector is merged into for feature vector；With

Behavior classification and confidence calculations module 402 are disposed for the video frame feature vector being input to Three networks obtain the behavior classification of the video frame and confidence level corresponding with behavior classification, wherein the third network by 4 full articulamentums are in turn connected to form with Softmax graders.

Fig. 5 is the block diagram of one embodiment of the computing device of the application.Another embodiment herein also provides A kind of computing device, the computing device include memory 1120, processor 1110 and are stored in the memory 1120 simultaneously The computer program that can be run by the processor 1110, the computer program are stored in memory 1120 and are used for program generation The space 1130 of code, the computer program are realized when being executed by processor 1110 for executing any one side according to the present invention Method step 1131.

The application another embodiments further provide a kind of computer readable storage medium.Fig. 6 is the calculating of the application The block diagram of one embodiment of machine readable storage medium storing program for executing, the computer readable storage medium include the storage list for program code Member, the storage unit are provided with the program 1131 ' for executing steps of a method in accordance with the invention, which is held by processor Row.

The embodiment of the present application also provides a kind of computer program products including instruction.When the computer program product.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer program Product includes one or more computer instructions.When computer loads and executes the computer program instructions, whole or portion Ground is divided to generate according to the flow or function described in the embodiment of the present application.The computer can be all-purpose computer, dedicated computing Machine, computer network obtain other programmable devices.The computer instruction can be stored in computer readable storage medium In, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computer Instruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data center User's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server or Data center is transmitted.The computer readable storage medium can be any usable medium that computer can access or It is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be with It is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state disk Solid State Disk (SSD)) etc..

Professional should further appreciate that, described in conjunction with the examples disclosed in the embodiments of the present disclosure Unit and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, hard in order to clearly demonstrate The interchangeability of part and software generally describes each exemplary composition and step according to function in the above description. These functions are implemented in hardware or software actually, depend on the specific application and design constraint of technical solution. Professional technician can use different methods to achieve the described function each specific application, but this realization It is not considered that exceeding scope of the present application.

One of ordinary skill in the art will appreciate that implement the method for the above embodiments be can be with It is completed come instruction processing unit by program, the program can be stored in computer readable storage medium, and the storage is situated between Matter is non-transitory (English：Non-transitory) medium, such as random access memory, read-only memory, flash Device, hard disk, solid state disk, tape (English：Magnetic tape), floppy disk (English：Floppy disk), CD (English： Optical disc) and its arbitrary combination.

The preferable specific implementation mode of the above, only the application, but the protection domain of the application is not limited thereto, Any one skilled in the art is in the technical scope that the application discloses, the change or replacement that can be readily occurred in, It should all cover within the protection domain of the application.Therefore, the protection domain of the application should be with scope of the claims Subject to.

Claims

1. a kind of methods of video segmentation, including：

Fragment segmentation step：Based on the related coefficient between video frame adjacent in video, by the Video segmentation at segment；

Scene Recognition step：For the video frame in the segment, the scene of the video frame is identified, obtain scene characteristic vector；

Local behavioural characteristic identification step：For the video frame in the segment, identifies the local behavioural characteristic of the video frame, obtain To local behavioural characteristic vector；

Video frame behavior classification judgment step：Based on local behavioural characteristic vector described in the scene characteristic vector sum, institute is identified State the behavior classification of video frame and confidence level corresponding with behavior classification；

Segment behavior class determining step：The behavior classification and confidence level of video frame based on the segment, determine the segment Behavior classification；

2. according to the method described in claim 1, it is characterized in that, the fragment segmentation step includes：

Related coefficient calculates step：Calculate the phase of the YCbCr histograms of the video frame and the YCbCr histograms of previous video frame Relationship number；

Threshold value comparison step：When the related coefficient is less than scheduled first threshold, using the video frame as new segment Start frame.

3. method according to claim 1 or 2, which is characterized in that the scene Recognition step includes：

Scene characteristic vector generation step：Video frame after resolution ratio converts is input in first network model, is obtained The scene characteristic vector of the video frame, wherein the first network model is：Remove the full articulamentum of last layer and Softmax The VGG16 network models of grader.

4. method according to claim 1 or 2, which is characterized in that it is described part behavioural characteristic identification step include：

The long fixing step of most short side：The RGB channel of the video frame is separately converted to most short side and grows fixed resolution ratio；With

Local behavioural characteristic vector generation step：Most short side is grown fixed video frame to be input in first network model, by institute The output result for stating first network model is input in the convolutional neural networks based on region (FasterRCNN) model, utilizes institute The output result for stating the convolutional neural networks based on region calculates optimal detection category result, by the optimal detection category result Local behavioural characteristic vector is obtained by area-of-interest pond layer.

5. according to the method described in claim 4, it is characterized in that, the video frame behavior classification judgment step includes：

Behavior classification and confidence calculations step：The video frame feature vector is input to third network, obtains the video The behavior classification of frame and confidence level corresponding with behavior classification, wherein the third network by 4 full articulamentums with Softmax graders are in turn connected to form.

6. according to the method described in claim 1, it is characterized in that, the segment behavior classification judgment step includes：In behavior In the case that the ratio of the identical video frame quantity of classification and the video frame total quantity of the segment is more than scheduled second threshold, Using behavior classification as the behavior classification of the segment.

7. a kind of Video segmentation device, including：

Fragment segmentation module is disposed for based on the related coefficient between video frame adjacent in video, by the video It is divided into segment；

Scene Recognition module is disposed for identifying the video frame in the segment scene of the video frame, must show up Scape feature vector；

Local behavioural characteristic identification module is disposed for identifying the video frame in the segment office of the video frame Portion's behavioural characteristic obtains local behavioural characteristic vector；

Video frame behavior classification judgment module is disposed for based on local behavioural characteristic described in the scene characteristic vector sum Vector identifies the behavior classification of the video frame and confidence level corresponding with behavior classification；

Segment behavior category determination module is disposed for the behavior classification and confidence level of the video frame based on the segment, Determine the behavior classification of the segment；With

Segment merging module is disposed for merging the adjacent identical segment of behavior classification, obtains point of the video Cut result.

8. a kind of computer equipment, including memory, processor and storage can be transported in the memory and by the processor Capable computer program, wherein the processor is realized when executing the computer program such as any one of claim 1 to 6 The method.

9. a kind of computer readable storage medium, preferably non-volatile readable storage medium, are stored with computer program, The computer program realizes such as method according to any one of claims 1 to 6 when executed by the processor.

10. a kind of computer program product, including computer-readable code, when the computer-readable code is by computer equipment When execution, the computer equipment is caused to execute such as method according to any one of claims 1 to 6.