CN106469314A

CN106469314A - A kind of video image classifier method based on space-time symbiosis binary-flow network

Info

Publication number: CN106469314A
Application number: CN201610794689.9A
Authority: CN
Inventors: 夏春秋
Original assignee: Shenzhen Vision Technology Co Ltd
Current assignee: Shenzhen Vision Technology Co Ltd
Priority date: 2016-08-31
Filing date: 2016-08-31
Publication date: 2017-03-01

Abstract

A kind of method of the video image classifier based on space-time symbiosis binary-flow network proposing in the present invention, its main contents includes：Data input, space-time binary-flow network, fusion, SVM classifier, its process is, input picture and Optic flow information first, and binding time network and spatial network carry out early stage and merge, it is input to merging output among SVM classifier as characteristic vector, obtain final classification result.The present invention, using the early binary-flow network binding time melting and spatial information (space-time symbiosis) method, using the sets of video data of monkey class, uses more frames (i.e. more spatial datas) to produce a significant precision from each video and improves；The combination of room and time information, both form complementation, and precision reaches 65.8%.Formed by using the less detached cluster of the method for symbiosis, with detached cluster often more closely together with, better profit from temporal information.

Description

A kind of video image classifier method based on space-time symbiosis binary-flow network

Technical field

The present invention relates to video image classifier field, especially relate to a kind of video based on space-time symbiosis binary-flow network Image classification method.

Background technology

Video image classifier is a problem having very much challenge, because posture and cosmetic variation cause big internal type Change, the little interior change that also nuance in the overall appearance between type causes.Recently, depth convolution god It is used to learn many powerful functions through network (DCNNs), process big change with hierarchical model, be automatically positioned area Domain.Although these methods have improved, former work is asked the classification task of object as a static image classification Topic, ignores complementary temporal information present in video.So far, not using the method based on neural net method come to regarding The object of frequency is classified.

Present invention introduces the object classification problem based on video, employ the early binary-flow network binding time melting and space letter Breath (space-time symbiosis) method, first input picture and Optic flow information, binding time network and spatial network carry out merging in early days, will Merge output to be input among SVM classifier as characteristic vector, obtain final classification result.From each video using more Frame (i.e. more spatial datas) produces a significant precision and improves；The combination of room and time information, both form complementation, Precision reaches 65.8%.Merged using early stage is to have a potential defect, the information fusion of room and time because late period merges It is to finally complete, which has limited the quantity (or decision) of the side information obtaining from SoftMax classification layer combination, so, lead to Cross and formed using the less detached cluster of the method for symbiosis, with detached cluster often more closely together with, Ke Yigeng Good land productivity temporal information.

Content of the invention

For have ignored the problem that video data is classified, it is an object of the invention to provide a kind of be based on space-time altogether The video image classifier method of raw binary-flow network, using the early binary-flow network binding time melting and spatial information (space-time symbiosis) side Method, using the sets of video data of monkey class, improves classification performance.

For solving the above problems, the present invention provides a kind of video image classifier method based on space-time symbiosis binary-flow network, Its main contents includes：

(1) data input；

(2) space-time binary-flow network；

(3) merge；

(4) SVM classifier.

Wherein, a kind of video image classifier method based on space-time symbiosis binary-flow network, using the early binary-flow network melting knot Conjunction time and spatial information (space-time symbiosis) method, using the sets of video data of monkey class, use more frames (i.e. from each video More spatial datas) produce a significant precision raising；The combination of room and time information, both form complementation, precision Reach 65.8%.

Wherein, a kind of video image classifier method based on space-time symbiosis binary-flow network, the visualization technique based on dimensionality reduction Drawn using t- distribution neighborhood embedded mobile GIS (t-SNE), formed by using the less detached cluster of the method for symbiosis, and point From cluster often more closely together, better profit from temporal information.

Wherein, described data input, including image and Optic flow information, data set is made up of 100 kinds of monkey class video sets；Should Data set is divided into training set and test set.Record monkey class video in certain distance, this data set has larger challenge, such as big The camera motion change of scale and sizable attitudes vibration；

There is provided data below for each class (monkey kind)：There is the video clipping of active annotations, sound clip, surround Box, and classification and distributing position.

Further, described test, each video clipping is tested by the way of 5 frames (FPS) per second, calculates every 5 frames Light stream carry out computational efficiency.

Wherein, described space-time binary-flow network, including time network, spatial network, and space-time symbiosis decoding.

Further, described time network and spatial network, including

(1) sequential networkIt is used as level stream O_x, vertically flow O_ySize O with light stream_magInput combine to form one Individual single optical signature mapping O ∈ R^h×w×3, wherein h × w is the size of Feature Mapping (image)；

(2) spatial networkRGB frame (image) is used as input；

WithAll using DCNN structure, form S by 5 convolutional layers^c1,S^c2,…,S^c5, next to that being fully connected a layer S^fc6；Should Network is trained, and is a single example by the incoming frame (image or light stream) of each video, using pre-training net Network；When being classified, each image (or light stream of frame) is initially treated as independent；N for a video_fFrame generates N_fClassification determines.

Further, described space-time symbiosis decoding, including by the common united room and time feature occurring, makes Pass through to calculate the convolutional layer that symbiosis combines room and time network with DCNNs method, specifically, allow the n-th layer of time-space network Feature Mapping beWithd_nIt is the number of the dimension of Feature Mapping, calculate Feature Mapping combination

WithRefer to the local feature vectors of the room and time stream on position (i, j), carry out vector Change operation,Symbiosis feature as position (i, j)；Therefore, in the symbiosis of each locus Pattern, the visual movement of apposition computing capture, maximum pond is applied to all of local coder vector；P_i,jCreate last feature RepresentFinally, L₂Standardization is applied to coding vector；

Space-time bilinearity DCNN feature combines fc6 space-time characteristic and is used for double fluid early stage fusion, and this enables us to combine office Portion and the room and time information of the overall situation.

Wherein, described fusion, comprises the steps：

(1) early stage merges

(2) two independent sequential networks and spatial network S are used^oWithBinary-flow network be used for action recognition, early stage In conjunction with double-current information, by merging fc6 output, S^fc6And T^fc6, fc6 is first full articulamentum, is often used in from DCNNs Extract single feature；We are referred to as this network, and double fluid is early to be melted；

(3) carry out merging in early days, the method then merging double fluid using bilinearity DCNN, room and time information phase is tied Close；The combination initial data pretreatment of the layer by being fully connected, determine classified counting object, carry out point on calculating object Class.

Wherein, described SVM classifier, its principle is：

If linear separability sample set and be (x_i,y_i), i=1 ..., n, x ∈ R^d, y ∈ {+1, -1 } is category label, then

W x+b=0

It is the classifying face equation of SVM classifier；

In classification, in order that classifying face is correctly classified by all samples and class interval reaches maximum, under needing to meet Two, face condition：

Φ (x)=min (w^Tw)

y_i(w·x_i+b)-1≥0

Can be obtained by optimal classification surface by solving this constrained optimization problems, and cross nearest from classifying face in two class samples Put and be just so that, parallel to the training sample on the hyperplane of optimal classification surface, those special sample that in formula, equal sign is set up, Because they support optimal classification surface, therefore it is referred to as support vector；Fusion output is input to SVM as characteristic vector divide Among class device, obtain final classification result.

Brief description

Fig. 1 is a kind of system flow chart of the video image classifier method based on space-time symbiosis binary-flow network of the present invention.

Fig. 2 is a kind of video data of the monkey class of the video image classifier method based on space-time symbiosis binary-flow network of the present invention Collection.

Fig. 3 is that a kind of use T-SNE of the video image classifier method based on space-time symbiosis binary-flow network of the present invention is visual Change qualitative evaluation.

Fig. 4 is a kind of space-time symbiosis concept map of the video image classifier method based on space-time symbiosis binary-flow network of the present invention Method.

Fig. 5 is that a kind of early stage convergence strategy of the video image classifier method based on space-time symbiosis binary-flow network of the present invention is shown It is intended to.

Fig. 6 is a kind of monkey class example positioning of video image classifier method based on space-time symbiosis binary-flow network of the present invention Figure.

Specific embodiment

It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Mutually combine, with specific embodiment, the present invention is described in further detail below in conjunction with the accompanying drawings.

Fig. 1 is a kind of system flow chart of the video image classifier method based on space-time symbiosis binary-flow network of the present invention.Main Data input to be included；Space-time binary-flow network；Merge；SVM classifier.

Data input includes image and Optic flow information, and data set is made up of 100 kinds of monkey class video sets；This data set is divided into instruction Practice collection and test set.Record monkey class video in certain distance, this data set has larger challenge, such as large-scale camera fortune Dynamic change and sizable attitudes vibration；There is provided data below for each class (monkey kind)：The video with active annotations is cut Volume, sound clip, autonavigator detects, and classification and distributing position.

Wherein, each video clipping is tested by the way of 5 frames (FPS) per second, the light stream calculating every 5 frames is calculating effect Rate.

Space-time binary-flow network, including time network, spatial network, and space-time symbiosis decoding.

Wherein, time network and spatial network, including

(2) spatial networkRGB frame (image) is used as input；

Wherein, space-time symbiosis decoding, including by the common united room and time feature occurring, using DCNNs side Method pass through calculate symbiosis combine room and time network convolutional layer, specifically, allow time-space network n-th layer Feature Mapping ForWithd_nIt is the number of the dimension of Feature Mapping, calculate Feature Mapping combination

WithRefer to the local feature vectors of the room and time stream on position (i, j), vector quantization is grasped Make,Symbiosis feature as position (i, j)；Therefore, in the Symbiotic Model of each locus, The visual movement of apposition computing capture, maximum pond is applied to all of local coder vector；P_i,jCreate last character representationFinally, L₂Standardization is applied to coding vector；

Fusion comprises the steps：

(1) early stage merges：Using two independent sequential networks and spatial network S^oWithBinary-flow network be used for action know Not, combine the information of double fluid in early stage, by merging fc6 output, S^fc6And T^fc6, fc6 is first full articulamentum, is often used in Single feature is extracted from DCNNs；We are referred to as this network, and double fluid is early to be melted；

(2) carry out merging in early days, the method then merging double fluid using bilinearity DCNN, room and time information phase is tied Close；The combination initial data pretreatment of the layer by being fully connected, determine classified counting object, carry out point on calculating object Class.

The principle of SVM classifier is：

W x+b=0

It is the classifying face equation of SVM classifier；

Φ (x)=min (w^Tw)

y_i(w·x_i+b)-1≥0

Fig. 2 is a kind of video data of the monkey class of the video image classifier method based on space-time symbiosis binary-flow network of the present invention Collection.Including image and Optic flow information, data set is made up of 100 kinds of monkey class video sets；This data set is divided into training set and test set. Record monkey class video in certain distance, this data set has larger challenge, such as large-scale camera motion changes and suitable Big attitudes vibration；

Fig. 3 is that a kind of use T-SNE of the video image classifier method based on space-time symbiosis binary-flow network of the present invention is visual Change qualitative evaluation.Visualization technique based on dimensionality reduction is using t- distribution neighborhood embedded mobile GIS (t-SNE) it can be seen that passing through to make Formed with the less detached cluster of the method for symbiosis, with detached cluster often more closely together with, better profit from Temporal information.

Fig. 4 is a kind of space-time symbiosis concept map of the video image classifier method based on space-time symbiosis binary-flow network of the present invention Method.Including by the common united room and time feature occurring, pass through calculating symbiosis combination using DCNNs method empty Between and time network convolutional layer, specifically, allow the Feature Mapping of n-th layer of time-space network beWithd_nIt is the number of the dimension of Feature Mapping, calculate Feature Mapping combination

WithRefer to the local feature vectors of the room and time stream on position (i, j), vector quantization is grasped Make,Symbiosis feature as position (i, j)；Therefore, the vision of apposition computing capture and motion exist The Symbiotic Model of each locus, maximum pond is applied to all of local coder vector；P_i,jCreate last character representationFinally, L₂Standardization is applied to coding vector；

Fig. 5 is that a kind of early stage convergence strategy of the video image classifier method based on space-time symbiosis binary-flow network of the present invention is shown It is intended to.Fusion comprises the steps：

(1) early stage merges

Using two independent sequential networks and spatial network S^oWithBinary-flow network be used for action recognition, early stage tie Close the information of two streams, by merging fc6 output, S^fc6And T^fc6, fc6 is first full articulamentum, is often used in from DCNNs Extract single feature；We are referred to as double-current (early melting) this modification network；

(2) carrying out early stage merges, and the combination initial data pretreatment of the layer by being fully connected, determines classified counting pair As, calculate object on classified；Then the method merging double fluid using bilinearity DCNN, room and time information phase is tied Close.

Fig. 6 is a kind of monkey class example positioning of video image classifier method based on space-time symbiosis binary-flow network of the present invention Figure.In most of the cases, the position of monkey class in image can accurately be navigated to.But work as the stricture of vagina occurring in picture obscuring Reason, when mixed and disorderly object and occlusion condition, the positioning of video image makes a mistake.

For those skilled in the art, the present invention is not restricted to the details of above-described embodiment, in the essence without departing substantially from the present invention In the case of god and scope, the present invention can be realized with other concrete forms.Additionally, those skilled in the art can be to this Bright carry out various change and modification without departing from the spirit and scope of the present invention, these improve and modification also should be regarded as the present invention's Protection domain.Therefore, all changes that claims are intended to be construed to including preferred embodiment and fall into the scope of the invention More and modification.

Claims

1. a kind of video image classifier method based on space-time symbiosis binary-flow network is it is characterised in that mainly include data input (1)；Space-time binary-flow network (two)；Merge (three)；SVM classifier (four).

2. based on a kind of video image classifier method based on space-time symbiosis binary-flow network described in claims 1, its feature It is, including using the early binary-flow network binding time melting and spatial information (space-time symbiosis) method, using the video counts of monkey class According to collection, use more frames (i.e. more spatial datas) to produce a significant precision from each video and improve；Space and when Between information combination, both form complementation, and precision reaches 65.8%.

3. based on a kind of video image classifier method based on space-time symbiosis binary-flow network described in claims 1, its feature It is, the visualization technique based on dimensionality reduction is drawn using t- distribution neighborhood embedded mobile GIS (t-SNE), by using the method for symbiosis Less detached cluster is formed, with detached cluster often more closely together with, better profit from temporal information.

4. based on the data input () described in claims 1 it is characterised in that including image and Optic flow information, data set It is made up of 100 kinds of monkey class video sets；This data set is divided into training set and test set；Record monkey class video in certain distance, should Data set has larger challenge, and such as large-scale camera motion changes and sizable attitudes vibration；

There is provided data below for each class (monkey kind)：There is the video clipping of active annotations, sound clip, bounding box, with And classification and distributing position.

5. based on the test described in claims 4, each video clipping is tested by the way of 5 frames (FPS) per second, calculates every The light stream of 5 frames carrys out computational efficiency.

6. based on the space-time binary-flow network (two) described in claims 1 it is characterised in that including time network, spatial network, And space-time symbiosis decoding.

7. based on the time network described in claims 6 and spatial network it is characterised in that including

(1) sequential networkIt is used as level stream O_x, vertically flow O_ySize O with light stream_magInput combine to form one single Optical signature mapping O ∈ R^h×w×3, wherein h × w is the size of Feature Mapping (image)；

(2) spatial networkRGB frame (image) is used as input；

WithAll using DCNN structure, form S by 5 convolutional layers^c1,S^c2,…,S^c5, next to that being fully connected a layer S^fc6；This network It is trained, be a single example by the incoming frame (image or light stream) of each video, using pre-training network；? When being classified, each image (or light stream of frame) is initially treated as independent；N for a video_fFrame generates N_fClassification Determine.

8. decoded based on the space-time symbiosis described in claims 6 it is characterised in that including by the common united sky occurring Between and temporal characteristics, using DCNNs method pass through calculate symbiosis combine room and time network convolutional layer, specifically, allow The Feature Mapping of the n-th layer of time-space network isWithd_nIt is the number of the dimension of Feature Mapping, Calculate Feature Mapping combination

P_{i, j} = v e c (S_{i, j}^{n} {T_{i, j}^{n}}^{T})

WithRefer to the local feature vectors of the room and time stream on position (i, j), carry out vector quantization behaviour Make,Symbiosis feature as position (i, j)；Therefore, in the Symbiotic Model of each locus, The visual movement of apposition computing capture, maximum pond is applied to all of local coder vector；P_i,jCreate last character representationFinally, L₂Standardization is applied to coding vector；

Space-time bilinearity DCNN feature combine fc6 space-time characteristic be used for double fluid early stage merge, this enable us to reference to local and The room and time information of the overall situation.

9. based on the fusion (three) described in claims 1 it is characterised in that comprising the steps：

(1) early stage merges

Using two independent sequential networks and spatial network S^oWithBinary-flow network be used for action recognition, combine double in early stage The information of stream, by merging fc6 output, S^fc6And T^fc6, fc6 is first full articulamentum, is often used in extracting list from DCNNs Feature；We are referred to as this network, and double fluid is early to be melted；

(2) carry out merging in early days, the method then merging double fluid using bilinearity DCNN, room and time information combines；

The combination initial data pretreatment of the layer by being fully connected, determine classified counting object, carry out point on calculating object Class.

10. based on the SVM classifier (four) described in claims 1 it is characterised in that the principle including SVM classifier is：

W x+b=0

It is the classifying face equation of SVM classifier；

In classification, in order that classifying face is correctly classified by all samples and class interval reaches maximum, need to meet following two Individual condition：

Φ (x)=min (w^Tw)

y_i(w·x_i+b)-1≥0

Can be obtained by optimal classification surface by solving this constrained optimization problems, and cross two class samples in from the nearest point of classifying face and Just it is so that those special sample that in formula, equal sign is set up parallel to the training sample on the hyperplane of optimal classification surface, because They support optimal classification surface, are therefore referred to as support vector；It is input to SVM classifier using merging output as characteristic vector Among, obtain final classification result.