CN110532996A

CN110532996A - The method of visual classification, the method for information processing and server

Info

Publication number: CN110532996A
Application number: CN201910834142.0A
Authority: CN
Inventors: 唐永毅; 马林; 刘威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2019-12-03
Anticipated expiration: 2037-09-15
Also published as: KR20190133040A; WO2019052301A1; MA50252A; JP2020533709A; CN110532996B; EP3683723A4; US20190384985A1; US10956748B2; JP7127120B2; CN109508584A; EP3683723A1; KR102392943B1; CN109508584B

Abstract

This application discloses a kind of methods of information processing, and this method is applied to artificial intelligence field, this method comprises: obtaining video to be processed；Video to be processed is sampled according to temporal characteristics sampling rule, and obtains at least one video frame characteristic sequence；At least one video frame characteristic sequence is handled by first nerves network model, obtains the feature representation result of each video frame characteristic sequence；It is handled by feature representation result of the nervus opticus network model at least one video frame characteristic sequence, obtains prediction result corresponding at least one video frame characteristic sequence, prediction result is used to determine the classification of video to be processed.The application also provides a kind of server.The application is during classifying to video, it is also contemplated that changing features of the video on time dimension improve the accuracy rate of visual classification so as to preferably express video content, promote the effect of visual classification.

Description

The method of visual classification, the method for information processing and server

The application be the submission Patent Office of the People's Republic of China on the 15th of September in 2017, application No. is 201710833668.8, it is entitled The divisional application of the Chinese patent application of " method of visual classification, the method for information processing and server ".

Technical field

This application involves field of computer technology more particularly to the methods of visual classification, the method for information processing and clothes Business device.

Background technique

With the rapid development of network multimedia technology, miscellaneous multimedia messages are continued to bring out.More and more User gets used to watching video on network, in order to allow users to select oneself to think the content of viewing from a large amount of video, It would generally classify to video, therefore, visual classification is recommended to have particularly significant for the management and interest of realizing video Effect.

Currently used video classification methods are mainly first to treat each of marking video video frame to carry out feature It extracts, frame level another characteristic is then transformed to by videl stage another characteristic by average characteristics method, it is finally that videl stage is other Feature, which is transferred in sorter network, classifies.

However, being converted only with average characteristics method to frame level another characteristic in current video classification methods It is more single, there is no the variation in view of other dimensions to video frame transformation influence, to be unfavorable for visual classification Accuracy.

Summary of the invention

The embodiment of the present application provides a kind of method of visual classification, the method for information processing and server, to view During frequency is classified, it is also contemplated that changing features of the video on time dimension, so as to preferably express video Content improves the accuracy rate of visual classification, promotes the effect of visual classification.

In view of this, the application first aspect provides a kind of method of visual classification, comprising:

Obtain video to be processed, wherein the video to be processed includes multiple video frames, when each video frame is one corresponding Between feature；

The video to be processed is sampled according to temporal characteristics sampling rule, and obtains at least one video frame feature Sequence, wherein corresponding relationship of the temporal characteristics sampling rule between temporal characteristics and video frame characteristic sequence；

At least one described video frame characteristic sequence is handled by first nerves network model, obtains each video Feature representation result corresponding to frame characteristic sequence；

By nervus opticus network model to feature representation result corresponding at least one described video frame characteristic sequence It is handled, obtains prediction result corresponding at least one described video frame characteristic sequence；

The class of the video to be processed is determined according to prediction result corresponding at least one described video frame characteristic sequence Not.

The application second aspect provides a kind of method of information processing, comprising:

By nervus opticus network model to feature representation result corresponding at least one described video frame characteristic sequence It is handled, obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the prediction result is used for Determine the classification of the video to be processed.

The application third aspect provides a kind of server, comprising:

First obtains module, for obtaining video to be processed, wherein the video to be processed includes multiple video frames, often The corresponding temporal characteristics of a video frame；

Second obtains module, for being obtained described in module acquisition to described first wait locate according to temporal characteristics sampling rule Reason video is sampled, and obtains at least one video frame characteristic sequence, wherein the temporal characteristics sampling rule is that the time is special Corresponding relationship between sign and video frame characteristic sequence；

First input module, for being obtained described in module acquisition at least by first nerves network model to described second One video frame characteristic sequence is handled, and feature representation result corresponding to each video frame characteristic sequence is obtained；

Second input module, the institute for being obtained after being inputted by nervus opticus network model to first input module It states feature representation result corresponding at least one video frame characteristic sequence to be handled, it is special to obtain at least one described video frame Levy prediction result corresponding to sequence, wherein the prediction result is used to determine the classification of the video to be processed.

The application fourth aspect provides a kind of server, comprising: memory, processor and bus system；

Wherein, the memory is for storing program；

The processor is used to execute the program in the memory, specifically comprises the following steps:

By nervus opticus network model to feature representation result corresponding at least one described video frame characteristic sequence It is handled, obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the prediction result is used for Determine the classification of the video to be processed；

The bus system is for connecting the memory and the processor, so that the memory and the place Reason device is communicated.

The 5th aspect of the application provides a kind of computer readable storage medium, in the computer readable storage medium It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned various aspects.

As can be seen from the above technical solutions, the embodiment of the present application has the advantage that

In the embodiment of the present application, a kind of method of information processing is provided, server first obtains video to be processed, In, video to be processed includes multiple video frames, then the corresponding temporal characteristics of each video frame are sampled according to temporal characteristics and advised Then video to be processed is sampled, and obtains at least one video frame characteristic sequence, wherein when temporal characteristics sampling rule is Between corresponding relationship between feature and video frame characteristic sequence, server by first nerves network model to it is described at least one Video frame characteristic sequence is handled, and obtains feature representation corresponding to each video frame characteristic sequence as a result, last server Feature representation result corresponding at least one video frame characteristic sequence is handled by nervus opticus network model, is obtained Prediction result corresponding at least one video frame characteristic sequence, prediction result are used to determine the classification of video to be processed.Pass through Aforesaid way, during classifying to video, it is also contemplated that changing features of the video on time dimension, so as to Video content is preferably expressed, the accuracy rate of visual classification is improved, promotes the effect of visual classification.

Detailed description of the invention

Fig. 1 is the configuration diagram of information processing in the embodiment of the present application；

Fig. 2 is method one embodiment schematic diagram of information processing in the embodiment of the present application；

Fig. 3 is a schematic diagram of video to be processed in the embodiment of the present application；

Fig. 4 is the convolutional neural networks schematic diagram in the embodiment of the present application with idea structure；

Fig. 5 is a structural schematic diagram of first nerves network model in the embodiment of the present application；

Fig. 6 is a structural schematic diagram of nervus opticus network model in the embodiment of the present application；

Fig. 7 is server one embodiment schematic diagram in the embodiment of the present application；

Fig. 8 is another embodiment schematic diagram of server in the embodiment of the present application；

Fig. 9 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 10 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 11 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 12 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 13 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 14 is another embodiment schematic diagram of server in the embodiment of the present application；

Figure 15 is one structural schematic diagram of server in the embodiment of the present application.

Specific embodiment

The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " having " and theirs is any Deformation, it is intended that cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, production Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for this A little process, methods, the other step or units of product or equipment inherently.

It should be understood that this programme is mainly used for providing video content classification service, specifically can be based on artificial intelligence The video content classification service of (Artificial Intelligence, AI) carries out feature to video in background server and mentions It takes, time series modeling and Feature Compression classify to video features finally by Mixture of expert model, are taking to realize Device be engaged in video progress mechanized classification and label.This programme can be deployed in video class website, to regard in video class website Frequency adds crucial words, can be also used for fast search and matching content, in addition, can be used for video personalized recommendation again.

Artificial intelligence is machine simulation, extension and the intelligence for extending people controlled using digital computer or digital computer Can, perception environment obtains knowledge and theory, method, technology and application system using Knowledge Acquirement optimum.In other words It says, artificial intelligence is a complex art of computer science, it attempts to understand the essence of intelligence, and produces a kind of new energy The intelligence machine made a response in such a way that human intelligence is similar.The design that artificial intelligence namely studies various intelligence machines is former Reason and implementation method make machine have the function of perception, reasoning and decision.

Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage, The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.

Video classification methods and information processing method provided by the present application can pass through computer vision technique (Computer Vision, CV) video content is identified.Wherein, computer vision is is studied the science for how making machine " seeing ", more Further, just refer to and the machine vision such as replace human eye to be identified, tracked to target with video camera and computer and measured, and Graphics process is further done, computer is made to be treated as the image for being more suitable for eye-observation or sending instrument detection to.As one Branch of science, the relevant theory and technology of computer vision research, it is intended to which foundation can be obtained from image or multidimensional data The artificial intelligence system of information.Computer vision technique generally includes image procossing, image recognition, image, semantic understanding, image Retrieval, the processing of optical character identification (Optical Character Recognition, OCR), video, video semanteme understand, regard The skills such as frequency content/Activity recognition, three-dimension object reconstruction, 3D technology, virtual reality, augmented reality, synchronous superposition Art further includes the biometrics identification technologies such as common recognition of face, fingerprint recognition.

For the ease of introducing, referring to Fig. 1, Fig. 1 is the configuration diagram of information processing in the embodiment of the present application, as schemed institute Show, firstly, server obtains video to be processed, it will be seen from figure 1 that video to be processed contains multiple video frames, and every The corresponding temporal characteristics of a video frame, different temporal characteristics can be indicated with t.Next, server is using convolution mind It is handled through each video frame in network handles processing video, obtains temporal characteristics corresponding to each video frame, then Server temporal characteristics according to corresponding to each video frame determine the temporal characteristics sequence of video to be processed, temporal characteristics sequence Column i.e. the other deep learning expression of frame level.

And then, please continue to refer to Fig. 1, server can sample rule according to temporal characteristics and adopt to video to be processed Sample, wherein temporal characteristics sampling rule refers to samples video features with different frame rates on time dimension, and obtains extremely A few video frame characteristic sequence, these video frame characteristic sequences correspond to different time scales.Server is by different time The corresponding video frame characteristic sequence input of scale is separately input into forward-backward recutrnce neural network, obtains each video frame characteristic sequence Corresponding feature representation is as a result, this feature expression of results is the video features expression in each time scale.Finally, service All feature representation results are input in Mixture of expert model by device, and are obtained pre- corresponding to each video frame characteristic sequence It surveys as a result, can determine the classification of video to be processed according to these prediction results, is classified with this to video to be processed.

In common video data, user usually can be described to video information, comment on and provide personalization Label data, thus form text information abundant relevant to Online Video.These text informations can also be used as video point The foundation of class.

Scheme provided by the embodiments of the present application is related to the technologies such as the computer vision of artificial intelligence, especially by being implemented as follows Example is illustrated, and below by from the angle of server, the method for information processing in the application is introduced, referring to Fig. 2, this Method one embodiment of information processing includes: in application embodiment

101, video to be processed is obtained, wherein video to be processed includes multiple video frames, when each video frame is one corresponding Between feature；

In the present embodiment, server will first obtain video to be processed, specifically, referring to Fig. 3, Fig. 3 is the application implementation A schematic diagram of video to be processed, video to be processed contain multiple video frames in example, as every figure is a view in Fig. 3 Frequency frame, each video frame can correspond to a temporal characteristics.

Wherein, since video to be processed has one section of play time, each frame video frame all has different broadcastings Moment, it is assumed that the temporal characteristics of first video frame are " 1 " in video to be processed, then the temporal characteristics of second video frame are " 2 ", and so on, the temporal characteristics of the T video frame are " T ".

102, video to be processed is sampled according to temporal characteristics sampling rule, and obtains at least one video frame feature Sequence, wherein corresponding relationship of the temporal characteristics sampling rule between temporal characteristics and video frame characteristic sequence；

In the present embodiment, next, server needs to adopt the video to be processed according to temporal characteristics sampling rule At sample.Wherein, the pass between preset temporal characteristics and video frame characteristic sequence is contained in temporal characteristics sampling rule System, different video frame characteristic sequences can have different scale sizes, for example, a video to be processed shares 1000 views Frequency frame, this 1000 video frames respectively correspond 1 to 1000 temporal characteristics, if temporal characteristics sampling rule is that each time is special A corresponding video frame characteristic sequence is levied, then the video frame characteristic sequence length under this scale is 1000.If the time is special Sign sampling rule is the corresponding video frame characteristic sequence of every 100 temporal characteristics, then the video frame feature under this scale Sequence length is 10.And so on, it does not repeat herein.

103, at least one described video frame characteristic sequence is handled by first nerves network model, is obtained each Feature representation result corresponding to video frame characteristic sequence；

In the present embodiment, server, can be respectively by different scale after obtaining at least one video frame characteristic sequence Corresponding video frame characteristic sequence is input to first nerves network model, then exports each view by first nerves network model The feature representation result of frequency frame characteristic sequence.

Wherein, different scales i.e. different video frame characteristic sequence length, it is false as described in step 102 The total length of setting video is T, if be exactly video frame characteristic sequence length being exactly T/ using 1 frame as video frame characteristic sequence 1.If we using 10 frames as video frame characteristic sequence, video frame characteristic sequence length is exactly T/10.

104, by nervus opticus network model to feature representation result corresponding at least one video frame characteristic sequence It is handled, obtains prediction result corresponding at least one video frame characteristic sequence, wherein prediction result is for determining wait locate Manage the classification of video.

In the present embodiment, server can respectively be inputted feature representation result corresponding to each video frame characteristic sequence To nervus opticus network model, prediction knot corresponding to each feature representation result is then exported as nervus opticus network model Fruit.Finally, server can determine the classification of video to be processed according to prediction result.

It is understood that the classification of video to be processed can have " sport ", " news ", " music ", " animation " and " trip Play " etc., herein without limitation.

In the embodiment of the present application, a kind of method of information processing is provided, server first obtains video to be processed, In, video to be processed includes multiple video frames, then the corresponding temporal characteristics of each video frame are sampled according to temporal characteristics and advised Then video to be processed is sampled, and obtains at least one video frame characteristic sequence, wherein when temporal characteristics sampling rule is Between corresponding relationship between feature and video frame characteristic sequence, at least one video frame characteristic sequence is input to again by server One neural network model obtains feature representation corresponding to each video frame characteristic sequence as a result, last server is by each view Feature representation result corresponding to frequency frame characteristic sequence is input to nervus opticus network model, obtains each video frame characteristic sequence Corresponding prediction result, prediction result are used to determine the classification of video to be processed.By the above-mentioned means, dividing to video During class, it is also contemplated that changing features of the video on time dimension improve so as to preferably express video content The accuracy rate of visual classification promotes the effect of visual classification.

Optionally, on the basis of above-mentioned Fig. 2 corresponding embodiment, the method for information processing provided by the embodiments of the present application In first alternative embodiment, after obtaining video to be processed, can also include:

Each video frame in video to be processed is handled using convolutional neural networks CNN, obtains each video frame Corresponding temporal characteristics；

According to temporal characteristics corresponding to each video frame, the temporal characteristics sequence of video to be processed is determined, wherein the time Characteristic sequence is for being sampled.

In the present embodiment, server needs after obtaining video to be processed using with idea (inception) structure Convolutional neural networks (convolutional neural network, CNN) to each video frame in video to be processed into Row processing, then extracts temporal characteristics corresponding to each video frame.Finally, server is special according to the time of each video frame Sign, determines the temporal characteristics sequence of video to be processed.Assuming that first video frame of video to be processed is 1, second video frame It is 2, and so on, the last one video frame is T, then can determine that the temporal characteristics sequence of video to be processed is T (second).

The CNN of inception structure is explained below, referring to Fig. 4, Fig. 4 is to have idea knot in the embodiment of the present application The convolutional neural networks schematic diagram of structure, as shown, inception structure contains 3 various sizes of convolution, i.e., volume 1 × 1 The maximum pond layer in lamination, 3 × 3 convolutional layers, 5 × 5 convolutional layers and 3 × 3, eliminates last full articulamentum, and uses global Average pond layer (dimension of picture is become 1 × 1) replaces full articulamentum.

To enhance network capabilities, network depth can be increased, increase network-wide.But in order to reduce over-fitting, Reduce free parameter.Therefore, in the same layer of inception structure, have 1 × 1 convolutional layer of convolution, 3 × 3 convolutional layers and 5 × 5 different convolution masks of convolutional layer three, they can do feature extraction and a kind of mixed model under different sizes. Because maximum pond layer itself also plays the role of feature extraction, and different with convolution, no parameter will not over-fitting, also conduct One branch.But directly do so, whole network calculation amount can be larger, and level does not deepen, therefore, in 3 × 3 convolution Convolution with first doing 1 × 1 before 5 × 5 convolution, reduces the quantity in the channel of input, both network is deepened in this way, calculate simultaneously It measures small instead.

Secondly, after server obtains video to be processed, convolutional neural networks pair can also be used in the embodiment of the present application Each video frame in the video to be processed is handled, and obtains the corresponding temporal characteristics of each video frame, these times Feature is used to constitute the temporal characteristics sequence of entire video to be processed.By the above-mentioned means, using convolutional neural networks to each Video frame is trained and handles, and is conducive to promote accuracy and effect that temporal characteristics extract.

Optionally, on the basis of above-mentioned Fig. 2 corresponding one embodiment, information processing provided by the embodiments of the present application Second alternative embodiment of method in, according to temporal characteristics sampling rule video to be processed is sampled, and obtain at least One video frame characteristic sequence may include:

Rule is sampled according to temporal characteristics and determines at least one time window, wherein each time window includes to be processed At least one video frame in video；

The corresponding video frame characteristic sequence of each time window is extracted from temporal characteristics sequence.

In the present embodiment, it will introduce how server obtains at least one video frame characteristic sequence.

Specifically, rule is sampled according to temporal characteristics first and define at least one time window, to carry out multiple dimensioned view The sampling of frequency frame characteristic sequence.Assuming that video to be processed is T seconds shared, respectively with 1 frame video frame, 5 frame video frames and 10 frame video frames It for time window, is averaged to the video frame feature in the time window, obtains the video frame feature sequence under three different scales Column.If T seconds are equal to 100 frames, we are using 1 frame as time window, then video frame characteristic sequence length is exactly T/1=T.If We are using 10 frames as time window, then what is finally obtained be exactly video frame characteristic sequence length is exactly T/10.Therefore, video frame Characteristic sequence length is related with the size of time window.

Wherein, time window size can be artificial predetermined, and the quantity of video frame is more in a time window, Granularity is also bigger, and for the content in each time window, we are it to be done an average operation, become " one with this The content of frame ".

Again, in the embodiment of the present application, the method for illustrating to extract video frame characteristic sequence under different scale, i.e., first basis Temporal characteristics sampling rule determine at least one time window, wherein each time window include video to be processed at least Then one video frame extracts the corresponding video frame characteristic sequence of each time window from temporal characteristics sequence.By above-mentioned Mode can obtain the video frame characteristic sequence under different scale, and it is trained to carry out feature to obtain multiple and different samples with this, In this case, be conducive to improve the accuracy of visual classification result.

Optionally, on the basis of above-mentioned Fig. 2 corresponding embodiment, the method for information processing provided by the embodiments of the present application In third alternative embodiment, at least one described video frame characteristic sequence is handled by first nerves network model, Feature representation corresponding to each video frame characteristic sequence is obtained as a result, may include:

At least one video frame characteristic sequence is input to the forward recursive neural network in first nerves network model, with Obtain the first expression of results；

At least one video frame characteristic sequence is input to the backward recursive neural network in first nerves network model, with Obtain the second expression of results；

According to the first expression of results and the second expression of results, feature corresponding at least one video frame characteristic sequence is calculated Expression of results.

In the present embodiment, it will introduce and how to be obtained corresponding to each video frame characteristic sequence using first nerves network model Feature representation result.

Specifically, referring to Fig. 5, Fig. 5 is a structural schematic diagram of first nerves network model in the embodiment of the present application, As shown, entire first nerves network model includes two parts, i.e. forward recursive neural network and backward recursive nerve net Each video frame characteristic sequence is input to forward recursive neural network, then exports corresponding first expression of results by network.With this Meanwhile each video frame characteristic sequence is input to backward recursive neural network, then export corresponding second expression of results.

Finally, the first expression of results of direct splicing and the second expression of results, can be obtained corresponding to video frame characteristic sequence Feature representation result.

Secondly, on the basis of extracting video frame characteristic sequence, can be used based on recurrence door list in the embodiment of the present application The recurrent neural network of member carries out time series modeling to video frame characteristic sequence, further, in order to preferably advise to different time The information of mould carries out feature representation, first nerves network model can also be used to carry out video features compression in this programme.Pass through Aforesaid way makes recurrent neural network since the to the effect that generation of most of videos is in the middle part of video time With forward-backward recutrnce neural network respectively from forward and backward toward the time centre of video to be processed point position carry out Feature Compression with Expression.With this, the operability of lifting scheme.

Optionally, on the basis of above-mentioned Fig. 2 corresponding third embodiment, information processing provided by the embodiments of the present application The 4th alternative embodiment of method in, according to the first expression of results and the second expression of results, calculate at least one video frame spy Feature representation corresponding to sequence is levied as a result, may include:

Feature representation result corresponding at least one video frame characteristic sequence is calculated using following formula:

Wherein, h indicate a video frame characteristic sequence feature representation as a result,Indicate the first expression of results, Indicate the second expression of results, x_tIndicate that the video frame characteristic sequence of t moment, GRU () are indicated using gating cycle unit GRU mind Through network processes, T indicates the total time of video to be processed, and t is indicated from 1 integer into T.

In the present embodiment, forward-backward recutrnce neural network can be used respectively from forward and backward toward video time center point It sets and carries out Feature Compression and expression.Specifically, for the video frame characteristic sequence x of some scale_t, t ∈ [1, T].

Forward recursive neural network are as follows:

Backward recursive neural network are as follows:

Wherein,For the middle layer feature representation of forward recursive neural network, the first expression of results can also be expressed asFor the middle layer feature representation of backward recursive neural network, the second expression of results can also be expressed asGRU() For recurrence gate cell function, concrete form are as follows:

z_t=σ_g(W_zx_t+U_zh_t-1+b_z)；

r_t=σ_g(W_rx_t+U_rh_t-1+b_r)；

Wherein, σ_gIndicate sigmoid function, σ_hArctan function is indicated, in addition, W_z、W_r、W_t、U_z、U_rAnd U_hIt is linear Transformation parameter matrix, different subscripts respectively indicate different " door ", b_z、b_rAnd b_hFor offset parameter vector.Indicate compound The calculating of function.

The first expression of results and the second expression of results can be stitched together by we as a result, and it is corresponding to obtain some scale Feature representation is as a result, i.e.

Again, it in the embodiment of the present application, how specifically describes according to the first expression of results and the second expression of results, calculates Feature representation result corresponding to each video frame characteristic sequence.It is calculated by the above-mentioned means, can use correlation formula Prediction result provides feasible mode for the realization of scheme, so as to the feasibility and operability of lifting scheme.

Optionally, on the basis of above-mentioned Fig. 2 corresponding embodiment, the method for information processing provided by the embodiments of the present application In 5th alternative embodiment, by nervus opticus network model to spy corresponding at least one described video frame characteristic sequence Sign expression of results is handled, and is obtained prediction result corresponding at least one video frame characteristic sequence, be may include:

Feature representation result corresponding at least one video frame characteristic sequence is input in nervus opticus network model The first submodel, to obtain third expression of results；

Feature representation result corresponding at least one video frame characteristic sequence is input in nervus opticus network model The second submodel, to obtain the 4th expression of results；

According to third expression of results and the 4th expression of results, prediction corresponding at least one video frame characteristic sequence is calculated As a result.

In the present embodiment, it will introduce and how to be obtained corresponding to each video frame characteristic sequence using nervus opticus network model Prediction result.

Specifically, referring to Fig. 6, Fig. 6 is a structural schematic diagram of nervus opticus network model in the embodiment of the present application, As shown, entire nervus opticus network model includes two parts, respectively the first submodel and the second submodel, the first son Model is properly termed as " gate expression " again, and the second submodel is properly termed as " activation expression " again.By each video frame characteristic sequence institute Corresponding feature representation result is input to " gate expression ", then exports corresponding third expression of results.At the same time, by each view Feature representation result corresponding to frequency frame characteristic sequence is input to " activation expression ", then exports corresponding 4th expression of results.

Each third expression of results is multiplied with each 4th expression of results, then is added, the view can be obtained The prediction result of frequency frame characteristic sequence.

It, can be with after obtaining feature representation result using first nerves network model secondly, in the embodiment of the present application Further classified using nervus opticus network model to this feature expression of results.By the above-mentioned means, can be by feature Expression of results obtains gate expression and activation expression by nonlinear transformation respectively, and two-way expression is then carried out product operation and is gone forward side by side Row is added, and obtains final feature representation for classifying, to be conducive to be promoted the accuracy of classification.

Optionally, on the basis of above-mentioned Fig. 2 corresponding 5th embodiment, information processing provided by the embodiments of the present application The 6th alternative embodiment of method in, according to third expression of results and the 4th expression of results, calculate at least one video frame spy Prediction result corresponding to sequence is levied, may include:

Prediction result corresponding at least one video frame characteristic sequence is calculated using following formula:

g_n=σ_g(W_gh+b_g),n∈[1,N]；

a_n=σ_a(W_ah+b_a),n∈[1,N]；

Wherein, lable indicates the prediction result of a video frame characteristic sequence, g_nIndicate third expression of results, a_nIndicate the Four expressions of results, σ_gIndicate softmax function, σ_aIndicate that sigmoid function, h indicate the feature representation knot of video frame characteristic sequence Fruit, W_gAnd b_gIndicate the parameter in the first submodel, W_aAnd b_aIndicate the parameter of the second submodel, N is indicated to feature representation result What is obtained after progress nonlinear transformation totalizes, and n is indicated from 1 integer into N.

In the present embodiment, specifically describe how using corresponding formula each video frame characteristic sequence institute to be calculated right The prediction result answered.

Firstly, carrying out the road the N gate expression (gate that nonlinear transformation obtains to feature representation result using obtaining Representation) and (activation representation) is expressed in activation, then calculates gate The corresponding third expression of results g of representation_n, and calculate activation representation corresponding the Four expression of results a_n, it should be noted that calculate third expression of results g_nWith the 4th expression of results a_nWhen regardless of calculate it is suitable Sequence.

After obtaining two-way expression, product operation will be carried out, then carry out phase add operation, a video frame feature sequence can be obtained The prediction result of column.

Again, it in the embodiment of the present application, how specifically describes according to third expression of results and the 4th expression of results, calculates Prediction result corresponding to each video frame characteristic sequence.By the above-mentioned means, can use correlation formula is calculated prediction As a result, the realization for scheme provides feasible mode, so as to the feasibility and operability of lifting scheme.

Optionally, above-mentioned Fig. 2 and Fig. 2 it is corresponding first to any one of the 6th embodiment on the basis of, this In the 7th alternative embodiment of method for applying for the information processing that embodiment provides, by nervus opticus network model at least one Feature representation result corresponding to a video frame characteristic sequence is handled, and is obtained corresponding at least one video frame characteristic sequence Prediction result after, can also include:

According to prediction result corresponding at least one video frame characteristic sequence and at least one video frame characteristic sequence Corresponding weighted value calculates the classification of video to be processed；

Classified according to the classification of video to be processed to video to be processed.

In the present embodiment, server can also the prediction result according to corresponding to each video frame characteristic sequence and each Weighted value corresponding to video frame characteristic sequence calculates the classification of video to be processed, and to be processed to this according to classification results Video is classified.

Specifically, it is assumed that prediction result is up to 5, and " 0 and 1 " for being 5 with length encodes to indicate prediction result, such as Say that prediction result is 1 to be encoded to 00001, prediction result is 3 to be encoded to 00100, and so on, if a view to be processed Frequency includes prediction result 1 and prediction result 3 simultaneously, then the representation of video shot to be processed is 00101.

But for entire video to be processed, what we obtained is for corresponding to each video frame characteristic sequence Prediction result, so each prediction result is not more than 1, prediction result can represent the possibility that video to be processed belongs to this classification Property.For example { 0.01,0.02,0.9,0.005,1.0 } is a reasonable prediction result, means that the video to be processed belongs to The probability of one classification is 1.0 i.e. 100%, and the probability for belonging to second classification is 0.005 i.e. 0.5%, belongs to third classification Probability be 0.9 i.e. 90%, belong to the 4th classification probability be 0.02 i.e. 2%, belong to the 5th classification probability be 0.01 I.e. 1%.

At this point, being calculated using pre-set weighted value prediction result, calculating can use weighting algorithm, each Weighted value is learnt by linear regression, is a numerical value, represents the importance of each video frame characteristic sequence, and weigh The sum of weight values are 1, such as { 0.1,0.4,0.5 }.The classification that video to be processed how is calculated will be described in detail below.

If weighted value is { 0.2,0.3,0.5 }, the prediction result of video frame characteristic sequence 1 be 0.01,0.02,0.9, 0.005,1.0 }, the prediction result of video frame characteristic sequence 2 is { 0.02,0.01,0.9,0.000 0.9 }, video frame feature sequence The prediction result of column 3 is { 0.2,0.3,0.8,0.01 0.7 }, then the classification of video to be processed is expressed as:

0.2×0.01+0.3×0.02+0.5×0.2,0.2×0.02+0.3×0.01+0.5×0.3,0.2×0.9+ 0.3×0.9+

0.5×0.8,0.2×0.005+0.3×0.000+0.5×0.01,0.2×1.0+0.3×0.9+0.5×0.7

={ 0.108,0.157,0.85,0.0075,0.82 }

It can be seen that the maximum probability that video to be processed belongs to third classification, followed by first from the result of above formula Therefore classification can show video priority to be processed in the list of videos of third classification.

Further, in the embodiment of the present application, server is obtaining prediction knot corresponding to each video frame characteristic sequence After fruit, can also the prediction result according to corresponding to each video frame characteristic sequence and each video frame characteristic sequence institute it is right The weighted value answered calculates the classification of video to be processed, is finally divided according to the classification of video to be processed video to be processed Class.By the above-mentioned means, since prediction result with reference to temporal characteristics is able to ascend view when analyzing video to be processed The ability of frequency division class has preferable practicability to realize personalized recommendation.

The server in the application is described in detail below, referring to Fig. 7, Fig. 7 is to service in the embodiment of the present application Device one embodiment schematic diagram, server 20 include:

First obtains module 201, for obtaining video to be processed, wherein and the video to be processed includes multiple video frames, The corresponding temporal characteristics of each video frame；

Second obtains module 202, for obtaining the institute that module 201 obtains to described first according to temporal characteristics sampling rule It states video to be processed to be sampled, and obtains at least one video frame characteristic sequence, wherein the temporal characteristics sample rule and are Corresponding relationship between temporal characteristics and video frame characteristic sequence；

First input module 203, for obtaining the institute that module 202 obtains to described second by first nerves network model It states at least one video frame characteristic sequence to be handled, obtains feature representation result corresponding to each video frame characteristic sequence；

Second input module 204, it is described for being obtained by nervus opticus network model to first input module 203 Feature representation result corresponding at least one video frame characteristic sequence is handled, at least one described video frame feature is obtained Prediction result corresponding to sequence, wherein the prediction result is used to determine the classification of the video to be processed.

In the present embodiment, the first acquisition module 201 obtains video to be processed, wherein the video to be processed includes multiple Video frame, the corresponding temporal characteristics of each video frame, second obtains module 202 according to temporal characteristics sampling rule to described the The video to be processed that one acquisition module 201 obtains is sampled, and obtains at least one video frame characteristic sequence, wherein Corresponding relationship of the temporal characteristics sampling rule between temporal characteristics and video frame characteristic sequence, the first input module 203 At least one video frame characteristic sequence described in the acquisition of module 202 is obtained to described second by first nerves network model to carry out Processing obtains feature representation corresponding to each video frame characteristic sequence as a result, the second input module 204 passes through nervus opticus net Network model to first input module 203 obtain described in feature representation knot corresponding at least one video frame characteristic sequence Fruit is handled, and obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the prediction result is used In the classification for determining the video to be processed.

In the embodiment of the present application, a kind of server is provided, the server obtains video to be processed first, wherein wait locate Managing video includes multiple video frames, then the corresponding temporal characteristics of each video frame sample rule according to temporal characteristics and treat Processing video is sampled, and obtains at least one video frame characteristic sequence, wherein temporal characteristics sampling rule is temporal characteristics With the corresponding relationship between video frame characteristic sequence, at least one video frame characteristic sequence is input to first nerves again by server Network model obtains feature representation corresponding to each video frame characteristic sequence as a result, last server is special by each video frame Feature representation result corresponding to sign sequence is input to nervus opticus network model, obtains corresponding to each video frame characteristic sequence Prediction result, prediction result is used to determine the classification of video to be processed.By the above-mentioned means, in the mistake classified to video Cheng Zhong, it is also contemplated that changing features of the video on time dimension improve video point so as to preferably express video content The accuracy rate of class promotes the effect of visual classification.

Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, referring to Fig. 8, clothes provided by the embodiments of the present application It is engaged in another embodiment of device 20, the server 20 further include:

Processing module 205, after obtaining video to be processed for the first acquisition module 201, using convolutional Neural net Network CNN handles each video frame in the video to be processed, when obtaining described corresponding to each video frame Between feature；

Determining module 206, time corresponding to each video frame for being handled according to the processing module 205 are special Sign, determines the temporal characteristics sequence of the video to be processed, wherein the temporal characteristics sequence is for being sampled.

Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 8, referring to Fig. 9, clothes provided by the embodiments of the present application It is engaged in another embodiment of device 20,

Described second, which obtains module 202, includes:

Determination unit 2021 determines at least one time window for sampling rule according to the temporal characteristics, wherein every A time window includes at least one video frame in the video to be processed；

Extraction unit 2022, for extracting the every of the determination of determination unit 2021 from the temporal characteristics sequence The corresponding video frame characteristic sequence of a time window.

Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 9, referring to Fig. 10, provided by the embodiments of the present application In another embodiment of server 20,

First input module 203 includes:

First acquisition unit 2031, at least one described video frame characteristic sequence to be input to the first nerves net Forward recursive neural network in network model obtains the first expression of results；

Second acquisition unit 2032, for each video frame characteristic sequence to be input to the first nerves network mould Backward recursive neural network in type obtains the second expression of results；

First computing unit 2033, first expression of results for being obtained according to the first acquisition unit 2031 and It is right to calculate at least one described video frame characteristic sequence institute for second expression of results that the second acquisition unit 2032 obtains The feature representation result answered.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 10, Figure 11 is please referred to, it is provided by the embodiments of the present application In another embodiment of server 20,

First computing unit 2033 includes:

First computation subunit 20331, for being calculated corresponding at least one video frame characteristic sequence using following formula Feature representation result:

Wherein, the h indicates the feature representation of a video frame characteristic sequence as a result, describedIndicate first table Up to as a result, describedIndicate second expression of results, the x_tIndicate the video frame characteristic sequence of t moment, institute Stating GRU () indicates that, using gating cycle unit GRU Processing with Neural Network, the T indicates the total time of the video to be processed, The t is indicated from 1 integer into the T.

Optionally, on the basis of the embodiment corresponding to above-mentioned Fig. 7, Figure 12 is please referred to, it is provided by the embodiments of the present application In another embodiment of server 20,

Second input module 204 includes:

Third acquiring unit 2041, for inputting feature representation result corresponding to each video frame characteristic sequence To the first submodel in the nervus opticus network model, to obtain third expression of results；

4th acquiring unit 2042, for inputting feature representation result corresponding to each video frame characteristic sequence To the second submodel in the nervus opticus network model, to obtain the 4th expression of results；

Second computing unit 2043, the third expression of results for being obtained according to the third acquiring unit 2041 and The 4th expression of results that 4th acquiring unit 2042 obtains calculates corresponding to each video frame characteristic sequence Prediction result.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 13 is please referred to, it is provided by the embodiments of the present application In another embodiment of server 20,

Second computing unit 2043 includes:

Second computation subunit 20431, for being calculated corresponding to each video frame characteristic sequence using following formula Prediction result:

g_n=σ_g(W_gh+b_g),n∈[1,N]；

a_n=σ_a(W_ah+b_a),n∈[1,N]；

Wherein, the lable indicates the prediction result of a video frame characteristic sequence, the g_nIndicate the third expression As a result, a_nIndicate the 4th expression of results, the σ_gIndicate softmax function, the σ_aIndicate sigmoid function, institute Stating h indicates the feature representation of the video frame characteristic sequence as a result, the W_gWith the b_gIndicate the ginseng in first submodel Number, the W_aWith the b_aIndicate the parameter of second submodel, the N indicates to carry out the feature representation result non-thread Property transformation after obtain totalize, the n is indicated from 1 integer into the N.

Optionally, it on the basis of embodiment corresponding to any one of above-mentioned Fig. 7 to Figure 13, please refers to

Figure 14, in another embodiment of server 20 provided by the embodiments of the present application,

The server 20 further include:

Computing module 207, for second input module 204 by nervus opticus network model to it is described at least one Feature representation result corresponding to video frame characteristic sequence is handled, and it is right to obtain at least one described video frame characteristic sequence institute After the prediction result answered, according to the prediction result corresponding at least one described video frame characteristic sequence and it is described extremely Weighted value corresponding to a few video frame characteristic sequence, calculates the classification of the video to be processed；

Categorization module 208, the classification of the video to be processed for being calculated according to the computing module 207 to it is described to Processing video is classified.

Figure 15 is a kind of server architecture schematic diagram provided by the embodiments of the present application, which can be because of configuration or property Energy is different and generates bigger difference, may include one or more central processing units (central processing Units, CPU) 322 (for example, one or more processors) and memory 332, one or more storages apply journey The storage medium 330 (such as one or more mass memory units) of sequence 342 or data 344.Wherein, 332 He of memory Storage medium 330 can be of short duration storage or persistent storage.The program for being stored in storage medium 330 may include one or one With upper module (diagram does not mark), each module may include to the series of instructions operation in server.Further, in Central processor 322 can be set to communicate with storage medium 330, execute on server 300 a series of in storage medium 330 Instruction operation.

Server 300 can also include one or more power supplys 326, one or more wired or wireless networks Interface 350, one or more input/output interfaces 358, and/or, one or more operating systems 341, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by server can be based on the server architecture shown in figure 15 in above-described embodiment.

In the embodiment of the present application, CPU 322 included by the server is with the following functions:

Optionally, CPU 322 is also used to execute following steps:

Each video frame in the video to be processed is handled using convolutional neural networks CNN, is obtained described every Temporal characteristics corresponding to a video frame；

According to temporal characteristics corresponding to each video frame, the temporal characteristics sequence of the video to be processed is determined, Wherein, the temporal characteristics sequence is for being sampled.

Optionally, CPU 322 is specifically used for executing following steps:

Rule is sampled according to the temporal characteristics and determines at least one time window, wherein each time window includes institute State at least one video frame in video to be processed；

The corresponding video frame characteristic sequence of each time window is extracted from the temporal characteristics sequence.

Optionally, CPU 322 is specifically used for executing following steps:

At least one described video frame characteristic sequence is input to the mind of the forward recursive in the first nerves network model Through network, the first expression of results is obtained；

At least one described video frame characteristic sequence is input to the mind of the backward recursive in the first nerves network model Through network, the second expression of results is obtained；

According to first expression of results and second expression of results, at least one described video frame characteristic sequence is calculated Corresponding feature representation result.

Optionally, CPU 322 is specifically used for executing following steps:

Feature representation result corresponding at least one described video frame characteristic sequence is calculated using following formula:

Optionally, CPU 322 is specifically used for executing following steps:

Feature representation result corresponding at least one described video frame characteristic sequence is input to the nervus opticus net The first submodel in network model obtains third expression of results；

Feature representation result corresponding at least one described video frame characteristic sequence is input to the nervus opticus net The second submodel in network model obtains the 4th expression of results；

According to the third expression of results and the 4th expression of results, at least one described video frame characteristic sequence is calculated Corresponding prediction result.

Optionally, CPU 322 is specifically used for executing following steps:

Prediction result corresponding at least one described video frame characteristic sequence is calculated using following formula:

g_n=σ_g(W_gh+b_g),n∈[1,N]；

a_n=σ_a(W_ah+b_a),n∈[1,N]；

Optionally, CPU 322 is also used to execute following steps:

According to the prediction result corresponding at least one described video frame characteristic sequence and at least one described view Weighted value corresponding to frequency frame characteristic sequence calculates the classification of the video to be processed；

Classified according to the classification of the video to be processed to the video to be processed.

In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.

The computer program product includes one or more computer instructions.Load and execute on computers the meter When calculation machine program instruction, entirely or partly generate according to process or function described in the embodiment of the present application.The computer can To be general purpose computer, special purpose computer, computer network or other programmable devices.The computer instruction can be deposited Storage in a computer-readable storage medium, or from a computer readable storage medium to another computer readable storage medium Transmission, for example, the computer instruction can pass through wired (example from a web-site, computer, server or data center As coaxial cable, optical fiber, Digital Subscriber Line (digital subscriber line, DSL) or it is wireless (such as it is infrared, wireless, Microwave etc.) mode transmitted to another web-site, computer, server or data center.It is described computer-readable to deposit Storage media can be any usable medium that computer can store or include the integrated clothes of one or more usable mediums The data storage devices such as business device, data center.The usable medium can be magnetic medium, (for example, floppy disk, hard disk, tape), Optical medium (for example, digital versatile disc (digital versatile disc, DVD)) or semiconductor medium (such as solid-state Hard disk (solid state disk, SSD) etc..

With artificial intelligence technology research and progress, research and application is unfolded in multiple fields in artificial intelligence technology, such as Common smart home, intelligent wearable device, virtual assistant, intelligent sound box, intelligent marketing, unmanned, automatic Pilot, nobody Machine, robot, intelligent medical, intelligent customer service etc., it is believed that with the development of technology, artificial intelligence technology will obtain in more fields To application, and play more and more important value.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program The medium of code.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of method of visual classification characterized by comprising

Obtain video to be processed, wherein the video to be processed includes multiple video frames, the corresponding time spy of each video frame Sign；

The video to be processed is sampled according to temporal characteristics sampling rule, and obtains at least one video frame feature sequence Column, wherein corresponding relationship of the temporal characteristics sampling rule between temporal characteristics and video frame characteristic sequence；

At least one described video frame characteristic sequence is handled by first nerves network model, it is special to obtain each video frame Levy sequence corresponding to feature representation result, wherein the first nerves network model include forward recursive neural network and Backward recursive neural network；

Feature representation result corresponding at least one described video frame characteristic sequence is carried out by nervus opticus network model Processing obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the nervus opticus network model Including the first submodel and the second submodel；

The classification of the video to be processed is determined according to prediction result corresponding at least one described video frame characteristic sequence.

2. a kind of method of information processing characterized by comprising

Feature representation result corresponding at least one described video frame characteristic sequence is carried out by nervus opticus network model Processing obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the prediction result is for determining The classification of the video to be processed, wherein the nervus opticus network model includes the first submodel and the second submodel.

3. according to the method described in claim 2, it is characterized in that, the method is also wrapped after the acquisition video to be processed It includes:

Each video frame in the video to be processed is handled using convolutional neural networks CNN, obtains each view Temporal characteristics corresponding to frequency frame；

According to temporal characteristics corresponding to each video frame, the temporal characteristics sequence of the video to be processed is determined, wherein The temporal characteristics sequence is for being sampled.

4. according to the method described in claim 3, it is characterized in that, described regular to described to be processed according to temporal characteristics sampling Video is sampled, and obtains at least one video frame characteristic sequence, comprising:

Sample rule according to the temporal characteristics and determine at least one time window, wherein each time window include it is described to Handle at least one video frame in video；

5. according to the method described in claim 2, it is characterized in that, it is described by first nerves network model to described at least one A video frame characteristic sequence is handled, and feature representation result corresponding to each video frame characteristic sequence is obtained, comprising:

According to first expression of results and second expression of results, it is right to calculate at least one described video frame characteristic sequence institute The feature representation result answered.

6. according to the method described in claim 5, it is characterized in that, described according to first expression of results and second table Up to as a result, calculating feature representation result corresponding at least one described video frame characteristic sequence, comprising:

Wherein, the h indicates the feature representation of a video frame characteristic sequence as a result, describedIndicate the first expression knot Fruit, it is describedIndicate second expression of results, the x_tIndicate the video frame characteristic sequence of t moment, the GRU () indicates to use gating cycle unit GRU Processing with Neural Network, and the T indicates the total time of the video to be processed, the t It indicates from 1 integer into the T.

7. according to the method described in claim 2, it is characterized in that, it is described by nervus opticus network model to described at least one Feature representation result corresponding to a video frame characteristic sequence is handled, at least one described video frame characteristic sequence institute is obtained Corresponding prediction result, comprising:

Feature representation result corresponding at least one described video frame characteristic sequence is input to the nervus opticus network mould First submodel in type obtains third expression of results；

Feature representation result corresponding at least one described video frame characteristic sequence is input to the nervus opticus network mould Second submodel in type obtains the 4th expression of results；

According to the third expression of results and the 4th expression of results, it is right to calculate at least one described video frame characteristic sequence institute The prediction result answered.

8. the method according to the description of claim 7 is characterized in that described according to the third expression of results and the 4th table Up to as a result, calculating prediction result corresponding at least one described video frame characteristic sequence, comprising:

g_n=σ_g(W_gh+b_g),n∈[1,N]；

a_n=σ_a(W_ah+b_a),n∈[1,N]；

Wherein, the lable indicates the prediction result of a video frame characteristic sequence, the g_nIndicate the third expression of results, The a_nIndicate the 4th expression of results, the σ_gIndicate softmax function, the σ_aIndicate sigmoid function, the h table Show the feature representation of the video frame characteristic sequence as a result, the W_gWith the b_gIndicate the parameter in first submodel, institute State W_aWith the b_aIndicate the parameter of second submodel, the N indicates to carry out nonlinear transformation to the feature representation result What is obtained afterwards totalizes, and the n is indicated from 1 integer into the N.

9. method according to any one of claim 1 to 8, which is characterized in that described to pass through nervus opticus network model Feature representation result corresponding at least one described video frame characteristic sequence is handled, at least one described video is obtained After prediction result corresponding to frame characteristic sequence, the method also includes:

According to the prediction result corresponding at least one described video frame characteristic sequence and at least one described video frame Weighted value corresponding to characteristic sequence calculates the classification of the video to be processed；

10. a kind of server characterized by comprising

First obtains module, for obtaining video to be processed, wherein the video to be processed includes multiple video frames, each view Frequency frame corresponds to a temporal characteristics；

Second obtains module, for obtaining the view to be processed that module obtains to described first according to temporal characteristics sampling rule Frequency is sampled, and obtains at least one video frame characteristic sequence, wherein the temporal characteristics sampling rule be temporal characteristics with Corresponding relationship between video frame characteristic sequence；

First input module, for obtaining at least one described in module acquisition to described second by first nerves network model Video frame characteristic sequence is handled, and feature representation result corresponding to each video frame characteristic sequence is obtained, wherein described One neural network model includes forward recursive neural network and backward recursive neural network；

Second input module, for by nervus opticus network model to first input module input after obtain described in extremely Feature representation result corresponding to a few video frame characteristic sequence is handled, at least one described video frame feature sequence is obtained The corresponding prediction result of column, wherein the prediction result is used to determine the classification of the video to be processed, wherein described the Two neural network models include the first submodel and the second submodel.

11. a kind of server characterized by comprising memory, processor and bus system；

Wherein, the memory is for storing program；

Feature representation result corresponding at least one described video frame characteristic sequence is carried out by nervus opticus network model Processing obtains prediction result corresponding at least one described video frame characteristic sequence, wherein the prediction result is for determining The classification of the video to be processed, wherein the nervus opticus network model includes the first submodel and the second submodel；

The bus system is for connecting the memory and the processor, so that the memory and the processor It is communicated.

12. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as Claim 1, or hold the method as described in any one of claim row 2-9.