CN110347873A

CN110347873A - Video classification methods, device, electronic equipment and storage medium

Info

Publication number: CN110347873A
Application number: CN201910562350.XA
Authority: CN
Inventors: 康健
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-06-26
Filing date: 2019-06-26
Publication date: 2019-10-18
Anticipated expiration: 2039-06-26
Also published as: CN110347873B

Abstract

Present disclose provides a kind of video classification methods, device, electronic equipment and computer readable storage mediums, are related to technical field of image processing, and the video classification methods include: to carry out sparse sampling to video to be processed to obtain multiple key frames；The multiple key frame is handled by the feature extraction network in preset model, to extract the feature of the multiple key frame；The feature of the multiple key frame is merged by attention network trained in the preset model, and fused feature is handled to obtain the classification results of the video to be processed.The disclosure can reduce calculation amount, improve visual classification speed and efficiency.

Description

Video classification methods, device, electronic equipment and storage medium

Technical field

This disclosure relates to which technical field of image processing, fills in particular to a kind of video classification methods, visual classification It sets, electronic equipment and computer readable storage medium.

Background technique

With the development of video technique, user can obtain various videos from multiple channel.Due to the number of video Amount is excessively huge, by carrying out classification processing to video, can search and use the video of needs convenient for user, improve user's body It tests.

In the related technology, video classification methods may include the method based on shot and long term memory network, based on 3D convolution Method and method based on binary-flow network.

In above-mentioned several ways, since the parameter amount that network structure is larger and calculates is larger, processing speed is slower. In addition, when handling inter-frame information, all can carry out global operation in above-mentioned several ways to individual frames, cause computing resource unrestrained Take；And the information since interframe cannot be utilized, it may cause classification results inaccuracy.

It should be noted that information is only used for reinforcing the reason to the background of the disclosure disclosed in above-mentioned background technology part Solution, therefore may include the information not constituted to the prior art known to persons of ordinary skill in the art.

Summary of the invention

The disclosure is designed to provide a kind of video classification methods, device, electronic equipment and computer-readable storage medium Matter, and then overcome visual classification caused by the limitation and defect due to the relevant technologies slow at least to a certain extent Problem.

Other characteristics and advantages of the disclosure will be apparent from by the following detailed description, or partially by the disclosure Practice and acquistion.

According to one aspect of the disclosure, a kind of video classification methods are provided, comprising: sparse adopt is carried out to video to be processed Sample obtains multiple key frames；The multiple key frame is handled by the feature extraction network in preset model, to extract The feature of the multiple key frame；By attention network trained in the preset model to the spy of the multiple key frame Sign is merged, and is handled fused feature to obtain the classification results of the video to be processed.

In a kind of exemplary embodiment of the disclosure, the feature extraction network includes residual error network, by presetting mould Feature extraction network in type handles the multiple key frame, with extract the feature of the multiple key frame include: by The multiple key frame described will be criticized in the input residual error network as one batch, to extract the multiple key frame The feature.

In a kind of exemplary embodiment of the disclosure, by attention network trained in the preset model to institute The feature for stating multiple key frames is merged, and is handled fused feature to obtain the classification knot of the video to be processed Fruit includes: that the feature of the multiple key frame is inputted the trained attention network, obtains fused feature；According to The fused feature determines that the video to be processed belongs to the probability of each classification, to divide according to the determine the probability Class result.

In a kind of exemplary embodiment of the disclosure, the feature of the multiple key frame is inputted into the trained note Meaning power network, before obtaining fused feature, the method also includes: the residual error network is fixed, and to the attention Power network is trained, to obtain the trained attention network.

In a kind of exemplary embodiment of the disclosure, the method also includes: obtaining the trained attention After network, the preset model is trained, obtains trained preset model.

In a kind of exemplary embodiment of the disclosure, the preset model is trained, is obtained trained default Model includes: to be trained end to end to the preset model, to obtain the trained preset model.

In a kind of exemplary embodiment of the disclosure, the method also includes: it is trained based on loss is returned to described Preset model compressed；And/or the parameter type of the trained preset model is adjusted.

According to one aspect of the disclosure, a kind of visual classification device is provided, comprising: key frame obtain module, for pair Video to be processed carries out sparse sampling and obtains multiple key frames；Characteristic extracting module, for being mentioned by the feature in preset model Network is taken to handle the multiple key frame, to extract the feature of the multiple key frame；Classification results determining module is used Trained attention network merges the feature of the multiple key frame in through the preset model, and to fusion Feature afterwards is handled to obtain the classification results of the video to be processed.

According to one aspect of the disclosure, a kind of electronic equipment is provided, comprising: processor；And memory, for storing The executable instruction of the processor；Wherein, the processor is configured to above-mentioned to execute via the executable instruction is executed Video classification methods described in any one.

According to one aspect of the disclosure, a kind of computer readable storage medium is provided, computer program is stored thereon with, The computer program realizes video classification methods described in above-mentioned any one when being executed by processor.

In video classification methods, device, electronic equipment and the computer readable storage medium that the present exemplary embodiment provides, By extracting the feature of the key frame of video to be processed, and merged using feature of the attention network to multiple key frames, To classify to video to be processed.On the one hand, video to be processed is extracted by the feature extraction network in preset model The feature of multiple key frames reduces the parameter for being input to feature extraction network, and due to the network knot of feature extraction network Structure is smaller, reduces the quantity of the parameter of processing, avoids and extracts the features of all frames of video to be processed in the related technology and make At time waste, improve extract feature speed, improve treatment effeciency.On the other hand, using attention network to more The feature of a key frame is merged to obtain the classification results of video to be processed, can be melted to the feature of multiple key frames It closes so that the information between different frame is uniformly processed, global operation can be carried out to each individual key frame in the related technology by avoiding The step of, reduce the waste to computing resource, reduces resource consumption；And inter-frame information can be efficiently used, therefore energy It is enough that Accurate classification is carried out to video to be processed, improve the precision of classification results.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the disclosure Example, and together with specification for explaining the principles of this disclosure.It should be evident that the accompanying drawings in the following description is only the disclosure Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 schematically shows the schematic diagram of video classification methods in disclosure exemplary embodiment.

Fig. 2 schematically shows the structural schematic diagram of disclosure exemplary embodiment preset model.

Fig. 3 schematically shows the flow chart that classification results are determined in disclosure exemplary embodiment.

Fig. 4 schematically shows the overall flow figure classified in disclosure exemplary embodiment to video.

Fig. 5 schematically shows the block diagram of visual classification device in disclosure exemplary embodiment.

Fig. 6 schematically shows the schematic diagram of the electronic equipment in disclosure exemplary embodiment.

Specific embodiment

Example embodiment is described more fully with reference to the drawings.However, example embodiment can be with a variety of shapes Formula is implemented, and is not understood as limited to example set forth herein；On the contrary, thesing embodiments are provided so that the disclosure will more Fully and completely, and by the design of example embodiment comprehensively it is communicated to those skilled in the art.Described feature, knot Structure or characteristic can be incorporated in any suitable manner in one or more embodiments.In the following description, it provides perhaps More details fully understand embodiment of the present disclosure to provide.It will be appreciated, however, by one skilled in the art that can It is omitted with technical solution of the disclosure one or more in the specific detail, or others side can be used Method, constituent element, device, step etc..In other cases, be not shown in detail or describe known solution to avoid a presumptuous guest usurps the role of the host and So that all aspects of this disclosure thicken.

In addition, attached drawing is only the schematic illustrations of the disclosure, it is not necessarily drawn to scale.Identical attached drawing mark in figure Note indicates same or similar part, thus will omit repetition thereof.Some block diagrams shown in the drawings are function Energy entity, not necessarily must be corresponding with physically or logically independent entity.These function can be realized using software form Energy entity, or these functional entitys are realized in one or more hardware modules or integrated circuit, or at heterogeneous networks and/or place These functional entitys are realized in reason device device and/or microcontroller device.

In the present exemplary embodiment, a kind of video classification methods are provided firstly, which can be applied to Any scene classified to photo, video either picture.Next, refering to what is shown in Fig. 1, in the present exemplary embodiment Video classification methods be described in detail.

In step s 110, sparse sampling is carried out to video to be processed and obtains multiple key frames.

In the present exemplary embodiment, video to be processed may include the multitude of video (example that some file stores in terminal Such as the video in intelligent terminal photograph album) or certain information exchange platforms in upload and storage multitude of video.View to be processed The concrete type of frequency can be determined according to actual operating function demand, such as when needing to classify, video to be processed is referred to Video to be sorted.

The difference as existing for the continuous interframe of video to be processed is little, again will not need in the present exemplary embodiment Each frame information of video to be processed is all used as the input of subsequent processes.It is carried out to choose the partial frame of video to be processed Processing, can sample video to be processed.Sampling refers to video to be processed is enterprising in the ranks in time-domain as sample size Every the process of sampling.The sparse degree of the corresponding sampled result of different sample rates is different, for example, gives a view to be processed Frequency such as video sequence V, the video sequence, can be equally divided into T+1 sections, each section of view comprising identical quantity by a length of T at that time Then frequency frame randomly selects sample of the frame as sampling from each section.In this way, which this can be obtained from T+1 sections to from Manage multiple key frames of video.In the present exemplary embodiment, multiple keys are obtained by carrying out sparse sampling to video to be processed Frame can reduce sampling number under conditions of guaranteeing that data are in fidelity range, decrease input feature vector and extract net The parameter of network, to reduce operand.

Shown in continuing to refer to figure 1, in the step s 120, by the feature extraction network in preset model to the multiple Key frame is handled, to extract the feature of the multiple key frame.

In the present exemplary embodiment, preset model refers to obtaining view to be processed for handling multiple key frames The entire model of the classification results of frequency.Preset model mainly may include two parts: wherein first part is characterized extraction network, Second part is attention network.The feature of multiple key frames can specifically be indicated with feature vector.

Feature extraction network is illustrated first.Feature extraction network is mainly used for extracting input this feature extraction network Each video to be processed multiple key frames feature.Feature extraction network may include that any one can extract feature Network model, such as suitable machine learning model, machine learning model can include but is not limited to convolutional neural networks, follow Ring neural network and residual error network model etc..If feature extraction network is convolutional neural networks, due to convolutional neural networks It may include multiple convolutional layers and pond layer, each convolutional layer for extracting different features respectively, and pond layer is for reducing dimension Degree is to extract main feature, to execute subsequent processing for main feature as final feature.

In the present exemplary embodiment, the network principal based on the end PC that character network uses is extracted, but is used based on shifting The network of moved end such as MobileNet, ThunderNet carry out feature extraction work also within the scope of protection of this application.

In the present exemplary embodiment, if feature extraction network is residual error network, by feature extraction network to described more A key frame is handled, and includes: to make the multiple key frame to extract the detailed process of the feature of the multiple key frame It is one batch, and described will criticizes in the input residual error network, extracts the feature of the multiple key frame.Wherein, residual Poor network can be any one in a variety of residual error networks such as 18 layers of residual error network, 34 layers of residual error network, this sentences 18 It is illustrated for the residual error network ResNet18 of layer.

Residual error network is to be made of residual block (difference of output and input), and the mapping of residual error Web vector graphic congruence is direct Later layer is passed into preceding layer output.Assuming that the input of certain section of neural network is x, desired output is H (x).In residual error network In, input x directly can be passed into output as a result, the target for then needing to learn is residual error H (x)-x, rather than it is complete defeated Out.

Constructing a ResNet network is exactly by being packed together many such residual blocks, by a common volume The method that product becomes residual error network through network is connected plus all jumps, and one shortcut of every two layers of increase constitutes a residual error Block.For example, 5 residual blocks, which link together, constitutes a residual error network.In each residual block being sequentially connected in residual error network Include an identical mapping and at least two convolutional layers in any one residual block, the identical mapping of any one residual block by The input terminal of any one residual block is directed toward the output end of any one residual block.The specific network knot of residual error network Structure, number of plies etc. can be configured according to needs such as computing resource consumption, recognition performances, be not particularly limited herein.It needs Bright, the residual error network ResNet18 of coded portion used in this step is preparatory trained model, therefore is originally shown It does not need to be trained it optimization in example property embodiment.

It, can after the multiple key frames for getting multiple videos to be processed in step s 110 in the present exemplary embodiment Using this multiple key frame as one crowd of Batch.Batch size is a hyper parameter, updates internal model ginseng for being defined on Sample size to be processed before number.Batch processing is considered as loop iteration one or more sample and is predicted.In batch processing At the end of, prediction is compared with anticipated output variable, and calculate error.From the mistake, more new algorithm is for improving mould Type, such as moved down along error gradient.When all samples are for creating a Batch, learning algorithm is known as batch gradient Decline.Since all key frames form one batch, the renewal frequency and update times to network can be reduced.

Specifically, using multiple key frames as first residual block of one batch of input residual error network；For any one Residual block receives the output of a upper residual block, and is based on the first convolutional layer, the second convolutional layer and third convolutional layer, to upper The output of one residual block carries out feature extraction；The output for obtaining third convolutional layer, by the output of third convolutional layer and upper one The output of a residual block is transmitted to next residual block；The output for obtaining the last one residual block of residual error network, obtains multiple The feature of key frame.

In the present exemplary embodiment, due to using 18 layers of residual error network as the network for carrying out feature extraction, network tool The stronger ability that feature extraction is carried out to image is had, while the network number of plies is less and then reduces network parameter.It solves Because the network number of plies it is too deep caused by gradient disperse problem, can with deeper network structure carry out feature extraction, it is ensured that feature The accuracy of extraction, and reduce calculation amount.

Method in step S110 and step S120 obtains multiple keys by carrying out sparse sampling to video to be processed Each frame information of video to be processed, is no longer all used as the input of next step by frame, to reduce the parameter of input.Also, Residual error network has the ability to image zooming-out feature, and the network number of plies is less, further reduces the quantity of parameter.Such one Come, the feature of key frame is extracted by the less feature extraction network of sparse sampling and the number of plies, reduces needs and transmit and calculate Parameter quantity, save computing resource.

Shown in continuing to refer to figure 1, in step s 130, pass through attention network pair trained in the preset model The feature of the multiple key frame is merged, and is handled fused feature to obtain the classification of the video to be processed As a result.

In the present exemplary embodiment, preset model refers to trained preset model.It is diagrammatically illustrated in Fig. 2 default The concrete structure diagram of model, with reference to shown in Fig. 2, preset model may be used also other than feature extraction network and attention network With include one BN layers, a full articulamentum and a softmax, to obtain classify according to the vector that softmax is exported more The result of label.Wherein, feature extraction network is the ResNet18 for removing softmax, the frame of one batch of the network inputs, Export the feature vector of multiple key frames；Attention network and feature extraction are connected to the network, and the input of attention network is more The feature vector of a key frame, the output of attention network are fused vector；BN layers are connected to the network with attention, are used for Each neuron is normalized, to accelerate training speed, improves model accuracy；Full articulamentum (fully Connected layers, FC) it is connect with BN layers, play the role of classifier in entire convolutional neural networks； Softmax is connect with full articulamentum, finally exports predicted vector, each dimension of predicted vector represents the general of corresponding classification Rate.

Since convolutional neural networks do not have the ability of fusion inter-frame information, for the key frame that extracts Attention network can be used to merge the feature between multiple and different key frames for feature, to obtain for view to be processed The classification results of frequency.The attention network can be interframe attention network, and input can be more obtained in step S120 The feature of the Batch of a key frame composition, output are fused vector.

The flow chart of determining classification results is diagrammatically illustrated in Fig. 3, mainly includes step S310 with reference to shown in Fig. 3 With step S320, in which:

In step s310, the feature of the multiple key frame is inputted into the trained attention network, is melted Feature after conjunction.

In this step, attention network refers to the network based on attention mechanism, and attention mechanism can allow a mind A part of information of its input can be only focused on through network, it can select specifically to input.Attention mechanism can be applied To any type of input, regardless of its shape, input, such as image either vector for matrix form etc..

It, can be before calculating fused feature, first to attention network in order to guarantee the accuracy of fused feature It is trained, to be carried out at fusion by the feature that trained attention network handles handle multiple key frames of video Reason.It may include: to fix the residual error network to the detailed process that attention network is trained, and to the attention net Network is trained, to obtain the trained attention network.That is, the training process of entire model, due to front The ResNet18 network of the coded portion used is preparatory trained model, so in the training process, first fixing this Partial parameters are only trained subsequent attention network, after the loss function of attention network tends towards stability, stopping pair The training process of attention network, to obtain trained attention network.Specifically, the note in the present exemplary embodiment Anticipating power network can be as shown in formula (1):

Wherein, a represents the vector of input attention network, i.e., the feature vector of multiple key frames；C is calculated more The fusion vector of the feature of a key frame.Input the parameter calculation such as formula (2) and formula of the vector a of attention network (3) shown in:

e_i=w^Ta_iFormula (2)

Wherein, w is the parameter learnt in the training process, by the parameter learnt, can use trained note Meaning power network query function goes out the fused vector c of the feature of multiple key frames.

When being trained to attention network, it is possible, firstly, to obtain the image data of multiple key frames, and manually determine Classification belonging to this video to be processed out；Then, the attention network is trained using classification and image data, with The weight of each convolution kernel constantly in adjustment attention network, until the classification for obtaining classification and manually setting, to obtain Trained attention network.

It may include: to be obtained entire convolutional layer information as input by the specific steps that attention network is merged The point to be concentrated for the first time is taken, to indicate the attention to different location.It obtains after paying attention to force vector, it can be by last point Notice that the vector of force vector and convolutional layer does product, vector that product obtains indicate it is noted that point location information.By position After information and timing information combine incoming network, under current timing, the prediction that new position vector and output is calculated is general Rate information.Constantly output is combined with convolutional layer to generate new location point information, so that new attention is obtained, using new Attention combines input to obtain new output information.In the present exemplary embodiment, by the ResNet18 conduct for removing softmax The network of feature is extracted, it is corresponding to export a series of multiple key frames by crowd batch which is made of multiple key frames Multiple feature vectors connect interframe attention network later, obtain the corresponding fusion vector of multiple feature vectors.

Based on this, inter-frame information can be efficiently used by attention network, avoiding in the related technology can be to each independent Key frame carry out global operation the step of, reduce the waste to computing resource, reduce resource consumption.By fused Vector can indicate the feature of video to be processed, more accurately so as to more accurately classify.In addition, due to paying attention to Power network can efficiently use inter-frame information, therefore can carry out exact classification to video to be processed based on inter-frame information.

After obtaining the trained attention network, entire preset model can be trained, be trained Good preset model.For example, being finely adjusted to feature extraction network and attention network, until the class of some video to be processed Until consistent with the classification manually set, to obtain the trained preset model of better performances, to pass through preset model Improve the precision of visual classification.When being trained to preset model, it can be achieved that training end to end.Training can end to end To include: that can obtain a prediction result from input terminal to output end, an error can be obtained compared with legitimate reading, this Error can each layer of transmitting (backpropagation) in a model, each layer of expression can all adjust according to this error, directly Restraining or get a desired effect to model just terminates.Training is exactly not do other extra process in fact end to end, from original Data are input to task result output, entire training and prediction process, are completed in model.For example, in entire model There is no individual models, but a neural network is directly used to be connected from input terminal to output end, this neural network is allowed Undertake the function of original all modules.By training end to end, reduces operating procedure, improve training effectiveness.

You need to add is that entire trained preset model can be adjusted again to advanced optimize performance, Specifically include following adjustment mode: the first, based on return loss the trained preset model is compressed, i.e., It can be to each layer of progress model beta pruning processing in preset model.Since the parameter of neural network is numerous, but some of them is joined It is several that final output result is contributed less and the redundancy that seems, it is therefore desirable to cut the parameter of these redundancies.Model beta pruning side Method can according to weighted value carry out beta pruning method etc..In the present exemplary embodiment, loss can be returned based on LASSO to adjust The number of channels of whole preset model, to remove, recurrence loss is lesser to influence little channel on classification results, in terms of reducing Calculation amount.By carrying out beta pruning processing to trained preset model, it is able to ascend the speed of service, and it is big to reduce model file It is small.

Second, the parameter type of the trained preset model is adjusted.Specifically, the ginseng in preset model Several classes of types are generally float32, and in the present example embodiment, parameter type can be truncated by float32 as float16, from And in the case where not influencing to calculate effect, the model scale of construction is reduced, and reduce the consumption to computing resource.

It should be noted that can only carry out model compression in the present exemplary embodiment, parameter type can also be only carried out Adjustment can also carry out model compression and parameter type adjustment simultaneously, to promote the speed of service, reduce the consumption of computing resource.

Next, in step s 320, determining that the video to be processed belongs to each class according to the fused feature Other probability, with the classification results according to the determine the probability.

In this step, classification results can belong to the probability of each classification with video to be processed to indicate, specifically can be with thing One probability threshold value is first set；When probability value is more than or equal to the probability threshold value, it may be determined that video to be processed belongs to such Not.

After obtaining fused feature, which can be inputted BN layers and be normalized, in turn Full articulamentum is inputted to classify, further input softmax layers obtain predicted vector, thus according to predicted vector each Dimension obtains the probability that video to be processed belongs to some classification, to determine its classification results according to probability value.

For example, probability threshold value can be 0.7, be 0.9 when video 1 to be processed belongs to the probability of classification 1, belong to classification When 2 probability is 0.1, it may be determined that the classification results of video 1 to be processed are classification 1.

In the present exemplary embodiment, the preset model that is made up of residual error network and attention network to video to be processed into Row classification reduces parameter for the relevant technologies, time-consuming less, while will not lose too many precision.Attention simultaneously Network is effectively utilized the information between multiple and different key frames, and saves computing resource.

The overall flow figure of visual classification is diagrammatically illustrated in Fig. 4, with reference to shown in Fig. 4, is mainly comprised the steps that

In step S401, video to be processed is carried out to cut frame processing, sparse sampling specifically can be used and extract view to be processed Multiple key frames of frequency.

In step S402, by the feature extraction network on multiple key frames input basis, feature extraction network herein can Residual error network ResNet18 is thought, to obtain the vector for indicating feature.

In step S403, it is corresponding for indicating the vector of high dimensional feature to obtain each key frame.

In step s 404, high dimensional feature is inputted into attention network, obtains fused vector.

In step S405, classification results are obtained according to fused vector.Specifically, fused vector is inputted into BN Layer, full articulamentum and softmax layers, to obtain the probability that video to be processed belongs to each classification, and then according to determine the probability point Class result.

In conclusion the technical solution in the present exemplary embodiment, first carries out sparse sampling to video to be processed, is closed Key frame simultaneously carries out feature extraction by residual error network.For the feature extracted, carried out using attention network further Fusion Features obtain the fusion feature between different key frames, final output prediction result.By this method, reduce defeated Enter to the parameter of feature extraction network, and since the network structure of feature extraction network is smaller, reduces the parameter of processing Quantity avoids the waste of time caused by the feature for extracting all frames of video to be processed in the related technology, it is special to improve extraction The efficiency and speed of sign.In addition, can merge to the feature of multiple key frames, avoiding in the related technology can be to each list Only key frame carries out the step of global operation, reduces the waste to computing resource, reduces resource consumption.In addition to this, It further uses model pruning method to be handled, compact model parameter amount and speed can be promoted.

In the present exemplary embodiment, a kind of visual classification device is additionally provided, refering to what is shown in Fig. 5, the device 500 can wrap It includes:

Key frame obtains module 501, obtains multiple key frames for carrying out sparse sampling to video to be processed；

Characteristic extracting module 502, for being carried out by the feature extraction network in preset model to the multiple key frame Processing, to extract the feature of the multiple key frame；

Classification results determining module 503 is used for through attention network trained in the preset model to described more The feature of a key frame is merged, and is handled fused feature to obtain the classification results of the video to be processed.

In a kind of exemplary embodiment of the disclosure, the feature extraction network includes residual error network, feature extraction mould Block is configured as: using the multiple key frame as one batch, and described will be criticized in the input residual error network, described in extracting The feature of multiple key frames.

In a kind of exemplary embodiment of the disclosure, classification results determining module includes: Fusion Features module, and being used for will The feature of the multiple key frame inputs the trained attention network, obtains fused feature；Probability evaluation entity, For determining that the video to be processed belongs to the probability of each classification according to the fused feature, with true according to the probability The fixed classification results.

In a kind of exemplary embodiment of the disclosure, the feature of the multiple key frame is inputted into the trained note Meaning power network, before obtaining fused feature, described device further include: network training module is used for the residual error network It is fixed, and the attention network is trained, to obtain the trained attention network.

In a kind of exemplary embodiment of the disclosure, described device further include: preset model training module, for obtaining To after the trained attention network, the preset model is trained, trained preset model is obtained.

In a kind of exemplary embodiment of the disclosure, preset model training module includes: Training Control module, for pair The preset model is trained end to end, to obtain the trained preset model.

In a kind of exemplary embodiment of the disclosure, described device further include: model compression module, for based on recurrence The trained preset model is compressed in loss；And/or parameter adjustment module, for the trained default mould The parameter type of type is adjusted.

It should be noted that the detail of each module carries out in corresponding method in above-mentioned visual classification device It elaborates, therefore details are not described herein again.

It should be noted that although being referred to several modules or list for acting the equipment executed in the above detailed description Member, but this division is not enforceable.In fact, according to embodiment of the present disclosure, it is above-described two or more Module or the feature and function of unit can embody in a module or unit.Conversely, an above-described mould The feature and function of block or unit can be to be embodied by multiple modules or unit with further division.

In addition, although describing each step of method in the disclosure in the accompanying drawings with particular order, this does not really want These steps must be executed in this particular order by asking or implying, or having to carry out step shown in whole could realize Desired result.Additional or alternative, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/ Or a step is decomposed into execution of multiple steps etc..

In an exemplary embodiment of the disclosure, a kind of electronic equipment that can be realized the above method is additionally provided.

Person of ordinary skill in the field it is understood that various aspects of the invention can be implemented as system, method or Program product.Therefore, various aspects of the invention can be embodied in the following forms, it may be assumed that complete hardware embodiment, complete The embodiment combined in terms of full Software Implementation (including firmware, microcode etc.) or hardware and software, can unite here Referred to as circuit, " module " or " system ".

The electronic equipment 600 of this embodiment according to the present invention is described referring to Fig. 6.The electronics that Fig. 6 is shown Equipment 600 is only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in fig. 6, electronic equipment 600 is showed in the form of universal computing device.The component of electronic equipment 600 can be with Including but not limited to: at least one above-mentioned processing unit 610, at least one above-mentioned storage unit 620, the different system components of connection The bus 630 of (including storage unit 620 and processing unit 610).

Wherein, the storage unit is stored with program code, and said program code can be held by the processing unit 610 Row, so that various according to the present invention described in the execution of the processing unit 610 above-mentioned " illustrative methods " part of this specification The step of illustrative embodiments.For example, the processing unit 610 can execute step as shown in fig. 1.

Storage unit 620 may include the readable medium of volatile memory cell form, such as Random Access Storage Unit (RAM) 6201 and/or cache memory unit 6202, it can further include read-only memory unit (ROM) 6203.

Storage unit 620 can also include program/utility with one group of (at least one) program module 6205 6204, such program module 6205 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.

Bus 630 can be to indicate one of a few class bus structures or a variety of, including storage unit bus or storage Cell controller, peripheral bus, graphics acceleration port, processing unit use any bus structures in a variety of bus structures Local bus.

Display unit 640 can be display having a display function, to pass through the display exhibits by processing unit 610 Execute processing result obtained from the method in the present exemplary embodiment.Display include but is not limited to liquid crystal display either Other displays.

Electronic equipment 600 can also be with one or more external equipments 700 (such as keyboard, sensing equipment, bluetooth equipment Deng) communication, can also be enabled a user to one or more equipment interact with the electronic equipment 600 communicate, and/or with make The electronic equipment 600 any equipment (such as the router, modulatedemodulate that can be communicated with one or more of the other calculating equipment Adjust device etc.) communication.This communication can be carried out by input/output (I/O) interface 650.Also, electronic equipment 600 may be used also To pass through network adapter 660 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network Network, such as internet) communication.As shown, network adapter 660 passes through other modules of bus 630 and electronic equipment 600 Communication.It should be understood that although not shown in the drawings, other hardware and/or software module, packet can be used in conjunction with electronic equipment 600 It includes but is not limited to: microcode, device driver, redundant processing unit, external disk drive array, RAID system, magnetic tape drive Device and data backup storage system etc..

Through the above description of the embodiments, those skilled in the art is it can be readily appreciated that example described herein is implemented Mode can also be realized by software realization in such a way that software is in conjunction with necessary hardware.Therefore, according to the disclosure The technical solution of embodiment can be embodied in the form of software products, which can store non-volatile at one Property storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.) in or network on, including some instructions are so that a calculating Equipment (can be personal computer, server, terminal installation or network equipment etc.) is executed according to disclosure embodiment Method.

In an exemplary embodiment of the disclosure, a kind of computer readable storage medium is additionally provided, energy is stored thereon with Enough realize the program product of this specification above method.In some possible embodiments, various aspects of the invention may be used also In the form of being embodied as a kind of program product comprising program code, when described program product is run on the terminal device, institute Program code is stated for executing the terminal device described in above-mentioned " illustrative methods " part of this specification according to this hair The step of bright various illustrative embodiments.

The program product for realizing the above method of embodiment according to the present invention can use Portable, compact Disk read-only memory (CD-ROM) and including program code, and can be run on terminal device, such as PC.However, Program product of the invention is without being limited thereto, and in this document, readable storage medium storing program for executing, which can be, any includes or storage program has Shape medium, the program can be commanded execution system, device or device use or in connection.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example can be but be not limited to electricity, magnetic, optical, electromagnetic, infrared ray or System, device or the device of semiconductor, or any above combination.The more specific example of readable storage medium storing program for executing is (non exhaustive List) include: electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only Memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read only memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

Computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, In carry readable program code.The data-signal of this propagation can take various forms, including but not limited to electromagnetic signal, Optical signal or above-mentioned any appropriate combination.Readable signal medium can also be any readable Jie other than readable storage medium storing program for executing Matter, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or and its The program of combined use.

The program code for including on readable medium can transmit with any suitable medium, including but not limited to wirelessly, have Line, optical cable, RF etc. or above-mentioned any appropriate combination.

The program for executing operation of the present invention can be write with any combination of one or more programming languages Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind, including local area network (LAN) or wide area network (WAN), it is connected to user calculating equipment, or, it may be connected to external computing device (such as utilize ISP To be connected by internet).

In addition, above-mentioned attached drawing is only the schematic theory of processing included by method according to an exemplary embodiment of the present invention It is bright, rather than limit purpose.It can be readily appreciated that the time that above-mentioned processing shown in the drawings did not indicated or limited these processing is suitable Sequence.In addition, be also easy to understand, these processing, which can be, for example either synchronously or asynchronously to be executed in multiple modules.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to its of the disclosure His embodiment.This application is intended to cover any variations, uses, or adaptations of the disclosure, these modifications, purposes or Adaptive change follow the general principles of this disclosure and including the undocumented common knowledge in the art of the disclosure or Conventional techniques.The description and examples are only to be considered as illustrative, and the true scope and spirit of the disclosure are by claim It points out.

It should be understood that the present disclosure is not limited to the precise structures that have been described above and shown in the drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present disclosure is only limited by the attached claims.

Claims

1. a kind of video classification methods characterized by comprising

Sparse sampling is carried out to video to be processed and obtains multiple key frames；

The multiple key frame is handled by the feature extraction network in preset model, to extract the multiple key frame Feature；

The feature of the multiple key frame is merged by attention network trained in the preset model, and to melting Feature after conjunction is handled to obtain the classification results of the video to be processed.

2. video classification methods according to claim 1, which is characterized in that the feature extraction network includes residual error net Network is handled the multiple key frame by the feature extraction network in preset model, to extract the multiple key frame Feature include:

It using the multiple key frame as one batch, and described will criticize in the input residual error network, to extract the multiple pass The feature of key frame.

3. video classification methods according to claim 1, which is characterized in that pass through trained note in the preset model Meaning power network merges the feature of the multiple key frame, and is handled to obtain to fused feature described to be processed The classification results of video include:

The feature of the multiple key frame is inputted into the trained attention network, obtains fused feature；

Determine that the video to be processed belongs to the probability of each classification according to the fused feature, with true according to the probability The fixed classification results.

4. video classification methods according to claim 1, which is characterized in that the feature of the multiple key frame is inputted institute Trained attention network is stated, before obtaining fused feature, the method also includes:

The residual error network is fixed, and the attention network is trained, to obtain the trained attention net Network.

5. video classification methods according to claim 1, which is characterized in that the method also includes:

After obtaining the trained attention network, the preset model is trained, is obtained trained default Model.

6. video classification methods according to claim 5, which is characterized in that be trained, obtain to the preset model Trained preset model includes:

The preset model is trained end to end, to obtain the trained preset model.

7. video classification methods according to claim 5, which is characterized in that the method also includes:

The trained preset model is compressed based on loss is returned；And/or

The parameter type of the trained preset model is adjusted.

8. a kind of visual classification device characterized by comprising

Key frame obtains module, obtains multiple key frames for carrying out sparse sampling to video to be processed；

Characteristic extracting module, for being handled by the feature extraction network in preset model the multiple key frame, with Extract the feature of the multiple key frame；

Classification results determining module is used for through attention network trained in the preset model to the multiple key frame Feature merged, and fused feature is handled to obtain the classification results of the video to be processed.

9. a kind of electronic equipment characterized by comprising

Processor；And

Memory, for storing the executable instruction of the processor；

Wherein, the processor is configured to come described in perform claim requirement 1-7 any one via the execution executable instruction Video classification methods.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program Video classification methods described in claim 1-7 any one are realized when being executed by processor.