CN111382616B

CN111382616B - Video classification method and device, storage medium and computer equipment

Info

Publication number: CN111382616B
Application number: CN201811626861.5A
Authority: CN
Inventors: 刘汇川
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Bigo Technology Pte Ltd
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2023-08-18
Anticipated expiration: 2038-12-28
Also published as: CN111382616A

Abstract

The invention provides a video classification method, a device, a storage medium and computer equipment, wherein the method comprises the steps of acquiring time sequence information, length and width of each frame image of a target video and color characteristic information of each frame image; generating a feature matrix according to the time sequence information, the length and the width of each frame image and the color feature information of each frame image; inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode and analyzing the time domain features and the space domain features after extracting the space domain features in the feature matrix in the independent convolution mode, and outputting an output matrix for representing the confidence value of the video attribution category; and determining the attribution category of the target video according to the confidence value in the output matrix. The method improves the accuracy of video classification on the basis of automatically realizing video classification.

Description

Video classification method and device, storage medium and computer equipment

Technical Field

The invention relates to the technical field of video classification processing, in particular to a video classification method and device based on a neural network, a video convolution network architecture, a storage medium and computer equipment.

Background

In a video classification task, a number of different types of video elements need to be identified. Besides the tasks closely related to time activities such as common motion recognition, character tracking, event recognition and the like, the method also comprises tasks which are less or completely independent of time domain information such as various object recognition, fuzzy detection, overexposure detection and the like, and can also comprise tasks which are less dependent on space information such as static state and jamming detection and the like.

Existing video classification algorithms are primarily optimized around various depth networks. For tasks that rely more on spatial features, the spatial domain feature extraction capability of the model needs to be enhanced by manually adjusting the convolution kernel structure. Similarly, for more time feature dependent tasks, manual adjustments are required at the convolution design stage to emphasize the abstract capabilities of the time domain features. By means of manual adjustment algorithm, not only human resources are wasted, but also the accuracy of the output result after manual adjustment is not high.

Disclosure of Invention

The invention provides a video classification method and device based on a neural network, a video convolution network architecture, a storage medium and computer equipment, so as to improve the accuracy of video classification on the basis of automatically realizing video classification.

The invention provides the following scheme:

a video classification method based on a neural network, comprising: acquiring frame images of a preset quantity value of a target video; acquiring time sequence information of each frame image, length and width of each frame image and color characteristic information of each frame image; generating a feature matrix according to the time sequence information of each frame image, the length and the width of each frame image and the color feature information of each frame image; inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features after extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing confidence values of the attribution categories of videos; and determining the attribution category of the target video according to the confidence value in the output matrix.

In an embodiment, the acquiring the frame image of the preset number value of the target video includes: and acquiring frame images of a preset quantity value of the target video according to the preset transmission frame number per second.

In one embodiment, the generating the feature matrix according to the time sequence information of each frame image, the length and the width of each frame image and the color feature information of each frame image includes: and storing the time sequence information, the length and the width of each frame image and the color data corresponding to the color characteristic information into a data matrix according to the sequence of acquiring the frame images of the target video to obtain the characteristic matrix.

In an embodiment, the asymmetric convolutional neural network model is an acceptance neural network model; the acceptance neural network model comprises a first convolution module for extracting time domain features in the feature matrix in an independent convolution mode and a second convolution module for extracting space domain features in the feature matrix in an independent convolution mode; the acceptance neural network model is used for extracting the time domain features through the first convolution module, extracting the space domain features through the second convolution module, inputting the time domain features and the space domain features into a full-connection layer of the acceptance neural network model for connection, and outputting the output matrix containing the confidence coefficient values.

In one embodiment, the acceptance neural network model is trained by: acquiring a preset number of videos as training samples; the video in the training sample is marked with a category label according to the category to which the video belongs; inputting the frame image of the training sample into the acceptance neural network model for model training to obtain each network parameter of the acceptance neural network model; and determining the acceptance neural network model according to the network parameters.

In an embodiment, inputting the frame image of the training sample into the acceptance neural network model for model training, to obtain each network parameter of the acceptance neural network model, including: inputting the frame image of the training sample into the acceptance neural network model, calculating the loss of the score vector and the label vector output by the acceptance neural network model by using a Sigmoid function and a binary_cross sentropy function, carrying out back propagation training on the acceptance neural network model by using a random gradient descent method, and obtaining each network parameter of the acceptance neural network model when the loss function of the acceptance neural network model is converged.

In an embodiment, the determining the attribution category of the target video according to the confidence value in the output matrix includes: comparing the confidence value with a plurality of numerical values preset by a system; and if the confidence value is larger than a certain value in the plurality of values, reading a video category corresponding to the certain value, and taking the video category as the attribution category of the target video.

A neural network-based video classification device, comprising: the first acquisition module is used for acquiring frame images of a preset quantity value of a target video; the second acquisition module is used for acquiring time sequence information of each frame image, length and width of each frame image and color characteristic information of each frame image; the generating module is used for generating a feature matrix according to the time sequence information of each frame image, the length and the width of each frame image and the color feature information of each frame image; the third acquisition module is used for inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features after extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing confidence values of the attribution categories of videos; and the determining module is used for determining the attribution category of the target video according to the confidence value in the output matrix.

A video convolutional network architecture comprising a convolutional module of an I3D network and an asymmetric convolutional neural network model; the asymmetric convolution neural network model is used for analyzing the time domain features and the space domain features after extracting the time domain features in the feature matrix in an independent convolution mode and extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing the confidence coefficient value of the video attribution category; the convolution module of the I3D network is used for carrying out convolution operation on time sequence information, length and width of an input video frame image and color characteristic information of the frame image, and then inputting a convolution operation result into the asymmetric convolution neural network model to obtain an output matrix of confidence coefficient values of video attribution categories.

A storage medium having a computer program stored thereon; the computer program is adapted to be loaded by a processor and to perform the neural network based video classification method of any of the embodiments described above.

A computer apparatus, comprising: one or more processors; a memory; one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the neural network-based video classification method according to any of the embodiments described above.

According to the video classification method based on the neural network, after the preset number of frame images of the target video are acquired, the server collects time sequence information, length and width of each frame image and color characteristic information of each frame image to generate a characteristic matrix input into an asymmetric convolutional neural network model. The asymmetric convolutional neural network model can perform independent time domain feature extraction and independent space domain feature extraction on the input feature matrix, so that the output matrix for representing the confidence value of the attribution category of the video is output after the time domain feature and the space domain feature of the target video are independently analyzed. Therefore, the attribution category of the target video can be judged according to the confidence value in the output matrix. Therefore, the analysis capability of the enhancement model on the spatial domain features and the temporal domain features in the feature matrix of the target video can be automatically enhanced, and the accuracy of classification of the target video is improved.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a method for classifying video based on a neural network according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of an InceptionV1 model according to the present invention;

FIG. 3 is a schematic diagram of an asymmetric convolutional neural network model according to an embodiment of the present invention;

FIG. 4 is a table of relevant parameters of convolution models in the InceptionV1 model and the asymmetric convolution neural network model provided by the invention;

FIG. 5 is a schematic diagram of a 3D neural network structure within an asymmetric convolutional neural network model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of the TSBlock building block shown in FIG. 5;

FIG. 7 is a schematic diagram of an embodiment of the 3D convolution unit of FIG. 5;

FIG. 8 is a schematic diagram of a structure of an embodiment of the multi-layer sensor of FIG. 5;

fig. 9 is a schematic structural diagram of an embodiment of a video classification device based on a neural network according to the present invention;

fig. 10 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

It will be understood by those within the art that, unless expressly stated otherwise, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, and that "first," "second," and "the" are used herein merely to distinguish one and the same technical feature and do not limit the order, quantity, etc. of that technical feature. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Abbreviation and key term definitions

CNN: convolutional neural networks.

RNN: a recurrent neural network.

FC: and (5) fully connecting the neural network.

MLP: a multi-layer perceptron.

AP: average accuracy.

AUC: area under the subject's working characteristics.

The invention provides a video classification method based on a neural network. The video classification method based on the neural network adopts a new convolution network structure, and the design of introducing 2D convolution into the 3D convolution layer is adopted, so that the abstraction capability of the model to the space features is further improved on the basis of not reducing the abstraction capability of the original 3D features, and the video classification accuracy is improved. The following details the related technology of the video classification method based on the neural network in the present invention:

the video algorithm flow is realized by adopting a 2D neural network and comprises several stages of video frame cutting, frame feature extraction, frame feature fusion, fusion feature classification prediction and the like. Where frame feature extraction often uses a 2D convolution-like network to extract spatial features, the primary representation model includes ResNets, inceptions, RCNNs, etc. However, the feature layer significant in the picture classification extracts a representative model of the model, such as ResNeXt, xception, SSD. In the frame Feature fusion and classification prediction stage, the method of firstly carrying out time domain pooling and then connecting models such as SVM, GBDT, MLP and the like can be adopted, and the method of directly sending 2D Feature Maps into an RNNs network frame by frame for calculation and calculating average value, maximum value or final value can be adopted for prediction.

The idea of the 2D convolution type algorithm is to use two completely independent networks to operate the feature processing in the time domain and the feature extraction in the space domain in the flow. Therefore, the method has strong expression capability on the characteristics of the spatial domain, but the later classification layer can only accept the spatial characteristics with higher abstract degree, and then gives the final prediction result after nonlinear combination of the spatial characteristics. Obviously, this type of method lacks the low-level features of the spatial domain in the processing of the time information, and has certain limitations.

The 3D convolution network is an extension of the 2D convolution network, and the abstract operation on time and space information is simultaneously realized in each convolution layer by using the 3D convolution kernel, so that two characteristics can be effectively combined from a shallow layer network to a high layer network, and the problem that the 2D convolution network cannot effectively utilize the space information with low abstract degree in a time domain is avoided. The 3D convolutional network is not widely used at present, and common basic network types are models of evolution from a 2D convolutional network, such as acceptance 3D, conv3D and the like. The network firstly uses a plurality of square convolution kernels with strong universality and a pooling layer of a plurality of time-space domains to extract the characteristics of continuous frame blocks of the whole video, and then sends the continuous frame blocks into a classification layer for calculation. The advantage of this approach is that it does not require much attention to the variability of the spatial and temporal information dependence of the different types of valid features, and it is desirable that the model can obtain an overall optimal distribution through a gradient optimization algorithm. The model training difficulty is increased while the universality is emphasized. However, in practical tasks, it was found that a complex I3D network may have less accuracy in identifying blurred video than a parametric amount of resnet-18 network. This is because the characterization of the blurred video does not require the introduction of information in the time dimension. There is some redundancy calculation for feature extraction using 3D convolution kernels of 7x7x7 or 3x3x 3. While increasing computational complexity, it is also easy to introduce extraneous or noise information over the time domain. Therefore, for the task of more relying on spatial features, the spatial feature extraction capability of the model needs to be enhanced by manually adjusting the convolution kernel structure. Similarly, for tasks that rely more on temporal features, the abstract capabilities of temporal features should be emphasized in the convolution design phase. The convolution kernel shape is designed in a refined mode, asymmetric abstraction is carried out on time domain and space domain information, and model feature effect is improved under specific tasks. Meanwhile, some reasonable priori constraints are added in the design stage of the convolution kernel, so that the difficulty of parameter optimization is reduced, and the training speed is improved.

In one embodiment, as shown in fig. 1, the present invention provides a video classification method based on a neural network, which includes the following steps:

s100, acquiring frame images of a preset quantity value of a target video.

In this embodiment, the server acquires a target video to be identified, and divides the target video into a plurality of frame images. Specifically, the server acquires frame images of a preset number of values of the target video according to the time sequence of video playing. Wherein each frame image can be marked with a corresponding time stamp.

In one embodiment, step S100 includes: and acquiring frame images of a preset quantity value of the target video according to the preset transmission frame number per second. Specifically, for the target video file, after video decoding, the target video is frame-sampled according to a specified FPS (transmission frame number per second), and RGB data of the frame image is further extracted and stored into a data matrix of the system frame by frame in the order of [ Channel, time, high, width ]. Wherein Channel represents three data input channels of RGB of a frame image. Time value is the Time of the frame image, high is the length of the frame image, and Width is the Width of the frame image.

S200, acquiring time sequence information of each frame image, length and width of each frame image and color characteristic information of each frame image.

In this embodiment, after the server acquires the preset number of frame images, the timing information of each frame image is further acquired. And simultaneously, acquiring color characteristic information of each frame image and the length and width of each frame image. Wherein the color characteristic information of each frame image includes RGB data information of each pixel point. The length and the width of each frame image of the same target video are equal.

In one embodiment, when the server acquires the frame images of the preset number of values of the target video according to the preset transmission frame number per second, step S200 includes: and storing the time sequence information, the length and the width of each frame image and the color data corresponding to the color characteristic information into a data matrix according to the sequence of acquiring the frame images of the target video to obtain the characteristic matrix. Specifically, the feature matrix generated by the server is generated after sequencing the time sequence information, the length and the width and the color feature information of each frame image according to the time sequence of the acquired frame image.

S300, generating a feature matrix according to the time sequence information of each frame image, the length and the width of each frame image and the color feature information of each frame image.

In this embodiment, the server generates a feature matrix from the time series information, the length and width, and the color feature information of each frame image. Specifically, each frame image corresponds to a time point on a time axis, and the timing information thereof corresponds to a time value. The length and width of each image corresponds to a value, respectively. The color characteristic information corresponds to RGB values of pixel points in the frame image. Therefore, the feature matrix of the target video can be generated by respectively corresponding the time point value, the length and the width of each frame of image and the RGB value according to the time sequence information.

S400, inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, extracting space domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features, and outputting an output matrix for representing confidence values of the attribution categories of videos.

In this embodiment, the server takes the feature matrix of the target video as an input matrix, and inputs the feature matrix to the pre-trained asymmetric convolutional neural network model. The asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, extracting space domain features in the feature matrix in an independent convolution mode, and further analyzing the time domain features and the space domain features and outputting an output matrix for representing confidence values of the attribution categories of the videos.

In this embodiment, the asymmetric convolutional neural network model is an acceptance neural network model. The acceptance neural network model is a 3D acceptance model based on a CNN model. Specifically, a 3D acceptance model of a first convolution unit for extracting the time domain features in the feature matrix in an independent convolution manner and a second convolution unit for extracting the space domain features in the feature matrix in an independent convolution manner is added on the basis of the acceptance v1 model. The schematic structural diagram of the InceptionV1 model is shown in FIG. 2, and the asymmetric convolutional neural network model according to the invention can be shown in FIG. 3. Therein, as shown in fig. 2 and 3, parameters N, C, T, H, W are input. N represents the number of videos input into the model (the model can perform video classification operation on a plurality of videos at the same time), C represents color characteristic information of the frame image, T represents timing information of the frame image, H represents the length of the frame image, and W represents the width of the frame image. Branch 0, branch 1, branch 2, branch 3, branch 4, and Branch 5 represent the convolution elements in each Branch. 3D Conv Unit 0, 3D Conv Unit 1, 3D Conv Unit 2 and 3D Conv Unit 3 represent the respective 3D convolution units in the model. MaxPooling represents maximum pooling. As shown in fig. 2, convolution unit 10 back outputs N, C, T, H, W, convolution unit 20 back outputs N, C, T, H, W, convolution unit 30 back outputs N, C3, T, H, W, convolution unit 40 back outputs N, C4, T, H, W. And performing full-connection layer accumulation on the output of each convolution unit to obtain model outputs N, C1+C2+C3+C4 and T, H, W. As shown in fig. 3, branch 0, 1, 2, 3 represent convolution modules comprising the individual convolution elements of fig. 2, here as a simple expression. The asymmetric convolutional neural network model described in the present invention ultimately outputs matrices for N, C0+c1+c2+c3+c4+c5, T, H, W. The relevant parameters of the convolution models in fig. 2 and 3 above are shown with reference to fig. 4. As in FIG. 4, unit Name represents the unit Name, kernel Size represents the convolution Kernel Size, stride represents the step Size, packing represents the margin, BN represents background lot normalization, and Activation represents Activation. ReLu is the ReLu activation function. It should be noted that the related parameter expressions in fig. 2, 3 and 4 belong to parameter expressions commonly used in the art.

Specifically, the asymmetric convolutional neural network model according to the present invention as shown in fig. 3 may be referred to as a TSBlock structure. The TSBlock structure adds two separate branches 4 and 5 based on the InceptionV1 model. Where Branch4 is a time domain Branch, the time domain Feature of the Feature Map of the input matrix is extracted separately by a separate 3x1x1 convolution operation. Branch5 is a spatial domain Branch that uses a separate 1x3x3 convolution structure to abstract the spatial domain information of features. Video data is input, and after the video data passes through the Branch0,1,2,3,4 and 5 separately, output vectors which are the same as the output vectors of the Branch0,1,2 and 3 in several dimensions of T, H and W are obtained. And finally, sequentially connecting in a Channel dimension to be used as the output of the TSBlock structure.

The invention also provides a video convolution network architecture. The video convolution network architecture comprises a convolution module of the I3D network and the asymmetric convolution neural network model in the embodiment. The asymmetric convolution neural network model is used for analyzing the time domain features and the spatial domain features after extracting the time domain features in the feature matrix in an independent convolution mode and extracting the spatial domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing the confidence value of the video attribution category. Wherein, the above embodiments have detailed the asymmetric convolutional neural network model, and the description thereof will not be repeated here.

In the video convolution network architecture, the convolution module of the I3D network is configured to perform convolution operation on time sequence information, length and width of an input video frame image and color feature information of the frame image, and then input a convolution operation result to the asymmetric convolution neural network model to obtain an output matrix of confidence values of video attribution categories. Wherein, the convolution modules of the I3D (infated 3D ConvNet) network refer to the 3D convolution unit 0, the 3D convolution unit 1 and the 3D convolution unit 2 shown in fig. 5. The 3D convolution unit 0, the 3D convolution unit 1 and the 3D convolution unit 2 belong to convolution modules of the original deep neural network structure. The asymmetric convolutional neural network model is shown as TSBlock structural unit 0, TSBlock structural unit 1 and TSBlock structural unit 2 in FIG. 5. The TSBlock structural unit 0, the TSBlock structural unit 1 and the TSBlock structural unit 2 all belong to the above-described TSBlock structure, and are not described in detail herein. Wherein, the internal structure of each TSBlock structural unit of the TSBlock structure is shown in FIG. 6. The internal structure of each 3D convolution unit is shown in fig. 7. The internal structure of the multi-layer perceptron is shown in figure 8.

Compared with the traditional I3D network structure, the video convolution network structure reduces the calculated amount of system operation by reducing the Block Stage in the I3D network structure from 4 to 3 (see 3 TSBlock structural units shown in FIG. 5), thereby further reducing the network consumption video memory. Specifically, the I3D network includes 4 stages, each stage includes 2, 5, and 2 innov 1 models (see fig. 2) in turn, and in the video convolution network architecture, only 3 stages are used, and each stage includes only 1 TSBlock (see fig. 3), so that the network AP index and the AUC index can be improved. The invention can effectively improve the AP and AUC indexes of the 3D network model on static, fuzzy, cartoon and other videos by using the TSBlock structure, and maintain the effect of the network on the relatively balanced classification of other time-space domain information. Meanwhile, using the 3D neural network structure shown in fig. 5, the AP and AUC indexes can be improved.

In one implementation of this embodiment, the acceptance neural network model is trained by:

acquiring a preset number of videos as training samples; the video in the training sample is marked with a category label according to the category to which the video belongs; inputting the frame image of the training sample into the acceptance neural network model for model training to obtain each network parameter of the acceptance neural network model; and determining the acceptance neural network model according to the network parameters. Further, inputting the frame image of the training sample into the acceptance neural network model for model training to obtain each network parameter of the acceptance neural network model, including: inputting the frame image of the training sample into the acceptance neural network model, calculating the loss of the score vector and the label vector output by the acceptance neural network model by using a Sigmoid function and a binary_cross sentropy function, carrying out back propagation training on the acceptance neural network model by using a random gradient descent method, and obtaining each network parameter of the acceptance neural network model when the loss function of the acceptance neural network model is converged.

Specifically, in the training stage, after video data subjected to data preprocessing of random slices is input into a network, loss of score vectors and label vectors output by the network is calculated by using a Sigmoid function and a binary_cross sentropy function, and then counter-propagation is performed by a batch random gradient descent method, so that network parameters are continuously optimized. When the loss function of the network reaches convergence, training is stopped. In the prediction stage, a video data input network of a central slice is used, and for the vector output by the network, the floating point value of each dimension is the confidence that the input video belongs to the category. And (5) thresholding the confidence level according to the experience value to judge whether the video belongs to a certain category.

Wherein, the data preprocessing here is: for each video file, firstly, video decoding is needed, after video is sampled according to a designated FPS, RGB data of the extracted frames are stored into a data matrix frame by frame according to the sequence of [ Channel, time, high, width ], and the data matrix is subjected to short_based size to 256. In the model training stage, a spatial domain random slicing operation is required for the 4-dimensional matrix which completes the resize. Let it be assumed that a video block is given containing N frames of pictures, each picture being H and W in length. In the model training phase, for each frame of image, x and y are randomly taken, and 0< = x < = W-224and 0< = y < = H-224 are satisfied, then a new image I' =i (x, y, x+224, y+224). During the model prediction or validation phase, x=w/2-112 and y=h/2-112. For each frame image corresponding matrix in the same video, x and y remain identical. After the slicing operation is completed, a matrix of [3, T,224 ] will be obtained. And finally, normalizing the element values of the input matrix from the value range of 0-255 to [ -1,1] by using the normalization parameters of std= [127.5,127.5,127.5], mean= [127.5,127.5,127.5], wherein the matrix is the input matrix of the neural network. Where std represents the standard deviation and mean represents the mean.

S500, determining the attribution category of the target video according to the confidence value in the output matrix.

In this embodiment, after the server inputs the feature matrix of the target video into the pre-trained asymmetric convolutional neural network model, the asymmetric convolutional neural network model outputs an output matrix containing confidence values for characterizing the video attribution category. Each confidence value in the output matrix represents the confidence that the video belongs to a certain video classification in the system. The attribution category of the video on the classification can be determined by comparing the confidence value in the speaking output matrix with the preset numerical value in the system. The preset values in the system correspond to the types of the videos stored in the system in advance, and the types of the videos can be determined through value comparison.

In one embodiment, step S500 includes: comparing the confidence value with a plurality of numerical values preset by a system; and if the confidence value is larger than a certain value in the plurality of values, reading a video category corresponding to the certain value, and taking the video category as the attribution category of the target video. Specifically, each confidence value is compared with a numerical value preset by the system, and the classification attribution of the target video is judged according to the comparison result. In other embodiments, the confidence value may be thresholded according to an empirical value, so as to determine whether the target video belongs to a certain category.

The invention also provides a video classification device based on the neural network. In one embodiment, as shown in fig. 9, the video classification apparatus based on the neural network includes a first acquisition module 10, a second acquisition module 20, a generation module 30, a third acquisition module 40, and a determination module 50.

The first acquisition module 10 is configured to acquire a frame image of a preset number of values of a target video. In this embodiment, the server acquires a target video to be identified, and divides the target video into a plurality of frame images. Specifically, the server acquires frame images of a preset number of values of the target video according to the time sequence of video playing. Wherein each frame image can be marked with a corresponding time stamp.

The second acquisition module 20 is configured to acquire timing information of each of the frame images, a length and a width of each of the frame images, and color feature information of each of the frame images. In this embodiment, after the server acquires the preset number of frame images, the timing information of each frame image is further acquired. And simultaneously, acquiring color characteristic information of each frame image and the length and width of each frame image. Wherein the color characteristic information of each frame image includes RGB data information of each pixel point. The length and the width of each frame image of the same target video are equal.

The generating module 30 is configured to generate a feature matrix according to the timing information of each frame image, the length and width of each frame image, and the color feature information of each frame image. In this embodiment, the server generates a feature matrix from the time series information, the length and width, and the color feature information of each frame image. Specifically, each frame image corresponds to a time point on a time axis, and the timing information thereof corresponds to a time value. The length and width of each image corresponds to a value, respectively. The color characteristic information corresponds to RGB values of pixel points in the frame image. Therefore, the feature matrix of the target video can be generated by respectively corresponding the time point value, the length and the width of each frame of image and the RGB value according to the time sequence information.

The third obtaining module 40 is configured to input the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, extracting space domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features, and outputting an output matrix for representing confidence values of the attribution categories of videos. In this embodiment, the server takes the feature matrix of the target video as an input matrix, and inputs the feature matrix to the pre-trained asymmetric convolutional neural network model. The asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, extracting space domain features in the feature matrix in an independent convolution mode, and further analyzing the time domain features and the space domain features and outputting an output matrix for representing confidence values of the attribution categories of the videos.

The determining module 50 determines the attribution category of the target video according to the confidence value in the output matrix. In this embodiment, after the server inputs the feature matrix of the target video into the pre-trained asymmetric convolutional neural network model, the asymmetric convolutional neural network model outputs an output matrix containing confidence values for characterizing the video attribution category. Each confidence value in the output matrix represents the confidence that the video belongs to a certain video classification in the system. The attribution category of the video on the classification can be determined by comparing the confidence value in the speaking output matrix with the preset numerical value in the system. The preset values in the system correspond to the types of the videos stored in the system in advance, and the types of the videos can be determined through value comparison.

In other embodiments, each module in the video classification device based on the neural network provided by the present invention is further configured to execute operations corresponding to each step in the video classification method based on the neural network provided by the present invention, which are not described in detail herein.

The invention also provides a storage medium. The storage medium has a computer program stored thereon; the computer program, when executed by a processor, implements the neural network-based video classification method according to any of the above embodiments. The storage medium may be a memory. Such as internal memory or external memory, or both. The internal memory may include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, or random access memory. The external memory may include a hard disk, floppy disk, ZIP disk, U-disk, tape, etc. The storage media disclosed herein include, but are not limited to, these types of memory. The memory disclosed herein is by way of example only and not by way of limitation.

The invention also provides computer equipment. A computer device comprising: one or more processors; a memory; one or more applications. Wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the neural network-based video classification of any of the embodiments described above.

FIG. 10 is a schematic diagram of a computer device in an embodiment of the invention. The computer device in this embodiment may be a server, a personal computer, or a network device. As shown in fig. 10, the apparatus includes a processor 1003, a memory 1005, an input unit 1007, and a display unit 1009. It will be appreciated by those skilled in the art that the device architecture shown in fig. 10 does not constitute a limitation on all devices, and may include more or fewer components than shown, or may combine certain components. The memory 1005 may be used to store an application 1001 and various functional modules, and the processor 1003 runs the application 1001 stored in the memory 1005, thereby executing various functional applications of the device and data processing. The memory may be internal memory or external memory, or include both internal memory and external memory. The internal memory may include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), flash memory, or random access memory. The external memory may include a hard disk, floppy disk, ZIP disk, U-disk, tape, etc. The disclosed memory includes, but is not limited to, these types of memory. The memory disclosed herein is by way of example only and not by way of limitation.

The input unit 1007 is used to receive input of a signal and to receive keywords input by a user. The input unit 1007 may include a touch panel and other input devices. The touch panel may collect touch operations on or near the user (e.g., the user's operation on or near the touch panel using any suitable object or accessory such as a finger, stylus, etc.), and drive the corresponding connection device according to a preset program; other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., play control keys, switch keys, etc.), a trackball, mouse, joystick, etc. The display unit 1009 may be used to display information input by a user or information provided to the user and various menus of the computer device. The display unit 1009 may take the form of a liquid crystal display, an organic light emitting diode, or the like. The processor 1003 is a control center of the computer device, connects various parts of the entire computer using various interfaces and lines, and performs various functions and processes data by running or executing software programs and/or modules stored in the memory 1005, and calling data stored in the memory.

In one embodiment, the device includes one or more processors 1003, and one or more memories 1005, one or more applications 1001. Wherein the one or more application programs 1001 are stored in the memory 1005 and configured to be executed by the one or more processors 1003, the one or more application programs 1001 configured to perform the neural network-based video classification described in the above embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

Those of ordinary skill in the art will appreciate that all or a portion of the steps implementing the above embodiments may be implemented by hardware, or may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the storage medium may include a memory, a magnetic disk, an optical disk, or the like.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

It should be understood that each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.

Claims

1. A video classification method based on a neural network, comprising:

acquiring frame images of a preset quantity value of a target video;

acquiring time sequence information of each frame image, length and width of each frame image and color characteristic information of each frame image; the timing information includes a time point value;

Generating a feature matrix according to the time sequence information of each frame image, the length and the width of each frame image and the color feature information of each frame image, wherein the feature matrix comprises the following steps: generating a feature matrix of the target video by using the time point value, the value corresponding to the length and the width of each frame of image and the RGB value according to the time sequence information;

inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features after extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing confidence values of the attribution categories of videos;

and determining the attribution category of the target video according to the confidence value in the output matrix.

2. The method of claim 1, wherein the acquiring the frame image of the preset number of values of the target video comprises: and acquiring frame images of a preset quantity value of the target video according to the preset transmission frame number per second.

3. The method of claim 2, wherein generating the feature matrix based on the timing information of each of the frame images, the length and width of each of the frame images, and the color feature information of each of the frame images, comprises:

And storing the time sequence information, the length and the width of each frame image and the color data corresponding to the color characteristic information into a data matrix according to the sequence of acquiring the frame images of the target video to obtain the characteristic matrix.

4. The method of claim 1, wherein the asymmetric convolutional neural network model is an acceptance neural network model; the acceptance neural network model comprises a first convolution module for extracting time domain features in the feature matrix in an independent convolution mode and a second convolution module for extracting space domain features in the feature matrix in an independent convolution mode;

the acceptance neural network model is used for extracting the time domain features through the first convolution module, extracting the space domain features through the second convolution module, inputting the time domain features and the space domain features into a full-connection layer of the acceptance neural network model for connection, and outputting the output matrix containing the confidence coefficient values.

5. The method of claim 4, wherein the acceptance neural network model is trained by:

acquiring a preset number of videos as training samples; the video in the training sample is marked with a category label according to the category to which the video belongs;

Inputting the frame image of the training sample into the acceptance neural network model for model training to obtain each network parameter of the acceptance neural network model;

and determining the acceptance neural network model according to the network parameters.

6. The method of claim 5, wherein inputting the frame image of the training sample into the acceptance neural network model for model training to obtain each network parameter of the acceptance neural network model comprises:

inputting the frame image of the training sample into the acceptance neural network model, calculating the loss of the score vector and the label vector output by the acceptance neural network model by using a Sigmoid function and a binary_cross sentropy function, carrying out back propagation training on the acceptance neural network model by using a random gradient descent method, and obtaining each network parameter of the acceptance neural network model when the loss function of the acceptance neural network model is converged.

7. The method of claim 1, wherein the determining the home category of the target video based on the confidence value in the output matrix comprises:

Comparing the confidence value with a plurality of numerical values preset by a system;

and if the confidence value is larger than a certain value in the plurality of values, reading a video category corresponding to the certain value, and taking the video category as the attribution category of the target video.

8. A video classification device based on a neural network, comprising:

the first acquisition module is used for acquiring frame images of a preset quantity value of a target video;

the second acquisition module is used for acquiring time sequence information of each frame image, length and width of each frame image and color characteristic information of each frame image;

the generating module is used for generating a feature matrix according to time sequence information of each frame image, the length and the width of each frame image and color feature information of each frame image, wherein the time sequence information corresponds to a time value, and the time point value, the value corresponding to the length and the width of each frame image and the RGB value are respectively used for generating the feature matrix of the target video according to the time sequence information;

the third acquisition module is used for inputting the feature matrix into an asymmetric convolutional neural network model to obtain an output matrix containing confidence values; the asymmetric convolution neural network model is used for extracting time domain features in the feature matrix in an independent convolution mode, analyzing the time domain features and the space domain features after extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing confidence values of the attribution categories of videos;

And the determining module is used for determining the attribution category of the target video according to the confidence value in the output matrix.

9. A video convolutional network architecture, wherein the video convolutional network architecture comprises a convolutional module of an I3D network and an asymmetric convolutional neural network model; the asymmetric convolution neural network model is used for analyzing the time domain features and the space domain features after extracting the time domain features in the feature matrix in an independent convolution mode and extracting the space domain features in the feature matrix in an independent convolution mode, and outputting an output matrix for representing the confidence value of the video attribution category; the convolution module of the I3D network is used for carrying out convolution operation on time sequence information, length and width of an input video frame image and color characteristic information of the frame image, then inputting a convolution operation result into the asymmetric convolution neural network model to obtain an output matrix of a confidence value of a video attribution category, wherein the time sequence information comprises a time point value, and a characteristic matrix of a target video is generated by using values and RGB values corresponding to the time point value, the length and the width of each frame image according to the time sequence information.

10. A storage medium having a computer program stored thereon; the computer program is adapted to be loaded by a processor and to perform the neural network based video classification method of any of the preceding claims 1 to 7.

11. A computer device, comprising:

one or more processors;

a memory;

one or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the one or more processors, the one or more applications configured to perform the neural network-based video classification method of any of claims 1-7.