CN113033677A

CN113033677A - Video classification method and device, electronic equipment and storage medium

Info

Publication number: CN113033677A
Application number: CN202110341382.4A
Authority: CN
Inventors: 佘琪; 沈铮阳; 王长虎
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-06-25

Abstract

The invention provides a video classification method, a device, electronic equipment and a storage medium, wherein a cluster invariant three-dimensional convolutional neural network classification model is utilized, firstly, an image to be recognized is determined from a video frame image of a video to be recognized, then, the image to be recognized is input into a classification model trained in advance, an image class corresponding to the image to be recognized is obtained, the classification model is used for representing the corresponding relation between the image and the image class corresponding to the image, the obtained image class belongs to at least two preset image classes, the classification model is a cluster invariant three-dimensional convolutional neural network classification model, and finally, the video class of the video to be classified is determined based on the image class corresponding to the image to be recognized. Due to the rotation and other variable characteristics of the cluster and other variable three-dimensional convolutional neural network classification model, insensitivity to whether a camera is vertically placed or not and whether an object in a video is vertically placed or not in the video classification process can be realized, and the video classification robustness is improved.

Description

Video classification method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of computer vision, in particular to a video classification method, a video classification device, electronic equipment and a storage medium.

Background

With the development of the internet, various video applications are widely used. In order to classify videos variously, such as video classification of whether a video meets public release requirements for auditing, which category of, for example, sports, entertainment, food, clothing, etc., a video content belongs to is classified.

The common method in the current video classification is implemented based on a Convolutional Neural Network (CNN) technology.

Disclosure of Invention

The embodiment of the disclosure provides a video classification method, a video classification device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a video classification method, including:

determining an image to be identified from a video frame image of a video to be classified;

inputting the image to be recognized into a pre-trained classification model to obtain an image category corresponding to the image to be recognized, wherein the classification model is used for representing the corresponding relation between the image and the image category corresponding to the image, the obtained image category belongs to at least two preset image categories, and the classification model is a group equal-variation three-dimensional convolution neural network classification model;

and determining the video category of the video to be classified based on the image category corresponding to the image to be recognized.

In some optional embodiments, the classification model is obtained by pre-training through the following training steps:

acquiring a training sample set, wherein the training sample comprises a training sample image and sample labeling information used for representing the image category to which the training sample image belongs;

inputting training sample images in the training sample set into an initial group equal-variation three-dimensional convolutional neural network classification model to obtain corresponding image classes, and adjusting model parameters of the initial group equal-variation three-dimensional convolutional neural network classification model based on the difference between the obtained image classes and sample labeling information in the training sample until a preset training end condition is met;

and determining the initial group equal variation three-dimensional convolution neural network classification model as the classification model.

In some optional embodiments, the preset training end condition comprises at least one of:

testing the accuracy of the initial group equal-variation three-dimensional convolutional neural network classification model by using a test sample set, wherein the accuracy is greater than a preset accuracy threshold, and the test sample comprises a test sample image and sample labeling information used for representing the image category to which the test sample image belongs;

the difference between the obtained image category and the sample marking information in the training sample is smaller than a preset difference threshold value;

and adjusting the model parameters of the initial group equal-variation three-dimensional convolutional neural network classification model for more than or equal to a preset parameter adjusting threshold.

In some alternative embodiments, the cluster invariant three-dimensional convolutional neural network classification model is translationally invariant and rotationally invariant.

In some optional embodiments, the classification model of the group equivalent three-dimensional convolutional neural network comprises a group equivalent three-dimensional convolutional neural network and a classifier, and the convolution operation in the group equivalent three-dimensional convolutional neural network is defined as follows:

where f is the input feature image, ≧ represents the cross-correlation operation, t is the translation parameter, s is the rotation parameter, ψ is the convolution function, ψ_kIs the kth convolution function, k is a positive integer, k is the sequence number used to characterize the convolution function, L_sIs rotated according to a rotation parameter s, L_tIs to perform a panning operation according to a panning parameter t, X representing f and psi_kFor the input layer X ═ Z of the cluster invariant three-dimensional convolutional neural network classification model³，Z³Representing a three-dimensional space, for the intermediate layer X ═ G,

C_Nare in groups of equal variation, C_N＝{A₀，A₁，…，A_N-1}，

N is a preset positive integer greater than or equal to 2, and i is an integer between 0 and N-1.

In some alternative embodiments, the training sample images in the training samples of the set of training samples include forward sample images in which the presented image objects are placed forward relative to the training sample images and/or non-forward sample images in which the presented image objects are placed rotated relative to the training sample images.

In some alternative embodiments, the classifier is a linear classifier.

In a second aspect, an embodiment of the present disclosure provides a video classification apparatus, including: an image determining unit configured to determine an image to be recognized from video frame images of a video to be classified; the classification unit is configured to input the image to be recognized into a pre-trained classification model to obtain an image class corresponding to the image to be recognized, wherein the classification model is used for representing the corresponding relation between the image and the image class corresponding to the image, the obtained image class belongs to at least two preset image classes, and the classification model is a group equal-variation three-dimensional convolution neural network classification model; the video category determining unit is configured to determine the video category of the video to be classified based on the image category corresponding to the image to be recognized.

C_Nare in groups of equal variation, C_N＝{A₀，A₁，…，A_N-1}，

In some alternative embodiments, the classifier is a linear classifier.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device, on which one or more programs are stored, which, when executed by the one or more processors, cause the one or more processors to implement the method as described in any implementation manner of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method as described in any of the implementations of the first aspect.

In the current video classification based on convolutional neural network implementation, the camera may be deflected by a certain angle because the camera is not always vertically positioned. In addition, even if the camera is placed vertically, objects in the scene are not necessarily in a vertical position. Therefore, the CNN trained on the original video data is poor in robustness, and the videos with the rotation transformation often do not have good generalization performance, so that the video classification accuracy is low.

In order to realize normal video classification of videos in forward placement and rotation placement of image objects in a video classification process, the video classification method, the video classification device, the electronic device and the storage medium provided by the embodiment of the disclosure determine an image to be recognized from a video frame image of a video to be classified by using a group equivalent three-dimensional convolutional neural network classification model, then input the image to be recognized into a pre-trained classification model to obtain an image class corresponding to the image to be recognized, wherein the classification model is used for representing a corresponding relation between the image and the image class corresponding to the image, the obtained image class belongs to at least two preset image classes, the classification model is a group equivalent three-dimensional convolutional neural network classification model, and finally the video class of the video to be classified is determined based on the image class corresponding to the image to be recognized. Due to the rotation and other variable characteristics of the cluster and other variable three-dimensional convolutional neural network classification model, insensitivity to whether a camera is vertically placed or not and whether an object in a video is vertically placed or not in the video classification process can be realized, and the video classification robustness is improved.

Drawings

Other features, objects, and advantages of the disclosure will become apparent from a reading of the following detailed description of non-limiting embodiments which proceeds with reference to the accompanying drawings. The drawings are only for purposes of illustrating the particular embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a video classification method according to the present disclosure;

FIG. 3 is a flow chart of one embodiment of training steps according to the present disclosure;

FIG. 4 is a schematic block diagram of one embodiment of a video classification apparatus according to the present disclosure;

FIG. 5 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the video classification method, apparatus, electronic device, and storage medium of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a video-on-demand application, a video editing application, a video shooting application, a short video social application, a web conference application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a video capture device (e.g. a camera), a tablet, and a display screen, including but not limited to a smart phone, a tablet computer, an e-book reader, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts Group Audio Layer IV, motion Picture Experts Group Audio Layer 4), an MP4 player, a laptop computer, a desktop computer, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-listed terminal apparatuses. It may be implemented as multiple software or software modules (e.g., to provide video recording, editing, and playing services), or as a single software or software module. And is not particularly limited herein.

In some cases, the video classification method provided by the present disclosure may be performed by the

terminal devices

101, 102, 103, and accordingly, the video classification apparatus may be provided in the

terminal devices

101, 102, 103. In this case, the system architecture 100 may not include the server 105.

terminal devices

101, 102, 103 and the server 105 together, for example, the step of "determining an image to be recognized from a video frame image of a video to be classified" may be performed by the

terminal devices

101, 102, 103, the step of "inputting the image to be recognized into a classification model trained in advance, and obtaining an image class corresponding to the image to be recognized" may be performed by the server 105. The present disclosure is not limited thereto. Accordingly, the video classification means may be provided in the

terminal devices

101, 102, and 103 and the server 105, respectively.

In some cases, the video classification method provided by the present disclosure may be executed by the server 105, and accordingly, the video classification apparatus may also be disposed in the server 105, and in this case, the system architecture 100 may also not include the

terminal devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continuing reference to fig. 2, a flow 200 of one embodiment of a video classification method according to the present disclosure is shown, the video classification method comprising the steps of:

step 201, determining an image to be identified from a video frame image of a video to be classified.

In this embodiment, an executing subject (e.g., the server 105 shown in fig. 1) of the video classification method may locally or remotely acquire a video to be classified from other electronic devices (e.g., the

terminal devices

101, 102, 103 shown in fig. 1) connected to the executing subject through a network, and determine an image to be recognized from video frame images of the video to be classified by using various implementations.

Here, the video to be classified may be a short video (generally, a video that is spread on new internet media within 5 minutes) or a regular video, and this disclosure is not limited thereto.

As an example, the image to be recognized may be randomly selected from video frame images of the video to be classified. For another example, the image to be recognized may also be a median video frame image among video frame images of the video to be classified. Alternatively, the image to be recognized may also be the first frame video image or the last frame video image of the video to be classified.

It should be noted that the image to be recognized may be one frame of image in the video to be classified, or may also be at least two frames of image in the video to be classified, which is not specifically limited in this disclosure.

Step 202, inputting the image to be recognized into a classification model trained in advance, and obtaining the image category corresponding to the image to be recognized.

In this embodiment, the executing subject may input the to-be-recognized image determined in step 201 into a classification model trained in advance, so as to obtain an image category corresponding to the to-be-recognized image.

It should be noted that, when the image to be recognized is one frame of image in the video to be classified, the one frame of image may be input into a classification model trained in advance to obtain one image category as the image category corresponding to the image to be recognized. When the image to be recognized is at least two frames of images in the video to be classified, each frame of image can be respectively input into a classification model trained in advance to respectively obtain an image category, and each obtained image category is used as the image category corresponding to the image to be recognized.

Here, the classification model is used to characterize the correspondence between the image and the image class to which the image corresponds. And, here, the image categories to which the images correspond belong to at least two preset image categories. For example, the at least two preset image categories may be an "unsatisfactory category" and a "satisfactory category". For another example, the at least two preset image categories may be "sports event category", "food category", "clothing category", "nature scene category", "network lesson category", and the like.

Here, the classification model may be a cluster-like three-dimensional convolutional neural network classification model. That is, the classification model may include a swarm-like three-dimensional convolutional neural network for extracting the invariant features and a classifier for classifying based on the extracted invariant features. The group equivariant three-dimensional convolutional neural network is a three-dimensional convolutional neural network with group equivariant characteristics. Here, different invariant features can be extracted by designing different invariant groups. For example, the iso-variant groups may be translational iso-variant groups, and then a cluster iso-variant three-dimensional convolutional neural network may implement extraction of iso-variant or invariant features for the translated video frame images. For another example, the invariant group may be a rotated variant group, and then a group-like three-dimensional convolutional neural network may extract an invariant or invariant feature for the rotated video frame image.

Step 203, determining the video category of the video to be classified based on the image category corresponding to the image to be recognized.

In this embodiment, the executing entity may adopt various embodiments to determine the video category of the video to be classified based on the image category corresponding to the image to be recognized obtained in step 202.

When the image to be recognized is one frame image or at least two frames of images, and the image category corresponding to the corresponding image to be recognized is only one image category, the video category of the video to be classified may be the same as or corresponding to the image category corresponding to the image to be recognized. For example, the image category corresponding to the image to be recognized may be directly used as the video category of the video to be classified. For another example, the video category to which the image category of the image to be recognized is mapped may be determined as the video category of the video to be classified according to a preset correspondence between the image category and the video category.

When the image to be recognized is at least two frames of images and the image category corresponding to the corresponding image to be recognized is at least two image categories, the video category of the video to be classified may also be all the image categories corresponding to the image to be recognized or the video categories corresponding to all the image categories. Or the video category of the video to be classified may also be the most dominant image category in all image categories corresponding to multiple frames of images to be identified or the video category corresponding to the most dominant image category. The present disclosure is not particularly limited thereto.

The video classification method provided by the above embodiment of the present disclosure obtains the image category by determining the image to be identified from the video to be classified, inputting the image to be identified into the classification model, and then determines the video category of the video to be classified according to the obtained image category, which can achieve the following technical effects including, but not limited to: firstly, the image to be recognized is input into the model, and compared with the video to be classified, the calculation amount is reduced, and the video classification speed can be improved; secondly, the input classification model is a three-dimensional convolutional neural network classification model based on group invariant, and the classification model has the characteristic of extracting invariant or invariant features, so that the robustness of the classification model can be improved, and the classification model is further more suitable for improving the classification accuracy of the angle rotation video.

In some alternative embodiments, the classification model used in step 202 may be pre-trained by a training step 300 as shown in fig. 3, where the training step 300 includes the following steps:

here, the main body of execution of the training step 300 may be the same as that of the video classification method described above. In this way, the executing agent in the training step may store the model parameters of the classification model in the local executing agent after the classification model is obtained by training, and read the model structure information and the model parameter information of the classification model obtained by training in the process of executing the video classification method.

Here, the execution subject of the training step may also be different from the execution subject of the above-described video classification method. In this way, the executing agent of the training step may send the model parameters of the classification model to the executing agent of the video classification method after the classification model is obtained by training. In this way, the executing entity of the video classification method may read the model structure information and the model parameter information of the classification model received from the executing entity of the training step in the process of executing the video classification method.

Step 301, a training sample set is obtained.

Here, the training sample may include a training sample image and sample labeling information for characterizing an image class to which the training sample image belongs.

Here, the training sample image may be various images acquired by an image acquisition device, and the sample labeling information may be obtained by manually performing image type labeling on the training sample image.

Here, the training sample image may also be a video frame image in a sample video captured by a camera. Correspondingly, the video category to which the sample video belongs can be labeled manually, and the labeling result of the sample video is used as the sample labeling information of the training sample image, so that the workload of manual data labeling required in the training process can be reduced.

In some alternative embodiments, the training sample images in the training samples of the set of training samples include forward sample images in which the presented image objects are placed forward relative to the training sample images and/or non-forward sample images in which the presented image objects are placed rotated relative to the training sample images. That is, it is possible whether the training sample image objects in the training sample are placed in a forward or non-forward orientation with respect to the training sample image. In the training sample images of the training samples of the traditional video classification model, the sample images need to be manually corrected to the front and the back to be used as the training sample images, so that the optional implementation mode can reduce the workload of manual correction in the training process, reduce the labor cost and the time cost and accelerate the training speed.

Step 302, inputting training sample images in a training sample set into an initial group equal-variation three-dimensional convolutional neural network classification model to obtain corresponding image classes, and adjusting model parameters of the initial group equal-variation three-dimensional convolutional neural network classification model based on the difference between the obtained image classes and sample labeling information in the training sample until a preset training end condition is met.

Here, first, the model structure of the initial cluster-like three-dimensional convolutional neural network classification model can be determined.

The initial group equivariant three-dimensional convolutional neural network classification model can comprise a group equivariant three-dimensional convolutional neural network and a classifier. The classifier may adopt a corresponding linear classifier or a corresponding non-linear classifier according to actual needs, which is not specifically limited by the present disclosure.

And the group equivariant three-dimensional convolutional neural network can be a three-dimensional convolutional neural network with the group equivariant characteristic. Here, the invariant groups corresponding to the group-invariant convolutional neural network may be determined, and it may also be determined which layers are specifically included, for example, which convolutional layers, pooling layers, full-link layers, and the precedence relationship between layers. If convolutional layers are included, the size of the convolutional kernel of the convolutional layer, the convolution step size, can be determined. If pooling layers are included, pooling methods may be determined, and so on. For example, the invariant group may be a translational variant group, a rotational variant group, or a translational and rotational variant group.

In some alternative embodiments, the cluster-invariant three-dimensional convolutional neural network classification model is both translationally invariant and rotationally invariant.

Model parameters of the initial population of equal-variation three-dimensional convolutional neural network classification models can then be initialized.

And finally, inputting the training sample images in the training sample set into the initial group equivalent three-dimensional convolutional neural network classification model to obtain corresponding image classes, and adjusting the model parameters of the initial group equivalent three-dimensional convolutional neural network classification model based on the difference between the obtained image classes and the sample labeling information in the training sample until the preset training end condition is met.

In some optional embodiments, the preset training end condition may include at least one of:

firstly, testing the accuracy of the initial group equal-variation three-dimensional convolution neural network classification model by using a test sample set, wherein the accuracy is greater than a preset accuracy threshold. The test sample comprises a test sample image and sample labeling information used for representing the image category to which the test sample image belongs.

Optionally, the test sample image in the test sample of the set of test samples comprises a forward test sample image with the presented image object placed forward relative to the test sample image and/or a non-forward test sample image with the presented image object placed rotated relative to the test sample image.

Secondly, the difference between the obtained image category and the sample labeling information in the training sample is smaller than a preset difference threshold value.

And thirdly, adjusting the model parameters of the initial group equal-variation three-dimensional convolutional neural network classification model for more than or equal to a preset parameter adjusting threshold.

In some alternative embodiments, the convolution operation in the group-invariant three-dimensional convolutional neural network can be defined as follows:

where f is the input feature image, ≧ represents the cross-correlation operation, t is the translation parameter, s is the rotation parameter, ψ is the convolution function, ψ_kIs the kth convolution function, k is a positive integer, k is the sequence number used to characterize the convolution function, L_sIs rotated according to a rotation parameter s, L_tIs to perform a panning operation according to a panning parameter t, X representing f and psi_kThe domain of definition of (1).

Input layer X-Z for group equal variation three-dimensional convolution neural network classification model³，Z³Representing a three-dimensional space, for the intermediate layer X ═ G,

C_Nare in groups of equal variation, C_N＝{A₀，A₁，…，A_N-1}，

The group equivariant three-dimensional convolutional neural network defined above is translation and rotation equivariant, and can extract translation and rotation invariant features and improve the robustness of the classification model.

Step 303, determining the initial group equal-variation three-dimensional convolution neural network classification model as a classification model.

With further reference to fig. 4, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a video classification apparatus, which corresponds to the embodiment of the method shown in fig. 4, and which is particularly applicable to various electronic devices.

As shown in fig. 4, the video classification apparatus 400 of the present embodiment may include: an image determination unit 401, a classification unit 402 and a video category determination unit 403. Wherein, the image determining unit 401 is configured to determine an image to be identified from video frame images of the video to be classified; a classification unit 402, configured to input the image to be recognized into a classification model trained in advance, to obtain an image category corresponding to the image to be recognized, where the classification model is used to represent a correspondence between the image and the image category corresponding to the image, the obtained image category belongs to at least two preset image categories, and the classification model is a group-invariant three-dimensional convolutional neural network classification model; a video category determining unit 403, configured to determine a video category of the video to be classified based on an image category corresponding to the image to be recognized.

In this embodiment, specific processes of the image determining unit 401, the classifying unit 402, and the video category determining unit 403 of the video classifying device 400 and technical effects thereof can refer to the related descriptions of step 201, step 202, and step 203 in the corresponding embodiment of fig. 2, respectively, and are not described herein again.

In some alternative embodiments, the classification model may be obtained by pre-training through the following training steps:

the difference between the obtained image category and the sample labeling information in the training sample can be smaller than a preset difference threshold value;

the number of times of adjusting the model parameters of the initial group equal-variation three-dimensional convolutional neural network classification model can be larger than or equal to a preset parameter adjusting threshold value.

In some alternative embodiments, the cluster invariant three-dimensional convolutional neural network classification model may be translation invariant and rotation invariant.

In some optional embodiments, the classification model of the group-equal-variation three-dimensional convolutional neural network may include a group-equal-variation three-dimensional convolutional neural network and a classifier, and the convolution operation in the group-equal-variation three-dimensional convolutional neural network is defined as follows:

C_Nare in groups of equal variation, C_N＝{A₀，A₁，…，A_N-1}，

In some alternative embodiments, the training sample images in the training samples of the set of training samples may include forward sample images in which the presented image objects are placed forward relative to the training sample images and/or non-forward sample images in which the presented image objects are placed rotationally relative to the training sample images.

In some alternative embodiments, the classifier may be a linear classifier.

It should be noted that, for details of implementation and technical effects of each unit in the video classification apparatus provided in the embodiment of the present disclosure, reference may be made to descriptions of other embodiments in the present disclosure, and details are not repeated herein.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing the electronic device of the present disclosure is shown. The computer system 500 shown in fig. 5 is only an example and should not bring any limitations to the functionality or scope of use of the embodiments of the present disclosure.

As shown in fig. 5, computer system 500 may include a processing device (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage device 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the computer system 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touch pad, a keyboard, a mouse, a camera, a microphone, and the like; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the computer system 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates a computer system 500 having various means of electronic equipment, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program, when executed by the processing device 501, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the video classification method as shown in the embodiment shown in fig. 2 and its alternative embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. Here, the name of the unit does not constitute a limitation to the unit itself in some cases, and for example, the video category determination unit may also be described as a "unit that determines a video category of a video to be classified based on an image category corresponding to an image to be recognized".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A video classification method, comprising:

2. The method of claim 1, wherein the classification model is pre-trained by the training steps of:

3. The method of claim 2, wherein the preset training end condition comprises at least one of:

4. The method of claim 3, wherein the cluster invariant three dimensional convolutional neural network classification model is translation and rotation invariant.

5. The method of claim 4, wherein the cluster invariant three dimensional convolutional neural network classification model comprises a cluster invariant three dimensional convolutional neural network and a classifier, and the convolution operation in the cluster invariant three dimensional convolutional neural network is defined as follows:

C_Nare in groups of equal variation, C_N＝{A₀，A₁，…，A_N-1}，

6. The method of claim 5, wherein training sample images in training samples of the set of training samples comprise forward sample images in which the presented image objects are placed forward relative to the training sample images and/or non-forward sample images in which the presented image objects are placed rotated relative to the training sample images.

7. The method of claim 5 or 6, wherein the classifier is a linear classifier.

8. A video classification apparatus comprising:

an image determining unit configured to determine an image to be recognized from video frame images of a video to be classified;

the classification unit is configured to input the image to be recognized into a pre-trained classification model to obtain an image class corresponding to the image to be recognized, wherein the classification model is used for representing the corresponding relation between the image and the image class corresponding to the image, the obtained image class belongs to at least two preset image classes, and the classification model is a group equal-variation three-dimensional convolution neural network classification model;

the video category determining unit is configured to determine the video category of the video to be classified based on the image category corresponding to the image to be recognized.

9. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-8.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by one or more processors, implements the method of any one of claims 1-8.