CN111242019A

CN111242019A - Video content detection method and device, electronic equipment and storage medium

Info

Publication number: CN111242019A
Application number: CN202010027419.1A
Authority: CN
Inventors: 彭健腾; 王兴华; 康斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2020-06-05
Anticipated expiration: 2040-01-10
Also published as: CN111242019B

Abstract

The embodiment of the invention discloses a method and a device for detecting video content, electronic equipment and a storage medium, wherein the method for detecting the video content comprises the following steps: the method comprises the steps of obtaining a video to be detected, selecting video frames for video content detection from the video to be detected, obtaining a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining a clustering center to which each content category belongs, calculating distances between each image feature and the clustering centers respectively, obtaining a distance set corresponding to each image feature, and determining the content category of the video to be detected according to the distance set and the clustering centers, so that the accuracy of detecting the video content can be improved.

Description

Video content detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a video content detection method and device, electronic equipment and a storage medium.

Background

In the age of rapid development of the internet, many websites support and allow users to upload videos and display the videos to the public by themselves, with the decrease of the threshold of content production, the uploading amount of videos rapidly increases at an exponential speed, and in order to ensure the security of distributed content, the verification of video content needs to be completed in a short time, for example, whether the content relates to sensitive information, content quality, security and the like is identified and processed.

At present, a video content detection scheme mainly counts distribution of color features of a video, content categories of the video are divided based on the distribution of the color features, and semantics of the video are not considered, so that a detection result is inaccurate in the current detection scheme.

Disclosure of Invention

The embodiment of the invention provides a method and a device for detecting video content, electronic equipment and a storage medium, which can improve the accuracy of detecting the video content.

The embodiment of the invention provides a method for detecting video content, which comprises the following steps:

acquiring a video to be detected, wherein the video to be detected comprises a plurality of video frames;

selecting video frames for video content detection from videos to be detected to obtain a plurality of video frames to be detected;

extracting image characteristics corresponding to each video frame to be detected, and acquiring a clustering center to which each content category belongs;

respectively calculating the distance between each image feature and a plurality of clustering centers to obtain a distance set corresponding to each image feature;

and determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

Correspondingly, an embodiment of the present invention further provides a device for detecting video content, including:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a video to be detected, and the video to be detected comprises a plurality of video frames;

the selection module is used for selecting video frames for video content detection from the videos to be detected to obtain a plurality of video frames to be detected;

the extraction module is used for extracting image characteristics corresponding to each video frame to be detected;

the second acquisition module is used for acquiring the clustering center to which each content category belongs;

the computing module is used for respectively computing the distances between the image features and the clustering centers to obtain a distance set corresponding to the image features;

and the determining module is used for determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

Optionally, in some embodiments of the present invention, the second obtaining module includes:

the device comprises an acquisition unit, a classification unit and a classification unit, wherein the acquisition unit is used for acquiring a trained video content classification model and a plurality of sample video frames marked with video content types, and the video content classification model is formed by training the plurality of sample video frames;

and the construction unit is used for constructing the clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

Optionally, in some embodiments of the present invention, the building unit includes:

the extraction subunit is used for respectively extracting the characteristics of each sample video frame by utilizing the trained video content classification model;

and the constructing subunit is used for constructing the clustering center to which each content category belongs based on the extracted features.

Optionally, in some embodiments of the present invention, the building subunit is specifically configured to:

acquiring a plurality of preset content tags;

determining the number of the classes to be clustered according to a plurality of preset content labels;

and clustering the extracted features based on a preset clustering algorithm and the classification number to obtain a clustering center to which each content category belongs.

Optionally, in some embodiments of the present invention, the apparatus further includes a training module, where the training module is specifically configured to:

collecting a plurality of sample video frames marked with video content types;

determining a sample video frame needing training currently from a plurality of collected sample video frames to obtain a current processing object;

importing the current processing object into a preset initial classification model for training to obtain a predicted value of video content corresponding to the current processing object;

converging a predicted value corresponding to the current processing object and the marked video content type of the current processing object so as to adjust the parameters of the preset initial classification model;

and returning to the step of determining the sample video frame which needs to be trained currently from the collected multiple sample video frames until the multiple sample video frames are trained.

Optionally, in some embodiments of the present invention, the determining module includes:

the selecting unit is used for selecting a preset number of distances from the distance set to obtain at least one target distance;

the first determining unit is used for determining a clustering center corresponding to the target distance to obtain at least one target clustering center;

and the second determining unit is used for determining the content category of the video to be detected according to the at least one target clustering center.

Optionally, in some embodiments of the present invention, the second determining unit is specifically configured to:

respectively acquiring content categories corresponding to a plurality of target clustering centers;

and determining the content category of the video to be detected based on the determined content category.

Optionally, in some embodiments of the present invention, the selecting unit is specifically configured to: and selecting the distance with the minimum distance from the distance set as the target distance.

Optionally, in some embodiments of the present invention, the selecting module is specifically configured to:

detecting the number of video frames in a video to be detected;

judging whether the number is larger than a preset number or not;

when the number is larger than the preset number, removing corresponding video frames from the video to be detected based on a preset strategy to obtain a reserved video frame set;

and selecting a plurality of video frames at intervals from the reserved video frame set to obtain the video frame to be detected.

Optionally, in some embodiments of the present invention, the selecting module is further specifically configured to:

and when the number is less than or equal to the preset number, selecting a plurality of video frames at intervals from the video to be detected to obtain the video frames to be detected.

After a video to be detected is obtained, wherein the video to be detected comprises a plurality of video frames, selecting the video frames for detecting the video content from the video to be detected to obtain a plurality of video frames to be detected, then extracting the image characteristics corresponding to the video frames to be detected, obtaining the clustering centers to which the content types belong, then respectively calculating the distances between the image characteristics and the clustering centers to obtain the distance sets corresponding to the image characteristics, and finally determining the content types of the video to be detected according to the distance sets and the clustering centers. Therefore, the scheme can effectively improve the accuracy of detecting the video content.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a scene schematic diagram of a method for detecting video content according to an embodiment of the present invention;

fig. 1b is a schematic flowchart of a method for detecting video content according to an embodiment of the present invention;

FIG. 1c is a schematic diagram of cluster center distribution in the method for detecting video content according to the embodiment of the present invention

Fig. 2a is another schematic flow chart of a method for detecting video content according to an embodiment of the present invention;

fig. 2b is a schematic view of another scene of a detection method of video content according to an embodiment of the present invention;

FIG. 2c is a schematic diagram of sample video processing performed by a server in the detection of video content according to an embodiment of the present invention;

fig. 3a is a schematic structural diagram of an apparatus for detecting video content according to an embodiment of the present invention;

fig. 3b is a schematic structural diagram of an apparatus for detecting video content according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The embodiment of the invention provides a video content detection method and device, electronic equipment and a storage medium.

The video content detection device (hereinafter, referred to as a detection device) may be specifically integrated in a server or a terminal, where the server may include an independently operating server or a distributed server, or may include a server cluster composed of a plurality of servers, and the terminal may include a mobile phone, a tablet computer, or a Personal Computer (PC).

Referring to fig. 1a, taking the example that the detection device is integrated in the server, the server can receive the video uploaded by the user through the network (i.e. the video to be detected), for convenience of explanation, the following description refers to a case where the user uploads a video a, and after the server acquires the video a, the video A can comprise a plurality of video frames, the server selects the video frames for video content detection from the video A to obtain a plurality of video frames to be detected, then, the server extracts image features corresponding to each video frame to be detected, and obtains a clustering center to which each content category belongs, next, the server can respectively calculate distances between each image feature and a plurality of clustering centers to obtain a distance set corresponding to each image feature, and finally, the server determines the content category of the video A according to the distance set and the plurality of clustering centers, for example, the content category of the video A is detected to be cartoon.

According to the scheme, the distance between each image feature and the plurality of clustering centers is calculated, and then the content category of the video to be detected is determined based on the distance set and the plurality of clustering centers, namely, the content category of each frame of image frame is considered in the actual detection process, so that the accuracy of detecting the video content can be improved.

The following are detailed below. It should be noted that the description sequence of the following embodiments is not intended to limit the priority sequence of the embodiments.

A method of detecting video content, comprising: the method comprises the steps of obtaining a video to be detected, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining a clustering center to which each content category belongs, calculating distances between each image feature and the clustering centers respectively to obtain a distance set corresponding to each image feature, and determining the content category of the video to be detected according to the distance set and the clustering centers.

Referring to fig. 1b, fig. 1b is a schematic flowchart illustrating a method for detecting video content according to an embodiment of the invention. The specific flow of the video content detection method may be as follows:

101. and acquiring the video to be detected.

The video to be detected may include a plurality of video frames, and the way of acquiring the video to be detected may be various, for example, the video to be detected may be acquired from the internet and/or a designated database, and may be specifically determined according to the requirements of the actual application, and the video to be detected may include a tv series, a movie, a video recorded by a user, and the like.

102. And selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected.

For example, specifically, according to the arrangement sequence of video frames in a video to be detected, video frames for video content detection may be selected at intervals from the video to be detected to obtain a plurality of video frames to be detected, and in order to improve the calculation efficiency, therefore, the number of the video frames may be compressed, that is, a part of the video frames are deleted, and then video frames for video content detection are selected from remaining video frames, that is, optionally, in some embodiments, the step "selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected" may specifically include:

(11) detecting the number of video frames in a video to be detected;

(12) judging whether the number is larger than a preset number or not;

(13) when the number is larger than the preset number, removing corresponding video frames from the video to be detected based on a preset strategy to obtain a reserved video frame set;

(14) and selecting a plurality of video frames at intervals from the reserved video frame set to obtain the video frame to be detected.

For example, it is detected that the number of video frames in the video to be detected is 200 frames, and the preset number is 100 frames, so that corresponding video frames can be removed from the video to be detected based on a preset policy to obtain a reserved video frame set, and then, a plurality of video frames are selected at intervals from the reserved video frame set to obtain the video frames to be detected.

In addition, when the number of the video frames is less than or equal to the preset number, a plurality of video frames may be selected at intervals from the video to be detected to obtain the video frames to be detected, that is, optionally, in some embodiments, the method may further include: and when the number is less than or equal to the preset number, selecting a plurality of video frames at intervals from the video to be detected to obtain the video frames to be detected.

It should be further noted that, in order to reduce the influence of the low information content frame on the subsequent video content detection, the low information content frame may be deleted before the video frame used for the video content detection is selected from the video to be detected, where the low information content frame refers to a video frame with too simple color texture features, a video title frame, and so on, and specifically, an algorithm of image frame complexity may be used to detect all the video frames, that is, optionally, in some embodiments, before the step "selecting a video frame used for the video content detection from the video to be detected to obtain a plurality of video frames to be detected", specifically, the method may further include:

(21) detecting all video frames in a video to be detected respectively by adopting a preset algorithm;

(22) and processing the video to be detected based on the detection result to obtain the processed video to be detected.

For example, when 3 frames of video frames are detected to be black and white frames and pure color blocks exist at the image edges corresponding to 2 frames of video frames, the 5 frames of video frames can be deleted to obtain a processed video to be detected, and then, a video frame for video content detection can be selected from the preprocessed video to be detected to obtain a plurality of video frames to be detected.

103. And extracting image characteristics corresponding to each video frame to be detected, and acquiring a clustering center to which each content category belongs.

For example, specifically, feature extraction may be performed on each to-be-detected video frame based on a trained video content classification model to obtain an image feature corresponding to each to-be-detected video frame, where the video content classification model is formed by training a plurality of sample video frames labeled with video content types, and a clustering center to which each content type belongs may be constructed based on the trained video content classification model and the plurality of sample video frames, that is, optionally, in some embodiments, the step "obtaining a clustering center to which each content type belongs" may specifically include:

(31) acquiring a trained video content classification model;

(32) and constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

For example, specifically, each sample video frame may be introduced into a trained video content classification model to obtain a sample image feature corresponding to each sample video, and then a clustering center to which each content category belongs is constructed according to the sample image feature, that is, optionally, the step "constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames" may specifically include:

(41) respectively extracting the characteristics of each sample video frame by using the trained video content classification model;

(42) and constructing a clustering center to which each content category belongs based on the extracted features.

For example, labels such as "comedy", "horror", and "inference drama" may be preset, or labels such as "animation" and "non-animation" may also be set, and taking the labels of "animation" and "non-animation" as an example, the number of the classes to be clustered is 2, and then the extracted features are clustered based on a preset clustering algorithm and the number of the classes, so as to obtain a clustering center to which each content class belongs, that is, optionally, the step "constructing a clustering center to which each content class belongs based on the extracted features" may specifically include:

(51) acquiring a plurality of preset content tags;

(52) determining the number of the classes to be clustered according to a plurality of preset content labels;

(53) clustering the extracted features based on a preset clustering algorithm and the classification number to obtain a clustering center to which each content category belongs

The number of content tags may be set according to actual needs, for example, first, tags of "animation" and "non-animation" may be set, and then, tags of "fun", "hot blood", and "different world" may be set under the "animation" tag, respectively, and tags of "romance", "hot blood", and "different world" may be set under the "animation" tag, or an animation name tag may be set under the "animation" tag, such as "X picture X", "dog XX", and "X god", and a tag of "X picture X", "dog XX", and "X god", may not be set under the "non-animation" tag, and may be set according to actual situations, which is not described herein again.

It should be noted that the video content classification model may be pre-established, that is, in some embodiments, before the step "obtaining a trained video content classification model", the method may further include:

(61) collecting a plurality of sample video frames marked with video content types;

(62) determining a sample video frame needing training currently from a plurality of collected sample video frames to obtain a current processing object;

(63) importing the current processing object into a preset initial classification model for training to obtain a predicted value of video content corresponding to the current processing object;

(64) converging a predicted value corresponding to the current processing object and the marked video content type of the current processing object so as to adjust the parameters of the preset initial classification model;

(65) and returning to the step of determining the sample video frame which needs to be trained currently from the collected multiple sample video frames until the multiple sample video frames are trained.

And (3) rolling layers: the method is mainly used for feature extraction of an input image (such as a training sample or an image frame needing to be identified), wherein the size of a convolution kernel can be determined according to practical application, for example, the sizes of convolution kernels from a first layer of convolution layer to a fourth layer of convolution layer can be (7, 7), (5, 5), (3, 3), (3, 3); optionally, in order to reduce the complexity of the calculation and improve the calculation efficiency, in this embodiment, the sizes of convolution kernels of the four convolution layers may all be set to (3, 3), the activation functions all use "reduce (Linear rectification function, Rectified Linear Unit)", the padding (padding, which refers to a space between an attribute definition element border and an element content) modes are all set to "same", and the "same" padding mode may be simply understood as that an edge is padded with 0, and the number of 0 padding on the left side (upper side) is the same as or less than the number of 0 padding on the right side (lower side). Optionally, the convolutional layers may be directly connected to each other, so as to accelerate the network convergence speed, and in order to further reduce the amount of computation, downsampling (downsampling) may be performed on all layers or any 1 to 2 layers of the second to fourth convolutional layers, where the downsampling operation is substantially the same as the operation of convolution, and the downsampling convolution kernel is only a maximum value (maxpolong) or an average value (average value) of corresponding positions.

It should be noted that, for convenience of description, in the embodiment of the present invention, both the layer where the activation function is located and the down-sampling layer (also referred to as a pooling layer) are included in the convolution layer, and it should be understood that the structure may also be considered to include the convolution layer, the layer where the activation function is located, the down-sampling layer (i.e., a pooling layer), and a full-connection layer, and of course, the structure may also include an input layer for inputting data and an output layer for outputting data, which are not described herein again.

Full connection layer: the learned features may be mapped to a sample label space, which mainly functions as a "classifier" in the whole convolutional neural network, and each node of the fully-connected layer is connected to all nodes output by the previous layer (e.g., the down-sampling layer in the convolutional layer), where one node of the fully-connected layer is referred to as one neuron in the fully-connected layer, and the number of neurons in the fully-connected layer may be determined according to the requirements of the practical application, for example, in the text detection model, the number of neurons in the fully-connected layer may be set to 512 each, or may be set to 128 each, and so on. Similar to the convolutional layer, optionally, in the fully-connected layer, a non-linear factor may be added by adding an activation function, for example, an activation function sigmoid (sigmoid function) may be added.

Specifically, for example, a sample video frame set can be acquired through multiple ways, the sample video frame set includes multiple sample video frames labeled with video content types, then, determining a sample video frame which needs to be trained currently from the collected multiple sample video frames to obtain a current processing object, then, the current processing object is led into a preset initial classification model for training to obtain a predicted value of the video content corresponding to the current processing object, and then, converging the predicted value corresponding to the current processing object and the marked video content type of the current processing object, and adjusting parameters of a preset initial classification model, returning to the step of determining the sample video frame needing training currently from the collected multiple sample video frames until the multiple sample video frames are trained, and finally obtaining a video content classification model.

104. And respectively calculating the distance between each image feature and the plurality of clustering centers to obtain a distance set corresponding to each image feature.

For example, there are 10 image features and 6 clustering centers, and each image feature corresponds to a distance set, where the distance set includes distances between the image features and the 6 clustering centers, respectively, where the distance may be represented by a euclidean distance, or, of course, may also be represented by a mahalanobis distance, which is specifically selected according to actual situations and is not described herein again.

105. And determining the content category of the video to be detected according to the distance set and the plurality of clustering centers.

For example, specifically, 3 distances may be randomly selected from the distance set, and the content category of the video to be detected is determined according to the clustering centers corresponding to the 3 distances, that is, optionally, in some embodiments, the step "determining the content category of the video to be detected according to the distance set and the multiple clustering centers" may specifically include:

(71) selecting a preset number of distances from the distance set to obtain at least one target distance;

(72) determining a clustering center corresponding to the target distance to obtain at least one target clustering center;

(73) and determining the content category of the video to be detected according to at least one target clustering center.

Selecting a preset number of distances from the distance set to obtain at least one target distance in two ways;

the first mode is as follows: taking a video feature a and 6 cluster centers as an example for illustration, a distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5 and a sixth distance B6, three distances may be randomly selected from the distance set B, for example, the second distance B2, the fifth distance B5 and the sixth distance B6 are selected, then a cluster center corresponding to the second distance B2, a cluster center corresponding to the fifth distance B5 and a cluster center corresponding to the sixth distance B6 are respectively obtained, and finally, based on content categories corresponding to the cluster centers, a content category of the video to be detected is determined, that is, optionally, in some embodiments, the step "determining the content category of the video to be detected according to at least one target cluster center" includes:

(81) respectively acquiring content categories corresponding to a plurality of target clustering centers;

(82) and determining the content category of the video to be detected based on the determined content category.

For example, after obtaining content categories of 3 target clustering centers corresponding to the image frame D to be detected, the content categories corresponding to the 3 target clustering centers are respectively: "X-position X person", "X-shadow X person", and "pseudo X person", which are distributed as shown in fig. 1c, it may be determined that the content category of the image frame D to be detected is "cartoon", and the same processing is performed on other image frames, and finally, the content category of the video to be detected is determined according to the content categories corresponding to all the image frames to be detected, for example, the number of the image frames to be detected is 100, and the number of the image frames to be detected with the content category "cartoon" is 60, and then it may be determined that the content category of the video to be detected is "cartoon", that is, in some embodiments, the step "determining the content category of the video to be detected based on the determined content category" may specifically include:

(91) calculating the proportion of the image frames to be detected corresponding to each content category in all the image frames to be detected;

(92) and when the proportion is larger than the preset proportion, determining the content type of the image frame to be detected corresponding to the proportion larger than the preset proportion as the content type of the video to be detected.

The preset proportion can be set according to actual requirements, and is not described in detail herein.

In order to further improve the accuracy of detecting the video content, the minimum three distances may be selected from the distance set B, and preferably, in some embodiments, N distances may be selected from the distance set to obtain N target distances, where N is a positive odd number.

The second mode is as follows: taking one video feature a and 6 cluster centers as an example, a distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5, and a sixth distance B6, and a minimum distance may be randomly selected as a target distance in the distance set B, and if the first distance B1 is minimum, the first distance B1 is used as the target distance, that is, optionally, in some embodiments, the step "selecting a preset number of distances from the distance set to obtain at least one target distance" may specifically include: and selecting the distance with the minimum distance in the distance set as the target distance.

After a video to be detected is obtained, selecting a video frame for video content detection from the video to be detected to obtain a plurality of video frames to be detected, then extracting image features corresponding to the video frames to be detected, obtaining a clustering center to which each content category belongs, then respectively calculating distances between each image feature and the clustering centers to obtain a distance set corresponding to each image feature, and finally determining the content category of the video to be detected according to the distance set and the clustering centers. Compared with the existing video content detection scheme, the method has the advantages that the distance between each image feature and the plurality of clustering centers is calculated, and then the content category of the video to be detected is determined based on the distance set and the plurality of clustering centers, namely, the content category of each frame of image frame is considered in the actual detection process, so that the content category of the video to be detected is determined, and therefore, the accuracy of detecting the video content can be improved.

The method according to the examples is further described in detail below by way of example.

In this embodiment, the detection device of the video content is specifically integrated in the server as an example.

Referring to fig. 2a, a specific process of detecting video content may be as follows:

201. the server acquires a video to be detected.

The video to be detected may include a plurality of video frames, and the server may obtain the video to be detected in a plurality of ways, for example, the server may obtain the video from a designated database, which may be determined according to the requirements of the actual application, and the video to be detected may include a tv drama, a movie, a video recorded by a user, and the like.

202. The server selects video frames for video content detection from the videos to be detected to obtain a plurality of video frames to be detected.

For example, specifically, the server may select video frames for video content detection at intervals from the video to be detected according to an arrangement order of the video frames in the video to be detected to obtain a plurality of video frames to be detected, and in order to improve the calculation efficiency, the server may compress the number of the video frames, that is, delete a part of the video frames, and then select video frames for video content detection from the remaining video frames.

It should be noted that, in order to reduce the influence of the low information content frame on the subsequent video content detection, the server may further delete the low information content frame before selecting a video frame for video content detection from the video to be detected, where the low information content frame refers to a video frame with too simple color texture features, a video title frame, and the like, and specifically, an algorithm of image frame complexity may be used to detect all video frames.

203. And the server extracts the image characteristics corresponding to the video frames to be detected and acquires the clustering centers to which the content categories belong.

For example, the server may perform feature extraction on each video frame to be detected based on a trained video content classification model to obtain image features corresponding to each video frame to be detected, where the video content classification model is formed by training a plurality of sample video frames labeled with video content types, and the server may introduce each sample video frame into the trained video content classification model to obtain sample image features corresponding to each sample video, and then construct a clustering center to which each content type belongs according to the sample image features.

It should be noted that the video content classification model may be pre-established, and please refer to the foregoing embodiment specifically, which is not described herein again.

204. And the server respectively calculates the distance between each image feature and the plurality of clustering centers to obtain a distance set corresponding to each image feature.

205. And the server determines the content category of the video to be detected according to the distance set and the plurality of clustering centers.

The server determines the content category of the video to be detected according to the distance set and the plurality of clustering centers in two modes.

The first mode is as follows: taking one video feature a and 6 cluster centers as an example for illustration, a distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5 and a sixth distance B6, the server may randomly select three distances in the distance set B, for example, select the second distance B2, the fifth distance B5 and the sixth distance B6, then the server respectively obtains a cluster center corresponding to the second distance B2, a cluster center corresponding to the fifth distance B5 and a cluster center corresponding to the sixth distance B6, and finally, the server determines the content category of the video to be detected based on the content categories corresponding to the cluster centers.

The second mode is as follows: taking a video feature a and 6 cluster centers as an example for explanation, a distance set B corresponding to the video feature a includes a first distance B1, a second distance B2, a third distance B3, a fourth distance B4, a fifth distance B5 and a sixth distance B6, the server may randomly select the distance with the minimum distance from the distance set B as a target distance, and if the first distance B1 is the minimum, the first distance B1 is used as the target distance, then the server obtains the cluster center corresponding to the first distance B1, and finally, the server determines the content category of the video to be detected based on the content category corresponding to the cluster center.

To facilitate understanding of the method for detecting video content provided by the embodiment of the present invention, taking detecting which animation a video belongs to as an example for further description, please refer to fig. 2b, wherein a device for detecting video content is integrated in a server, and the server can detect a video X to be detected according to a trained video content classification model, and the method is divided into three stages: the method comprises a training data acquisition stage, a video frame preprocessing stage and a detection stage.

In the training data collection stage, the server may collect a sample video set through multiple approaches, where the sample video set includes multiple sample video frames labeled with video content types, for example, 100 animation videos and 1 non-animation video, where each frame of training image has two tags, a first tag i is used to indicate whether the video frame is animation, a second tag k is used to indicate which animation the video frame belongs to, and then, the convolutional neural network is used as a base network, and the convolutional neural network is trained through the sample video frame set to obtain a trained video content classification model, it should be noted that, when the convolutional neural network is trained, three loss functions, including a loss function L1, a loss function L2, and a loss function L3, and a loss function L1, may be used to determine whether the video is animation, the loss function L2 is used to determine which animation the video belongs to, and the loss function L3 is used to determine whether the video is animation, as shown in FIG. 2c, when in use, the loss function is removed, and a video content classification model can be obtained

In the video frame preprocessing stage, the server may compress the number of frames, delete frames with low information content, and/or clip video frames, which is specifically referred to in the foregoing embodiments, and details are not described here, it should be noted that the server may or may not preprocess video frames, and specifically select video frames according to actual situations. And then, the server respectively extracts the features of each sample video frame by using the trained video content classification model, and finally, the server constructs a clustering center to which each content category belongs based on the extracted features.

In the detection phase, when acquiring an image feature Q of a frame of image, the server calculates the euclidean distances d between the image feature Q and all cluster centers, where the cluster centers may be divided into a first cluster center set C1 and a second cluster center set C2, C1 ═ xi | i ═ 1, 2,. and C1}, C2 ═ yi | i ═ 1, 2,. and C2}, i is a positive integer, the labels corresponding to the cluster centers in the first cluster center set may be labels such as "nice", "hot blood" and "heterogeneous", the labels corresponding to the cluster centers in the second cluster center set may be labels such as "drama", "comedy" and "comedy inference", and the clustering method may employ a K-language clustering Algorithm (K-world clustering Algorithm), and then calculates the euclidean distances between the image feature Q and each cluster center, obtaining Euclidean distances L between the image feature Q and each cluster center, then, selecting the shortest n Euclidean distances from the Euclidean distances, where n is a positive odd number, and obtaining the cluster center corresponding to the selected Euclidean distance, obtaining a target cluster center, for example, selecting the shortest 3 Euclidean distances from the Euclidean distances, obtaining the cluster centers corresponding to the 3 Euclidean distances, obtaining cluster center T1, cluster center T2 and cluster center T3, then, respectively determining the content categories of the cluster center T1, cluster center T2 and cluster center T3, if the cluster center T1 is "Kaiguan", the cluster center T2 is "different world Fair" and the cluster center T3 is "comedy", then determining that the image feature Q belongs to cartoon, if the center T1 is "comedy", the cluster center T2 is "different world Fair" and the cluster center T3 is "comedy", then the image characteristic Q can be determined to belong to the non-cartoon, in the invention, the image characteristic corresponding to the video frame to be detected can be judged whether each frame image of a section of video is cartoon or not through the three steps, preferably, if the cartoon frame of the section of video accounts for more than 30%, the section of video is considered to be the cartoon video, otherwise, the section of video is judged to be the non-cartoon video,

as can be seen from the above, after the server acquires the video to be detected, the server selects the video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, then the server extracts the image features corresponding to the video frames to be detected and acquires the clustering centers to which the content categories belong, then the server calculates the distances between the image features and the clustering centers respectively to obtain the distance sets corresponding to the image features, and finally the server determines the content categories of the video to be detected according to the distance sets and the clustering centers. Compared with the existing video content detection scheme, the server of the scheme calculates the distance between each image feature and the plurality of clustering centers, and then determines the content category of the video to be detected based on the distance set and the plurality of clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

In order to better implement the method for detecting video content according to the embodiment of the present invention, an embodiment of the present invention further provides a detection apparatus (referred to as a detection apparatus for short) based on the video content. The terms are the same as those in the above-mentioned video content detection method, and specific implementation details can refer to the description in the method embodiment.

Referring to fig. 3a, fig. 3a is a schematic structural diagram of a detection apparatus for video content according to an embodiment of the present invention, where the detection apparatus may include a first obtaining module 301, a selecting module 302, an extracting module 303, a second obtaining module 304, a calculating module 305, and a determining module 306, and specifically may be as follows:

the first obtaining module 301 is configured to obtain a video to be detected.

For example, the first obtaining module 301 may obtain the video to be detected from the internet and/or a designated database, which may be determined according to requirements of actual applications, and the video to be detected may include a tv drama, a movie, a video recorded by a user, and the like.

The selecting module 302 is configured to select a video frame for video content detection from videos to be detected, so as to obtain a plurality of video frames to be detected.

For example, specifically, the selection module 302 may select video frames for video content detection at intervals from a video to be detected according to an arrangement order of the video frames in the video to be detected, to obtain a plurality of video frames to be detected, and in order to improve the computational efficiency, therefore, the selection module 302 may compress the number of the video frames, that is, delete a part of the video frames, and then select video frames for video content detection from the remaining video frames, that is, optionally, in some embodiments, the selection module 302 may specifically be configured to: detecting the number of video frames in a video to be detected, judging whether the number is larger than a preset number, removing corresponding video frames from the video to be detected based on a preset strategy when the number is larger than the preset number to obtain a reserved video frame set, and selecting a plurality of video frames at intervals from the reserved video frame set to obtain the video frames to be detected.

Optionally, in some embodiments, the selection module 302 may be further specifically configured to: and when the number is less than or equal to the preset number, selecting a plurality of video frames at intervals from the video to be detected to obtain the video frames to be detected.

The extracting module 303 is configured to extract image features corresponding to the video frames to be detected.

For example, the extracting module 303 may specifically perform feature extraction on each to-be-detected video frame based on a trained video content classification model to obtain an image feature corresponding to each to-be-detected video frame, where the video content classification model is trained by a plurality of sample video frames labeled with video content types.

The second obtaining module 304 is configured to obtain a clustering center to which each content category belongs.

The second obtaining module 304 may construct a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames, that is, optionally, in some embodiments, the second obtaining module 304 includes:

the acquisition unit is used for acquiring the trained video content classification model and a plurality of sample video frames marked with video content types;

Optionally, in some embodiments, the building unit may specifically include:

Optionally, in some embodiments, the building subunit may specifically be configured to: the method comprises the steps of obtaining a plurality of preset content labels, determining the number of classification to be clustered according to the preset content labels, and clustering extracted features based on a preset clustering algorithm and the number of classification to obtain a clustering center to which each content category belongs.

Optionally, in some embodiments, referring to fig. 3b, the detection apparatus further includes a training module 307, where the training module 307 may specifically be configured to: the method comprises the steps of collecting a plurality of sample video frames marked with video content types, determining the sample video frames needing to be trained currently from the collected sample video frames to obtain a current processing object, guiding the current processing object into a preset initial classification model to be trained to obtain a predicted value of video content corresponding to the current processing object, converging the predicted value corresponding to the current processing object and the marked video content types of the current processing object to adjust parameters of the preset initial classification model, and returning to the step of determining the sample video frames needing to be trained currently from the collected sample video frames until the sample video frames are trained completely.

The calculating module 305 is configured to calculate distances between each image feature and a plurality of clustering centers respectively, so as to obtain a distance set corresponding to each image feature.

And the determining module 306 is configured to determine the content category of the video to be detected according to the distance set and the plurality of cluster centers.

For example, specifically, the determining module 306 may randomly select 3 distances from the distance set, and determine the content category of the video to be detected according to the cluster centers corresponding to the 3 distances.

Optionally, in some embodiments, the determining module 306 may specifically include:

the selection unit is used for selecting a preset number of distances from the distance set to obtain at least one target distance;

Optionally, in some embodiments, the selection unit may specifically be configured to: and selecting the distance with the minimum distance in the distance set as the target distance.

Optionally, in some embodiments, the second determining unit may specifically be configured to: and respectively acquiring content categories corresponding to the target clustering centers, and determining the content category of the video to be detected based on the determined content categories.

As can be seen, after the first obtaining module 301 obtains the video to be detected, the selecting module 302 selects the video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, then the extracting module 303 extracts the image features corresponding to the video frames to be detected, the second obtaining module 304 obtains the clustering centers to which the content categories belong, then the calculating module 305 calculates the distances between the image features and the clustering centers respectively to obtain distance sets corresponding to the image features, and finally the determining module 306 determines the content categories of the video to be detected according to the distance sets and the clustering centers. Compared with the existing video content detection scheme, the method has the advantages that the distance between each image feature and the plurality of clustering centers is calculated, and then the content category of the video to be detected is determined based on the distance set and the plurality of clustering centers, namely, the content category of each frame of image frame is considered in the actual detection process, so that the content category of the video to be detected is determined, and therefore, the accuracy of detecting the video content can be improved.

In addition, an embodiment of the present invention further provides an electronic device, as shown in fig. 4, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 4 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

the method comprises the steps of obtaining a video to be detected, selecting video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, extracting image features corresponding to the video frames to be detected, obtaining a clustering center to which each content category belongs, calculating distances between each image feature and the clustering centers respectively to obtain a distance set corresponding to each image feature, and determining the content category of the video to be detected according to the distance set and the clustering centers.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

After the server acquires the video to be detected, the server selects video frames for video content detection from the video to be detected to obtain a plurality of video frames to be detected, then extracts image features corresponding to the video frames to be detected and acquires a clustering center to which each content category belongs, then the server calculates distances between each image feature and the clustering centers respectively to obtain a distance set corresponding to each image feature, and finally the server determines the content category of the video to be detected according to the distance set and the clustering centers. Compared with the existing video content detection scheme, the server of the scheme calculates the distance between each image feature and the plurality of clustering centers, and then determines the content category of the video to be detected based on the distance set and the plurality of clustering centers, namely, in the actual detection process, the content category of each frame of image frame is considered, and then the content category of the video to be detected is determined, so that the accuracy of detecting the video content can be improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present invention provides a storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the methods for detecting video content provided by the embodiments of the present invention. For example, the instructions may perform the steps of:

Wherein the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium can execute the steps in any video content detection method provided in the embodiment of the present invention, beneficial effects that can be achieved by any video content detection method provided in the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

The foregoing describes in detail a method, an apparatus, an electronic device, and a storage medium for detecting video content according to embodiments of the present invention, and a specific example is applied in the present disclosure to explain the principles and embodiments of the present invention, and the description of the foregoing embodiments is only used to help understand the method and the core idea of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for detecting video content, comprising:

2. The method according to claim 1, wherein the obtaining the clustering center to which each content category belongs comprises:

acquiring a trained video content classification model and a plurality of sample video frames marked with video content types, wherein the video content classification model is formed by training the plurality of sample video frames;

and constructing a clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames.

3. The method of claim 2, wherein constructing the clustering center to which each content category belongs based on the trained video content classification model and the plurality of sample video frames comprises:

respectively extracting the characteristics of each sample video frame by using the trained video content classification model;

and constructing a clustering center to which each content category belongs based on the extracted features.

4. The method of claim 3, wherein the constructing the cluster center to which each content category belongs based on the extracted features comprises:

acquiring a plurality of preset content tags;

5. The method of claim 2, wherein before obtaining the trained video content classification model, further comprising:

collecting a plurality of sample video frames marked with video content types;

6. The method according to any one of claims 1 to 5, wherein determining the content category of the video to be detected according to the distance set and the plurality of cluster centers comprises:

selecting a preset number of distances from the distance set to obtain at least one target distance;

determining a clustering center corresponding to the target distance to obtain at least one target clustering center;

and determining the content category of the video to be detected according to at least one target clustering center.

7. The method according to claim 6, wherein determining the content category of the video to be detected according to the at least one target cluster center comprises:

8. The method of claim 6, wherein selecting a preset number of distances from the set of distances to obtain at least one target distance comprises:

and selecting the distance with the minimum distance from the distance set as the target distance.

9. The method according to any one of claims 1 to 5, wherein selecting a video frame for video content detection from the video to be detected to obtain a plurality of video frames to be detected comprises:

detecting the number of video frames in a video to be detected;

judging whether the number is larger than a preset number or not;

10. The method of claim 9, further comprising:

11. An apparatus for detecting video content, comprising:

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for detecting video content according to any one of claims 1-10 are implemented when the program is executed by the processor.

13. A storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the method for detecting video content according to any one of claims 1-10.