CN112818888A

CN112818888A - Video auditing model training method, video auditing method and related device

Info

Publication number: CN112818888A
Application number: CN202110181850.6A
Authority: CN
Inventors: 丘林; 眭哲豪
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-18
Also published as: WO2022171011A1

Abstract

The embodiment of the invention discloses a video audit model training method, a video audit method and a related device, wherein the video audit model training method comprises the following steps: acquiring a first sample image and a classification label of the first sample image; initializing a video audit model, wherein the video audit model comprises a primary submodel and a secondary submodel; training a primary sub-model by using the first sample image and calculating a classification loss rate of the primary sub-model for classifying the first sample image according to the classification label; and when the classification loss rate is greater than a preset value, training a secondary sub-model by adopting the first sample image. The first sample image with the classification loss rate larger than the preset value is a difficult sample image which is difficult to distinguish positive and negative samples, so that a secondary sub-model can be trained by the difficult sample image, the secondary sub-model learns the capacity of distinguishing the difficult samples, finally, the whole video auditing model can accurately distinguish the positive and negative samples, illegal images in the video can be accurately determined, and the accuracy of video submission is improved.

Description

Video auditing model training method, video auditing method and related device

Technical Field

The embodiment of the invention relates to the technical field of video auditing, in particular to a video auditing model training method, a video auditing method and a related device.

Background

With the explosive growth of the mobile internet and the implementation of the network security method, content platform operators face more severe examinations, on one hand, malicious users are increased, on the other hand, the supervision of illegal contents in videos is enhanced, the video content auditing can help enterprises to screen illegal images, videos, characters and other contents in the platform, and the illegal contents can be filtered and deleted through the video content auditing, so that a green and safe network environment is established for the users.

With the application of machine learning technology, in the prior art, videos are usually audited through a trained video audit model, however, the live broadcast scene has complexity and particularity, on one hand, the live broadcast scene is complex and changeable in live broadcast and has a plurality of objects; on the other hand, live screenshots are affected by light, camera equipment and the like, and the problems of poor image quality, blurring and the like exist; moreover, visual features such as a mobile phone, an interphone and a microphone in a live broadcast scene are similar to the visual angle features of an illegal object, so that the precision of a video sent for manual review is not high; finally, in an online real data scene, the ratio difference between the positive sample and the negative sample is too large, which eventually causes the problem of FP (false positive) when the video is audited by using the video audit model, the video audit model cannot accurately distinguish the negative sample from the positive sample, and the accuracy of video audit is low.

Disclosure of Invention

The embodiment of the invention provides a video auditing model training method, a video auditing device, electronic equipment and a storage medium, and aims to solve the problem of low auditing accuracy caused by the fact that a video auditing model in the prior art is difficult to distinguish a positive sample from a negative sample.

In a first aspect, an embodiment of the present invention provides a method for training a video audit model, including:

acquiring a first sample image and a classification label of the first sample image;

initializing a video auditing model, wherein the video auditing model comprises a primary sub-model and a secondary sub-model;

training the first-level sub-model by using the first sample image and calculating the classification loss rate of the first-level sub-model for classifying the first sample image according to the classification label;

and when the classification loss rate is greater than a preset value, training the secondary sub-model by using the first sample image.

In a second aspect, an embodiment of the present invention provides a video auditing method, including:

acquiring a video image from a video to be audited;

inputting the video image into a pre-trained video auditing model to obtain a score of the video image belonging to an illegal image, wherein the video auditing model comprises a primary submodel and a secondary submodel, the primary submodel is used for predicting a first score of the video image belonging to the illegal image and outputting the first score when the first score is smaller than a preset value, and the secondary submodel is used for predicting a second score of the video image belonging to the illegal image and outputting the second score when the first score is larger than the preset value;

when the score is larger than a preset threshold value, auditing the video to be audited;

wherein the video audit model is trained by the video audit model training method of the first aspect.

In a third aspect, an embodiment of the present invention provides a video audit model training apparatus, including:

the system comprises a sample acquisition module, a classification module and a classification module, wherein the sample acquisition module is used for acquiring a first sample image and a classification label of the first sample image;

the model initialization module is used for initializing a video audit model, and the video audit model comprises a primary sub-model and a secondary sub-model;

the first-level sub-model training module is used for training the first-level sub-model by adopting the first sample image and calculating the classification loss rate of the first-level sub-model for classifying the first sample image according to the classification label;

and the secondary sub-model training module is used for training the secondary sub-model by adopting the first sample image when the classification loss rate is greater than a preset value.

In a fourth aspect, an embodiment of the present invention provides a video auditing apparatus, including:

the video image acquisition module is used for acquiring video images from videos to be audited;

the model prediction module is used for inputting the video image into a pre-trained video auditing model to obtain a score of the video image belonging to an illegal image, wherein the video auditing model comprises a primary submodel and a secondary submodel, the primary submodel is used for predicting a first score of the video image belonging to the illegal image and outputting the first score when the first score is smaller than a preset value, and the secondary submodel is used for predicting a second score of the video image belonging to the illegal image and outputting the second score when the first score is larger than the preset value;

the auditing module is used for auditing the video to be audited when the score is greater than a preset threshold value;

In a fifth aspect, an embodiment of the present invention provides an electronic device, where the electronic device includes:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the video review model training method of the first aspect of the invention, and/or the video review method of the second aspect of the invention.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the video review model training method according to the first aspect of the present invention and/or the video review method according to the second aspect of the present invention.

The video auditing model of the embodiment of the invention comprises a first-level sub-model and a second-level sub-model, after the video auditing model is initialized, the first-level sub-model is trained by adopting a first sample image, the classification loss rate of the first sample image classified by the first-level sub-model is calculated according to the classification label, the second-level sub-model is trained by adopting the first sample image when the classification loss rate is greater than a preset value, the classification loss rate of the first sample image is obtained by predicting and calculating the first-level sub-model, and the second-level sub-model can be trained by adopting a difficult sample image because the first sample image with the classification loss rate greater than the preset value is a difficult sample image which is difficult to distinguish positive and negative samples, so that the capability of distinguishing the difficult sample is learned by the second-level sub-model, finally, the whole video auditing model can accurately distinguish the positive and negative samples, and can, the accuracy of video submission is improved.

Drawings

Fig. 1 is a flowchart illustrating steps of a video audit model training method according to an embodiment of the present invention;

fig. 2A is a flowchart illustrating steps of a video audit model training method according to a second embodiment of the present invention;

FIG. 2B is a schematic structural diagram of a video audit model according to an embodiment of the present invention;

FIG. 2C is a schematic diagram of Densenet in an embodiment of the present invention;

FIG. 2D is a diagram of a residual module in an embodiment of the invention;

FIG. 2E is a schematic diagram of a primary submodel and a secondary submodel in an embodiment of the invention;

FIG. 2F is a schematic diagram of an attention mechanism module in an embodiment of the invention;

fig. 3 is a flowchart illustrating steps of a video auditing method according to a third embodiment of the present invention;

fig. 4 is a block diagram of a video audit model training apparatus according to a fourth embodiment of the present invention;

fig. 5 is a block diagram of a video auditing apparatus according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an electronic device according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures. The embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

Fig. 1 is a flowchart of steps of a video audit model training method according to an embodiment of the present invention, where the embodiment of the present invention is applicable to a situation where a video audit model is trained to audit a video, and the method may be executed by a video audit model training apparatus according to an embodiment of the present invention, where the video audit model training apparatus may be implemented by hardware or software and integrated in an electronic device according to an embodiment of the present invention, and specifically, as shown in fig. 1, the video audit model training method according to an embodiment of the present invention may include the following steps:

s101, obtaining a first sample image and a classification label of the first sample image.

In an embodiment of the present invention, the sample image may be an image used for training a video audit model, where the sample image may include an image of an illegal object, such as an image of an illegal object including a gun, a knife, an riot, and the like, and the classification label of the sample image may be a label expressing that the sample image is a normal image or an illegal image, in an example, the classification label may be 0 when the sample image is a normal image, and the classification label is 1 when the sample image is an illegal image.

In an alternative embodiment of the present invention, a plurality of original images may be obtained first, each original image may be subjected to image enhancement processing and normalization processing to obtain a plurality of sample images, a classification label of a sample image is determined based on an annotation operation, and illustratively, a plurality of video images may be captured from a plurality of live videos as the original images, then, brightness, contrast and definition adjustment are carried out on each original image to enhance the original image, the size of the original image is adjusted to be uniform, for example, the image is adjusted to 224 pixels in length and width, and finally the pixel values of the image are normalized to obtain a sample image, and labeling the classification label of the sample image based on a manual determination of whether the sample image contains an illegal object, if the sample image contains the illegal object, the classification label of the sample image is 1, otherwise, the classification label of the sample image is 0.

S102, initializing a video auditing model, wherein the video auditing model comprises a primary sub-model and a secondary sub-model.

In the embodiment of the invention, the video auditing model comprises a first-level submodel and a second-level submodel which are cascaded, wherein the first-level submodel is used for predicting a first score of the sample image belonging to the illegal image, and the second-level submodel is used for predicting a second score of the sample image belonging to the illegal image when the first score is larger than a preset value. Specifically, the primary and secondary submodels may be classified neural networks, for example, the primary and secondary submodels may be classified neural networks such as VGG, ResNet, and densneet. Before training the video audit model, a primary sub-model and a secondary sub-model can be constructed, and model parameters of the primary sub-model and the secondary sub-model are initialized.

S103, training the primary sub-model by adopting the first sample image and calculating the classification loss rate of the primary sub-model for classifying the first sample image according to the classification label.

Specifically, a score that the first sample image belongs to the violation image may be obtained from a first-level sub-model after the first sample image is randomly extracted from the plurality of first sample images and input into the initialization, and a classification loss rate of the first sample image classified by the first-level sub-model may be calculated according to the score and a classification label of the first sample image.

After a first sample image is input to train a primary submodel and the classification loss rate is calculated, model parameters of the primary submodel can be adjusted according to the classification loss rate, illustratively, gradients can be calculated according to the classification loss rate, the primary submodel is continuously trained in an iterative manner after the gradient of the model parameters of the primary submodel is reduced, and the trained primary submodel is obtained until a preset iteration number is reached or the classification loss rate is smaller than a preset threshold value.

And S104, when the classification loss rate is larger than a preset value, training the secondary sub-model by using the first sample image.

After the first-level submodel is iteratively trained every time, if the classification loss rate of the first-level submodel for classifying the first sample image is greater than a preset value, the first sample image is a difficult sample image which is difficult to distinguish whether the first sample image is a positive sample or a negative sample, the first sample image can be used for training the second-level submodel, so that the second-level submodel learns the capability of distinguishing the difficult sample image as the positive sample or the negative sample, specifically, the first sample image with the classification loss rate greater than the preset value is input into the second-level submodel to obtain the classification loss rate of the second-level submodel, and the model parameters of the second-level submodel are adjusted according to the classification loss rate of the second-level submodel until the preset iteration number is reached or the classification loss rate of the second-level submodel is smaller than a.

Example two

Fig. 2A is a flowchart of steps of a video audit model training method according to a second embodiment of the present invention, where the embodiment of the present invention is optimized based on the first embodiment, specifically, as shown in fig. 2A, the video audit model training method according to the embodiment of the present invention may include the following steps:

s201, obtaining a first sample image and a classification label of the first sample image.

In an optional embodiment of the present invention, a plurality of video images may be captured from a video, the plurality of video images are subjected to image enhancement and normalization processing to obtain a plurality of first sample images, and a classification label of the first sample image is obtained based on manual labeling, where in one example, the classification label is 0 when the first sample image does not include an illegal object, and the classification label is 1 when the first sample image includes an illegal object. Of course, a certain number of images may also be randomly extracted from the network image library as sample images, and not only the video images are captured from the video to obtain the sample images, but the embodiment of the present invention does not impose any limitation on the manner of obtaining the first sample image.

S202, initializing a video auditing model, wherein the video auditing model comprises a primary sub-model and a secondary sub-model.

As shown in fig. 2B, the video review model according to the embodiment of the present invention includes a first-level sub model and a second-level sub model, which are cascaded, where the first-level sub model is used to predict a first score of a sample image belonging to an illegal image, and the second-level sub model is used to predict a second score of the sample image belonging to the illegal image when the first score is greater than a preset value. Before training the video audit model, model parameters of the primary sub-model and the secondary sub-model may be initialized.

Optionally, the primary sub-model may be a DenseNet, and as shown in fig. 2C, the diagram of the DenseNet is shown, in the DenseNet, all network layers are connected to each other, that is, each network layer accepts all network layers in front of it as additional inputs, so that each network layer can multiplex output characteristics of all network layers in front of the network layer, thereby implementing characteristic multiplexing and improving efficiency. The secondary submodel may be ResNet, the ResNet reduces difficulty in training the deep network by a residual learning method, the ResNet introduces a residual module on the basis of the full convolution network, as shown in fig. 2D, the residual module is a schematic diagram of the residual module, each residual module includes two paths, one of the paths is a direct connection path of the input features, the other path performs convolution operation on the input features for two to three times to obtain residual of the input features, and finally the features on the two paths are added, so that difficulty in training the deep network can be reduced by the residual module, and the features can be extracted more easily. Of course, when implementing the embodiment of the present invention, a person skilled in the art may also set the network types of the primary sub-model and the secondary sub-model according to actual needs, and the embodiment of the present invention is not limited thereto.

S203, conducting rough training on the primary sub-model by using the first sample images with the specified number to obtain a rough primary sub-model and a first score of each first sample image belonging to the illegal image.

In the embodiment of the invention, when the video audit model is trained, the primary sub-model and the secondary sub-model can be roughly trained, namely the primary sub-model and the secondary sub-model are trained through a specified number of first sample images to obtain the rough primary sub-model and the rough secondary sub-model which are trained for a certain number of times.

As shown in fig. 2E, the network structure of the primary sub-model and the secondary sub-model is a network structure of the primary sub-model and the secondary sub-model, where the primary sub-model and the secondary sub-model include five sets of convolutional layers, a pooling layer is used between every two sets of convolutional layers for spatial dimension reduction, multiple consecutive 3 × 3 convolution operations are used in the same set of convolutional layers, the number of convolutional cores is increased from 64 in the first set of convolutional layers to 512 in the last set of convolutional layers, the number of convolutional cores in the same set of convolutional layers is the same, two fully-connected layers are connected after the last set of convolutional layers, and a classification layer is after the fully-connected layers.

In an alternative embodiment of the present invention, the primary sub-model and the secondary sub-model may be a convolutional neural network added with an attention mechanism module, that is, the attention mechanism module is inserted after a part of convolutional layers of the primary sub-model and the secondary sub-model to replace a pooling layer, as shown in fig. 2E, a schematic diagram of the attention mechanism module is shown, and the attention mechanism module includes a channel attention sub-module and a spatial attention sub-module.

When a first-level sub-model is subjected to rough training by adopting a specified number of first sample images, inputting the first sample images into the first-level sub-model, and inputting the output characteristics of the convolution layer into an attention mechanism module to obtain the final output characteristics of the attention mechanism module for inputting the next convolution layer for the convolution layer connected with the attention mechanism module; and sequentially passing the output characteristics of the last convolutional layer through the full-link layer and the classification layer to obtain a first score of the first sample image belonging to the violation image, returning to the step of inputting the first sample image into the first-level submodel until a specified number of first sample images are input into the first-level submodel, and thus training the first-level submodel for a certain number of times to obtain the rough first-level submodel.

As shown in fig. 2F, in the attention mechanism module, the convolution layer output characteristics are input to the channel attention submodule of the attention mechanism module to obtain channel characteristics, the channel characteristics and the convolution layer output characteristics are multiplied to obtain intermediate characteristics, the intermediate characteristics are input to the spatial attention submodule of the attention mechanism module to obtain spatial characteristics, and the spatial characteristics and the intermediate characteristics are multiplied to obtain final output characteristics of the attention mechanism module to be input to the next convolution layer.

Wherein, as shown in fig. 2F, the convolutional layer output features pass through the maximum pooling layer and the average pooling layer in the channel attention submodule, then pass through the sensor to output the channel feature 1 and the channel feature 2, the channel feature 1 and the channel feature 2 pass through the addition operation, then the final channel feature of the channel attention submodule is obtained through the sigmoid activation operation, the channel feature output by the channel attention submodule is multiplied by the convolutional layer output feature to obtain an intermediate feature, the intermediate feature is used as the input feature of the spatial attention submodule, in the spatial attention submodule, the intermediate feature respectively passes through the maximum pooling layer and the average pooling layer and then is subjected to the convolution operation, finally the final spatial feature of the spatial attention submodule is obtained through the sigmoid activation operation, the spatial feature and the intermediate feature are subjected to the multiplication operation to obtain the final output feature of the entire attention mechanism module, and inputting the final output characteristics of the whole attention mechanism module into the next convolution layer, and finally outputting a first score of the first sample image belonging to the violation image at the classification layer of the first-level sub-model.

And S204, calculating the classification loss rate of the sample image by using the first score of the first sample image and the classification label.

In the embodiment of the present invention, the classification layer of the primary sub-model outputs a first score that the first sample image belongs to the violation image, where the first score may be a probability value, and then the classification loss rate of the primary sub-model classifying the first sample image may be calculated by the first score and the classification label of the first sample image.

It should be noted that, after the primary submodel is trained once per iteration, the model parameters of the primary submodel are adjusted according to the classification loss rate.

S205, after the first sample image is adopted to conduct rough training on the first-level sub-model, if the classification loss rate is larger than a preset value, the first sample image is adopted to conduct rough training on the second-level sub-model to obtain a rough second-level sub-model until the first sample image with the specified number is adopted to conduct rough training on the first-level sub-model.

In the embodiment of the invention, after the primary submodel is trained once in each iteration, if the classification loss rate L1 of the primary submodel for classifying the first sample image is greater than a preset value, the first sample image can be determined to be a difficult sample which is difficult to distinguish a positive sample from a negative sample, the first sample image can be adopted to carry out rough training on the secondary submodel to obtain the secondary submodel, and after the secondary submodel is trained in the iteration, the first samples with the specified number are returned to carry out rough training on the primary submodel until all the first samples with the specified number are adopted to carry out rough training on the primary submodel to obtain the rough primary submodel and the rough secondary submodel. The rough training of the secondary submodel refers to the process of rough training of the primary submodel in S203-S204, and is not described in detail here.

And S206, acquiring a thermodynamic diagram of the first sample image.

In the embodiment of the invention, the thermodynamic diagram expresses the mapping relation between the first score of the first sample image, which belongs to the illegal image, predicted by the primary submodel and the sensitive area in the first sample image, namely, the first score of the first sample image, which belongs to the illegal image, predicted by the primary submodel is more sensitively related to which area in the first sample image is.

In one example, all of the first sample images may be input into a trained rough first-order sub-model to obtain a second score that the first sample image belongs to the violation image, and a thermodynamic diagram of the first sample image may be generated based on the Grad-CAM and the second score.

Specifically, the partial derivatives of the second scores of the first sample image belonging to the violation image to all pixels Aij of the feature map output by the full link layer of the primary sub-model may be calculated, then the global average value of the width and height dimensions of the feature map is taken for the partial derivatives, the sensitivity of the violation object in the first sample image with respect to the kth channel (RGB channel) in the feature map output by the full link layer is obtained, and finally the sensitivity of the multiple channels of each pixel point is weighted and linearly combined to obtain the thermodynamic diagram.

And S207, splicing the thermodynamic diagram and the first sample image to obtain a second sample image.

In one example, the first sample image may be represented as H × W × 3, where H is the number of pixels of the first sample image in the length direction, W is the number of pixels of the first sample image in the height direction, and 3 is RGB channel data of the first sample image. Based on this, the first sample image is added with a fourth channel with a value of 0, that is, the first sample image is represented as H × W × 3 × 0, and after generating the thermodynamic diagram of the first sample image, the pixel value of the thermodynamic diagram can be taken as the value of the fourth channel of the first sample image, so that the thermodynamic diagram and the first sample image are merged to obtain a second sample image H × W × 3 × 1, where 1 represents the pixel value of the thermodynamic diagram.

And S208, training the rough first-level sub-model by adopting the second sample image to obtain a finally trained first-level sub-model.

In an alternative embodiment, the fourth channel value of a specified number of second sample images may be randomly set to 0 to obtain a third sample image, and the second sample image and the third sample image are used to train the coarse first-level sub-model to obtain a finally trained first-level sub-model. Specifically, the pixel value of the highlighted portion in a portion of the second sample image may be set to 0, that is, the channel value of the fourth channel of the second sample image, whose channel value is greater than the preset threshold, is set to 0 to obtain a third sample image, and then the second sample image and the third sample image are randomly adopted to perform iterative training on the rough first-level sub-model until the training frequency reaches the preset frequency or the loss rate is less than the preset threshold, so as to obtain the trained first-level sub-model.

S209, determining a fourth sample image with the classification loss rate larger than a preset value from the first sample image.

After the first sample image is input into the rough first-level submodel, scores of all the first sample images belonging to the illegal images can be obtained, and the classification loss rate of the first sample images can be calculated through the scores, so that the first sample images with the classification loss rate larger than a preset value can be used as fourth sample images, and the fourth sample images are difficult sample images of which the first-level submodel is difficult to distinguish into positive samples or negative samples.

And S210, acquiring a thermodynamic diagram of the fourth sample image.

Specifically, the fourth sample image may be input into a trained rough secondary sub-model to obtain a third score of the fourth sample image, and a thermodynamic diagram of the fourth sample image is generated based on the Grad-CAM and the third score, which may be obtained in S206 specifically, and will not be described in detail herein.

And S211, splicing the thermodynamic diagram and the fourth sample image to obtain a fifth sample image.

Specifically, the pixel value of the pixel point in the thermodynamic diagram may be used as the channel value of the fourth channel of the fourth sample image to splice the thermodynamic diagram and the fourth sample image, and the specific details may refer to S207, which is not described in detail herein.

S212, training the rough secondary submodel by using the fifth sample image to obtain a finally trained secondary submodel.

Training the rough secondary sub-model using the fifth sample image may refer to training the rough primary sub-model in S208, which is not described in detail herein.

In an optional embodiment of the invention, the last convolution layer of the rough secondary submodel adopts a variable convolution kernel, and the receptive field of the secondary submodel is variable, so that the secondary submodel can learn the characteristics of the illegal object, and the identification capability of the secondary submodel on the illegal object is enhanced.

Further, the first sample image is adopted to conduct rough training on the first-level sub-model and the second-level sub-model, the sample image spliced with the thermodynamic diagram is adopted to conduct training on the first-level sub-model and the second-level sub-model after rough training, on one hand, rough training can accelerate model convergence, on the other hand, the thermodynamic diagram is added into the sample image, weak supervision data is provided for model training, and the classification accuracy rate of the video auditing model on the images is improved.

Furthermore, an attention mechanism module is added in the first-level sub-model and the second-level sub-model, so that the models focus on the local area of the illegal object in the image, and the capability of the video auditing model for detecting the illegal object is improved.

Furthermore, the last convolution layer of the secondary submodel adopts a variable convolution kernel, so that the secondary submodel can better learn the characteristics of the illegal object, and the capability of the secondary submodel for identifying the illegal object is improved.

Furthermore, the pixel value of the highlight area in the thermodynamic diagram is randomly set to be 0, so that overfitting of the model can be avoided, the capability of the model for identifying the shielded illegal object can be improved, and the robustness of the model for identifying the shielded illegal object is improved.

EXAMPLE III

Fig. 3 is a flowchart of steps of a video auditing method according to a third embodiment of the present invention, where the third embodiment of the present invention is applicable to a situation where a trained video auditing model is used to audit a video, and the method may be implemented by a video auditing apparatus according to an embodiment of the present invention, where the video auditing apparatus may be implemented by hardware or software, and is integrated in an electronic device according to an embodiment of the present invention, specifically, as shown in fig. 3, the video auditing method according to an embodiment of the present invention may include the following steps:

s301, obtaining a video image from a video to be audited.

In the embodiment of the present invention, the video to be audited may be a short video, and exemplarily, the video to be audited may be a live video on a live platform, or may also be a short video on a short video platform, or may also be a long video, or the like. After the video to be audited is determined, a certain number of video images can be captured from the video to be audited, for example, a certain number of video images can be obtained from the video to be audited according to a certain sampling rate, and a certain number of video images can be obtained from the video to be audited according to a certain time interval.

S302, inputting the video image into a pre-trained video auditing model to obtain a score of the video image belonging to the illegal image, wherein the video auditing model comprises a primary submodel and a secondary submodel, the primary submodel is used for predicting a first score of the video image belonging to the illegal image and outputting the first score when the first score is smaller than a preset value, and the secondary submodel is used for predicting a second score of the video image belonging to the illegal image and outputting the second score when the first score is larger than the preset value.

The video audit model of the embodiment of the present invention can be trained by the video audit model training method of the first embodiment or the second embodiment, where the video audit model includes a first-level submodel and a second-level submodel that are cascaded, a video image is first input into the first-level submodel to obtain a first score that the video image belongs to an illegal image, if the first score is smaller than a preset value, the video audit model outputs the first score, and if the first score is larger than the preset value, the video image is input into the second-level submodel to obtain a second score that the video image belongs to the illegal image and outputs the second score.

And S303, when the score is larger than a preset threshold value, auditing the video to be audited.

If the score of the video image is larger than the preset threshold value, it is indicated that the video image probably contains an illegal object, the user ID of the video to be audited and the video image can be sent to a background, and the video is audited manually in the background.

The video auditing model comprises a first-level submodel and a second-level submodel, wherein a video image of a video to be audited is firstly input into the first-level submodel to obtain a first score of the video image belonging to the illegal image, if the first score is smaller than a preset value, the video auditing model outputs the first score, and if the first score is larger than the preset value, the video image is input into the second-level submodel to obtain a second score of the video image belonging to the illegal image and outputs the second score. The video auditing model adopts a cascaded two-stage sub-model, the classification loss rate of the first sample image is obtained by prediction and calculation of the first-stage sub-model during training, and the first sample image with the classification loss rate larger than a preset value is a difficult sample image which is difficult to distinguish positive and negative samples, so that the two-stage sub-model can be trained by using the difficult sample image, the capability of distinguishing the difficult samples is learned by the two-stage sub-model, finally, the whole video auditing model can accurately distinguish the positive and negative samples, the illegal image in the video can be accurately determined, and the accuracy of video submission is improved.

Example four

Fig. 4 is a block diagram of a video audit model training apparatus according to a fourth embodiment of the present invention, and as shown in fig. 4, the video audit model training apparatus according to the fourth embodiment of the present invention includes:

a sample obtaining module 401, configured to obtain a first sample image and a classification label of the first sample image;

a model initialization module 402, configured to initialize a video audit model, where the video audit model includes a primary sub-model and a secondary sub-model;

a first-level sub-model training module 403, configured to train the first-level sub-model with the first sample image and calculate a classification loss rate of the first-level sub-model for classifying the first sample image according to the classification label;

and a secondary sub-model training module 404, configured to train the secondary sub-model by using the first sample image when the classification loss rate is greater than a preset value.

The video audit model training device provided by the embodiment of the invention can execute the video audit model training method provided by the first embodiment and the second embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE five

Fig. 5 is a block diagram of a video auditing apparatus according to a fifth embodiment of the present invention, and as shown in fig. 5, the video auditing apparatus according to the fifth embodiment of the present invention may specifically include the following modules:

a video image obtaining module 501, configured to obtain a video image from a video to be audited;

the model prediction module 502 is configured to input the video image into a pre-trained video review model to obtain a score that the video image belongs to an illegal image, where the video review model includes a primary submodel and a secondary submodel, the primary submodel is configured to predict a first score that the video image belongs to the illegal image and output the first score when the first score is smaller than a preset value, and the secondary submodel is configured to predict a second score that the video image belongs to the illegal image and output the second score when the first score is larger than the preset value;

the auditing module 503 is configured to audit the video to be audited when the score is greater than a preset threshold;

the video audit model is trained by the video audit model training method described in the first embodiment or the second embodiment.

The video auditing device provided by the embodiment of the invention can execute the video auditing method provided by the third embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Referring to fig. 6, a schematic structural diagram of an electronic device in one example of the invention is shown. As shown in fig. 6, the electronic device may specifically include: a processor 601, a storage device 602, a display screen 603 with touch functionality, an input device 604, an output device 605, and a communication device 606. The number of the processors 601 in the electronic device may be one or more, and one processor 601 is taken as an example in fig. 6. The processor 601, the storage device 602, the display 603, the input device 604, the output device 605, and the communication device 606 of the electronic apparatus may be connected by a bus or other means, and fig. 6 illustrates an example of connection by a bus. The electronic device is used for executing the video auditing model training method and/or the video auditing method provided by any embodiment of the invention.

Embodiments of the present invention further provide a computer-readable storage medium, where instructions in the storage medium, when executed by a processor of a device, enable the device to perform a video audit model training method and/or a video audit method as described in the above method embodiments.

It should be noted that, as for the embodiments of the apparatus, the electronic device, and the storage medium, since they are basically similar to the embodiments of the method, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the embodiments of the method.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious modifications, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A video audit model training method is characterized by comprising the following steps:

2. The video audit model training method of claim 1, wherein the obtaining a first sample image and a classification label of the first sample image comprises:

acquiring an original image;

carrying out image enhancement processing and normalization processing on the original image to obtain a first sample image;

and determining a classification label of the first sample image based on an annotation operation, wherein the classification label indicates that the first sample image is a normal image or an illegal image.

3. The method of claim 1, wherein the training the primary sub-model with the first sample image and calculating a classification loss rate of the primary sub-model for classifying the first sample image according to the classification label comprises:

carrying out rough training on the primary sub-model by using a specified number of first sample images to obtain a rough primary sub-model and a first score of each first sample image belonging to an illegal image;

calculating a classification loss rate of the first sample image using the first score of the first sample image and the classification label;

acquiring a thermodynamic diagram of the first sample image;

splicing the thermodynamic diagram and the first sample image to obtain a second sample image;

and training the rough first-level sub-model by adopting the second sample image to obtain a finally trained first-level sub-model.

4. The video audit model training method of claim 3 wherein the primary submodel includes a convolutional layer, an attention mechanism module, a full link layer, and a classification layer, and the performing coarse training on the primary submodel using a specified number of the first sample images to obtain a coarse primary submodel and a first score for each of the first sample images belonging to an illegal image includes:

inputting the first sample image into the primary sub-model, and inputting the output characteristics of the convolution layer into an attention mechanism module to obtain the final output characteristics of the attention mechanism module for inputting the next convolution layer for the convolution layer connected with the attention mechanism module;

and sequentially passing the output characteristics of the last convolution layer through the full-connection layer and the classification layer to obtain a first score of the first sample image belonging to the illegal image, and returning to the step of inputting the first sample image into the first-level sub-model until a specified number of first sample images are input into the first-level sub-model.

5. The method for training a video audit model according to claim 4, wherein for a convolutional layer connected to an attention mechanism module, inputting the convolutional layer output characteristics into the attention mechanism module to obtain final output characteristics of the attention mechanism module for inputting a next convolutional layer, includes:

inputting the convolutional layer output characteristics into a channel attention submodule of the attention mechanism module to obtain channel characteristics;

multiplying the channel characteristic and the convolutional layer output characteristic to obtain an intermediate characteristic;

inputting the intermediate features into a spatial attention submodule of the attention mechanism module to obtain spatial features;

and multiplying the spatial feature and the intermediate feature to obtain the final output feature of the attention mechanism module to be input into the next coiled layer.

6. The video audit model training method of claim 4 wherein the obtaining a thermodynamic diagram of the first sample image comprises:

inputting all the first sample images into the rough first-level sub-model to obtain a second score of the first sample image belonging to the violation image;

generating a thermodynamic diagram of the first sample image based on the Grad-CAM and the second score.

7. The method for training a video audit model according to claim 4, wherein the stitching the thermodynamic diagram and the first sample image to obtain a second sample image comprises:

and splicing the pixel values of the thermodynamic diagram to a fourth channel of the first sample image to obtain a second sample image, wherein the first channel, the second channel and the third channel of the second sample image are respectively RGB values of the second sample image.

8. The method of claim 7, wherein the training the coarse first-level submodel using the second sample image to obtain a final trained first-level submodel comprises:

randomly setting the fourth channel value of the second sample images of the designated number to 0 to obtain a third sample image;

and training the rough first-level sub-model by adopting the second sample image and the third sample image to obtain a finally trained first-level sub-model.

9. The video audit model training method according to any one of claims 3 to 8, wherein when the classification loss rate is greater than a preset value, training the secondary sub-model using the first sample image includes:

after the first sample image is adopted to carry out rough training on the first-level sub-model, if the classification loss rate is greater than a preset value, the first sample image is adopted to carry out rough training on the second-level sub-model to obtain a rough second-level sub-model until the first sample image with the specified number is adopted to carry out rough training on the first-level sub-model;

determining a fourth sample image with the classification loss rate larger than a preset value from the first sample image;

acquiring a thermodynamic diagram of the fourth sample image;

splicing the thermodynamic diagram and the fourth sample image to obtain a fifth sample image;

and training the rough secondary submodel by adopting the fifth sample image to obtain a finally trained secondary submodel.

10. The method of claim 9, wherein the obtaining the thermodynamic diagram of the fourth sample image comprises:

inputting the fourth sample image into a trained rough secondary sub-model to obtain a third score of the fourth sample image belonging to the violation image;

generating a thermodynamic diagram of the fourth sample image based on the Grad-CAM and the third score.

11. The video audit model training method of claim 9 wherein the convolution kernel of the last convolutional layer of the secondary submodel is a deformable convolution kernel.

12. A video auditing method, comprising:

acquiring a video image from a video to be audited;

wherein the video audit model is trained by the video audit model training method of any one of claims 1-11.

13. A video audit model training device, comprising:

14. A video review apparatus, comprising:

15. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a video review model training method as claimed in any one of claims 1-11, and/or a video review method as claimed in claim 12.

16. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a video review model training method according to any one of claims 1 to 11 and/or a video review method according to claim 12.