CN113011254A

CN113011254A - Video data processing method, computer equipment and readable storage medium

Info

Publication number: CN113011254A
Application number: CN202110159590.2A
Authority: CN
Inventors: 尚焱; 李松南
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-04
Filing date: 2021-02-04
Publication date: 2021-06-22
Anticipated expiration: 2041-02-04
Also published as: CN113011254B

Abstract

The embodiment of the application discloses a video data processing method, computer equipment and a readable storage medium, which relate to a block chain technology and a video processing technology in artificial intelligence, wherein the method comprises the following steps: acquiring a key frame image from video data; identifying key image characteristics of the key frame image based on the character detection model, performing character region characteristic matching on the key image characteristics, and determining a character region in the key frame image; extracting the characteristics of the character area based on the image recognition model, recognizing the character data of the key frame image from the character area according to the extracted characteristics, and matching the character data with a character database to obtain a character detection result of the character area; and if the character detection result is the result of matching the character data with the character database, acquiring a target character string matched with the character data, and determining the data type of the target character string as the video type of the video data. By adopting the embodiment of the application, the efficiency and the accuracy of data detection can be improved.

Description

Video data processing method, computer equipment and readable storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video data processing method, a computer device, and a readable storage medium.

Background

The user can record video data and upload the video data to the social platform so that other users can conveniently view the interaction. Some social platforms add watermarks in video data uploaded by users to protect copyrights of original users, so that illegal users can be prevented from maliciously stealing original videos of other people for uploading. Therefore, it is an urgent problem to determine whether to infringe the copyright of others by detecting the video data uploaded by the user and determining whether the video data includes the watermark.

The existing video data detection method generally detects only specific positions in video data, such as upper left corner, lower left corner, upper right corner and lower right corner, to determine whether the video data contains a watermark, and the method only detects the specific positions of the video data, so that the accuracy of data detection is low.

Disclosure of Invention

The embodiment of the application provides a video data processing method, a computer device and a readable storage medium, which can improve the accuracy and efficiency of data detection.

An embodiment of the present application provides a video data processing method, including:

acquiring a key frame image from at least two video frame images constituting video data;

identifying key image features of the key frame image based on a character detection model, performing character region feature matching on the key image features, and determining a character region in the key frame image;

extracting the characteristics of the character area based on an image recognition model, recognizing the character data of the key frame image from the character area according to the extracted characteristics, and performing character matching on the character data of the key frame image and a character database to obtain a character detection result of the character area in the key frame image;

and if the character detection result is the result of matching the character data with the character database, acquiring a target character string matched with the character data from the character database, and determining the data type corresponding to the target character string as the video type of the video data.

Another aspect of the embodiments of the present application provides a video data processing method, including:

acquiring a sample key frame image from at least two sample video frame images forming sample video data, and acquiring a sample area label in the sample key frame image;

identifying sample key image features of the sample key frame image based on an initial character detection model, performing character region feature matching on the sample key image features, and determining a sample character region in the sample key frame image;

and generating a first loss function based on the sample character region and the sample region label, and training the initial character detection model based on the first loss function to generate a character detection model.

An aspect of an embodiment of the present application provides a video data processing apparatus, including:

the image acquisition module is used for acquiring a key frame image from at least two video frame images forming video data;

the character recognition module is used for recognizing key image characteristics of the key frame image based on the character detection model, performing character region characteristic matching on the key image characteristics and determining a character region in the key frame image;

the character matching module is used for extracting the characteristics of the character area based on the image recognition model, recognizing the character data of the key frame image from the character area according to the extracted characteristics, and performing character matching on the character data of the key frame image and a character database to obtain a character detection result of the character area in the key frame image;

and the category determining module is used for acquiring a target character string matched with the character data from the character database if the character detection result is the result of matching the character data with the character database, and determining the data category corresponding to the target character string as the video category to which the video data belongs.

Optionally, the image obtaining module includes:

the image matching unit is used for carrying out image feature matching on the ith video frame image and the (i +1) th video frame image in the at least two video frame images to obtain the similarity between the ith video frame image and the (i +1) th video frame image; i is a positive integer;

a first image determining unit, configured to determine the (i +1) th video frame image as a key frame image of the video data if a similarity between the i th video frame image and the (i +1) th video frame image is smaller than the video similarity threshold, and perform image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain a similarity between the (i +1) th video frame image and the (i +2) th video frame image;

a second image determining unit, configured to perform image feature matching on the (i +1) th video frame image and the (i +2) th video frame image if a similarity between the i-th video frame image and the (i +1) th video frame image is greater than or equal to a video similarity threshold, so as to obtain a similarity between the (i +1) th video frame image and the (i +2) th video frame image; and obtaining a key frame image of the video data until the (i +2) th video frame image is the last video frame image of the at least two video frame images.

The character recognition module includes:

the feature extraction unit is used for extracting features of the key frame image based on the convolution layer in the character detection model to obtain key image features of the key frame image;

the characteristic splicing unit is used for carrying out characteristic splicing on the characteristics of the key images to obtain spliced characteristic images corresponding to the key frame images; the pixel values of the pixels in the spliced characteristic image are used for representing the probability that the pixels in the corresponding key frame image are characters;

the image determining unit is used for acquiring the probability range to which each pixel value in the spliced characteristic image belongs, and generating a probability image and a character frame image according to the probability range to which each pixel value belongs;

and the region determining unit is used for performing feature fusion on the probability image and the character frame image to generate a fusion character image, and determining a character region in the key frame image based on the fusion character image.

The character matching module comprises:

the sequence acquisition unit is used for extracting the characteristics of the character area based on the convolution layer in the image recognition model to obtain the convolution characteristics corresponding to the character area, and performing serialization processing on the convolution characteristics corresponding to the character area to obtain a characteristic sequence corresponding to the character area;

the cyclic processing unit is used for carrying out recognition processing on the feature sequence based on a cyclic layer in the image recognition model and determining sequence character features corresponding to the feature sequence;

and the feature conversion unit is used for performing feature conversion on the sequence character features based on a transcription layer in the image recognition model to obtain character data of the key frame image.

Optionally, the number of key frame images in the video data is N; n is a positive integer; the character matching module comprises:

the character combination unit is used for combining the character data of the N key frame images in the video data to obtain combined character data;

a word segmentation determining unit, configured to perform word segmentation processing on the combined character data, and determine M word segmentation character data corresponding to the video data; m is a positive integer;

the character matching unit is used for respectively carrying out character matching on the M word segmentation character data corresponding to the video data and the character database to obtain k matched character strings and the matching number respectively corresponding to the k matched character strings; the matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data; k is a positive integer;

the result determining unit is used for determining the character detection result as the result of matching the character data with the character database if the matching character strings of which the matching number is greater than the matching threshold exist;

and the character determining unit is used for determining the matching character strings of which the matching number is greater than the matching threshold as the target character strings matched with the character data.

Optionally, the apparatus further comprises:

the data response module is used for responding to an uploading request of the user terminal for the video data;

the data prompt module is used for sending a data uploading exception prompt to the user terminal if the video category to which the video data belongs to the marked video category; the data uploading exception prompt comprises a video category to which the video data belongs;

and the data uploading module is used for uploading the video data to the application program if the video category to which the video data belongs does not belong to the marked video category.

An aspect of an embodiment of the present application provides another video data processing apparatus, including:

the region label acquiring module is used for acquiring a sample key frame image from at least two sample video frame images forming sample video data and acquiring a sample region label in the sample key frame image;

the sample region determining module is used for identifying the sample key image features of the sample key frame image based on the initial character detection model, performing character region feature matching on the sample key image features and determining a sample character region in the sample key frame image;

and the detection model generation module is used for generating a first loss function based on the sample character region and the sample region label, training the initial character detection model based on the first loss function and generating a character detection model.

Optionally, the apparatus further comprises:

the character label obtaining module is used for obtaining a sample character label in the sample key frame image;

the sample character acquisition module is used for extracting the characteristics of the sample character area based on an initial image recognition model and recognizing the sample character data of the sample key frame image from the sample character area according to the extracted sample characteristics;

and the recognition model generation module is used for generating a second loss function based on the sample character data and the sample character label, training the initial image recognition model based on the second loss function and generating the image recognition model.

One aspect of the present application provides a computer device, comprising: a processor, a memory, a network interface;

the processor is connected with a memory and a network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable a computer device comprising the processor to execute the method.

An aspect of the embodiments of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program being adapted to be loaded and executed by a processor, so as to make a computer device having the processor execute the above method.

An aspect of an embodiment of the present application provides a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternatives in one aspect of the embodiments of the application.

In the embodiment of the application, the key frame image is acquired from the at least two video frame images forming the video data, and the key frame image can be a representative image of the at least two video frame images included in the video data, so that the efficiency of data detection can be improved by acquiring the key frame image from the video data for identification processing. The character area in the key frame image is determined by identifying the image characteristics in the key frame image, and then when character data in the character area are identified, only the character area in the key frame image needs to be identified, the whole key frame image does not need to be identified, and the data identification efficiency can be improved. Furthermore, the first detection and identification are carried out on the video frame image, the character area in the key frame image is determined, then the character area is identified, and the character data in the character area is determined, which is equivalent to twice identification of the key frame image, so that the accuracy of data detection can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of a video data processing system according to an embodiment of the present disclosure;

fig. 2 is a schematic view of an application scenario of a video data processing method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application;

FIG. 4 is a scene diagram illustrating a determination of a character region in a key frame image based on a character detection model according to an embodiment of the present application;

FIG. 5 is a scene diagram illustrating a determination of character data of a key frame image based on an image recognition model according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating a method for determining a key frame image according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a scenario of extracting a sequence of key frames according to an embodiment of the present application;

fig. 8 is a schematic flowchart of another video data processing method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a video data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of another video data processing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine look, and more specifically, it refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission (P2P transmission), consensus mechanism, encryption algorithm, etc., and is essentially a decentralized database, which is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, and is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain can comprise a blockchain bottom platform, a platform product service layer and an application service layer; the blockchain can be composed of a plurality of serial transaction records (also called blocks) which are connected in series by cryptography and protect the contents, and the distributed accounts connected in series by the blockchain can effectively record the transactions by multiple parties and can permanently check the transactions (can not be tampered). The consensus mechanism is a mathematical algorithm for establishing trust and obtaining rights and interests among different nodes in the block chain network; that is, the consensus mechanism is a mathematical algorithm commonly recognized by network nodes in the blockchain.

The application relates to a video processing technology in a block chain technology and artificial intelligence, which can store video data in a block chain network by using the block chain technology, detect images of the video data by using the video processing technology, determine character areas in the images, extract characteristics of the character areas, determine character data of the images, perform character matching with a character database based on the character data obtained by recognition, and determine video categories to which the video data belong based on character matching results. The present application may also utilize a blockchain technique to store the character data in the character database and the video category to which the video data belongs in a blockchain network, and so on. The character detection result is obtained by detecting the key frame image, determining the character area in the key frame image and identifying the character area in the key frame image, so that the video category to which the video data belongs is determined, and the efficiency and the accuracy of data detection can be improved.

Referring to fig. 1, fig. 1 is a network architecture diagram of a video data processing system according to an embodiment of the present application, as shown in fig. 1, a computer device 101 may perform data interaction with a user terminal, where the number of the user terminals may be one or more, for example, when the number of the user terminals is multiple, the user terminals may include the user terminal 102a, the user terminal 102b, the user terminal 102c, and the like in fig. 1. Taking the user terminal 102a as an example, the computer device 101 may respond to an upload request of the user terminal 102a for video data, and obtain a key frame image from at least two video frame images constituting the video data based on the upload request. Further, the computer device 101 may identify key image features of the key frame image based on the character detection model, perform character region feature matching on the key image features, and determine a character region in the key frame image; and performing feature extraction on the character region based on the image recognition model, recognizing character data of the key frame image from the character region according to the extracted features, and performing character matching on the character data of the key frame image and the character database to obtain a character detection result of the character region in the key frame image. Further, if the character detection result is a result of matching the character data with the character database, the computer device 101 may obtain a target character string matching the character data from the character database, and determine a data category corresponding to the target character string as a video category to which the video data belongs.

The character area in the key frame image is determined by identifying the image characteristics in the key frame image, and then when character data in the character area are identified, only the character area in the key frame image needs to be identified, the whole key frame image does not need to be identified, and the data identification efficiency can be improved. Furthermore, the first detection and identification are carried out on the video frame image, the character area in the key frame image is determined, then the character area is identified, and the character data in the character area is determined, which is equivalent to twice identification of the key frame image, so that the accuracy of data detection can be improved.

It is understood that the computer device mentioned in the embodiments of the present application includes, but is not limited to, a terminal device or a server. In other words, the computer device or the user terminal may be a server or a terminal device, or may be a system composed of a server and a terminal device. The above-mentioned terminal device may be an electronic device, including but not limited to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted device, an Augmented Reality/Virtual Reality (AR/VR) device, a helmet display, a wearable device, a smart speaker, a digital camera, a camera, and other Mobile Internet Devices (MID) with network access capability. The above-mentioned server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, vehicle-road cooperation, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Further, please refer to fig. 2, wherein fig. 2 is a schematic view of an application scenario of a video data processing method according to an embodiment of the present application. As shown in fig. 2, the user terminal 20 sends an upload request for video data to the computer device 22, where the upload request carries the video data, the computer device 22 obtains a key frame image 21 from at least two video frame images constituting the video data, identifies key image features of the key frame image 21 based on a character detection model, performs character region feature matching on the key image features, and determines a character region 23 in the key frame image 21; the computer device 22 further performs feature extraction on the character region 23 based on the image recognition model, and recognizes character data 24 of the key frame image from the character region 23 based on the extracted features. For example, if the character data 24 of the identified key frame image is "flight video", the character data "flight video" is character-matched with the character database to obtain the character detection result of the character area in the key frame image 21. And if the character detection result is the result of matching the character data with the character database, acquiring a target character string matched with the character data from the character database, and determining the data type corresponding to the target character string as the video type to which the video data belongs. Optionally, if the video category belongs to the tagged video category, the computer device 22 may further send a data upload exception prompt to the user terminal 20, for example, the data upload exception prompt may include "prohibit uploading for avoiding risk because the uploaded video includes an update video flag", so that the user may view the data upload exception prompt through the user terminal, and accordingly modify the data upload exception prompt.

Further, please refer to fig. 3, wherein fig. 3 is a schematic flowchart of a video data processing method according to an embodiment of the present application; as shown in fig. 3, the method includes:

s101, acquiring a key frame image from at least two video frame images forming video data.

In the embodiment of the application, the computer equipment can acquire video data from a local database; alternatively, the video data may be acquired from other storage media; alternatively, the computer device may also obtain video data from the user terminal. The computer equipment obtains at least two video frame images by splitting the obtained video data, and obtains the key frame image by performing frame extraction processing on the at least two video frame images. Taking the example that the computer device acquires video data from the user terminal, when a user sends an upload request for the video data through the user terminal, the computer device acquires the video data based on the upload request, and if the video data is data composed of one video frame image, the video frame image is determined as a key frame image. If the video data is data composed of at least two video frame images, the computer device can split the video data to obtain at least two video frame images composing the video data, and perform frame extraction processing on the at least two video frame images to obtain a key frame image. The key frame images may reflect a large amount of image information in the video data. The computer equipment processes the key frame images, and the key frame images are representative images in at least two video frame images contained in the video data, and the number of the key frame images is less than the total number of the video frame images in the video data, so that the data processing efficiency can be improved by processing the key frame images, and the data detection result can accurately reflect the content of the video data. In the embodiment of the present application, if the video data includes one key frame image, the processing in steps S102 to S104 is performed for the key frame image. If the video data includes a plurality of key frame images, the processing of steps S102 to S104 is performed for each of the plurality of key frame images.

Optionally, the computer device may determine the key frame image based on a similarity between adjacent video frame images of the at least two video frame images; alternatively, the key frame images may be determined based on the number of video frame images in the video data, for example, the number of key frames may be determined based on the number of video frame images in the video data, and the key frame images may be extracted from at least two video frame images based on the number of key frames; alternatively, the key frame image is determined based on the duration of the video data, for example, the number of key frames is determined based on the duration of the video data, the position of the key frame is determined based on the number of key frames, and the video frame image located at the position of the key frame among at least two video frame images constituting the video data is determined as the key frame image, etc., which is not limited herein.

S102, identifying key image characteristics of the key frame image based on the character detection model, performing character region characteristic matching on the key image characteristics, and determining a character region in the key frame image.

In the embodiment of the application, the computer equipment identifies the key image features of the key frame image based on the character detection model, performs character region feature matching on the key image features, and determines the character region in the key frame image. Here, the key image feature may refer to an image feature in step S101 described above. The computer device may extract a feature in the key frame image as a key image feature for reflecting image information in the key frame image, for example, an object included in the key frame image, such as a character and object information other than the character. The computer device determines a character region in the key frame image by performing character region feature matching on the key image features.

The character region feature matching is used for determining the probability of an indication character in a key image feature, and determining a region in which the character is likely to be displayed in a key frame image based on the probability of the indication character in the key image feature, that is, the process of the character region feature matching specifically refers to a process of performing feature matching to determine a character region. Specifically, the computer device may perform feature extraction on the key frame image based on the convolution layer in the character detection model to obtain the key image features of the key frame image. Further, character region feature matching is carried out on key image features, character regions in the key frame images are determined, specifically, feature splicing can be carried out on the key image features, and spliced feature images corresponding to the key frame images are obtained; the pixel values of the pixels in the spliced characteristic image are used for representing the probability that the pixels in the corresponding key frame image are characters; acquiring the probability range to which each pixel value belongs in the spliced characteristic image, and generating a probability image and a character frame image according to the probability range to which each pixel value belongs; and performing feature fusion on the probability image and the character frame image to generate a fusion character image, and determining a character area in the key frame image based on the fusion character image.

The number of convolution layers in the character detection model may be multiple, and the convolution kernel of each convolution layer is different, and the physical meaning of the convolution kernel is a matrix of a (e.g. 1 × 1, 3 × 3, etc.). In a specific implementation, the key frame image may be quantized to obtain a pixel matrix corresponding to the key frame image, where the pixel matrix is an m × n matrix, m × n is equal to a pixel of the key frame image, and a value in the pixel matrix is a quantized value obtained by comprehensively quantizing luminance, chrominance, and the like in the key frame image. For example, the key frame image is a 1920 × 2040 picture, and the pixel matrix corresponding to the key frame image is a 1920 × 2040 matrix, where the value in the matrix is the quantization value of the pixel corresponding to the value. And then multiplying the pixel matrix of the key frame image by the matrix corresponding to the convolution kernel to obtain the pixel matrix corresponding to the key frame image, namely obtaining the key image characteristics. Because the convolution kernels of each convolution layer are different, after the different convolution layers are used for carrying out feature extraction on the key frame image, the obtained key image features are different, the number of the corresponding key image features is also different, and the obtained key image features are subjected to feature splicing, so that the image information in the key frame image can be more completely reflected by the features obtained after splicing.

Referring to fig. 4, fig. 4 is a schematic view of a scene where a character region in a key frame image is determined based on a character detection model according to an embodiment of the present application, a computer device inputs a key frame image 41 into the character detection model, and performs feature extraction on the key frame image based on a convolution layer 42 in the character detection model to obtain a key image feature of the key frame image, where the convolution layer 42 may include h convolution layers, and h is a positive integer. For example, if h is 5, the key image features extracted by 5 convolutional layers 42, including the first convolutional layer f1, the second convolutional layer f2, the third convolutional layer f3, the fourth convolutional layer f4 and the fifth convolutional layer f5, are different. The first convolution layer f1 performs feature extraction on the key frame image 41 to obtain a first key image feature; the second convolutional layer f2 performs feature extraction on the first key image features to obtain second key image features; the third convolution layer f3 performs feature extraction on the second key image features to obtain third key image features; the fourth convolutional layer f4 performs feature extraction on the third key image features to obtain fourth key image features; and the fifth convolutional layer f5 performs feature extraction on the fourth key image feature to obtain a fifth key image feature. And performing double upsampling (up x 2) on the fifth key image feature to obtain a sampled fifth key image feature, and fusing the sampled fifth key image feature and the fourth key image feature to obtain a fused fourth key image feature. And performing double up-sampling (up x 2) on the fourth key image features to obtain sampled fourth key image features, and fusing the sampled fourth key image features and the third key image features to obtain fused third key image features. And performing double up-sampling (up x 2) on the third key image features to obtain sampled third key image features, and fusing the sampled third key image features and the second key image features to obtain fused second key image features. Performing eight times of upsampling (up 8) and convolution processing on the fifth key image characteristic to obtain a first sampling image; performing quadruple up-sampling (up 4) and convolution processing on the fused fourth key image characteristics to obtain a second sampling image; and performing double upsampling (up 2) and convolution processing on the fused third key image features to obtain a third sampled image. And performing convolution processing on the fused second key image characteristics to obtain a fourth sampling image. And performing feature stitching on the first sampling image, the second sampling image, the third sampling image and the fourth sampling image to obtain a stitched feature image 43 corresponding to the key frame image 41. And the pixel values of the pixels in the splicing characteristic image are used for representing the probability that the pixels in the corresponding key frame image are characters. The computer device obtains the probability range to which each pixel value in the stitched feature image 43 belongs, and generates a probability image 44 and a character frame image 45 according to the probability range to which each pixel value belongs. Wherein, different probability ranges can be represented based on different colors, and the computer device can determine the color corresponding to each pixel value based on the probability range to which each pixel value in the stitched feature image 43 belongs, and generate the probability image 44 according to the color corresponding to each pixel value. For example, the probability image may be represented based on a thermodynamic diagram, which is a diagram showing a page area on which a visitor is keen and a geographical area in which the visitor is located in a special highlight form, and in this embodiment of the present application, a character area and a position of the character area in the key frame image are shown in different highlight forms (i.e., different colors). And performing feature fusion on the probability image 44 and the character frame image 45 to generate a fused character image 46, wherein the fused character image 46 can represent the position of the character in the key frame image, and the character area 47 in the key frame image is determined based on the fused character image 46.

By using the convolution layer in the character detection model to perform feature extraction on the key frame image, a plurality of key image features corresponding to the key frame image can be extracted; by splicing a plurality of key image characteristics corresponding to the key frame images, the spliced characteristics can reflect the image information in the key frame images more completely; by generating the probability image and the character frame image corresponding to the key frame image, the probability image can represent the probability that the pixel points in the key frame image are characters, and the character frame image can represent the character frame position in the key frame image, so that the character area in the key frame image is determined according to the characters and the character frame position, the character area in the video frame image can be more accurately reflected, the character area can be conveniently identified subsequently, and the accuracy of the video category to which the video data belongs is determined.

Optionally, when the key frame image is identified by using the character detection model, the computer device may detect a plurality of candidate regions containing character features and confidence degrees of the candidate regions, but the number of the finally determined character regions is determined, for example, the computer device may detect a plurality of candidate regions, where the candidate regions may include "month", "Teng-Tech video", and "Tech video", and the like. Specifically, the computer device obtains a candidate region with the highest confidence coefficient from the plurality of candidate regions, calculates a region overlap degree (IoU) between the candidate region with the highest confidence coefficient and each candidate region by using a Non-Maximum Suppression algorithm (NMS), and compares the region overlap degree with an overlap degree threshold value, thereby determining a final character region "flight video". The threshold value of the degree of overlap may be set to 0.7, 0.8, 0.9, or other values, and the embodiment of the present application is not limited. Repeated candidate regions in the key frame image can be removed by using a non-maximum suppression algorithm, so that a final character region is determined.

S103, extracting the characteristics of the character region based on the image recognition model, recognizing the character data of the key frame image from the character region according to the extracted characteristics, and performing character matching on the character data of the key frame image and the character database to obtain the character detection result of the character region in the key frame image.

In the embodiment of the application, the computer device extracts the characteristics of the character region based on the image recognition model, recognizes the character data of the key frame image from the character region according to the extracted characteristics, and performs character matching on the character data of the key frame image and the character database to obtain the character detection result of the character region in the key frame image. The character data may refer to specific characters, such as "flight news video", and the computer device performs character matching on the "flight news video" and the character database to obtain a character detection result of a character area in the key frame image. And if the character database contains the flight information video, determining that the character detection result is a result of matching the character data with the character database. If the character database does not contain the 'flight news video', the character detection result is determined to be the result that the character data are not matched with the character database, the computer equipment can output the information that the video data are legal data, the legal data can mean that the video data do not contain watermarks, the copyright of other people is not violated, and the video data can be uploaded to an application program so that other users can conveniently check and interact.

Optionally, the method for extracting features of the character region by the computer device based on the image recognition model and recognizing the character data of the key frame image from the character region according to the extracted features may include: the computer equipment extracts the characteristics of the character area based on the convolution layer in the image recognition model to obtain convolution characteristics corresponding to the character area, and carries out serialization processing on the convolution characteristics corresponding to the character area to obtain a characteristic sequence corresponding to the character area; identifying the characteristic sequence based on a circulation layer in the image identification model, and determining sequence character characteristics corresponding to the characteristic sequence; and performing feature conversion on the sequence character features based on a transcription layer in the image recognition model to obtain character data of the key frame image.

In a specific implementation, the image recognition model may include a feature extraction network, where the feature extraction network includes a convolution layer, and the computer device may perform feature extraction on the character region based on the convolution layer in the feature extraction network to obtain a plurality of convolution features corresponding to the character region, and the feature extraction network may include deep learning Networks such as a Convolutional Neural Network (CNN), a regression target detection (YOLO), a Single-point multi-box Detector (SSD), and the like. The computer equipment carries out serialization processing on a plurality of convolution characteristics corresponding to the character areas to obtain characteristic sequences corresponding to the character areas, and carries out recognition processing on the characteristic sequences based on a circulation layer in an image recognition model to determine sequence character characteristics corresponding to the characteristic sequences, wherein the circulation layer is used for recognizing the characteristic sequences corresponding to the convolution characteristics to determine the character characteristics corresponding to each characteristic sequence. And the computer equipment performs feature conversion on the sequence character features based on a transcription layer in the image recognition model, and integrates the empty characters and the repeated characters in the character features obtained after the feature conversion to obtain the character data of the key frame image. Wherein, the loop layer may include, but is not limited to, a Long-Short Term Memory model (Long-Short Term Memory RNN, LSTM) or other deep learning network, and the transcription layer may include, but is not limited to, a connection sense Temporal Classification (CTC) or other algorithm.

Referring to fig. 5, fig. 5 is a scene schematic diagram of determining character data of a key frame image based on an image recognition model according to an embodiment of the present application, in which a computer device inputs a character region 51 into the image recognition model, performs feature extraction on the character region 51 based on a convolution layer in the image recognition model, extracts features corresponding to characters in the character region, and obtains convolution features 52 corresponding to the character region 51, where the convolution features 52 are used to indicate character information of the character region, the computer device performs serialization processing on the convolution features 52 corresponding to the character region, where the convolution features 52 include multiple sets of data features, the computer device combines each set of data features to obtain a sequence sub-feature, and combines the sequence sub-features corresponding to each set of data features to generate a feature sequence 53. Further, the computer device performs recognition processing on the feature sequence 53 based on a loop layer in the image recognition model, converts the feature sequence into a character form, and determines a sequence character feature 54 corresponding to the feature sequence 53. For example, the computer device obtains the feature sequence 53 corresponding to the character region 51 shown in fig. 5, and may perform feature fusion on the feature sequence 53 based on the loop layer, perform feature migration between different sequence sub-features, and generate the sequence character feature 54, where the sequence character feature 54 is "-S-T-AATTE". The computer device performs feature conversion on the sequence character features 54 based on a transcription layer in the image recognition model, integrates the null characters and the repeated characters in the character features 54 to obtain character data 55 of the key frame image, and as shown in fig. 5, integrates the null characters "-" and the repeated characters "AA" and "TT" to obtain character data 55 of the key frame image, where the character data 55 is "STATE".

Optionally, if the number of the key frame images in the video data is 1 and the number of the character data of the key frame images is 1, the computer device may perform word segmentation processing on the character data of the key frame images, determine one or more word segmentation character data corresponding to the video data, perform character matching on the one or more word segmentation character data and the character database respectively, and obtain one or more matching character strings and a matching number corresponding to each matching character string; and if the matched character strings with the matching number larger than the matching threshold exist, determining that the character detection result is the result of matching the character data with the character database. Or, if the number of the key frame images in the video data is 1 and the number of the character data of the key frame images is multiple, the computer device may combine the multiple character data of the key frame images to obtain combined character data; performing word segmentation processing on the combined character data, determining one or more word segmentation character data corresponding to the video data, and performing character matching on the one or more word segmentation character data and a character database respectively to obtain one or more matching character strings and the matching number corresponding to each matching character string; and if the matched character strings with the matching number larger than the matching threshold exist, determining that the character detection result is the result of matching the character data with the character database.

Optionally, if the number of the key frame images in the video data is N, N is a positive integer; the computer device can perform character matching on the character data of the key frame image and the character database to obtain a character detection result of the character area in the key frame image. Specifically, the computer device combines character data of N key frame images in the video data to obtain combined character data; performing word segmentation processing on the combined character data, and determining M word segmentation character data corresponding to the video data; performing character matching on the M word segmentation character data corresponding to the video data and a character database respectively to obtain k matched character strings and matching numbers corresponding to the k matched character strings respectively; and if the matched character strings with the matching number larger than the matching threshold exist, determining that the character detection result is the result of matching the character data with the character database. The matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data, M is a positive integer, and k is a positive integer. The combined character data comprises one or more character data, and the word segmentation character data refers to data obtained after word segmentation processing is carried out on the combined character data. The character database may include a plurality of character strings, where a character string may refer to an enterprise name, a product name corresponding to an enterprise, or a website name corresponding to an enterprise, and is used to indicate that multimedia data carrying the character string may have infringement and other problems. For example, the character string may include "Tencent video" or "XX website", etc.

In specific implementation, since the number of the key frame images in the video data is N, the computer device may combine the character data of the N key frame images in the video data to obtain combined character data. For example, if N is 3, the character data of the 1 st key frame image is "flight signal", the character data of the 2 nd key frame image is "video", and the character data of the 3 rd key frame image is "flight signal video", the combined character data obtained by combining the character data of the 3 key frame images may be "flight signal video". Further, the computer device may perform word segmentation processing on the combined character data by using a word segmentation tool, such as a word segmentation tool of a Chinese character, or other word segmentation tools, to determine M word segmentation character data corresponding to the video data. For example, after the word segmentation processing is performed on the combined character data "Tengchong video", 2 word segmentation character data are obtained as "Tengchong video" and "Tengchong video", respectively. By combining the character data determined in the video data and then performing word segmentation processing on the combined character data, the influence on the accuracy of a final detection result due to incomplete watermarks in the key frame images can be avoided, and therefore the accuracy of data detection is improved.

Further, the computer device performs character matching on the 2 word segmentation character data and the character database respectively to obtain k matching character strings and matching quantities corresponding to the k matching character strings respectively. For example, if the number of the participle character data is 2, the 2 participle character data are respectively "Tencent video" and "Tencent video", and the character database includes the character string "Tencent video", the matching character string obtained after matching is "Tencent video", and the number corresponding to the matching character string is 2. If the number of the participle character data is 3, the 3 participle character data are respectively 'Tengchong video', 'Tengchong news' and 'Tengchong video', the character database comprises character strings 'Tengchong video' and 'Tengchong news', then the matched character strings obtained after matching are 'Tengchong video' and 'Tengchong news', the matching number corresponding to the matched character string 'Tengchong video' is 2, and the matching number corresponding to the matched character string 'Tengchong news' is 1.

Further, if the matching character strings with the matching number larger than the matching threshold exist, the computer equipment determines that the character detection result is the result of matching the character data with the character database. The matching threshold may be a default matching threshold, for example, 1, 2, 3, or other values, or the matching threshold may be determined empirically, or the matching threshold may be determined according to a result of history matching, and the like, which is not limited in this embodiment of the application. For example, if the matching threshold is 1, the matching number corresponding to the matching character string being "flight video" is greater than the matching threshold, and the computer device may determine the matching character string having the matching number greater than the matching threshold as the target character string matching the character data, that is, the target character string is "flight video". It is understood that, if there is no matching character string whose number of matches is greater than the matching threshold, the computer device determines the character detection result as a result of the character data not matching the character database.

And S104, if the character detection result is the result of matching the character data with the character database, acquiring a target character string matched with the character data from the character database, and determining the data type corresponding to the target character string as the video type of the video data.

In the embodiment of the application, the computer device obtains the character detection result, and if the character detection result is a result of matching the character data with the character database, the computer device may obtain a target character string matched with the character data from the character database, and determine a data category corresponding to the target character string as a video category to which the video data belongs. For example, the computer device obtains a target character string matched with the character data from the character database as "flight video", determines a data category corresponding to the "flight video" as a video category to which the video data belongs, that is, determines that the video category to which the video data belongs is the "flight video". Since the watermark in the video data has temporal stability and class invariance, it is possible to determine whether the video data contains the watermark and to determine the specific watermark class by setting a matching threshold. When the watermarks in the video data are matched with the types of the watermarks in the character database, and the number of the types of the watermarks in the video data is greater than a matching threshold value, the video data are determined to contain the watermarks, and the types of the watermarks in the video data are determined, so that the video type to which the video data belongs is determined, the probability of misjudgment of the watermarks can be reduced, and the accuracy of data detection is improved.

Optionally, when receiving an upload request for video data sent by a user terminal, the computer device may further respond to the upload request for video data from the user terminal after determining the video category to which the video data belongs by identifying the video category to which the video data belongs; if the video category to which the video data belongs to the marked video category, the computer device may send a data upload exception prompt to the user terminal. The data uploading exception prompt comprises a video category to which the video data belong; the tagged video category may be used to indicate a data category corresponding to each business, for example, the data category corresponding to Tencent may be "Tencent video". The data upload exception prompt may include "because the upload video includes a flight video flag, the upload is prohibited to avoid risk", and so on. The computer equipment sends the abnormal prompt including the data uploading to the user terminal, and a user uploading the video data can check the abnormal prompt through the user terminal, so that the video data can be changed quickly, and the copyright of other people can be prevented from being infringed by the user.

Optionally, if the video category to which the video data belongs does not belong to the tagged video category, the computer device uploads the video data to the application program. The video category to which the video data belongs not belonging to the marked video category may mean that the video data contains a watermark, but the watermark does not belong to the marked video category, so that it can be considered that the video data uploaded by the user does not violate the copyright of another person, and the computer device may upload the video data to an application program corresponding to the video data, for example, the application program may be a social application program, an educational application program, a sports application program, or another application program. For example, the application is a social application, the user may send an upload request for video data to the social application, the computer device uploads the video data to the social application by determining the category to which the video data belongs, and when determining that the video data does not belong to the tagged video category, other users may view the video data and interact with the user.

In step S101, based on the similarity between adjacent video frame images in at least two video frame images, the process of determining a key frame image by the computer device may refer to fig. 6, where fig. 6 is a schematic flowchart of a method for determining a key frame image according to an embodiment of the present application; as shown in fig. 6, the method includes:

s201, performing image feature matching on the ith video frame image and the (i +1) th video frame image in the at least two video frame images to obtain the similarity between the ith video frame image and the (i +1) th video frame image.

In this embodiment of the application, the computer device may perform image feature extraction on each of at least two video frame images in the video data to obtain an image feature corresponding to each video frame image, where the image feature is used to reflect image information, image details, and the like in the video frame images, where the at least two video frame images include an ith video frame image, and i is a positive integer, and then the computer device may calculate, based on the image feature corresponding to each of the at least two video frame images, a similarity between the ith video frame image and an (i +1) th video frame image. Optionally, the computer device may obtain the similarity between the ith video frame image and the (i +1) th video frame image by calculating an euclidean distance between the image feature of the ith video frame image and the image feature of the (i +1) th video frame image, and the calculation method of the similarity may further include, but is not limited to, a pearson correlation coefficient method, a Cosine similarity method, and the like.

S202, whether the similarity between the ith video frame image and the (i +1) th video frame image is smaller than a video similarity threshold value is determined.

In this embodiment of the application, if the similarity between the ith video frame image and the (i +1) th video frame image is smaller than the video similarity threshold, the computer device executes step S203 to determine the (i +1) th video frame image as a key frame image of the video data; if not, that is, if the similarity between the ith video frame image and the (i +1) th video frame image is greater than or equal to the video similarity threshold, the computer device executes step S204. The video similarity threshold may be 0.7, 0.8, 0.9, or other values, which is not limited in this embodiment.

S203, the (i +1) th video frame image is determined as a key frame image of the video data.

And S204, performing image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image, and obtaining a key frame image of the video data until the (i +2) th video frame image is the last video frame image of the at least two video frame images.

In this embodiment of the application, if the similarity between the ith video frame image and the (i +1) th video frame image is greater than or equal to the video similarity threshold, the computer device performs image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image, and obtains a key frame image of the video data until the (i +2) th video frame image is the last video frame image of the at least two video frame images. That is, the computer device determines the video frame image as the key frame image if the similarity is smaller than the video similarity threshold by calculating the similarity between the video frame image and the previous video frame image in the at least two video frame images, respectively. If the similarity is larger than or equal to the video similarity threshold, continuing to calculate the similarity between the video frame image and the next video frame image, and determining the next video frame image as a key frame image when the similarity is smaller than the video similarity threshold to obtain the key frame image of the video data.

Since the higher the similarity of two video frame images is, the more similar the image information and image details in the two video frame images are, when the similarity between the two video frame images is greater than the video similarity threshold, it can be considered that the two video frame images belong to two video frame images in the same Group of pictures (GOP), one GOP includes a plurality of consecutive video frame images, the similarity between any two video frame images in the video data can be calculated by the similarity calculation method, so as to determine one or more GOPs included in the video data, and determine the first video frame image, the j/2 th video frame image and the j video frame image in each GOP as key frame images, where j is a positive integer, j is the number of video frame images in the Group of pictures, and the key frame images include complete video information in the GOP, and the picture quality of the key frame pictures is higher than the picture quality of the other video frame pictures in the GOP.

Optionally, as shown in fig. 7, fig. 7 is a schematic view of a scene for extracting a sequence of key frames according to an embodiment of the present application; the computer equipment decodes the video data through a video processing tool to obtain a video frame data stream contained in the video data; the computer device extracts key frame images from a plurality of GOPs included in the video frame data stream, for example, a first video frame image, a j/2 th video frame image, and a j video frame image in each GOP may be extracted to obtain a key frame image sequence, where the key frame image sequence includes a plurality of key frame images. Since the watermark position in the video data is generally fixed and unchangeable, for example, the watermark position is fixedly present at the upper left corner position, the lower left corner position, the upper right corner position, the lower right corner position and the like of the video data, and the key frame image has the characteristics of high image quality, complete image information and the like, the subsequent detection is performed by extracting the key frame image from the video data, the data detection redundancy can be reduced, the data detection efficiency is improved, and the accuracy of the data detection result is improved.

In the embodiment of the application, the key frame image is acquired from at least two video frame images forming the video data, and the key frame image can represent the image data contained in the video data, so that the efficiency of data detection can be improved by acquiring the key frame image from the video data for identification processing. The character area in the key frame image is determined by identifying the image characteristics in the key frame image, and then when character data in the character area are identified, only the character area in the key frame image needs to be identified, the whole key frame image does not need to be identified, and the data identification efficiency can be improved. Furthermore, the first detection and identification are carried out on the video frame image, the character area in the key frame image is determined, then the character area is identified, and the character data in the character area is determined, which is equivalent to twice identification of the key frame image, so that the accuracy of data detection can be improved.

Optionally, in order to improve the accuracy of recognizing the key image features by the character detection model and the accuracy of extracting the features of the character region by the image recognition model, so as to improve the accuracy of determining the video category to which the video data belongs, before recognizing the key image features by using the character detection model and extracting the features of the character region by using the image recognition model, the computer device may train and adjust the model by using a large amount of sample video data, so that the trained model may realize more accurate recognition of the key image features and extract the features of the character region, thereby improving the accuracy of determining the video category to which the video data belongs. Referring to fig. 8, fig. 8 is a schematic flowchart illustrating another video data processing method according to an embodiment of the present disclosure. The method may be applied to a computer device; as shown in fig. 8, the method includes:

s301, acquiring a sample key frame image from at least two sample video frame images forming sample video data, and acquiring a sample area label in the sample key frame image.

In the embodiment of the application, the computer equipment can obtain the sample video data from the local database; alternatively, the sample video data may be obtained from other storage media. The computer equipment obtains at least two sample video frame images by splitting the obtained sample video data, and obtains a sample key frame image by performing frame extraction processing on the at least two sample video frame images. Specifically, the method for obtaining the sample key frame image from the sample video data may refer to the method for obtaining the key frame image from the video data in step S101, and details are not repeated here. The sample video data refers to video data prepared for training an initial character detection model. And if the sample video data is data consisting of a sample video frame image, determining the sample video frame image as a sample key frame image. If the sample video data is data composed of at least two sample video frame images, the computer device can split the sample video data to obtain at least two sample video frame images composing the sample video data, and perform frame extraction processing on the at least two sample video frame images to obtain a sample key frame image. The sample region label refers to a preset label, and the purpose of training the character detection model is to make the sample character region obtained by using the character detection model to identify the sample key frame image as the same as possible with the preset sample region label, so that the accuracy of the corresponding character detection model is higher.

S302, identifying sample key image characteristics of the sample key frame image based on the initial character detection model, performing character region characteristic matching on the sample key image characteristics, and determining a sample character region in the sample key frame image.

In this embodiment of the present application, a computer device identifies a sample key image feature of a sample key frame image based on an initial character detection model, performs character region feature matching on the sample key image feature, determines a probability that a sample character is indicated in the sample key image feature, and determines a region in the sample key frame image that may display the sample character based on the probability that the sample character is indicated in the sample key image feature, thereby determining a sample character region in the sample key frame image, where a method for specifically determining the sample character region may refer to a method for determining a character region in the key frame image in step S102, and no description is made here.

S303, generating a first loss function based on the sample character region and the sample region label, training the initial character detection model based on the first loss function, and generating the character detection model.

In the embodiment of the application, a sample character region in a sample key frame image is determined by using an initial character detection model, a first loss function can be determined according to the contact ratio between the sample character region and a preset sample region label, when a loss value corresponding to the first loss function is greater than a first loss threshold value, the initial character detection model continues to be trained, parameters in the initial character detection model are adjusted, the loss value corresponding to the first loss function is smaller than or equal to the first loss threshold value, and when the loss value corresponding to the first loss function is smaller than or equal to the first loss threshold value, the trained initial character detection model is stored, so that the character detection model is obtained. By training the character detection model by using a large amount of sample video data, the accuracy of the character detection model can be improved, so that the information of the key frame image can be more accurately reflected according to the sample character area determined by the character detection model.

S304, obtaining a sample character label in the sample key frame image.

In the embodiment of the application, the sample character label refers to a preset label, and the purpose of training the image recognition model is to make the sample character data obtained by the image recognition using the model recognition be the same as possible with the preset sample character label, so that the accuracy of the corresponding image recognition model is higher.

S305, extracting the characteristics of the sample character area based on the initial image recognition model, and recognizing the sample character data of the sample key frame image from the sample character area according to the extracted sample characteristics.

In this embodiment of the application, the computer device performs feature extraction on the sample character region based on the initial image recognition model, and the method for recognizing the sample character data of the sample key frame image from the sample character region according to the extracted sample features may refer to the method for recognizing the character data of the key frame image based on the image recognition model in step S103, which is not described herein too much.

S306, generating a second loss function based on the sample character data and the sample character labels, training the initial image recognition model based on the second loss function, and generating the image recognition model.

In the embodiment of the application, the sample character data of the sample key frame image is determined by using the initial image recognition model, the second loss function can be determined according to the contact ratio between the sample character data and the preset sample character label, when the loss value corresponding to the second loss function is greater than the second loss threshold, the initial image recognition model is continuously trained, parameters in the initial image recognition model are adjusted, the loss value corresponding to the second loss function is smaller than or equal to the second loss threshold, and when the loss value corresponding to the second loss function is smaller than or equal to the second loss threshold, the trained initial image recognition model is stored, so that the image recognition model is obtained. By training the image recognition model by using a large amount of sample video data, the accuracy of the image recognition model can be improved, so that the sample character data determined according to the image recognition model can more accurately reflect the character information in the video data.

In the embodiment of the application, the computer device trains and adjusts the model by using a large amount of sample video data, so that the character detection model obtained by training can realize more accurate identification of key image features, and the image identification model can realize more accurate feature extraction of character regions, thereby improving the accuracy of determining the video category to which the video data belongs.

The method of the embodiments of the present application is described above, and the apparatus of the embodiments of the present application is described below.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating a component structure of a video data processing apparatus according to an embodiment of the present application, where the video data processing apparatus may be a computer program (including program code) running in a computer device, for example, the video data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The apparatus 90 comprises:

an image obtaining module 91, configured to obtain a key frame image from at least two video frame images constituting video data;

a character recognition module 92, configured to recognize a key image feature of the key frame image based on a character detection model, perform character region feature matching on the key image feature, and determine a character region in the key frame image;

a character matching module 93, configured to perform feature extraction on the character region based on an image recognition model, recognize character data of the key frame image from the character region according to the extracted features, perform character matching on the character data of the key frame image and a character database, and obtain a character detection result of the character region in the key frame image;

a category determining module 94, configured to, if the character detection result is a result that the character data matches the character database, obtain a target character string matching the character data from the character database, and determine a data category corresponding to the target character string as a video category to which the video data belongs.

Optionally, the image obtaining module 91 includes:

an image matching unit 911, configured to perform image feature matching on an ith video frame image and an (i +1) th video frame image in the at least two video frame images, to obtain a similarity between the ith video frame image and the (i +1) th video frame image; i is a positive integer;

a first image determining unit 912, configured to determine the (i +1) th video frame image as a key frame image of the video data if the similarity between the i-th video frame image and the (i +1) th video frame image is smaller than the video similarity threshold, and perform image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image;

a second image determining unit 913, configured to perform image feature matching on the (i +1) th video frame image and the (i +2) th video frame image if the similarity between the i th video frame image and the (i +1) th video frame image is greater than or equal to a video similarity threshold, so as to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image; and obtaining a key frame image of the video data until the (i +2) th video frame image is the last video frame image of the at least two video frame images.

The character recognition module 92 includes:

a feature extraction unit 921, configured to perform feature extraction on the key frame image based on the convolution layer in the character detection model, so as to obtain a key image feature of the key frame image;

a feature splicing unit 922, configured to perform feature splicing on the key image features to obtain a spliced feature image corresponding to the key frame image; the pixel values of the pixels in the spliced characteristic image are used for representing the probability that the pixels in the corresponding key frame image are characters;

an image determining unit 923, configured to obtain a probability range to which each pixel value in the stitched feature image belongs, and generate a probability image and a character frame image according to the probability range to which each pixel value belongs;

a region determining unit 924, configured to perform feature fusion on the probability image and the character frame image, generate a fused character image, and determine a character region in the key frame image based on the fused character image.

The character matching module 93 includes:

a sequence obtaining unit 931, configured to perform feature extraction on the character region based on the convolution layer in the image recognition model to obtain convolution features corresponding to the character region, and perform serialization processing on the convolution features corresponding to the character region to obtain a feature sequence corresponding to the character region;

a loop processing unit 932, configured to perform recognition processing on the feature sequence based on a loop layer in the image recognition model, and determine a sequence character feature corresponding to the feature sequence;

a feature conversion unit 933, configured to perform feature conversion on the sequence character features based on the transcription layer in the image recognition model, to obtain character data of the key frame image.

Optionally, the number of key frame images in the video data is N; n is a positive integer; the character matching module 93 includes:

a character combining unit 934, configured to combine character data of the N key frame images in the video data to obtain combined character data;

a word segmentation determining unit 935, configured to perform word segmentation processing on the combined character data, and determine M word segmentation character data corresponding to the video data; m is a positive integer;

a character matching unit 936, configured to perform character matching on the M word segmentation character data corresponding to the video data and the character database, respectively, to obtain k matching character strings and matching numbers corresponding to the k matching character strings, respectively; the matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data; k is a positive integer;

a result determining unit 937, configured to determine that the character detection result is a result of matching the character data with the character database if there is a matching character string whose matching number is greater than a matching threshold;

a character determination unit 938 configured to determine a matching character string whose matching number is greater than the matching threshold as a target character string matching the character data.

Optionally, the apparatus 90 further comprises:

a data response module 95, configured to respond to an upload request of the user terminal for the video data;

the data prompting module 96 is configured to send a data upload exception prompt to the user terminal if the video category to which the video data belongs to the tagged video category; the data uploading exception prompt comprises a video category to which the video data belongs;

and the data uploading module 97 is configured to upload the video data to an application program if the video category to which the video data belongs does not belong to the tagged video category.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 9, reference may be made to the description of the method embodiment, and details are not described here again.

Referring to fig. 10, fig. 10 is a schematic diagram illustrating a structure of another video data processing apparatus according to an embodiment of the present application, where the video data processing apparatus may be a computer program (including program code) running in a computer device, for example, the video data processing apparatus is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. The apparatus 100 comprises:

a region label obtaining module 1001, configured to obtain a sample key frame image from at least two sample video frame images constituting sample video data, and obtain a sample region label in the sample key frame image;

a sample region determining module 1002, configured to identify sample key image features of the sample key frame image based on an initial character detection model, perform character region feature matching on the sample key image features, and determine a sample character region in the sample key frame image;

a detection model generating module 1003, configured to generate a first loss function based on the sample character region and the sample region label, train the initial character detection model based on the first loss function, and generate a character detection model.

Optionally, the apparatus 100 further includes:

a character tag obtaining module 1004, configured to obtain a sample character tag in the sample key frame image;

a sample character acquisition module 1005, configured to perform feature extraction on the sample character region based on the initial image recognition model, and recognize sample character data of the sample key frame image from the sample character region according to the extracted sample feature;

the recognition model generation module 1006 is configured to generate a second loss function based on the sample character data and the sample character tag, train the initial image recognition model based on the second loss function, and generate an image recognition model.

It should be noted that, for the content that is not mentioned in the embodiment corresponding to fig. 10, reference may be made to the description of the method embodiment, and details are not described here again.

Referring to fig. 11, fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 11, the computer device 110 may include: the processor 1101, the network interface 1104 and the memory 1105, and the computer device 110 may further include: a user interface 1103, and at least one communication bus 1102. Wherein a communication bus 1102 is used to enable connective communication between these components. The user interface 1103 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1103 may also include a standard wired interface and a standard wireless interface. The network interface 1104 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1105 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1105 may alternatively be at least one storage device located remotely from the processor 1101. As shown in fig. 11, a memory 1105, which is a kind of computer-readable storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.

In the computer device 110 of FIG. 11, the network interface 1104 may provide network communication functions; while user interface 1103 is primarily used to provide an interface for user input; and the processor 1101 may be configured to invoke a device control application stored in the memory 1105 to implement:

It should be understood that the computer device 110 described in this embodiment of the present application may perform the description of the video data processing method in the embodiment corresponding to fig. 3, fig. 6, and fig. 8, and may also perform the description of the video data processing apparatus in the embodiment corresponding to fig. 9 and fig. 10, which is not repeated herein. In addition, the beneficial effects of the same method are not described in detail.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, the computer program comprising program instructions, which, when executed by a computer, cause the computer to perform the method according to the foregoing embodiments, and the computer may be a part of the above-mentioned computer device. Such as processor 1101 described above. By way of example, the program instructions may be executed on one computer device, or on multiple computer devices located at one site, or distributed across multiple sites and interconnected by a communication network, which may comprise a blockchain network.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method of processing video data, comprising:

extracting the characteristics of the character region based on an image recognition model, recognizing the character data of the key frame image from the character region according to the extracted characteristics, and performing character matching on the character data of the key frame image and a character database to obtain a character detection result of the character region in the key frame image;

2. The method of claim 1, wherein said obtaining a key frame image from at least two video frame images comprising video data comprises:

performing image feature matching on the ith video frame image and the (i +1) th video frame image in the at least two video frame images to obtain the similarity between the ith video frame image and the (i +1) th video frame image; i is a positive integer;

if the similarity between the ith video frame image and the (i +1) th video frame image is smaller than the video similarity threshold, determining the (i +1) th video frame image as a key frame image of the video data, and performing image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image;

if the similarity between the ith video frame image and the (i +1) th video frame image is greater than or equal to a video similarity threshold, performing image feature matching on the (i +1) th video frame image and the (i +2) th video frame image to obtain the similarity between the (i +1) th video frame image and the (i +2) th video frame image;

and obtaining a key frame image of the video data until the (i +2) th video frame image is the last video frame image of the at least two video frame images.

3. The method of claim 1, wherein the identifying key image features of the key frame image based on the character detection model, performing character region feature matching on the key image features, and determining character regions in the key frame image comprises:

extracting the characteristics of the key frame image based on the convolution layer in the character detection model to obtain the key image characteristics of the key frame image;

performing feature splicing on the key image features to obtain spliced feature images corresponding to the key frame images; the pixel values of the pixels in the splicing characteristic image are used for representing the probability that the pixels in the corresponding key frame image are characters;

acquiring the probability range to which each pixel value in the spliced characteristic image belongs, and generating a probability image and a character frame image according to the probability range to which each pixel value belongs;

and performing feature fusion on the probability image and the character frame image to generate a fusion character image, and determining a character area in the key frame image based on the fusion character image.

4. The method according to claim 1, wherein the extracting features of the character region based on the image recognition model, and recognizing the character data of the key frame image from the character region according to the extracted features comprises:

performing feature extraction on the character region based on the convolution layer in the image recognition model to obtain convolution features corresponding to the character region, and performing serialization processing on the convolution features corresponding to the character region to obtain a feature sequence corresponding to the character region;

identifying the characteristic sequence based on a circulation layer in the image identification model, and determining sequence character characteristics corresponding to the characteristic sequence;

and performing feature conversion on the sequence character features based on a transcription layer in the image recognition model to obtain character data of the key frame image.

5. The method according to any of claims 1-4, wherein the number of key frame images in the video data is N; n is a positive integer;

the character matching of the character data of the key frame image and a character database to obtain the character detection result of the character area in the key frame image comprises the following steps:

combining character data of N key frame images in the video data to obtain combined character data;

performing word segmentation processing on the combined character data, and determining M word segmentation character data corresponding to the video data; m is a positive integer;

performing character matching on the M word segmentation character data corresponding to the video data and the character database respectively to obtain k matched character strings and the matching number respectively corresponding to the k matched character strings; the matching number is used for representing the number of character data matched with the matching character string in the M word segmentation character data; k is a positive integer;

if the matched character strings with the matching number larger than the matching threshold exist, determining that the character detection result is the result of matching the character data with the character database;

the obtaining of the target character string matched with the character data from the character database includes:

and determining the matched character strings with the matching number larger than the matching threshold value as target character strings matched with the character data.

6. The method of claim 1, further comprising:

responding to an uploading request of the user terminal for the video data;

if the video category to which the video data belongs to the marked video category, sending a data uploading exception prompt to the user terminal; the data uploading exception prompt comprises a video category to which the video data belongs;

and if the video category to which the video data belongs does not belong to the marked video category, uploading the video data to an application program.

7. A method of processing video data, comprising:

8. The method of claim 7, further comprising:

obtaining a sample character label in the sample key frame image;

performing feature extraction on the sample character region based on an initial image recognition model, and recognizing sample character data of the sample key frame image from the sample character region according to the extracted sample features;

and generating a second loss function based on the sample character data and the sample character label, and training the initial image recognition model based on the second loss function to generate an image recognition model.

9. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide data communication functions, the memory is configured to store program code, and the processor is configured to call the program code to cause the computer device to perform the method of any one of claims 1 to 6 or to perform the method of any one of claims 7 to 8.

10. A computer-readable storage medium, characterized in that it stores a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-6 or to perform the method of any of claims 7-8.