CN111813998A

CN111813998A - Video data processing method, device, equipment and storage medium

Info

Publication number: CN111813998A
Application number: CN202010943940.XA
Authority: CN
Inventors: 秦勇; 李兵
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-09-10
Filing date: 2020-09-10
Publication date: 2020-10-23
Anticipated expiration: 2040-09-10
Also published as: CN111813998B

Abstract

The application provides a video data processing method, a device, equipment and a storage medium; wherein, the method comprises the following steps: determining video information, wherein the video information is selected from preset video information, and the preset video information is obtained by separating video and audio of video data; determining subtitle content displayed by a video frame in the video information; classifying video frames in the video information at least based on subtitle content to obtain a video frame sequence, wherein the subtitle content displayed by each video frame in the video frame sequence is associated; and determining time information corresponding to the video frame sequence to obtain time information of the subtitle content corresponding to the video frame sequence, so as to determine target audio information matched with the subtitle content corresponding to the video frame sequence from the audio information of the video data.

Description

Video data processing method, device, equipment and storage medium

Technical Field

The present application relates to data processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing video data.

Background

The existing technology taking a deep learning neural network model as a leading part needs to use training data, but most of the training data needs to be realized through manual marking at present, and especially for a specific model, even manual recording and playing are needed to generate training data matched with video, audio and subtitle contents, so that the cost is greatly increased.

Disclosure of Invention

The embodiment of the application provides a video data processing method, a device, equipment and a storage medium, which are used for solving the problems in the related technology, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a video data processing method, including:

determining video information, wherein the video information is selected from preset video information, and the preset video information is obtained by separating video and audio of video data;

determining subtitle content displayed by a video frame in the video information;

classifying video frames in the video information at least based on subtitle content to obtain a video frame sequence, wherein the subtitle content displayed by each video frame in the video frame sequence is associated;

and determining time information corresponding to the video frame sequence to obtain time information of the subtitle content corresponding to the video frame sequence, so as to determine target audio information matched with the subtitle content corresponding to the video frame sequence from the audio information of the video data.

In one embodiment, the method further comprises:

generating a video segment based on the video frame sequence, the subtitle content corresponding to the video frame sequence and the determined target audio information corresponding to the video frame sequence; wherein the target audio information in the video segment matches the subtitle content presented by the video segment.

In one embodiment, the method further comprises:

taking the video frame sequence and the determined target audio information corresponding to the video frame sequence as training data; or, using a video segment generated based on the video frame sequence and the target audio information as training data;

and at least inputting the training data into a preset model so as to train the preset model by utilizing the corresponding relation between the key point characteristics of the face image in the video frame of the training data and the audio characteristics of the target audio information.

In one embodiment, the method further comprises:

acquiring video data, wherein subtitle content is displayed in the video data;

separating the video and the audio in the video data to obtain video information and audio information;

and taking the video information obtained by separation as preset video information.

In one embodiment, the determining the subtitle content represented by the video frame in the video information includes:

detecting the position of the subtitle content in the video frame of the video information;

and performing text recognition on the position of the subtitle content in the video frame to obtain the subtitle content displayed by the video frame in the video information.

In one embodiment, the detecting the position of the subtitle content in the video frame of the video information includes:

acquiring a text detection model;

and inputting the video frame of the video information into the text detection model to obtain the position of the subtitle content in the video frame of the video information.

In an embodiment, the performing text recognition on the position of the subtitle content in the video frame to obtain the subtitle content shown by the video frame in the video information includes:

acquiring a text recognition model;

and inputting the picture corresponding to the position of the subtitle content in the video frame to the text recognition model to obtain the subtitle content displayed by the video frame.

In a second aspect, an embodiment of the present application provides a video data processing apparatus, including:

the video information determining unit is used for determining video information, wherein the video information is selected from preset video information, and the preset video information is obtained by separating video and audio of video data;

the caption content determining unit is used for determining the caption content displayed by the video frame in the video information;

the classification processing unit is used for classifying the video frames in the video information at least based on the subtitle content to obtain a video frame sequence, wherein the subtitle content displayed by each video frame in the video frame sequence is associated;

and the audio information determining unit is used for determining the time information corresponding to the video frame sequence to obtain the time information of the subtitle content corresponding to the video frame sequence, so as to determine the target audio information matched with the subtitle content corresponding to the video frame sequence from the audio information of the video data.

In one embodiment, the method further comprises: a video segment generating unit, configured to generate a video segment based on the video frame sequence, the subtitle content corresponding to the video frame sequence, and the determined target audio information corresponding to the video frame sequence; wherein the target audio information in the video segment matches the subtitle content presented by the video segment.

In one embodiment, the method further comprises: a data transmission unit further configured to:

In one embodiment, the video information determination unit is further configured to:

acquiring video data, wherein subtitle content is displayed in the video data;

In one embodiment, the subtitle content determining unit is further configured to:

acquiring a text detection model;

acquiring a text recognition model;

In a third aspect, an embodiment of the present application provides a video data processing apparatus, including: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include:

by using the scheme of the application, the target audio information corresponding to the video frame sequence can be determined from any video data, and the target audio information is matched with the subtitle content presented by the video frame sequence, so that manual marking is not needed in the process, and the processing efficiency of the video data is improved.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

Fig. 1 shows a flow chart of a video data processing method according to an embodiment of the present application;

FIG. 2 illustrates a flow diagram of a video data processing method in a specific example according to an embodiment of the present application;

fig. 3 is a block diagram showing the configuration of a video data processing apparatus according to an embodiment of the present invention;

fig. 4 shows a block diagram of a video data processing apparatus according to an embodiment of the present invention.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 shows a flow chart of a video data processing method according to an embodiment of the present application. As shown in fig. 1, the method may include:

step S101: and determining video information, wherein the video information is selected from preset video information, and the preset video information is obtained by separating video and audio of video data.

Step S102: and determining the subtitle content displayed by the video frame in the video information.

Step S103: classifying the video frames in the video information at least based on the subtitle content to obtain a video frame sequence, wherein the subtitle content displayed by each video frame in the video frame sequence is associated.

Step S104: and determining time information corresponding to the video frame sequence to obtain time information of the subtitle content corresponding to the video frame sequence, so as to determine target audio information matched with the subtitle content corresponding to the video frame sequence from the audio information of the video data.

The video data can be any video data with subtitle content collected from the internet, and by using the scheme of the application, the target audio information corresponding to the video frame sequence can be determined from any video data, and the target audio information is matched with the subtitle content presented by the video frame sequence.

Moreover, the target audio information corresponding to the video frame sequence can be efficiently obtained by the scheme, and the target audio information is matched with the subtitle content presented by the video frame sequence, so that training data is provided for subsequent model training. .

Here, in a specific example, each video frame in the video information is classified, so that video frames with the same subtitle content are classified into one group, and a video frame sequence is obtained, where the subtitle content of each video frame in the video frame sequence is the same, and the video frame sequence is arranged according to the time sequence corresponding to the video information.

In a specific example of the scheme of the present application, a video segment may be further generated based on the video frame sequence, the subtitle content corresponding to the video frame sequence, and the determined target audio information corresponding to the video frame sequence; wherein the target audio information in the video segment matches the subtitle content presented by the video segment. Therefore, the purpose of cutting any video data into video segments is achieved, and the subtitle content presented by the video frames in the obtained video segments is matched with the audio information, so that training data can be conveniently provided for subsequent model training.

In a specific example of the scheme of the application, the sequence of video frames and the determined target audio information corresponding to the sequence of video frames may be used as training data, and then at least the training data is input to a preset model, so as to train the preset model by using a correspondence between key point features of face images in the video frames of the training data and audio features of the target audio information.

Or, in another specific example, a video clip generated based on the video frame sequence and the target audio information is used as training data, and then at least the training data is input to a preset model, so as to train the preset model by using a correspondence between key point features of a face image in a video frame of the training data and audio features of the target audio information.

Therefore, the scheme of the application can efficiently obtain the target audio information corresponding to the video frame sequence, and the target audio information is matched with the subtitle content presented by the video frame sequence, so that compared with a mode of obtaining training data through manual marking, the scheme of the application can greatly reduce the cost of the training data.

In a specific example of the scheme of the application, preset video information may be obtained in the following manner, specifically, video data is obtained, where subtitle content is shown in the video data; separating the video and the audio in the video data to obtain video information and audio information; and the video information obtained by separation is used as preset video information, so that a data base is laid for the subsequent processing of video data.

In a specific example of the scheme of the present application, the subtitle content shown by a video frame in video information may be obtained in the following manner, specifically, the position of the subtitle content in the video frame of the video information is detected; text recognition is carried out on the position of the subtitle content in the video frame to obtain the subtitle content displayed by the video frame in the video information, so that on one hand, the recognition cost can be reduced, unnecessary recognition processing tasks are avoided, on the other hand, a foundation is laid for accurately recognizing the subtitle content, and meanwhile, a foundation is laid for improving the recognition result, namely the accuracy of the subtitle content.

In a specific example of the scheme of the application, a model may be used to obtain a position where subtitle content is located in a video frame of the video information, specifically, a text detection model is obtained; and inputting the video frame of the video information into the text detection model to obtain the position of the subtitle content in the video frame of the video information. Therefore, the accuracy of the position of the identified subtitle content is improved by using the model, and a foundation is laid for effectively avoiding unnecessary identification processing tasks.

In a specific example of the solution of the present application, obtaining the subtitle content displayed by a video frame in the video information by using a model includes: acquiring a text recognition model; and inputting the picture corresponding to the position of the subtitle content in the video frame to the text recognition model to obtain the subtitle content displayed by the video frame. Therefore, the recognition accuracy of the subtitle content is improved by using the model.

Certainly, in a specific example, the text detection model may be used to obtain the position of the subtitle content in the video frame of the video information, and then the text recognition model is used to obtain the subtitle content displayed by the video frame, so that on one hand, the recognition cost can be reduced, unnecessary recognition processing tasks can be avoided, and on the other hand, the accuracy of the recognized subtitle content can be effectively improved.

Therefore, the target audio information corresponding to the video frame sequence can be determined from any video data by using the scheme of the application, and the target audio information is matched with the subtitle content presented by the video frame sequence.

The present solution is further explained in detail below with reference to specific examples, in particular, as shown in figure 2,

the scheme of the application aims to provide a method for applying a differentiable binarization method (DB) and a Convolution Recurrent Neural Network (CRNN) and a cross-platform general library, such as a human face detection, recognition and matching technology provided by a DLIB library, to videos with subtitles, such as television series, animation films, and art programs, so as to obtain a large amount of training data required by a specific human conversation face video generation model for audio driving, thereby reducing the cost for manually recording the training data. Here, the audio-driven model for generating a specific person conversation face video (i.e., the above preset model) can obtain video data in which a mouth shape and an audio (mouth shape change and audio change) in a conversation scene such as speech and chat of a specific person are matched by using a correspondence between a key point feature (e.g., a key point feature of the mouth shape) in a face image and an audio feature of audio information. Specifically, the training data required by the audio-driven specific person conversation face video generation model comprises two parts, wherein the first part is a large amount of audio information and corresponding mouth-shaped key point coordinate information, the second part is a large amount of face images with masks in other areas except lips and corresponding complete face images, and the information can be separated from the conversation video with subtitles of a specific person, so that the training data required by training the model can be obtained only by collecting specific audio segments and corresponding video segments from videos such as television dramas, animation films, art programs and the like.

Based on this, the core technical scheme of this application scheme includes three parts, is respectively: a first part, which obtains a large number of audio clips and corresponding video clips by using a DB model (namely a text detection model) and a CRNN model (namely a text recognition model), and a second part: and processing each video clip by using a face detection, recognition and matching model provided by a DLIB library to obtain a large number of specific video clips meeting the requirements and corresponding audio clips thereof, and processing and sorting data to obtain training data to be collected. The method comprises the following specific steps:

a first part:

collecting a large amount of video data with subtitle content, including TV dramas, animation films, art programs and the like, then randomly selecting a small amount of video data, dividing each video data into pictures according to frames, then manually marking the subtitle position of each video data, training a DB model by using the manually marked pictures, and obtaining a text detection model capable of detecting the text position on the image. In particular, the amount of the solvent to be used,

randomly selecting a large amount of video data, cutting each video data into pictures according to frames, cutting the lower half part of each picture (namely cutting the character part or the caption part of the picture), only reserving the upper half part, crawling a large amount of novels from a network, and pasting each randomly selected text sentence (the font, the size and the content are randomly changed) in the novels to the lower edge of the upper half part of each picture reserved in the previous step (here, aiming at each cut picture, a plurality of new synthesized text pictures with different text sentences, fonts and sizes can be randomly generated, and because the translational invariance of the picture is known, new character contents are added after the character part is cut off, the recognition degree of the whole picture cannot be greatly influenced); then, each picture with the newly synthesized text sentence is cut, the part with the text sentence is cut to be used as input data of the CRNN model, and the text sentence is crawled from the network and is known data, so the known data can be used as output data to train the CRNN model, and a text recognition model capable of recognizing text content of the text on the image is obtained. A large amount of sample data is obtained by crawling novels, extracting text sentences and pasting the images, so that manual identification of each image is avoided, and the model training efficiency is improved.

Dividing each collected video data (namely the large amount of video data with the subtitles, including the TV dramas, the animation films and the comprehensive programs) according to frames, and sending the divided video data into a DB model to obtain the subtitle position on each frame of image; then cutting off the caption part on the image, and sending the image into a CRNN model for text recognition to obtain caption content; arranging the identified caption contents according to a time sequence, and sequentially putting the caption contents into a set; merging the same subtitle content, recording the merged video frames, then obtaining the starting time and the ending time of each text sentence according to the total frame number and the total duration of each video data, then intercepting the audio clip on the audio corresponding to the video data according to the starting time and the ending time, and intercepting the video clip corresponding to the audio clip according to the time sequence so as to take the video clip as training data; here, in practical application, if the collected video data is a chinese video, a large number of chinese novels are crawled; and if the collected video data is English videos, crawling a large number of English novels.

A second part:

the face detection model provided by the DLIB library is used for sequentially carrying out face detection on the video segments obtained by the first part, deleting videos with two or more faces in each video segment, and classifying the same face video, so that a foundation is laid for subsequently obtaining training data of the audio-driven specific person conversation face video generation model, the interference of a plurality of faces to the training process is avoided, and meanwhile, a foundation is laid for improving the accuracy of the model.

And a third part:

using a face key point detection model provided by a DLIB library, extracting 68 key points of a face from each frame of image in a face video obtained by the second part, then keeping 20 lip key points, forming training data of a model for converting audio information into mouth type key point information by using an audio fragment and lip key point sequence, then according to the lip key points, marking a mask on a corresponding face image, and drawing a lip line to obtain a lip mask face image, wherein the lip mask face image and the corresponding complete face image form a training data set of a generating model (namely an audio-driven specific human conversation video generation model).

The method comprises the following specific steps:

the method comprises the steps of firstly, collecting a large amount of video data with subtitles through the Internet and a web crawler;

secondly, respectively extracting the audio information and the video information in each piece of video data collected in the first step by using an FFMPEG (fast Forward mpeg) tool;

step three, randomly selecting a small amount of video information from the video information obtained in the step two, and cutting each video information into pictures according to frames;

fourthly, manually marking the subtitle position on the picture obtained in the third step;

fifthly, training the DB model by using the image with the subtitle position marked in the fourth step to obtain a text detection model capable of detecting the character position on the image;

a sixth step of randomly selecting a large amount of video information from the video information obtained in the second step, and dividing each video information into pictures according to frames, which is similar to the third step;

seventhly, cutting off the lower half part of the picture obtained in the sixth step, which contains the subtitle content, by using a self-contained function in opencv, only keeping the upper half part of each picture, wherein the upper half part of the picture is almost free of the subtitle content and is called as a picture with the same texture structure;

step eight, crawling a large number of novels on the network through the Internet and a web crawler;

a ninth step of pasting the text content of the novel obtained in the eighth step, such as a text sentence, on the lower edge of the picture with the same texture structure obtained in the seventh step by using a self-contained function in opencv, wherein the font, the color and the size of the text sentence are randomly changed within a specified range, the picture obtained in the step is called a word pasting picture, and the lower half part of the word pasting picture is the text sentence in the novel;

tenth, cutting the picture of the position of the text sentence in the picture with the character pasted in the ninth step as a picture for identifying; and training the CRNN model by using the pictures for identification obtained in the step.

Here, since the text sentence for identifying the picture is crawled from the network and is known content, and manual marking is not needed, the CRNN model is trained by using the picture for identifying, so that manual marking can be avoided, and labor cost is reduced.

And step ten, training a CRNN model by using the picture for recognition obtained in the step ten and the text sentences in the picture for recognition to obtain a character recognition model capable of recognizing text contents.

The above steps are training steps of the text detection model and the text recognition model to obtain the trained text detection model and the trained text recognition model; in practical applications, the execution steps of the two models in the training process are not the steps described in the present application, and the scheme described in the present application is only an example and is not used to limit a specific execution flow. The following steps are the application process of the trained text detection model and the trained text recognition model.

The twelfth step, continue to use every video information in the video data obtained in the second step, and cut into pictures according to the frame; taking video information as a unit, sending pictures obtained by segmenting each piece of video information into a text detection model, and obtaining subtitle position information corresponding to each picture in the video information;

here, it should be noted that, in practical applications, the video data used in the model training process and the video data used in this step are not related, and may be the same or different. The present example continues with subsequent processing using the video data obtained in the second step for simplicity.

And a thirteenth step of performing interception processing on each picture (the subtitle position in the picture is identified and obtained, that is, the subtitle position information is included) corresponding to each piece of video information obtained in the twelfth step, and intercepting the picture with the subtitle content to obtain a picture sequence arranged according to a time sequence (that is, the time sequence in the video information).

Fourteenth, sending the picture sequence arranged according to the time sequence obtained in the thirteenth step into a character recognition model (namely, a model obtained after training a CRNN model), recognizing the caption content, and obtaining a text sentence corresponding to each picture in the picture sequence;

fifteenth, merging the same text sentences obtained in the fourteenth step according to the time sequence of each piece of video information (namely, recording the positions of the pictures in the whole video), and recording the occurrence times of the same text sentences;

sixthly, combining the total duration and the total frame number of the corresponding video information according to the combined processing result of the fifteenth step to obtain the starting time point and the ending time point of the text sentence;

seventeenth, based on the starting time point and the ending time point of a text sentence in the video information obtained in the sixteenth step, performing an intercepting operation from the audio information corresponding to the video information extracted in the second step, and intercepting target audio information corresponding to the starting time point and the ending time point of the text sentence, so as to obtain the text sentence and the corresponding target audio information (namely, an audio clip).

Eighteenth, according to the starting time point and the ending time point of a text sentence in the video information obtained in the sixteenth step, intercepting operation is carried out from the video information extracted in the second step, and target video information corresponding to the starting time point and the ending time point of the text sentence is obtained through intercepting, so that the text sentence and the corresponding target video information (namely the video clip) are obtained.

In summary, a text sentence (i.e. subtitle content) corresponding to the target video information and target audio information corresponding to the target video information are obtained, and a video clip is further generated, where the video clip includes video information of meaning subtitle content and audio information matched with the video information and the subtitle content.

Nineteenth step, using a face detection model provided by a DLIB library to sequentially perform face detection on the video segments obtained in the eighteenth step, deleting videos with two or more faces in each video segment, and grouping the same face video together;

twenty, using a face key point detection model provided by the DLIB library to extract 68 key points of the face from each frame of image in the face video obtained in the nineteenth step, and then reserving 20 lip key points;

twenty-first, the audio clip obtained in the seventeenth step and the lip key point sequence obtained in the twentieth step form training data of a model for converting the audio information into lip key point coordinate information;

and a twentieth step of masking the corresponding face image according to the coordinate information of the key points of the lips in the twentieth step, and drawing a lip line to obtain the masked face image of the lips, wherein the masked face image of the lips and the corresponding complete face image of the lips form a training data set of the generative model.

And a twentieth step, wherein the data sets obtained in the twentieth step and the twenty-second step constitute training data for an audio-driven framework for generating a video of a specific human conversation face.

Fig. 3 shows a block diagram of a video data processing apparatus according to an embodiment of the present invention. As shown in fig. 3, the apparatus may include:

a video information determining unit 301, configured to determine video information, where the video information is selected from preset video information, and the preset video information is obtained by performing video and audio separation on video data;

a caption content determining unit 302, configured to determine caption content shown in a video frame in the video information;

a classification processing unit 303, configured to classify video frames in the video information at least based on subtitle content to obtain a video frame sequence, where subtitle content displayed by each video frame in the video frame sequence is associated with each other;

an audio information determining unit 304, configured to determine time information corresponding to the video frame sequence, to obtain time information of subtitle content corresponding to the video frame sequence, so as to determine, from the audio information of the video data, target audio information that matches the subtitle content corresponding to the video frame sequence.

In a specific example of the scheme of the present application, the method further includes: a video segment generating unit, configured to generate a video segment based on the video frame sequence, the subtitle content corresponding to the video frame sequence, and the determined target audio information corresponding to the video frame sequence; wherein the target audio information in the video segment matches the subtitle content presented by the video segment.

In a specific example of the scheme of the present application, the method further includes: a data transmission unit further configured to:

In a specific example of the scheme of the present application, the video information determining unit is further configured to:

acquiring video data, wherein subtitle content is displayed in the video data;

In a specific example of the scheme of the present application, the subtitle content determining unit is further configured to:

acquiring a text detection model;

acquiring a text recognition model;

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

Fig. 4 shows a block diagram of a video data processing apparatus according to an embodiment of the present invention. As shown in fig. 4, the video data processing apparatus includes: a memory 410 and a processor 420, the memory 410 having stored therein a computer program operable on the processor 420. The processor 420, when executing the computer program, implements the video data processing method in the above-described embodiment. The number of the memory 410 and the processor 420 may be one or more.

The video data processing apparatus further includes:

and a communication interface 430, configured to communicate with an external device, and perform data interactive transmission.

If the memory 410, the processor 420 and the communication interface 430 are implemented independently, the memory 410, the processor 420 and the communication interface 430 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (enhanced Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Optionally, in an implementation, if the memory 410, the processor 420, and the communication interface 430 are integrated on a chip, the memory 410, the processor 420, and the communication interface 430 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of video data processing, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

4. The method of claim 1, further comprising:

acquiring video data, wherein subtitle content is displayed in the video data;

5. The method of claim 1, wherein the determining the caption content presented by the video frame in the video information comprises:

6. The method of claim 5, wherein the detecting a position of the subtitle content in the video frame of the video information comprises:

acquiring a text detection model;

7. The method according to claim 5 or 6, wherein the performing text recognition on the position of the subtitle content in the video frame to obtain the subtitle content shown by the video frame in the video information comprises:

acquiring a text recognition model;

8. A video data processing apparatus, comprising:

9. The apparatus of claim 8, further comprising: a video segment generating unit, configured to generate a video segment based on the video frame sequence, the subtitle content corresponding to the video frame sequence, and the determined target audio information corresponding to the video frame sequence; wherein the target audio information in the video segment matches the subtitle content presented by the video segment.

10. The apparatus of claim 8, further comprising: a data transmission unit further configured to:

11. The apparatus of claim 8, wherein the video information determination unit is further configured to:

acquiring video data, wherein subtitle content is displayed in the video data;

12. The apparatus of claim 8, wherein the subtitle content determining unit is further configured to:

13. The apparatus of claim 12, wherein the subtitle content determining unit is further configured to:

acquiring a text detection model;

14. The apparatus according to claim 12 or 13, wherein the subtitle content determining unit is further configured to:

acquiring a text recognition model;

15. Video data processing apparatus comprising a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 6.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.