CN112802469A

CN112802469A - Method and device for acquiring training data of voice recognition model

Info

Publication number: CN112802469A
Application number: CN202011576869.2A
Authority: CN
Inventors: 张彬彬; 杨超; 陈晓宇; 曾晨晨
Original assignee: Go Out And Ask Wuhan Information Technology Co ltd
Current assignee: Go Out And Ask Wuhan Information Technology Co ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-05-14

Abstract

The invention discloses a method and a device for acquiring training data of a speech recognition model, wherein the method comprises the following steps: acquiring a t frame image in a video stream; when the t frame image comprises a first caption area, acquiring a t +1 frame image in the video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area; when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring t +2 th and t +3 … th frame images in the video stream by taking the step length as 1 until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating the time period from the t frame image to the t + n-1 th frame image; extracting voice of a time period in the video stream to obtain voice data; and performing text recognition on any one of the first caption area to the nth caption area to obtain labeled text data corresponding to the voice data.

Description

Method and device for acquiring training data of voice recognition model

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method and an apparatus for obtaining speech recognition model training data.

Background

At present, the speech recognition model based on deep learning is a mainstream method for speech recognition, and a better effect is achieved in each task and application scene of speech recognition. The training data of the deep learning based speech recognition model is a large amount of speech data and corresponding labeled text data. When the data volume of the voice data and the corresponding labeled text data is insufficient, the performance of the voice recognition model is not robust enough, and the recognition rate is not accurate enough. Therefore, obtaining a large amount of speech data and corresponding labeled text data is important to improve the recognition rate of speech recognition. However, when acquiring the labeled text data corresponding to the voice data, it is necessary to manually label the tagged text data word by word according to the heard voice, and therefore, acquiring a large amount of voice data and corresponding labeled text data requires resources such as particularly high manpower, material resources, financial resources, and the like.

Therefore, how to obtain the training data of the speech recognition model in a low-cost and automatic mode is significant.

Content of application

The embodiment of the invention provides a method and a device for acquiring training data of a voice recognition model, which are used for solving the technical problem that the acquisition of a large amount of voice data and corresponding labeled text data in the prior art requires particularly high resources such as manpower, material resources, financial resources and the like.

In order to solve the above problem, in a first aspect, an embodiment of the present invention provides a method for obtaining training data of a speech recognition model, including: acquiring a t frame image in a video stream; when the t frame image comprises a first caption area, acquiring a t +1 frame image in the video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area; when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring t +2 th and t +3 … th frame images in the video stream by taking the step length as 1 until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating the time period from the t frame image to the t + n-1 th frame image; extracting voice of a time period in the video stream to obtain voice data; and carrying out text recognition on any one of the first caption area to the nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain the marked text data corresponding to the voice data.

Optionally, the t frame image includes a first subtitle region, including: recognizing each text area from the t frame image, and calculating the position coordinate of each text area; determining the position of each text area in the t-th frame image according to the position coordinates of each text area and the size of the t-th frame image; and when the position of at least one text region in the tth frame image comprises a specific position, determining that the tth frame image comprises the first subtitle region.

Optionally, the similarity between the second subtitle region and the first subtitle region is calculated by a structural similarity measurement method.

Optionally, performing text recognition on any one of a first subtitle region to an nth subtitle region corresponding to the t frame to the t + n-1 frame of image to obtain tagged text data corresponding to the voice data, including: intercepting a corresponding subtitle area from any one of the images from the t frame to the t + n-1 frame to obtain a sub-image; and inputting the subimages into a text recognition model for text recognition to obtain labeled text data corresponding to the voice data.

In a second aspect, an embodiment of the present invention provides an apparatus for obtaining training data of a speech recognition model, including: a first obtaining unit, configured to obtain a tth frame image in a video stream; a second obtaining unit, configured to obtain a t +1 th frame image in the video stream when the t-th frame image includes the first caption area, and determine an area having the same position coordinates as the first caption area from the t +1 th frame image as a second caption area; the calculating unit is used for sequentially acquiring a t +2 th frame image and a t +3 … th frame image in the video stream by taking the step length as 1 when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating a time period from the t frame image to the t + n-1 th frame image; the extraction unit is used for extracting voice of a time slot in the video stream to obtain voice data; and the recognition unit is used for performing text recognition on any one of the first caption area to the nth caption area corresponding to the t frame to the t + n-1 frame of image to obtain the marked text data corresponding to the voice data.

Optionally, the identification unit comprises: the intercepting subunit is used for intercepting a corresponding subtitle area from any one frame image from the t frame image to the t + n-1 frame image to obtain a sub-image; and the identification subunit is used for inputting the subimages into the text identification model for text identification to obtain labeled text data corresponding to the voice data.

In a third aspect, an embodiment of the present invention provides a computer, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method of obtaining speech recognition model training data as in the first aspect or any of the embodiments of the first aspect.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute a method for acquiring training data of a speech recognition model as in the first aspect or any implementation manner of the first aspect.

The embodiment of the invention provides a method and a device for acquiring training data of a voice recognition model, wherein the method comprises the steps of acquiring a t-th frame image in a video stream; when the t frame image comprises a first caption area, acquiring a t +1 frame image in the video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area; when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring t +2 th and t +3 … th frame images in the video stream by taking the step length as 1 until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating the time period from the t frame image to the t + n-1 th frame image; extracting voice of a time period in the video stream to obtain voice data; text recognition is carried out on any one of a first caption area to an nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain annotation text data corresponding to the voice data, so that the voice data and the annotation text data corresponding to the voice data can be automatically obtained from the video stream, training data of a voice recognition model can be obtained, and resources such as manpower, material resources and financial resources required for obtaining a large amount of voice data and corresponding annotation text data can be reduced.

The foregoing description is only an overview of the technical solutions of the present application, and the present application can be implemented according to the content of the description in order to make the technical means of the present application more clearly understood, and the following detailed description of the present application is given in order to make the above and other objects, features, and advantages of the present application more clearly understandable.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for obtaining training data of a speech recognition model according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an apparatus for obtaining training data of a speech recognition model according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a hardware structure of a computer according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, a large number of television programs (television dramas, art programs, etc.) exist on the network. The inventor finds that when someone speaks in many television programs, corresponding embedded subtitles exist, and the embedded subtitles are accurate and consistent with the boundaries of sound. Therefore, the inventor thinks that if we can automatically obtain the boundary of the caption, extract the sound in the boundary, and recognize the characters on the caption through the algorithm, i.e. obtain the voice data and the corresponding labeled text data, the voice data and the corresponding labeled text data can be used as the training data of the voice recognition model to improve the effect of the voice recognition model.

To this end, an embodiment of the present invention provides a method for obtaining training data of a speech recognition model, as shown in fig. 1, including:

s101, acquiring a t frame image in a video stream; specifically, the tth frame image can be read from the video stream, t is greater than or equal to 1, and t is a natural number. The tth frame image comprises at least one text area, and the tth frame image is a starting frame image of a section of subtitles in the video stream. When speech recognition model training data is first obtained from a segment of video stream, the 1 st frame of image may be obtained from the video stream. The text area is used to display characters in the image.

S102, when the t frame image comprises a first caption area, acquiring a t +1 frame image in a video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area;

specifically, the first subtitle region is used for displaying a part of characters corresponding to voice in the video stream. The position of the subtitle area in the video stream is generally set to be below the image, vertically centered or left aligned. Therefore, the first subtitle region may be preset as a region located below, vertically centered, or left aligned with the image of the t-th frame in the image of the t-th frame. Therefore, by performing text recognition on the image of the t-th frame, when the image of the t-th frame includes at least one text region and there is a vertically centered or left aligned text region located below the image of the t-th frame, it may be determined that the vertically centered or left aligned text region located below the image is the first subtitle region, and thus it is determined that the image of the t-th frame includes the first subtitle region. When the t frame image includes the first caption area, the t +1 frame image in the video stream may be read, and an area having the same position coordinates as the first caption area may be determined from the t +1 frame image as the second caption area.

When the t frame image does not include the first caption area, the t +1 th, t +2 th, and t +3 … th and t + n frame images in the video stream may be sequentially acquired, and whether the first caption area is included in the t +1 th, t +2 th, and t +3 … th and t + n frame images may be sequentially detected until the first caption area is included in the t + n frame image, and the steps of acquiring the t + n +1 th frame image in the video stream, and determining an area having the same position coordinates as the first caption area from the t + n +1 th frame image as the second caption area may be determined and then performed.

S103, when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring a t +2 th frame image and a t +3 … th frame image in the video stream by taking the step length as 1 until the similarity between an n +1 th caption area corresponding to the t + n frame image and an n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating a time period from the t frame image to the t + n-1 th frame image;

specifically, image similarity determination may be performed on a first subtitle region and a second subtitle region, and if the similarity between the second subtitle region and the first subtitle region is greater than or equal to a preset threshold, which indicates that subtitles in the first subtitle region and the second subtitle region do not change, t +2 th, t +3 … th and t + n th frame images in a video stream may be sequentially obtained with a step size of 1, until the similarity between an n +1 th subtitle region corresponding to the t + n frame image and an n subtitle region corresponding to the t + n-1 th frame image is smaller than the preset threshold, it is considered that subtitles in the n +1 th subtitle region and the n subtitle region change, and a time period corresponding to the segment is a time period between the t frame and the t + n-1 th frame.

S104, extracting voice of a time period in the video stream to obtain voice data; specifically, the voice in the video stream is extracted according to the time period corresponding to the time slice, so that voice data can be obtained.

And S105, performing text recognition on any one of the first caption area to the nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain the marked text data corresponding to the voice data. Specifically, because the subtitle corresponding to any one of the first subtitle region to the nth subtitle region corresponding to the images of the t frame to the t + n-1 frame does not change, text recognition can be performed on any one of the first subtitle region to the nth subtitle region corresponding to the images of the t frame to the t + n-1 frame to obtain the tagged text data corresponding to the voice data. Therefore, the voice data and the labeled text data corresponding to the voice data are obtained, and one training data of the voice recognition model is obtained.

The embodiment of the invention provides a method for acquiring training data of a voice recognition model, which comprises the steps of acquiring a t frame image in a video stream; when the t frame image comprises a first caption area, acquiring a t +1 frame image in the video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area; when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring t +2 th and t +3 … th frame images in the video stream by taking the step length as 1 until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating the time period from the t frame image to the t + n-1 th frame image; extracting voice of a time period in the video stream to obtain voice data; text recognition is carried out on any one of a first caption area to an nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain annotation text data corresponding to the voice data, so that the voice data and the annotation text data corresponding to the voice data can be automatically obtained from the video stream, training data of a voice recognition model can be obtained, and resources such as manpower, material resources and financial resources required for obtaining a large amount of voice data and corresponding annotation text data can be reduced.

In an alternative embodiment, the t-th frame image includes a first subtitle region including: recognizing each text area from the t frame image, and calculating the position coordinate of each text area; determining the position of each text area in the t-th frame image according to the position coordinates of each text area and the size of the t-th frame image; and when the position of at least one text region in the tth frame image comprises a specific position, determining that the tth frame image comprises the first subtitle region.

Specifically, the specific position is a position or positions preset empirically, for example, below the image, vertically centered, below the image, left aligned. When judging whether the first caption area is included in the t-th frame image, each text area can be identified from the t-th frame image through a text detection algorithm, and the position coordinates of each text area are calculated. The position of each text region in the image of the t-th frame can be determined according to the position coordinates of each text region and the size of the image of the t-th frame, for example, one text region is located at the upper left position of the image of the t-th frame, and one text region is located at the lower and vertically centered position of the image of the t-th frame. When there is a text region located at a specific position, it may be determined that the tth frame image includes the first subtitle region.

The text regions are identified from the image of the t-th frame, the position coordinates of the text regions are calculated, and the position of each text region in the image of the t-th frame is determined according to the position coordinates of each text region and the size of the image of the t-th frame, so that whether the first subtitle region exists can be rapidly judged from each text, and whether the image of the t-th frame comprises the first subtitle region is determined.

In an alternative embodiment, the similarity between the second subtitle region and the first subtitle region is calculated by a structural similarity measurement method.

Specifically, when the similarity between the second subtitle region and the first subtitle region is calculated by using the structural similarity measurement method, the similarity between the second subtitle region and the first subtitle region can be quickly and accurately calculated.

In an optional embodiment, performing text recognition on any one of a first subtitle region to an nth subtitle region corresponding to the t frame to the t + n-1 frame of image to obtain labeled text data corresponding to the voice data includes: intercepting a corresponding subtitle area from any one of the images from the t frame to the t + n-1 frame to obtain a sub-image; and inputting the subimages into a text recognition model for text recognition to obtain labeled text data corresponding to the voice data.

Specifically, since text recognition needs to be performed on any one of the first caption area to the nth caption area, any one of the first caption area to the nth caption area needs to be intercepted from the corresponding image, so as to obtain a sub-image only containing captions, and the sub-image is input into a text recognition model for text recognition, so that the labeled text data corresponding to the accurate voice data can be obtained.

An embodiment of the present invention further provides a device for obtaining training data of a speech recognition model, as shown in fig. 2, including:

a first acquiring unit 201, configured to acquire a tth frame image in a video stream; the detailed description of the specific implementation manner is given in step S101 of the above method embodiment, and is not repeated herein.

A second obtaining unit 202, configured to obtain a t +1 th frame image in the video stream when the t-th frame image includes the first caption area, and determine an area having the same position coordinates as the first caption area from the t +1 th frame image as a second caption area; the detailed description of the specific implementation manner is given in step S102 of the above method embodiment, and is not repeated herein.

A calculating unit 203, configured to, when the similarity between the second subtitle region and the first subtitle region is greater than or equal to a preset threshold, sequentially obtain a t +2 th frame image and a t +3 … th frame image in the video stream with the step size being 1 until the similarity between an n +1 th subtitle region corresponding to the t + n frame image and an n subtitle region corresponding to the t + n-1 th frame image is less than the preset threshold, and calculate a time period from the t frame image to the t + n-1 th frame image; the detailed description of the specific implementation manner is given in step S103 of the above method embodiment, and is not repeated herein.

An extracting unit 204, configured to extract a voice of a time segment in a video stream to obtain voice data; the detailed description of the specific implementation manner is given in step S104 of the above method embodiment, and is not repeated herein.

The recognition unit 205 is configured to perform text recognition on any one of the first subtitle region to the nth subtitle region corresponding to the t-th frame to the t + n-1-th frame of image, so as to obtain labeled text data corresponding to the voice data. The detailed description of the specific implementation manner is given in step S105 of the above method embodiment, and is not repeated herein.

According to the device for acquiring the training data of the voice recognition model, provided by the embodiment of the invention, the t-th frame image in the video stream is acquired; when the t frame image comprises a first caption area, acquiring a t +1 frame image in the video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area; when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring t +2 th and t +3 … th frame images in the video stream by taking the step length as 1 until the similarity between the n +1 th caption area corresponding to the t + n frame image and the n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating the time period from the t frame image to the t + n-1 th frame image; extracting voice of a time period in the video stream to obtain voice data; text recognition is carried out on any one of a first caption area to an nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain annotation text data corresponding to the voice data, so that the voice data and the annotation text data corresponding to the voice data can be automatically obtained from the video stream, training data of a voice recognition model can be obtained, and resources such as manpower, material resources and financial resources required for obtaining a large amount of voice data and corresponding annotation text data can be reduced.

In an alternative embodiment, in the second obtaining unit 202, the tth frame image includes a first subtitle region, including: recognizing each text area from the t frame image, and calculating the position coordinate of each text area; determining the position of each text area in the t-th frame image according to the position coordinates of each text area and the size of the t-th frame image; and when the position of at least one text region in the tth frame image comprises a specific position, determining that the tth frame image comprises the first subtitle region.

In an alternative embodiment, in the calculating unit 203, the similarity between the second subtitle region and the first subtitle region may be calculated by a structural similarity measurement method.

In an alternative embodiment, the recognition unit 205 comprises: the intercepting subunit is used for intercepting a corresponding subtitle area from any one frame image from the t frame image to the t + n-1 frame image to obtain a sub-image; and the identification subunit is used for inputting the subimages into the text identification model for text identification to obtain labeled text data corresponding to the voice data.

Based on the same inventive concept as the method for acquiring the training data of the speech recognition model in the foregoing embodiment, the present invention further provides a computer, as shown in fig. 3, including: a processor 31 and a memory 32, wherein the processor 31 and the memory 32 may be connected by a bus or other means, and the connection by the bus is illustrated in fig. 3 as an example.

The processor 31 may be a Central Processing Unit (CPU). The Processor 31 may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or combinations thereof.

The memory 32 is a non-transitory computer readable storage medium, and can be used for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for obtaining training data of a speech recognition model in the embodiment of the present invention. The processor 31 executes various functional applications and data processing of the processor by executing the non-transitory software programs, instructions and modules stored in the memory 32, namely, implements the method for acquiring the training data of the speech recognition model in the above method embodiment.

The memory 32 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor 31, and the like. Further, the memory 32 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 32 may optionally include memory located remotely from the processor 31, and these remote memories may be connected to the processor 31 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more of the modules described above are stored in the memory 32 and, when executed by the processor 31, perform the method of obtaining speech recognition model training data as in the embodiment shown in FIG. 1.

The details of the computer can be understood with reference to the corresponding related descriptions and effects in the embodiment shown in fig. 1, and are not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable information processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable information processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable information processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable information processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of obtaining speech recognition model training data, comprising:

acquiring a t frame image in a video stream;

when the t frame image comprises a first caption area, acquiring a t +1 frame image in a video stream, and determining an area with the same position coordinates as the first caption area from the t +1 frame image as a second caption area;

when the similarity between the second caption area and the first caption area is greater than or equal to a preset threshold value, sequentially acquiring a t +2 th frame image and a t +3 … th frame image in the video stream by taking the step length as 1 until the similarity between an n +1 th caption area corresponding to the t + n frame image and an n caption area corresponding to the t + n-1 th frame image is less than the preset threshold value, and calculating a time period from the t frame image to the t + n-1 th frame image;

extracting voice of the time period in the video stream to obtain voice data;

and performing text recognition on any one of the first caption area to the nth caption area corresponding to the images of the t frame to the t + n-1 frame to obtain the marked text data corresponding to the voice data.

2. The method of claim 1, wherein the tth frame image comprises a first caption region, comprising:

recognizing each text area from the t frame image, and calculating the position coordinate of each text area;

determining the position of each text region in the t-th frame image according to the position coordinates of each text region and the size of the t-th frame image;

and when the position of at least one text region in the tth frame image comprises a specific position, determining that the tth frame image comprises a first subtitle region.

3. The method of claim 1, wherein the similarity between the second caption region and the first caption region is calculated by a structural similarity measurement method.

4. The method according to claim 1, wherein the performing text recognition on any one of the first caption area to the nth caption area corresponding to the t-th frame to the t + n-1-th frame of image to obtain the labeled text data corresponding to the speech data includes:

intercepting a corresponding subtitle area from any one of the images from the t frame to the t + n-1 frame to obtain a sub-image;

and inputting the subimages into a text recognition model for text recognition to obtain labeled text data corresponding to the voice data.

5. An apparatus for obtaining training data for a speech recognition model, comprising:

a first obtaining unit, configured to obtain a tth frame image in a video stream;

a second obtaining unit, configured to obtain a t +1 th frame image in a video stream when the t-th frame image includes a first caption area, and determine an area having the same position coordinates as the first caption area from the t +1 th frame image as a second caption area;

a calculating unit, configured to, when the similarity between the second subtitle region and the first subtitle region is greater than or equal to a preset threshold, sequentially obtain a t +2 th frame image and a t +3 … th frame image in a video stream with a step size of 1 until the similarity between an n +1 th subtitle region corresponding to the t + n frame image and an n subtitle region corresponding to the t + n-1 th frame image is less than the preset threshold, and calculate a time period from the t frame image to the t + n-1 th frame image;

the extracting unit is used for extracting the voice of the time slot in the video stream to obtain voice data;

and the recognition unit is used for performing text recognition on any one of the first caption area to the nth caption area corresponding to the t frame to the t + n-1 frame of image to obtain the marked text data corresponding to the voice data.

6. The apparatus for obtaining training data of speech recognition models according to claim 5, wherein the tth frame image includes a first caption area, comprising:

7. The apparatus of claim 5, wherein the similarity between the second caption region and the first caption region is calculated by a structural similarity measurement method.

8. The apparatus for obtaining training data of a speech recognition model according to claim 5, wherein the recognition unit comprises:

the intercepting subunit is used for intercepting a corresponding subtitle area from any one frame image from the t frame image to the t + n-1 frame image to obtain a sub-image;

and the identification subunit is used for inputting the sub-images into a text identification model to perform text identification to obtain labeled text data corresponding to the voice data.

9. A computer, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of obtaining speech recognition model training data of any of claims 1-4.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the method of obtaining speech recognition model training data according to any one of claims 1 to 4.