CN115396690A

CN115396690A - Audio and text combination method and device, electronic equipment and storage medium

Info

Publication number: CN115396690A
Application number: CN202211049871.3A
Authority: CN
Inventors: 王炳乾; 宿绍勋; 孙晴晴; 黄光伟
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-11-25

Abstract

The embodiment of the disclosure provides an audio and text combination method and device, electronic equipment and a storage medium. The audio and text combination method comprises the following steps: performing text recognition on key frames in a video file to obtain a first subtitle set; performing subtitle switching detection on the first subtitle set to obtain a target subtitle and a timestamp of subtitle switching of the target subtitle; intercepting audio from an audio file of the video file based on the timestamp to obtain target audio; and constructing a combined pair of text and audio based on the target caption and the target audio. According to the embodiment of the disclosure, the matching degree between the audio and the text can be improved.

Description

Audio and text combination method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to an audio and text combining method and apparatus, an electronic device, and a storage medium.

Background

The voice recognition technology is one of artificial intelligence technologies, and is widely applied to numerous fields such as voice input, voice interaction, voice search, voice control, voice subtitles and the like at present. Particularly, under the influence of current epidemic situations, the value of the voice recognition technology is more highlighted through contactless voice interaction and control. Because the voice recognition technology is based on the deep learning algorithm, the accuracy of the algorithm is improved by relying on massive and high-quality training data, and the voice recognition quality achieves an ideal effect.

The existing method for generating training data of a speech recognition algorithm generally comprises the following steps: acquiring video resources with subtitles, such as audio books, scene commentary, documentaries, dramas, interviews, news, heddles, self-media and other video resources in a plurality of fields; extracting subtitles and timestamps of the subtitles from the video by using an OCR (optical character recognition) character recognition technology; and matching an audio segment corresponding to the caption from the video through the timestamp, and forming the caption and the audio segment into a training sample of a text-audio pair.

However, in the existing subtitle extraction schemes, the OCR character recognition technology is adopted to extract the subtitles in the video, and this method only depends on one-way information of the video content, so that the accuracy of subtitle extraction depends on the accuracy of the OCR recognition technology. When a complex background exists in a video or the position of a subtitle is not fixed, the OCR recognition effect is poor, so that when the time pair of < text and audio > is matched, the time information of the subtitle and the audio is misplaced, and the subtitle content and the audio content are not aligned easily. Further, the recognition accuracy of the speech recognition algorithm is affected.

Disclosure of Invention

The embodiment of the disclosure provides an audio and text combination method, an audio and text combination device, an electronic device and a storage medium, so as to solve or alleviate one or more technical problems in the prior art.

As a first aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide an audio and text combining method, including:

performing text recognition on key frames in a video file to obtain a first subtitle set;

performing caption switching detection on the first caption set to obtain a target caption and a timestamp for caption switching of the target caption;

intercepting audio from an audio file of the video file based on the timestamp to obtain target audio;

and constructing a combined pair of text and audio based on the target caption and the target audio.

As a second aspect of the embodiments of the present disclosure, an embodiment of the present disclosure provides an audio and text combining apparatus, including:

the text recognition module is used for performing text recognition on key frames in the video file to obtain a first subtitle set;

the subtitle switching detection module is used for carrying out subtitle switching detection on the first subtitle set to obtain a target subtitle and a timestamp of subtitle switching of the target subtitle;

the target audio intercepting module is used for intercepting audio from the audio file of the video file based on the timestamp to obtain target audio;

and the audio text combination module is used for constructing a text and audio combination pair based on the target caption and the target audio.

As a third aspect of the embodiments of the present disclosure, an embodiment of the present disclosure provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the audio and text combining method provided by the embodiments of the present disclosure.

As a fourth aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the audio and text combining method provided by the embodiments of the present disclosure.

As a fourth aspect of the embodiments of the present disclosure, the embodiments of the present disclosure provide a computer program product comprising a computer program, which when executed by a processor, implements the audio and text combining method provided by the embodiments of the present disclosure.

According to the technical scheme provided by the embodiment of the disclosure, text recognition is carried out on key frames in a video file to obtain a first subtitle set, subtitle switching detection is carried out on the first subtitle set, and a target subtitle and a timestamp of subtitle switching of the target subtitle are obtained. Therefore, the audio is intercepted from the audio file of the video file based on the timestamp, and the quasi-determined target audio can be obtained, so that the matching degree of the text and audio combined pair constructed based on the target subtitle and the target audio is high, and the conditions that the subtitle and audio time information is misplaced and the subtitle content and the audio content are not aligned are avoided.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present disclosure will be readily apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, like reference characters designate like or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are not to be considered limiting of its scope.

FIG. 1 is a flow chart of an audio and text combining method of an embodiment of the present disclosure;

fig. 2 is a flowchart of a subtitle processing method according to an embodiment of the present disclosure;

FIG. 3A is a schematic diagram of a key frame according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram of a key frame of another embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a key frame of another embodiment of the present disclosure;

FIG. 5 is a flow diagram of a process for constructing training samples for a speech recognition model according to an embodiment of the present disclosure;

fig. 6 is a flowchart of a subtitle file generation process according to an embodiment of the present disclosure;

FIG. 7 is a flowchart of subtitle verification, handover detection, and error correction according to an embodiment of the present disclosure;

FIG. 8 is a schematic illustration of pictograph dissection of an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a subtitle file according to an embodiment of the present disclosure;

FIG. 10 is a schematic diagram of an audio and text combining apparatus according to an embodiment of the present disclosure;

FIG. 11 is a schematic view of an electronic device of an embodiment of the present disclosure;

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art can appreciate, the described embodiments can be modified in various different ways, without departing from the spirit or scope of the present disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 is a flowchart of an audio and text combining method according to an embodiment of the present disclosure. As shown in fig. 1, the audio and text combining method may include the following steps:

s110, performing text recognition on key frames in a video file to obtain a first subtitle set;

s120, performing subtitle switching detection on the first subtitle set to obtain a target subtitle and a timestamp for subtitle switching of the target subtitle;

s130, intercepting audio from an audio file of the video file based on the time stamp of subtitle switching of the target subtitle to obtain target audio;

and S140, constructing a text and audio combined pair based on the target subtitles and the target audio.

In this example, text recognition is performed on a key frame in a video file to obtain a first subtitle set, and subtitle switching detection is performed on the first subtitle set to obtain a target subtitle and a timestamp of subtitle switching of the target subtitle. Therefore, the audio is intercepted from the audio file of the video file based on the timestamp, and the quasi-determined target audio can be obtained, so that the matching degree of the text and audio combined pair constructed based on the target subtitle and the target audio is high, and the conditions that the subtitle and audio time information is misplaced and the subtitle content and the audio content are not aligned are avoided.

Illustratively, a video file includes a plurality of image frames and corresponding audio played in a time sequence. The video files may include audio books, commentary, documentaries, drama, interviews, news, art, self media, and other video sources. The video file may be a short video or a longer video.

For example, in step S110, text recognition may be performed on all key frames in the video file, or text recognition may be performed on part of the key frames. The key frame may be any image frame in a video file, or may be an image frame with text.

Illustratively, text recognition may employ OCR recognition techniques.

Illustratively, the first subtitle set includes a plurality of subtitles, each of which may be from a different key frame. The subtitles may have the same content or different contents.

Illustratively, for a subtitle, it may appear in consecutive key frames, and when it is not the same as the subtitle of the previous frame or the subtitle of the next frame, it may be considered that a subtitle switching occurs. Therefore, the subtitle switching detection may be to detect whether subtitle switching occurs by using the degree of similarity between subtitles of two adjacent frames. And if the subtitle similarity degree between two adjacent frames is greater than the set subtitle similarity threshold, determining that the subtitle switching does not occur. And if the subtitle similarity degree between two adjacent frames is smaller than the set subtitle similarity threshold, determining that subtitle switching occurs. If subtitle switching occurs, a timestamp at which subtitle switching occurs may be determined based on timestamps of two adjacent frames. For example, the timestamp of the first frame of the two adjacent frames is used as the timestamp for the subtitle switching, or the timestamp of the second frame of the two adjacent frames is used as the timestamp for the subtitle switching.

Illustratively, the timestamp of the subtitle occurrence subtitle switching includes a start timestamp and an end timestamp. The subtitles in each key frame between the starting timestamp and the ending timestamp have the same subtitle content, namely the similarity degree of the subtitles of two adjacent frames is greater than a set subtitle similarity threshold value.

Illustratively, the constructed text and audio combined pairs may be applied to training of a speech recognition model.

In some embodiments, in step S110, the process of performing text recognition on the key frames in the video file to obtain the first subtitle set includes operations of checking whether the text is a subtitle, filtering non-subtitle text, and the like.

Illustratively, fig. 2 is a flowchart of a subtitle processing method according to an embodiment of the present disclosure. As shown in fig. 2, the subtitle processing method may include the steps of:

s210, performing text recognition on key frames in the video file to obtain candidate subtitle texts;

s220, determining a check audio corresponding to the candidate subtitle text in the video file;

s230, under the condition that the relation between the candidate subtitle text and the verification audio accords with the set condition, determining the candidate subtitle text as the subtitle text of the video file;

s240, acquiring a subtitle text of the video file to obtain a first subtitle set.

In this embodiment, the candidate subtitle text obtained by the text recognition mode is verified by using the verification text of the voice recognition, and whether the candidate subtitle text is a real subtitle is determined, so that the precision of subtitle recognition is improved, and the situations that the text and the audio time information are misaligned and the text content and the audio content are not aligned can be avoided when a combination consisting of the text and the audio is constructed in the following.

For example, in step S210, text recognition may be performed on all key frames in the video file, or text recognition may be performed on part of the key frames. The key frame may be any image frame in a video file, or may be an image frame with text.

Exemplarily, the step S210 of text recognition of the key frame may be to adopt an OCR text recognition technology. During text recognition of the key frame, one or more texts in the key frame, such as text blocks and text lines, can be obtained. In this case, some texts are not necessarily subtitles, and these texts can be used as candidate subtitles. Further, an audio check is performed for the candidate subtitle to accurately determine whether the candidate subtitle is a real subtitle.

Exemplarily, text recognition is performed on a key frame in a video file to obtain at least one text, and candidate subtitle texts are determined in a plurality of texts in the key frame according to a subtitle region of the video file and position information of each text in the key frame to further determine whether the candidate subtitle texts are subtitles. For example, if the position information of the text matches the subtitle region, the text is considered to be a subtitle, and if the position information of the text does not match the subtitle region, the text is not necessarily a subtitle, and it is necessary to further verify whether the text is a subtitle. The position information may include a position area of the text in the key frame, for example, a rectangular box is used to frame the text, and coordinates of four corners of the rectangular box are the position areas of the text in the key frame.

Illustratively, the candidate subtitle text is text for which it is not determined whether it is a subtitle, for example, possibly a subtitle, and possibly background text or watermark text in a key frame.

Illustratively, as shown in fig. 3A and 3B, this is a scene in which the teacher Actor1 is teaching. The text in the text area pointed by the teacher Actor1 in the figure is background text, and the two text blocks below the text area, namely the poem of Tang dynasty poem Libai and the water-depth thousand-scale ratio of Shirensu pool with exaggerated technique applied to Wanlon and his friend, are subtitles. If OCR recognition is adopted, the background text and the subtitles are simultaneously recognized, and at the moment, which subtitle text and subtitle text are the background text cannot be obtained. Therefore, by adopting the method provided by the embodiment of the disclosure, the subtitles can be accurately identified, and the background text can be filtered.

For example, in step 220, the corresponding audio segment may be intercepted from the audio in the video file based on the timestamp corresponding to the candidate subtitle text, resulting in the check audio. The time stamp may be a time stamp of text switching of the candidate subtitle text in the video file, that is, a first time and a last time of occurrence of the candidate subtitle text in time-sequentially consecutive image frames. Therefore, the target audio corresponding to the candidate subtitle text can be accurately intercepted, and the accuracy of the follow-up subtitle checking is improved.

In some embodiments, the degree of overlap or confidence between the candidate subtitle text and its corresponding check text, etc. may be used to determine whether it is a subtitle.

Exemplarily, the step S230 may include: carrying out voice recognition on the verification audio to obtain a verification text; and under the condition that the coincidence degree between the candidate subtitle text and the check text is greater than a set coincidence degree threshold value, determining that the candidate subtitle text is the subtitle text of the video file.

The Speech Recognition may employ ASR (Automatic Speech Recognition) technology, for example.

Illustratively, the degree of overlap refers to a ratio between the same character and all characters between two character strings, or a ratio between the same character and one of the character strings. The ratio may be a ratio between the number of characters or a ratio between the lengths of the characters.

Illustratively, the overlap ratio m between a candidate subtitle text and its corresponding check text is m = len (string) intersecting (asr))/len (string (ocr), wherein string (ocr) is the candidate subtitle text, tring (asr) is the check text corresponding to the candidate subtitle text, len (string) intersecting (asr)) is the length of the character intersection between the candidate subtitle text and the check text, and len (string (ocr) is the length of the candidate subtitle text.

In this example, the degree of overlap between the candidate subtitle text and the check text identified by its corresponding audio may be used to determine whether the candidate subtitle text is a subtitle. Thus, the accuracy of subtitle recognition is improved.

In some embodiments, some more visible non-subtitle text, such as watermarks, may be pre-filtered in the case of a large amount of subtitle recognition on a video file.

For example, in step S210, performing text recognition on the key frames in the video file to obtain candidate subtitle texts, including:

performing text recognition on key frames in a video file to obtain a first text set;

determining non-subtitle text based on the first text set;

filtering non-subtitle texts in the first text set to obtain a second text set;

based on the second text set, candidate subtitle texts are determined.

In this example, the non-subtitle text is filtered in advance, which is beneficial to improving the efficiency of subsequent subtitle recognition.

Illustratively, the key frames are extracted from the video file prior to text recognition of each key frame in the video file. Text detection is carried out on each image frame in the video file, whether the two frames are repeated or not is compared through a frame difference method, if the two frames are repeated, redundant frames are deleted, and finally a plurality of key frames are obtained, wherein each key frame comprises a text. Whether the two previous and next frames of images are repeated frames can be determined by comparing the similarity between the texts in the two previous and next frames of images, such as the edit distance. And the model for detecting the text of the image frame can be realized by adopting an RCNN deep learning model. In addition, in order to increase the speed of extracting the key frames, the key frames may be acquired at set intervals, for example, 3 frames per second may be extracted as key frames, and one frame may be extracted as key frames at intervals of 8 seconds.

For example, text recognition is performed on a key frame, and text information and position information of each text of the key frame can be obtained. Each text may be in the form of a block or line of text. For example, the text recognition result for a key frame may be labeled as:

[(text1,(xmin,xmax,ymin,ymax)),(text2,(xmin,xmax,ymin,ymax)),……]；

wherein, text1 is a text, and text2 is another text; xmin, xmax represents the minimum abscissa and the maximum abscissa of the text block in the current key frame; ymin, ymax represents the minimum and maximum ordinate of a text block in the current key frame. As shown in fig. 4, it shows the position information of the text block "good learning, day-to-day" in the current frame.

In some embodiments, the non-subtitle text includes background text such as a watermark, station logo, or advertisement. For example, background information such as station captions or advertisements may exist in some television shows and art nodes. As shown in fig. 4, "XXTV drama channel" in fig. 4 is a station caption. These background information tend to appear at fixed locations in the video and exist for the entire duration of the video, i.e., in all key frames of the video. Therefore, the occurrence frequency of the texts with the same position can be counted based on the position information of each text in the first text set, and the text with the occurrence frequency higher than the set threshold value can be determined as the non-subtitle text.

Illustratively, the determining the non-subtitle text based on the position information of each text in the first text set may include: determining the occurrence frequency of texts with the same position based on the position information of each text in the first text block set; and determining the text with the same position as the non-subtitle text when the occurrence frequency is greater than the set frequency threshold.

It should be noted that there may be slight differences in the position coordinate information between previous and subsequent frames or different frames of the same text. For example, the position coordinate information recognized in two frames before and after "good learning, day-to-day" in fig. 4 may be (125, 512, 678, 780) and (126, 513, 679, 781). Therefore, the two pieces of position coordinate information can be normalized and regarded as the same position.

Illustratively, when it is determined that the positional distance between two texts belonging to different key frames is less than a predetermined distance threshold based on the positional information of the two texts, the two texts are determined to be texts having the same position.

Illustratively, the position of the text block 1 in the key frame a is (x 1, y 1); the position of the text block 2 in the key frame b is (x 2, y 2). If | | | x1-x2| | < a, | | | y1-y2| | < b, then text block 1 and text block 2 are the same location text.

Illustratively, the occurrence frequency of the text with the same position is counted to obtain the occurrence frequency of the text with the same position. And determining the text with the highest occurrence frequency as the non-subtitle text, or determining the text with the occurrence frequency more than 80% or 70% of the total number of the key frames of the video file as the non-subtitle text.

In some embodiments, after the non-subtitle text is filtered in the first set of text, a second set of text is obtained. At this time, all texts in the second text set may be used as candidate subtitle texts to perform audio verification, so as to obtain subtitle texts of the video file. Or screening the texts in the second text set, and taking the screened texts as candidate subtitle texts.

In some embodiments, the candidate subtitle text may be filtered out of the second set of text using the subtitle region.

In some embodiments, the subtitle region is determined by: after the non-subtitle text is filtered, the remaining texts may be sorted according to the occurrence frequency of the positions, and the position area where the occurrence frequency is greater than a set threshold, for example, the text with the highest frequency, is determined as the subtitle area of the video text.

In some embodiments, the subtitle region may be utilized to determine whether text in the second set of text after filtering non-subtitle text is subtitle. For example, the position of the text matches the subtitle region, the text may be determined as a subtitle. If the position of the text does not match the subtitle region, the text cannot be determined as the subtitle, and further verification is required.

Illustratively, the determining candidate subtitle texts based on the second text set includes:

determining a first subtitle area based on the position information of each text in the second text set;

and for each text in the second text set, determining the text as a candidate subtitle text when the matching degree of the position information of the text and the first subtitle area is smaller than a set matching threshold.

In this example, the candidate subtitle text is a text for which it cannot be determined whether the candidate subtitle text is a subtitle, and further verification using a verification text obtained by speech recognition is required, so that the accuracy of subtitle recognition is improved.

For example, the determining the first subtitle region based on the position information of each text in the second text set may include: and sequencing the occurrence frequencies of the texts with the same positions in the second text set, and determining the position area where the text with the highest occurrence frequency is located as a first subtitle area.

Illustratively, as shown in fig. 3A and 3B, it is two key frames in the video that the teacher gives lessons. There is a background subtitle in the key frame and there are multiple lines of subtitles. The text in the text area pointed to by the teacher Actor1 is background text. The text area pointed by the teacher Actor1 is checked by using the determined subtitle area, and the subtitle area can be found to be unmatched with the text area pointed by the teacher Actor 1. Assuming that the subtitle region is (Xmin, xmax, ymin, ymax), the text region pointed by the teacher Actor1 is (Xminc, xmaxc, yminc, ymaxc), and if Yminc < Ymin or Ymaxc > Ymax, the text corresponding to the text region is a text other than the subtitle region, and cannot be determined as a subtitle. However, if the text area is directly regarded as not being a subtitle, the subtitle may be lost. Therefore, such a text is determined as a candidate subtitle text, which is checked using a check text obtained by speech recognition of the corresponding audio to determine whether the text is a subtitle. Therefore, on one hand, the subtitle can be prevented from being lost, and on the other hand, the accuracy of subtitle identification can be improved.

Since the caption can be identified by the caption area, the first caption set can be quickly obtained by identifying the first text set or the second text set by the caption area.

Illustratively, the determining candidate subtitle texts based on the second text set further includes:

and for each text in the second text set, determining the text as the subtitle text of the video file when the matching degree of the position information of the text and the first subtitle area is greater than a set matching threshold.

For example, in step S210, performing text recognition on a key frame of a video file to obtain a first subtitle set may include:

performing text recognition on each key frame in the video file to obtain a third text set;

determining a second subtitle region based on the position information of each text in the third text set;

and for each text in the third text set, determining the text as the subtitle text of the video file when the matching degree of the position information of the text and the second subtitle area is greater than the set matching threshold.

In the example, the caption is determined by utilizing the caption area, so that the caption can be determined quickly and accurately, and the caption identification efficiency is improved.

For example, each key frame in the video file is subjected to text recognition to obtain a third text set, and the non-subtitle text in the third text set may be filtered by using the method for filtering non-subtitle text described above to obtain a filtered third text set. Reference may be made to the foregoing filtering process of the non-caption text of the second text set and the recognition process of the non-caption text, which are not described in detail herein.

For example, the process of determining the second subtitle region may be the same as or similar to the process of determining the first subtitle region, and the foregoing may be referred to specifically, and will not be described in detail herein.

Illustratively, the first subtitle region and the second subtitle region may be the same region or different regions.

In some embodiments, after obtaining the first caption set, caption switching detection, duplicate removal and the like can be performed on the first caption set, and repeated text and audio combination pairs can be avoided.

Exemplarily, in the step S120, it may include:

determining a timestamp for switching subtitles based on the similarity between any two subtitle texts belonging to adjacent key frames in the first subtitle set, and dividing the first subtitle set by using the timestamp to obtain a plurality of second subtitle sets;

removing the duplicate of the second caption set to obtain a target caption;

and determining the timestamp of the target subtitle for subtitle switching based on the timestamp of the second subtitle set.

In this example, the caption switching detection on the first caption set results in a plurality of second caption sets, each of which includes a plurality of captions with the same content, so that each of the second caption sets can be deduplicated to obtain a target caption for combining the text and audio pairs, thereby avoiding the occurrence of repeated text and audio combined pairs.

Illustratively, subtitles belonging to the same key frame are merged before subtitle switching detection is performed on the first subtitle set. For example, a key frame may have two lines of subtitles, which may be combined into one subtitle. Thus, the subsequent caption switching detection can be facilitated.

Exemplarily, in the first subtitle set, calculating the similarity of any two subtitle texts belonging to adjacent key frames, and determining the subtitle texts of the adjacent key frames as subtitles with the same content when the similarity is greater than a set first similarity threshold; and under the condition that the similarity is smaller than a set similarity threshold, determining two subtitle texts of the adjacent key frames as subtitles with different contents, namely subtitle switching, and determining a timestamp corresponding to one of the two key frames as a timestamp for subtitle switching.

After all timestamps of subtitle switching in the first word set are obtained, the timestamps are used as dividing lines, and subtitles corresponding to key frames with the timestamps belonging to the two dividing lines are divided into the same subtitle set to obtain a plurality of second subtitle sets.

Illustratively, the second subtitle set includes a plurality of subtitles. The plurality of subtitles are subtitles having the same content.

Illustratively, the start timestamp and the end timestamp corresponding to the second subtitle set are timestamps when subtitle switching occurs for the target subtitle.

In some embodiments, although the similarity between the subtitle texts belonging to the adjacent key frames in the second subtitle set is greater than the first similarity threshold, the subtitles are not necessarily identical, which indicates that there may be a case where the text recognition is wrong for the two subtitles, i.e., there is a word that is wrong for a certain word. Therefore, before the second subtitle set is de-duplicated, the subtitles with wrong words in the second subtitle set can be corrected.

In order to improve the efficiency of subtitle error correction, the similarity between any two subtitles in the second subtitle set can be used to determine the subtitle with the wrong word, and the error correction can be performed on the subtitle with the wrong word.

Illustratively, before the deduplication is performed on the second subtitle set, the method further includes:

and under the condition that the similarity between the first subtitle text and the second subtitle text in the second subtitle set is greater than the first similarity threshold and smaller than the second similarity threshold, performing text error correction on the first subtitle text and the second subtitle text.

In the example, the subtitle sets belonging to one subtitle content are divided by using the time stamp of subtitle switching, and the subtitle needing text error correction is determined based on the similarity between the subtitle and the subtitle in each subtitle set, so that the efficiency of subtitle error correction is improved.

Illustratively, the similarity may be calculated by using an edit distance. The edit distance, also called the Levenshtein distance (Levenshtein), is a quantitative measure of how different two strings (e.g., english words) differ. The measurement is at least how many times the process is needed to change one string into another.

Illustratively, the first similarity threshold may be 90% and the second similarity threshold may be 100%.

Illustratively, the first similarity threshold may be 95% and the second similarity threshold may be 99.9%.

In some embodiments, word-error-blocking may be performed on the subtitle text for text error correction, and based on the masked subtitle text and the blocking words, the target prediction words may be determined, and then the blocking words in the subtitle text may be modified by using the target prediction words.

Illustratively, the text error correction of the first subtitle text and the second subtitle text includes:

determining words with the same position and different words in the first subtitle text and the second subtitle text as shielding words;

shielding the shielding words in the first caption text and the second caption text to obtain a third caption text;

determining a target prediction word based on the third caption text and the shielding word;

and modifying the shielding words in the first caption text and the second caption text based on the target predicted words.

In this example, the correct word can be accurately obtained by determining the target predicted word by using the caption text and the shielding word obtained by shielding the wrong word.

Illustratively, the first subtitle text and the second subtitle text may be subtitle texts of two adjacent frames.

Illustratively, assuming that the first subtitle text is "the flow of people at the current intersection is large or a little is needed to be careful", and the second subtitle text is "the flow of people at the current intersection is large or a little is needed to be careful", the "people" and the "people" are two words with the same position but different word tones, and are the blocking words.

Illustratively, the target predictor may include one or more.

For example, semantic analysis may be performed on the third subtitle text to obtain the predicted word sets and the confidence of each predicted word in each predicted word set, and then the target predicted word is determined by using the confidence of each predicted word and the relationship between each predicted word and the mask word.

For example, the third caption text may be subjected to semantic analysis using a pre-trained language model BERT (Bidirectional Encoder retrieval from transforms) in natural language processing, and the predicted word sets and the confidence of each predicted word in each predicted word set are obtained.

In practical applications, the BERT model is called MASK language model, i.e. its main task in the training phase is to predict the masked elements. Such as MASK of words suspected of being in error before entering the text into the BERT model. For example, MASK is performed on the above "person" and "in" to obtain the third caption text as "the current [ MASK ] is large in traffic and is still careful a little bit). The BERT model is then used to predict what the words are by MASK. Set of candidate words of MASK position predicted by BERT, confidence of each word.

In some embodiments, the target predicted word may be determined in combination with the confidence of each predicted word and the edit distance between each predicted word and the mask word.

Illustratively, the determining the target predicted word based on the third caption text and the mask word includes:

performing semantic analysis on the third subtitle text to obtain a predicted word set and a confidence coefficient of each predicted word in the predicted word set;

determining the quality score of each predicted word based on the confidence of each predicted word and the edit distance between each predicted word and the shielding word;

and determining the target predicted word in the predicted word set based on the quality score of each predicted word.

In this example, the confidence of each predicted word and the edit distance between each predicted word and the mask word are combined to determine the target predicted word in the plurality of predicted words, so that the prediction accuracy can be further improved.

For example, the predicted word and the confidence of the predicted word may be predicted by the model. The confidence level refers to the confidence level that the predicted word is the true text of the masked position in the third subtitle text.

Illustratively, the semantic analysis of the third caption text may employ a BERT model.

Illustratively, assume that the word for the MASK position predicted by the BERT module is: the confidence p of the car is 0.8123, the editing distance s between the car and the shielding word 'in' is 0.012, the confidence p of the person is 0.2313, and the editing distance s between the person and the IDS of the 'in' is 0.9342. Then, the degree of opposition and the edit distance were summed to obtain a "car" quality score of 0.8243 and a "person" quality score of 1.1655. Thus, the predictor "human" may be selected as the target predictor.

In some embodiments, the decision whether to modify may be based on the quality score of the target predicted word and the edit distance between the target predicted word and the mask word to avoid a mis-modification.

Illustratively, modifying the mask word in the first caption text and the second caption text based on the target predicted word comprises:

and modifying the shielding words in the first subtitle text and the second subtitle text into the target predictive words under the condition that the quality score of the target predictive words is larger than a set quality score threshold value and the editing distance between the target predictive words and the shielding words is larger than a set editing distance threshold value.

Illustratively, in the case that the quality score of the target predicted word is less than the set quality score threshold or the editing distance between the target predicted word and the mask word is less than the set editing distance threshold, no modification is made to the mask word in the first subtitle text and the second subtitle text.

In this example, whether to modify or not is decided based on the quality score of the target predicted word and the edit distance between the target predicted word and the mask word, and erroneous modification can be avoided.

Illustratively, the masking words are corrected when the quality score of the target predicted word is greater than 1 and the edit distance between the target predicted word and the masking words is greater than 0.5, otherwise, the subtitle text is not modified and remains unchanged.

In some embodiments, after performing text error correction on the second subtitle set, the second subtitle set may be deduplicated to obtain the target subtitle.

Illustratively, the foregoing de-duplicating the second subtitle set to obtain the target subtitle may include:

and under the condition that the similarity between any two subtitle texts in the second subtitle set is greater than a third similarity threshold value, carrying out duplicate removal on the second subtitle set to obtain the target subtitle.

In this example, how the similarity between any two subtitle texts in the second subtitle set is greater than the third similarity threshold indicates that the subtitles in the second subtitle set are all the same subtitles and there is no subtitle with a wrong word, at this time, the second subtitle set may be deduplicated, and the accuracy of obtaining the target subtitle is improved.

For example, the third similarity threshold may be greater than or equal to the second similarity threshold.

Illustratively, the third similarity threshold may be 99% or 100%, etc.

In some embodiments, after the combined pair of text and audio is obtained through the above steps, the audio in the combined pair may be used to check the text to determine whether the combined pair can be used as a training sample of the speech recognition model, so as to improve the recognition accuracy of the speech recognition model trained by using the training sample.

After obtaining the combined pair of text and audio, the method further includes:

performing voice recognition on the target audio based on the voice recognition model to obtain a check text corresponding to the target subtitle;

determining the confidence of the combined pair based on the target caption and the corresponding check text thereof;

and determining the combined pair as a training sample of the speech recognition model under the condition that the confidence coefficient of the combined pair meets the set confidence coefficient threshold value.

In the present exemplary embodiment, a pre-trained speech recognition model is used to perform speech recognition on a target audio, and a verification text obtained by recognition is used to verify a target subtitle, so as to obtain a confidence of a combined pair consisting of the target audio and the target subtitle, and whether the combined pair is determined as a training sample is determined based on the confidence, so that the training precision of the speech recognition model can be improved.

Illustratively, the confidence of the combined pair may be determined using the edit distance between the target subtitle and its check text, the character length of the target subtitle, and the character length of the check text.

For example, assuming that the ASR recognition result is string1 and the OCR caption result is string2, the confidence C of the training sample can be expressed as:

C＝1-EditDistance(string1,string2)/max(len(string1),len(string2))

where EditDistance (string 1, string 2) represents an editing distance between string1 and string2, len (string 1) is the length of string1, len (string 2) is the length of string2, and max () function is a maximum value-taking function.

And when the consistency of the ASR recognition result and the OCR recognition result is high, namely the consistency of string1 and string2 is high, the value of the confidence coefficient C of the training sample is high, and on the contrary, when the consistency of the ASR recognition result and the OCR recognition result is low, the value of the confidence coefficient C is low. Thus, samples with low labeling quality can be filtered out by the confidence C.

Illustratively, the training samples described above may be applied to the training of any speech recognition model. And the target audio is used as the input of the voice recognition model, the target caption is used as the label of the target audio, the target audio is compared with the output of the voice recognition model, and the parameters of the voice recognition model are adjusted according to the comparison result.

In some embodiments, the generated training samples cannot be directly applied to model training, and the text in the training samples may have some numbers or function symbols, and needs to be regularized.

Illustratively, after obtaining the training sample, the method further comprises:

and when the target characters exist in the target subtitles, regularizing the target subtitles.

In this example, regularization processing is performed on a target caption in which a target character exists, and the obtained training sample may be applied to model training.

Illustratively, the target characters may include numbers, function symbols, formulas, time-date, units, and the like. And converting the target character into a corresponding Chinese character. Generally, the method can be realized by methods such as dictionary mapping and regular matching.

For example, this piece of gold weighs 324.75 grams, and after regularization: the weight of the gold reaches three hundred, twenty, four points, seven and five grams; she born on 8/18 th 86 year she brother born on 3/1 th 1995, regularized to: she is born in eighty-six years, august and eighteen days, and she is born in one nineteen-five-year, march and one day; there is a 62% probability of rainfall tomorrow, after regularization: in tomorrow, there is a probability of rain of sixty-two percent; this is cell phone 8618544139121, after regularization: this is a mobile phone with eight six, eight five four, three nine, two and one.

An example of an application of a training sample set for constructing a speech recognition model based on OCR recognition combined with ASR recognition is described below, in particular as follows:

as shown in fig. 5, the process of constructing the training sample set of the speech recognition model mainly includes: extracting a video key frame, recognizing an OCR text, filtering a watermark and a non-subtitle text of a scene, switching and detecting a subtitle, correcting and generating a text, regularizing the text and marking and checking a sample.

(1) Video key frame extraction

And traversing each frame in the video to detect the text, comparing whether the two frames are repeated or not by a frame difference method, and deleting redundant frames if the two frames are repeated to finally obtain the key frame containing the subtitles. The method for judging whether the two frames are repeated is to compare the similarity of the text character strings extracted from the two frames of images. The calculation method of the similarity may be an edit distance or the like. The text detection and recognition model for the image frames can be implemented by an RCNN deep learning model. In addition, it is also possible to extract key frames quickly based on FPS (frames per second), for example, when FPS is 24, 3 frames per second can be extracted as key frames, and one key frame is extracted every 8 frames, thereby obtaining a key frame set. This extraction method is fast. The specific method for extracting the key frame may be determined according to characteristics of the video data or resource conditions.

(2) OCR text recognition

After obtaining the key frames of the video file, the text information of the text blocks and the position information of each text block in the key frames can be identified from the key frames by using an OCR technology. In general, OCR gives results of [ (text 1, (xmin, xmax, ymin, ymax)), (text 2, (xmin, xmax, ymin, ymax)), \ 8230; \ 8230;). Where text1 and text2 are text information obtained by recognition, xmin and xmax represent the minimum abscissa and the maximum abscissa of the text block in the key frame, ymin and ymax represent the minimum ordinate and the maximum ordinate of the text block in the key frame, as shown in fig. 4.

(3) Non-subtitle text filtering for watermarks and backgrounds

Background information such as a station caption or an advertisement may exist in some dramas and art programs, and for example, "XXTV drama channel" in fig. 4 is a station caption. This background information is often at a fixed location in the video and is present throughout the playback time period of the video file. Therefore, the frequency of the coordinate information of all texts in the OCR recognition result can be counted, the background contents can be filtered by the frequency threshold value, and the information can be prevented from being mistaken as the subtitle information. It should be noted that there is a slight difference in the position coordinate information recognized by OCR in different frames of the same text, for example, in fig. 3, the position coordinates of "good learning, day-to-day" recognized in two frames before and after may be (125, 512, 678, 780) and (126, 513, 679, 781), so that their coordinates need to be "normalized". I.e. when the difference in coordinates of two text blocks in two different frames is smaller than a predetermined difference threshold, it can be regarded as the same coordinate. For example, when the coordinate difference between the text block (x 1, y 1) and the text block (x 2, y 2) is | | | x1-x2| | < a, | | y1-y2| | < b, the positions of the two text blocks are considered to be the same. Where a and b are predetermined threshold values, which can be set according to practical situations, for example, a is 50 pixel values and b is 100 pixel values. After the watermark and the background content are filtered, sequencing the rest texts according to the output frequency sequencing of the texts, and determining the coordinate region of the text with the highest output frequency as a subtitle region which can be recorded as (Xmin, xmax, ymin, ymax), wherein the text is located in the region, or the region where the text is located is matched with the subtitle region, and determining the text as the subtitle.

(4) Caption switch detection, text error correction and generation

After the subtitles of the video are obtained, we need to merge the text blocks of the same frame, as shown in fig. 6, for example, in the case of double-line subtitles. Meanwhile, under the condition of key frame redundancy, the problem of subtitle repetition exists, and repeated subtitles need to be deleted.

As shown in fig. 7, in the subtitle position checking module, there may be a problem of matching adaptation due to multiple lines of subtitles or insufficient background filtering in subtitles obtained by text recognition, which is often present in videos produced from media. As shown in fig. 3A and 3B, which are scenes when the teacher gives lessons, background subtitles exist in the video, and subtitles are displayed in multiple lines. If the position of the current text is checked by the obtained subtitle position, there is a case that the subtitle position does not match the position of the text, that is, the position region (Xminc, xmaxc, yminc, ymaxc) where the text is located is larger than the subtitle position, for example, yminc < Ymin, ymaxc > Ymax, in which case, the subtitle content is lost if the subtitle is determined based on the subtitle position only, but the text information outside the subtitle region cannot be directly used as the subtitle by default. Therefore, the position area of the current text can be marked, the corresponding audio is obtained by using the timestamp of text switching of the current text, then the ASR is used for recognizing the audio to obtain the verification text, and the current OCR recognized text is summoned by using the verification text. The specific recall method comprises the following steps:

calculating the coincidence ratio m = len (string (OCR) & string (ASR))/len (string (OCR) between the OCR recognition text string (OCR) and the ASR recognition result string (ASR);

wherein, len () refers to a function of character length, string (ocr) crossing (asr) refers to the intersection character between the character string (ocr) and the character string (asr);

and when the contact ratio is greater than the set contact ratio threshold value, determining the OCR recognized text line as the subtitle content.

As shown in fig. 6, after all subtitles of a video file are obtained, all subtitles of the same frame are merged. Then, the caption switching detection shown in fig. 7 is performed.

The application example utilizes the similarity of the caption information between different frames to detect the caption switching. As shown in fig. 7, sequentially traversing the caption lines, recording the start frame _ s of the current frame number, comparing the caption similarity between the caption of the current frame and the caption of the next frame, when the caption similarity of the adjacent frame is greater than a preset threshold (e.g., 0.9), defaulting that the caption contents of the two frames are the same, skipping the next frame, continuing to compare with the next frame until different caption frames are compared, recording the current frame number as the end frame number frame _ e of the caption, and so on until all the caption lines are traversed. Therefore, the timestamp for switching the current subtitle is obtained as follows: frame _ s and frame _ e.

The above-mentioned similarity scheme can be calculated by using the edit distance of the character string. For example, the comparison of the similarity may be performed based on the edit distance of IDS (Ideographic Description Sequence). The IDS simply characterizes pictographs in a sequential manner, similar to the representation in english with alphabetic sequences, and chinese can be represented by strokes and chinese character structures, as shown in fig. 8, the IDS of poor characters. This characterization is more powerful and robust than a pure near-word-shaped characterization.

If the similarity of the subtitles of the adjacent frames is larger than a preset threshold value but is not completely the same when the subtitles are detected by switching, the fact that the OCR recognition error can exist in the current subtitles is indicated. For example, two adjacent frames are identified as "the current intersection has a large traffic volume, and care is needed" and "the current intersection has a large traffic volume, and care is needed". In this example, OCR recognizes "person" as "in". OCR, when performing text recognition, gives the result of recognizing the text, the location information of the text, and the confidence of the text. Therefore, the confidence in the OCR recognition result can also be used to determine whether the current character has an error and needs to be modified.

Therefore, the present application example considers text error correction for subtitle content in which an error exists. The error correction can be performed using a pre-trained language model BERT in natural language processing. BERT is called MASK language model, i.e. its main task in the training phase is to predict the masked elements. For example, when text is entered into the BERT model, the words suspected of being in error are MASK: "the current [ MASK ] flow is large, but care is taken" and the BERT model is then used to predict what the words are by MASK. And determining what the final correct-position MASK word is through the set of candidate words of the MASK position predicted by the BERT, the confidence coefficient p of each word and the IDS editing distance s between the candidate words and the MASK word. For example, the word for the MASK position predicted by BERT is:

1) "car", the confidence p is 0.8123, edit distance s with IDS "enter" is 0.012;

2) The confidence p of the "person" is 0.2313, and the editing distance s between the "person" and the IDS of the "in" is 0.9342;

the final scoring rule is Score = p + s value highest as the final candidate.

In some embodiments, additional constraints may be added to improve accuracy and avoid false modifications. For example, the final similarity s is greater than 0.5, and the Score value is greater than 1, and the original word is modified, otherwise, the original word is kept unchanged.

(5) Text regularization and sample annotation verification

The subtitle information of the finally generated video file is a subtitle file in srt format, which contains the time code + subtitle information, as shown in fig. 9.

The original audio file can be extracted from the original video through audio and video processing tools such as ffmpeg, the audio can be cut out by combining the subtitle file, and the audio and the subtitle are matched to form a text audio pair, namely a training sample of text and voice required in the voice recognition training.

However, the data generated by the above process cannot be directly used for training the ASR model, and the text needs to be regularized. The text regularization processing refers to expressing Ehribe numbers, time, date, units and the like appearing in characters by Chinese characters, and can be generally realized by methods such as dictionary mapping, regular matching and the like, as follows:

1) The weight of the gold reaches 324.75 g- > regularization- > the weight of the gold reaches three hundred twenty-four-point seven-five g;

2) She born on 8 th and 18 th of 86 years on her side born on 1 st of 3 th of 1995- > regularization- > she born on eighteen days of eighty six years on eighty eight months on her side born on three months of nine and fifty five years on one day;

3) Tomorrow has a probability of rainfall of 62% - > regularization- > tomorrow has a probability of rainfall of sixty-two percent;

4) This is the handset 8618544139121- > regularization- > this is the handset eight six-eight-five-four-three-nine-one.

In addition, in order to ensure the quality of the training sample, the pre-trained ASR model can be used to verify the OCR-labeled text in the training sample. Assuming that the ASR results in string1 and the OCR-identified caption results in string2, the confidence of the label of the data sample can be expressed as:

C＝1-EditDistance(string1,string2)/max(len(string1),len(string2))

and when the consistency of the ASR recognition result and the OCR recognition result is high, the consistency of string1 and string2 is high, and the confidence C is high, otherwise, when the consistency of the ASR recognition result and the OCR recognition result is low, the confidence C is low. Therefore, training samples with low labeling quality can be filtered out through the confidence C value.

Fig. 10 is a block diagram of an audio and text combining apparatus according to an embodiment of the disclosure.

As shown in fig. 10, an embodiment of the present disclosure provides an audio and text combining apparatus, including:

the text recognition module 110 is configured to perform text recognition on a key frame in a video file to obtain a first subtitle set;

a caption switching detection module 120, configured to perform caption switching detection on the first caption set to obtain a target caption and a timestamp of caption switching of the target caption;

a target audio intercepting module 130, configured to intercept an audio from an audio file of the video file based on the timestamp, to obtain a target audio;

and an audio text combination module 140, configured to construct a combined pair of text and audio based on the target subtitle and the target audio.

In some possible implementations, the text recognition module 110 may include:

the candidate subtitle recognition unit is used for performing text recognition on key frames in the video file to obtain candidate subtitle texts;

the verification audio acquiring unit is used for determining verification audio corresponding to the candidate subtitle text in the video file;

the subtitle checking unit is used for determining the candidate subtitle text as the subtitle text of the video file under the condition that the relation between the candidate subtitle text and the checking audio accords with a set condition;

and the first subtitle set acquisition unit is used for acquiring the subtitle text of the video file to obtain a first subtitle set.

Illustratively, the subtitle checking unit is specifically configured to:

performing voice recognition on the verification audio to obtain a verification text;

and under the condition that the coincidence degree between the candidate subtitle text and the check text is greater than a set coincidence degree threshold value, determining that the candidate subtitle text is the subtitle text of the video file.

In some possible implementations, the candidate subtitle identifying unit is specifically configured to:

performing text recognition on each key frame in a video file to obtain a first text set;

determining non-subtitle text based on the first text set;

filtering the non-subtitle texts in the first text set to obtain a second text set;

and determining candidate subtitle texts based on the second text set.

In some possible implementations, the determining non-caption text based on the first set of text includes:

determining the occurrence frequency of texts with the same positions based on the position information of each text in the first text block set;

and determining the text with the same position as the non-subtitle text when the frequency of occurrence is greater than a set frequency threshold.

In some possible implementations, the determining candidate subtitle texts based on the second text set includes:

and for each text in the second text set, determining the text as a candidate subtitle text when the matching degree of the position information of the text and the first subtitle region is smaller than a set matching threshold.

In some possible implementations, the determining candidate subtitle texts based on the second text set further includes:

and for each text in the second text set, determining the text as the subtitle text of the video file when the matching degree of the position information of the text and the first subtitle region is greater than a set matching threshold.

for each text in the third text set, determining that the text is a subtitle text of the video file when the matching degree of the position information of the text and the second subtitle region is greater than a set matching threshold;

and acquiring a subtitle text of the video text to obtain a first subtitle set.

In some possible implementations, the subtitle switch detecting module 120 includes:

the caption dividing unit is used for determining a timestamp for switching the captions based on the similarity between any two caption texts belonging to adjacent key frames in the first caption set, and dividing the first caption set by using the timestamp to obtain a plurality of second caption sets;

the caption duplicate removal unit is used for removing the duplicate of the second caption set to obtain a target caption;

and the time stamp determining unit is used for determining the time stamp of the subtitle switching of the target subtitle based on the time stamp of the second subtitle set.

In some possible implementations, the subtitle switching detection module 120 further includes:

and the subtitle error correction module is used for performing text error correction on the first subtitle text and the second subtitle text under the condition that the similarity between the first subtitle text and the second subtitle text in the second subtitle set is greater than a first similarity threshold and smaller than a second similarity threshold before the second subtitle set is de-duplicated.

In some possible implementations, the subtitle error correction unit is specifically configured to:

shielding the shielding words in the first subtitle text and the second subtitle text to obtain a third subtitle text;

determining a target predicted word based on the third caption text and the shielding word;

and modifying the shielding words in the first subtitle text and the second subtitle text based on the target predicted words.

In some possible implementations, the subtitle deduplication unit is specifically configured to:

and under the condition that the similarity between any two caption texts in the second caption set is greater than a third similarity threshold value, performing duplicate removal on the second caption set to obtain a target caption.

In some possible implementations, the apparatus further includes:

the verification text acquisition module is used for carrying out voice recognition on the target audio based on the voice recognition model to obtain a verification text corresponding to the target subtitle;

a confidence determining module, configured to determine a confidence of the combined pair based on the target subtitle and the check text corresponding to the target subtitle;

and the training sample determining module is used for determining the combined pair as a training sample of the voice recognition model under the condition that the confidence coefficient of the combined pair meets a set confidence coefficient threshold value.

In some possible implementations, the apparatus further includes:

and the regularization processing module is used for regularizing the target subtitles under the condition that the target subtitles have target characters.

The functions of each unit, module or sub-module in each apparatus in the embodiments of the present disclosure may refer to the corresponding description in the above method embodiments, and are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 11 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the electronic apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic apparatus 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 performs the respective methods and processes described above, such as an audio and text combining method. For example, in some embodiments, the audio and text combining method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 102 and/or the communication unit 809. When loaded into RAM 803 and executed by computing unit 801, a computer program may perform one or more of the steps of the audio and text combination method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the audio and text combining method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable atmosphere lamp adjusting apparatus such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio and text combining method, comprising:

performing subtitle switching detection on the first subtitle set to obtain a target subtitle and a timestamp of subtitle switching of the target subtitle;

2. The method of claim 1, wherein the text recognition of the key frames in the video file to obtain the first caption set comprises:

performing text recognition on key frames in a video file to obtain candidate subtitle texts;

in the video file, determining a check audio corresponding to the candidate subtitle text;

determining the candidate subtitle text as the subtitle text of the video file under the condition that the relation between the candidate subtitle text and the check audio accords with a set condition;

and acquiring a subtitle text of the video file to obtain a first subtitle set.

3. The method according to claim 2, wherein the determining that the candidate subtitle text is the subtitle text of the video file in the case that the relationship between the candidate subtitle text and the verification audio meets a set condition comprises:

4. The method of claim 2, wherein the performing text recognition on the key frames in the video file to obtain candidate subtitle texts comprises:

determining non-subtitle text based on the first text set;

and determining candidate subtitle texts based on the second text set.

5. The method of claim 4, wherein determining non-caption text based on the first set of text comprises:

6. The method of claim 4, wherein determining candidate subtitle text based on the second set of text comprises:

determining a first subtitle area based on the position information of the text in the second text set;

and for the text in the second text set, determining the text as a candidate subtitle text when the matching degree of the position information of the text and the first subtitle region is smaller than a set matching threshold.

7. The method of claim 6, further comprising:

and for the text in the second text set, determining that the text is the subtitle text of the video file when the matching degree of the position information of the text and the first subtitle region is greater than a set matching threshold.

8. The method of claim 1, wherein performing text recognition on the key frames of the video file to obtain the first subtitle set comprises:

performing text recognition on key frames in the video file to obtain a third text set;

determining a second subtitle area based on position information of the text in the third text set;

for the text in the third text set, determining that the text is the subtitle text of the video file when the matching degree of the position information of the text and the second subtitle region is greater than a set matching threshold;

and acquiring a subtitle text of the video text to obtain a first subtitle set.

9. The method of claim 1, wherein the performing caption switching detection on the first caption set to obtain a target caption and a timestamp of the target caption when caption switching occurs comprises:

removing the duplicate of the second caption set to obtain a target caption;

and determining the timestamp of subtitle switching of the target subtitle based on the timestamp of the second subtitle set.

10. The method of claim 9, wherein prior to de-duplicating the second subtitle set, the method further comprises:

and under the condition that the similarity between the first subtitle text and the second subtitle text in the second subtitle set is greater than a first similarity threshold and smaller than a second similarity threshold, performing text error correction on the first subtitle text and the second subtitle text.

11. The method of claim 10, wherein the text error correcting the first subtitle text and the second subtitle text comprises:

12. The method of claim 9, wherein the de-duplicating the second caption set to obtain the target caption comprises:

and under the condition that the similarity between any two subtitle texts in the second subtitle set is greater than a third similarity threshold, carrying out duplicate removal on the second subtitle set to obtain a target subtitle.

13. The method of claim 1, further comprising:

performing voice recognition on the target audio based on a voice recognition model to obtain a check text corresponding to the target subtitle;

and determining the combined pair as a training sample of the speech recognition model under the condition that the confidence coefficient of the combined pair meets a set confidence coefficient threshold value.

14. The method of claim 1, further comprising:

and under the condition that target characters exist in the target subtitles, regularizing the target subtitles.

15. An audio and text combining apparatus, comprising:

and the audio text combination module is used for constructing a text and audio combination pair based on the target subtitle and the target audio.

16. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.

17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-14.

18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-14.