CN111445902B

CN111445902B - Data collection method, device, storage medium and electronic equipment

Info

Publication number: CN111445902B
Application number: CN202010230599.3A
Authority: CN
Inventors: 王康; 何怡
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2023-05-30
Anticipated expiration: 2040-03-27
Also published as: CN111445902A

Abstract

The disclosure relates to a data collection method, a data collection device, a storage medium and electronic equipment, so as to quickly obtain high-quality text corpus. The method comprises the following steps: acquiring a plurality of target video frame images from a target video; performing OCR (optical character recognition) on the plurality of target video frame images to determine a first text contained in each target video frame image and text position information of the first text, wherein the text position information of the first text is used for indicating the position of the first text in the target video frame image; determining a caption area of the target video according to the text position information; determining a second text according to the caption area, the first text and the text position information of the first text, wherein the second text is taken from the first text of which the position indicated by the text position information is in the caption area; and determining the second text as a text corpus of the target video.

Description

Data collection method, device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, and in particular relates to a data collection method, a data collection device, a storage medium and electronic equipment.

Background

In the field of speech processing, the effectiveness of speech recognition techniques depends on the collection of training data, e.g., the training data of a language model is a text corpus. At present, when collecting text corpus required by a language model, original data is generally obtained from an open source channel, but because the data acquisition process comprises indiscriminate acquisition of errors, a large amount of noise data exists in the obtained data, and if the obtained data is directly used as the text corpus for training the language model, the training effect of the obtained language model is poor.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a data collection method, the method comprising:

acquiring a plurality of target video frame images from a target video;

performing OCR (optical character recognition) on the plurality of target video frame images to determine a first text contained in each target video frame image and text position information of the first text, wherein the text position information of the first text is used for indicating the position of the first text in the target video frame image;

Determining a caption area of the target video according to the text position information;

determining a second text according to the caption area, the first text and the text position information of the first text, wherein the second text is taken from the first text of which the position indicated by the text position information is in the caption area;

and determining the second text as a text corpus of the target video.

In a second aspect, the present disclosure provides a data collection device, the device comprising:

the first acquisition module is used for acquiring a plurality of target video frame images from the target video;

the recognition module is used for performing OCR (optical character recognition) on the plurality of target video frame images so as to determine a first text contained in each target video frame image and text position information of the first text, wherein the text position information of the first text is used for indicating the position of the first text in the target video frame image;

the first determining module is used for determining a caption area of the target video according to the text position information;

a second determining module, configured to determine a second text according to the subtitle region, the first text, and text position information of the first text, where the second text is taken from the first text whose position indicated by the text position information is in the subtitle region;

And the third determining module is used for determining the second text as the text corpus of the target video.

In a third aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which when executed by a processing device performs the steps of the method of the first aspect of the present disclosure.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method of the first aspect of the disclosure.

According to the technical scheme, a plurality of target video frame images are obtained from the target video, OCR (optical character recognition) is conducted on the plurality of target video frame images, so that text position information of a first text and a first text contained in each target video frame image is determined, a caption area of the target video is determined according to the text position information, a second text is determined according to the caption area, the first text and the text position information of the first text, and the second text is determined to be a text corpus of the target video. Therefore, through OCR recognition of images in the video, the most possible position of the caption, namely the caption area, is determined according to the recognition result, and the available text corpus is extracted from the caption area, so that the text corpus can be automatically obtained from the video, and the efficiency is high. Moreover, because the text content appearing in the same video has stronger correlation, the noise data in the acquired text corpus is less, the text corpus quality is higher, and a language model trained by using the text corpus also has higher quality.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

In the drawings:

FIG. 1 is a flow chart of a data collection method provided in accordance with one embodiment of the present disclosure;

FIG. 2 is a flow chart of a data collection method provided in accordance with another embodiment of the present disclosure;

FIG. 3 is a block diagram of a data collection device provided in accordance with one embodiment of the present disclosure;

fig. 4 is a block diagram of an apparatus provided in accordance with an embodiment.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

In the field of speech processing, the effectiveness of speech recognition techniques depends on the collection of training data. For example, the training data of the language model is a text corpus, and the training data of the speech recognition is audio and text corresponding to each other.

For the collection of the text corpus required by the language model, in the prior art, when the text corpus required by the language model is collected, the original data is generally obtained from an open source channel, but because the data acquisition process comprises the non-differential acquisition of errors, a large amount of noise data exists in the obtained data, and if the noise data is directly used as the text corpus for training the language model, the training of the obtained language model is poor.

Aiming at the collection of training data required by voice recognition, in the prior art, manual labeling is generally required based on collected audio and characters, and then the audio and the characters are mutually corresponding and used as the training data used by voice recognition, however, the manual labeling mode consumes resources and has low efficiency.

In order to solve the above-mentioned problems in the prior art, the present disclosure provides a data collection method, apparatus, storage medium and electronic device, so as to quickly obtain high quality training data.

Fig. 1 is a flow chart of a data collection method provided in accordance with one embodiment of the present disclosure. As shown in fig. 1, the method may include the following steps.

In step 11, a plurality of target video frame images are acquired from a target video.

The method provided by the disclosure can extract text corpus from target video, wherein the target video can be any video, and each time the text corpus is required to be extracted from the video, the video can be used as the target video, and a series of steps of the data collection method provided by the disclosure are executed.

A plurality of target video frame images can be always acquired from the target video, wherein each target video frame image is one frame image in the target video.

In step 12, OCR recognition is performed on the plurality of target video frame images to determine a first text and text position information of the first text contained in each target video frame image.

OCR (Optical Character Recognition ) is capable of recognizing text in an image as editable text, and therefore, text portions thereof can be extracted from the image by OCR recognition.

Based on the plurality of target video frame images obtained in the step 11, performing OCR (optical character recognition) on each target video frame image, so that the text contained in each target video frame image and the position where the text appears in the target video frame image can be obtained, and the first text contained in each target video frame image and the text position information of the first text can be obtained. The first text may be formed by one or more words, and the text position information of the first text is used to indicate a position where the first text appears in the target video frame image, where the position where the first text appears in the target video frame image may be represented by coordinates of a pixel point corresponding to the first text in the target video frame image (for example, a set of coordinates of all pixel points corresponding to the first text in the target video frame image, or a set of coordinates of pixel points corresponding to the first text in the target video frame image and located at an edge, etc.).

And, there may be text in multiple locations in the same target video frame image, where text locations may be, for example, bottom of picture, top of picture, middle of picture, top left corner of picture, top right corner of picture, bottom left corner of picture, etc. Such a target video frame image may be subjected to OCR recognition to obtain a plurality of first texts at different positions, each first text corresponding to a different position in the target video frame image. For example, if the target video frame image a is subjected to OCR recognition to obtain a text B1 located at the top of the frame and a text B2 located at the bottom of the frame, the target video frame image includes two first texts, namely, the text B1 and the text B2, and the text position information of the first text B1 is the top of the frame and the text position information of the second text B2 is the bottom of the frame.

In step 13, a subtitle region of the target video is determined based on each text position information.

And determining the subtitle region of the target video according to the positions indicated by the text position information. For example, the subtitle region may be a bar region at the bottom of the target video frame image.

In step 14, a second text is determined based on the subtitle region, the first text, and text position information of the first text.

Wherein the second text is taken from the first text whose position indicated by the text position information is within the subtitle region. That is, the text in which the caption area appears is extracted from the target video and used as text corpus.

In step 15, the second text is determined as a text corpus of the target video.

In order to enable those skilled in the art to better understand the technical solutions provided by the embodiments of the present invention, the following details of the corresponding steps are described below.

First, in step 11, a plurality of target video frame images are acquired from a target video.

In one possible implementation, all video frame images in the target video may be regarded as target video frame images, i.e. data collection may be performed based on all video frame images of the target video frame. Thus, the text corpus in the target video can be collected more comprehensively.

In another possible embodiment, step 11 may comprise the steps of:

and performing frame extraction processing on the target video according to a preset time interval to obtain a plurality of target video frame images.

For example, the preset time interval may be 0.5s, that is, a video frame image is extracted from the target video every 0.5s as the target video frame image.

By adopting the mode, the frame extraction processing is carried out on the target video according to the preset time interval, so that a plurality of target video frame images are obtained, excessive data in the target video can not be omitted, and the calculation pressure in the subsequent data collection process can be reduced.

In step 13, the subtitle region of the target video is determined based on the text position information.

In one possible embodiment, step 13 may comprise the steps of:

determining the subtitle position according to the text position information;

and determining the subtitle region of the target video according to the subtitle position.

The subtitle position is the position with the largest occurrence number among the positions indicated by the text position information, or the subtitle position is the position with the occurrence number which exceeds the preset proportion threshold value among the positions indicated by the text position information. For example, the preset ratio threshold may be 0.6.

If there is content that can be converted into text in the target video, for example, the speaking content of the person in the video will be displayed in the target video in the form of subtitles, and in general, the subtitle position in the same video is relatively fixed, that is, a position in the video where text frequently appears (for example, the bottom of the video picture). The text position information of each first text indicates the position of the first text in the target video frame image, so that the position with the largest occurrence number among the positions indicated by the text position information of each first text can be determined as the subtitle position, or the position with the occurrence number exceeding the preset proportion threshold value among the positions indicated by the text position information of each first text.

After the subtitle position is determined, the subtitle region of the target video can be determined according to the subtitle position.

In one possible embodiment, the region corresponding to the subtitle position may be directly determined as the subtitle region of the target video.

In another possible embodiment, determining the subtitle region of the target video according to the subtitle position may include the steps of:

and determining an area corresponding to a preset range around the subtitle position as a subtitle area.

That is, after the subtitle position is determined, the range may be appropriately enlarged according to the subtitle position, and the enlarged region may be used as a subtitle region to more comprehensively cover the position where the text may appear in the image. For example, based on the subtitle position, a text height (for example, 1.5 times text height) is extended upward and downward by a certain multiple, and then a region corresponding to the extended position range is determined as the subtitle region.

In the following step 14, the second text is determined according to the subtitle region, the first text, and the text position information of the first text, and detailed description will be given.

In one possible embodiment, step 14 may comprise the steps of:

and determining the first text with the position indicated by the text position information being in the subtitle region as the second text.

Thus, all texts in the subtitle region can be determined as the second text, and thus, the text corpus obtained based on the second text has more abundant contents.

In another possible embodiment, step 14 may comprise the steps of:

determining a first text with the position indicated by the text position information being in the subtitle region as a third text;

if a plurality of third texts which are adjacent to the first appearance time and have the text similarity larger than a preset similarity threshold exist, determining the third text with the longest text length as a second text;

if the preset characters exist in the third text, deleting the preset characters in the third text, and determining the text obtained after deleting the preset characters as the second text.

Wherein each first text corresponds to a first moment of occurrence in the target video. When the target video frame images are acquired from the target video, each target video frame image corresponds to a time instant when the target video frame image appears in the target video, and therefore, the first text obtained based on the target video frame image has a first appearance time instant which is the same as the time instant when the target video frame image appears in the target video. For example, if the time at which the target video frame image C appears in the target video is 0.5s, the first appearance time of the first text D obtained based on the target video frame image C in the target video is also 0.5s.

And determining the first text with the position indicated by the text position information in the subtitle region as a third text, namely, first screening the first text with the position indicated by the text position information in the subtitle region as the third text, and further screening based on the third text.

If there are multiple third texts with adjacent first appearance moments and text similarity greater than the preset similarity threshold, it is indicated that the contents of the third texts are similar and come from adjacent target video frames, that is, the third texts have portions with identical contents, so that duplication removal can be performed to a certain extent, therefore, the third text with the longest text length can be determined as the second text, and the longest third text can be reserved to the greatest extent.

The preset character is a character with low availability in the current data collection process, for example, when data collection is performed for a common language, the character of the common language can be determined as the preset character. For example, if data collection is performed for chinese and english, other characters except for the chinese, english and arabic numerals may be set as preset characters. If the preset characters exist in the third text, the fact that the characters with low availability for the current data collection exist in the third text is indicated, so that the preset characters in the third text can be deleted, and the text obtained after deleting the preset characters is determined to be the second text.

By adopting the mode, after the position screening based on the subtitle region, the duplication removal or character deletion processing is further carried out according to the position screening result, so that the second text with better quality can be obtained, and the text corpus obtained based on the second text can have higher quality.

Fig. 2 is a flow chart of a data collection method provided in accordance with another embodiment of the present disclosure. As shown in fig. 2, the method of the present disclosure may further include the following steps on the basis of the steps shown in fig. 1, wherein the relevant descriptions about steps 11 to 15 are already given above and are not repeated here.

In step 21, the text corpus is divided, and a target corpus fragment is obtained.

From the foregoing, it can be seen that the text corpus is composed of a plurality of second texts. And, each second text corresponds to a second instant of occurrence in the target video. When the target video frame images are acquired from the target video, each target video frame image corresponds to a moment of occurrence in the target video, so that a first text obtained based on the target video frame images and a first moment of occurrence of the target video frame images in the target video are the same, and correspondingly, a second text obtained based on the first text has a second moment of occurrence which is the same as the first moment of occurrence of the first text in the target video. For example, if the time at which the target video frame image C appears in the target video is 0.5s, the first appearance time of the first text D obtained based on the target video frame image C in the target video is also 0.5s, and the second appearance time of the second text E obtained from the first text D in the target video is also 0.5s.

In one possible embodiment, N adjacent second texts in the text corpus may be taken as a group, so as to obtain a combination of multiple groups of second texts. Each set of second text combinations can be used as a target corpus segment, and follow-up steps provided by the disclosure are executed. Wherein N is a positive integer greater than or equal to 1. Illustratively, N may be 2. Thus, the amount of calculation in the data processing can be reduced.

In step 22, according to the second occurrence time of the second text included in the target corpus fragment in the target video, the initial start time and the initial end time corresponding to the target corpus fragment in the target video are determined.

In one possible implementation manner, if the target corpus segment includes only one second text, the time corresponding to the first preset duration before the second occurrence time may be taken as the initial starting time, and the time corresponding to the second preset duration after the second occurrence time may be taken as the initial ending time based on the second occurrence time of the second text in the target video. The first preset duration may be a duration greater than or equal to 0, and the second preset duration may be a duration greater than or equal to 0. Typically, the first preset duration and the second preset duration are not 0 at the same time. The first preset time period may be, for example, 0.5s. For another example, the second preset time period may be 0.5s.

In one possible embodiment, if the target corpus segment includes a plurality of second texts, a second occurrence time of a second text that occurs earliest in the target video may be taken as an initial start time, and a second occurrence time of a second text that occurs latest in the target video may be taken as an initial end time.

In step 23, a target audio clip is obtained from the target video with the initial start time as a time start point and the initial end time as a time end point.

In step 24, a preset time alignment tool is used to time align the target corpus segment and the target audio segment, so as to obtain a target starting time and a target ending time corresponding to the target corpus segment in the target video.

Illustratively, the preset time alignment tool is an alignment tool. Accordingly, step 24 may include the steps of:

setting a beam parameter of the alignment tool as a first value, aligning a target corpus fragment and a target audio fragment for the first time, setting the beam parameter of the alignment tool as a second value, and aligning the target corpus fragment and the target audio fragment for the second time to obtain a target starting time and a target ending time corresponding to the target corpus fragment in a target video.

Wherein the first value is less than the second value. The first value may be, for example, 3. For another example, the second value may be 30.

In the alignment process, the first alignment sets the beam parameter to be a smaller first value, so that the accuracy of one-to-one correspondence between the characters and the audio in the target corpus fragment and the target audio fragment, namely the correspondence between the single characters and the audio, can be ensured. And setting the beam parameter to be a larger second value in the second alignment, so that the accuracy of head and tail alignment of the target corpus fragment and the target audio fragment, namely the accuracy of the time point when the target expected fragment appears and disappears in the target audio fragment, can be ensured.

By adopting the method, after the text corpus of the target video is obtained, the content in the text corpus and the corresponding audio in the target video can be automatically aligned, and the complicated step of manual labeling is omitted, so that the corresponding audio and text can be obtained more efficiently and can be rapidly used in voice recognition.

Fig. 3 is a block diagram of a data collection device provided in accordance with one embodiment of the present disclosure, as shown in fig. 3, the device 30 may include:

a first obtaining module 31, configured to obtain a plurality of target video frame images from a target video;

An identification module 32, configured to perform OCR recognition on the plurality of target video frame images to determine a first text included in each of the target video frame images and text position information of the first text, where the text position information of the first text is used to indicate a position where the first text appears in the target video frame image;

a first determining module 33, configured to determine a subtitle region of the target video according to each text position information;

a second determining module 34, configured to determine a second text according to the caption area, the first text, and text position information of the first text, where the second text is taken from the first text whose position indicated by the text position information is in the caption area;

a third determining module 35, configured to determine the second text as a text corpus of the target video.

Optionally, the first obtaining module 31 is configured to perform frame extraction processing on the target video at a preset time interval, so as to obtain the plurality of target video frame images.

Optionally, the first determining module 33 includes:

the first determining submodule is used for determining a subtitle position according to each piece of text position information, wherein the subtitle position is the position with the largest occurrence number among the positions indicated by each piece of text position information, or the subtitle position is the position with the occurrence number exceeding a preset proportion threshold value among the positions indicated by each piece of text position information;

And the second determining submodule is used for determining the subtitle region of the target video according to the subtitle position.

Optionally, the second determining submodule is configured to determine an area corresponding to the preset range around the subtitle position as the subtitle area.

Optionally, the second determining module 34 includes:

and a third determining sub-module, configured to determine, as the second text, a first text whose position indicated by text position information is in the subtitle region.

Optionally, each first text corresponds to a first appearance time in the target video;

the second determining module 34 includes:

a fourth determining sub-module, configured to determine, as a third text, a first text whose position indicated by text position information is within the subtitle region;

a fifth determining submodule, configured to determine, if there are multiple third texts adjacent to the first occurrence time and having text similarity greater than a preset similarity threshold, a third text with a longest text length as the second text;

and a sixth determining submodule, configured to delete a preset character in a third text if the preset character exists in the third text, and determine a text obtained after deleting the preset character as the second text.

Optionally, the text corpus is composed of a plurality of second texts, and each second text corresponds to a second appearance time in the target video;

the device 30 further comprises:

the dividing module is used for dividing the text corpus to obtain a target corpus fragment;

a fourth determining module, configured to determine an initial start time and an initial end time corresponding to a target corpus fragment in the target video according to a second occurrence time of a second text included in the target corpus fragment in the target video;

the second acquisition module is used for acquiring a target audio fragment from the target video by taking the initial starting moment as a time starting point and the initial ending moment as a time ending point;

the alignment module is used for carrying out time alignment on the target corpus fragment and the target audio fragment by utilizing a preset time alignment tool so as to obtain a target starting time and a target ending time corresponding to the target corpus fragment in the target video.

Optionally, the preset time alignment tool is an alignment tool;

the alignment module is configured to set a beam parameter of the alignment tool to a first value, perform first alignment on the target corpus segment and the target audio segment, and then set the beam parameter of the alignment tool to a second value, and perform second alignment on the target corpus segment and the target audio segment, so as to obtain a target start time and a target end time corresponding to the target corpus segment in the target video, where the first value is smaller than the second value.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Referring now to fig. 4, a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 4 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 4, the electronic device 600 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 601, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

In general, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, magnetic tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 600 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 609, or from storage means 608, or from ROM 602. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 601.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a plurality of target video frame images from a target video; performing OCR (optical character recognition) on the plurality of target video frame images to determine a first text contained in each target video frame image and text position information of the first text, wherein the text position information of the first text is used for indicating the position of the first text in the target video frame image; determining a caption area of the target video according to the text position information; determining a second text according to the caption area, the first text and the text position information of the first text, wherein the second text is taken from the first text of which the position indicated by the text position information is in the caption area; and determining the second text as a text corpus of the target video.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of the module is not limited to the module itself in some cases, and for example, the first acquisition module may also be described as "a module that acquires a plurality of target video frame images from a target video".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a data collection method, the method comprising:

acquiring a plurality of target video frame images from a target video;

and determining the second text as a text corpus of the target video.

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the acquiring a plurality of target video frame images from a target video includes:

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the determining a subtitle region of the target video according to each of the text position information includes:

determining a caption position according to each piece of text position information, wherein the caption position is the position with the largest occurrence frequency among the positions indicated by each piece of text position information, or the caption position is the position with the proportion of occurrence frequency in all positions exceeding a preset proportion threshold value among the positions indicated by each piece of text position information;

and determining a caption area of the target video according to the caption position.

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the determining a subtitle region of the target video according to the subtitle position includes:

and determining an area corresponding to the preset range around the subtitle position as the subtitle area.

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the determining a second text according to the subtitle region, the first text, and text position information of the first text includes:

And determining the first text of which the position indicated by the text position information is in the subtitle region as the second text.

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein each of the first texts corresponds to a first occurrence time in the target video;

the determining a second text according to the subtitle region, the first text and the text position information of the first text includes:

determining a first text, the position of which is indicated by the text position information, in the subtitle region as a third text;

if a plurality of third texts which are adjacent to the first appearance time and have the text similarity larger than a preset similarity threshold exist, determining the third text with the longest text length as the second text;

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the text corpus is composed of a plurality of the second texts, and each of the second texts corresponds to a second occurrence time in the target video;

The method further comprises the steps of:

dividing the text corpus to obtain a target corpus fragment;

determining initial starting time and initial ending time corresponding to the target corpus fragment in the target video according to second occurrence time of a second text contained in the target corpus fragment in the target video;

taking the initial starting time as a time starting point and the initial ending time as a time ending point, and acquiring a target audio fragment from the target video;

and performing time alignment on the target corpus fragment and the target audio fragment by using a preset time alignment tool to obtain a target starting time and a target ending time of the target corpus fragment corresponding to the target video.

According to one or more embodiments of the present disclosure, there is provided a data collection method, wherein the preset time alignment tool is an alignment tool;

the step of performing time alignment on the target corpus segment and the target audio segment by using a preset time alignment tool to obtain a target starting time and a target ending time corresponding to the target corpus segment in the target video, including:

Setting a beam parameter of the alignment tool as a first value, aligning the target corpus fragment and the target audio fragment for the first time, setting a beam parameter of the alignment tool as a second value, and aligning the target corpus fragment and the target audio fragment for the second time to obtain a target starting time and a target ending time corresponding to the target corpus fragment in the target video, wherein the first value is smaller than the second value.

According to one or more embodiments of the present disclosure, there is provided a data collection apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method of any embodiment of the present disclosure.

According to one or more embodiments of the present disclosure, there is provided an electronic device including:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any of the embodiments of the present disclosure.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of data collection, the method comprising:

acquiring a plurality of target video frame images from a target video;

performing OCR (optical character recognition) on the plurality of target video frame images to determine a first text and text position information of the first text contained in each target video frame image, wherein the text position information of the first text is used for indicating the position of the first text in the target video frame image, and each first text corresponds to a first appearance time in the target video;

determining the second text as a text corpus of the target video;

the determining a second text according to the subtitle region, the first text and the text position information of the first text includes: determining a first text, indicated by text position information, located in the subtitle region as the second text; or alternatively, the process may be performed,

The determining a second text according to the subtitle region, the first text and the text position information of the first text includes: determining a first text, the position of which is indicated by the text position information, in the subtitle region as a third text; if a plurality of third texts which are adjacent to the first appearance time and have the text similarity larger than a preset similarity threshold exist, determining the third text with the longest text length as the second text; if the preset characters exist in the third text, deleting the preset characters in the third text, and determining the text obtained after deleting the preset characters as the second text.

2. The method of claim 1, wherein the obtaining a plurality of target video frame images from the target video comprises:

3. The method of claim 1, wherein said determining a subtitle region of the target video based on each of the text position information comprises:

4. The method of claim 3, wherein said determining a subtitle region of the target video based on the subtitle position comprises:

5. The method of claim 1, wherein the text corpus is comprised of a plurality of the second texts, and each of the second texts corresponds to a second instant of occurrence in the target video;

the method further comprises the steps of:

dividing the text corpus to obtain a target corpus fragment;

6. The method of claim 5, wherein the predetermined time alignment tool is an alignment tool;

7. A data collection device, the device comprising:

the recognition module is used for performing OCR (optical character recognition) on the plurality of target video frame images so as to determine first texts and text position information of the first texts contained in each target video frame image, wherein the text position information of the first texts is used for indicating the positions of the first texts in the target video frame images, and each first text corresponds to a first appearance time in the target video;

the third determining module is used for determining the second text as the text corpus of the target video;

the second determining module includes: a third determining sub-module, configured to determine, as the second text, a first text whose position indicated by text position information is within the subtitle region; or alternatively, the process may be performed,

the second determining module includes: a fourth determining sub-module, configured to determine, as a third text, a first text whose position indicated by text position information is within the subtitle region; a fifth determining submodule, configured to determine, if there are multiple third texts adjacent to the first occurrence time and having text similarity greater than a preset similarity threshold, a third text with a longest text length as the second text; and a sixth determining submodule, configured to delete a preset character in a third text if the preset character exists in the third text, and determine a text obtained after deleting the preset character as the second text.

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processing device, implements the steps of the method of any of claims 1-6.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-6.