CN113066498A

CN113066498A - Information processing method, apparatus and medium

Info

Publication number: CN113066498A
Application number: CN202110307569.2A
Authority: CN
Inventors: 胡其斌; 董宗保; 单益峰
Original assignee: Shanghai Zhangmen Science and Technology Co Ltd
Current assignee: Shanghai Zhangmen Science and Technology Co Ltd
Priority date: 2021-03-23
Filing date: 2021-03-23
Publication date: 2021-07-02
Anticipated expiration: 2041-03-23
Also published as: CN113066498B

Abstract

The disclosure discloses an information processing method, equipment and medium, and relates to the technical field of data processing. The method comprises the following steps: converting the voice corresponding to the caption into an acoustic element, and determining the time information of the voice; determining a content unit corresponding to the acoustic primitive according to a pre-established mapping relation, wherein the content unit is a content unit in a text corresponding to the subtitle, and the mapping relation comprises the mapping relation between the acoustic primitive and the content unit; and constructing a text containing time information based on the time information and the content unit, and constructing a subtitle file based on the text containing the time information. The generation efficiency of the subtitle file can be improved.

Description

Information processing method, apparatus and medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to an information processing method, device, and medium.

Background

When watching a video, a user can understand the video content through the subtitles on the video display interface. The video provider can record the video, make a subtitle file corresponding to the video, and synthesize the video and the subtitle file to obtain the video added with the subtitle.

In the related art, the subtitle file is obtained by offline nonlinear editing. However, this method has a problem of low efficiency.

Disclosure of Invention

The present disclosure provides an information processing method, apparatus, and medium.

According to an aspect of the present disclosure, there is provided an information processing method including: converting the voice corresponding to the caption into an acoustic element, and determining the time information of the voice; determining a content unit corresponding to the acoustic primitive according to a pre-established mapping relation, wherein the content unit is a content unit in a text corresponding to the subtitle, and the mapping relation comprises the mapping relation between the acoustic primitive and the content unit; and constructing a text containing time information based on the time information and the content unit, and constructing a subtitle file based on the text containing the time information.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the above aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above aspects.

According to the technical scheme disclosed by the invention, the voice of the caption is processed, and the caption file is constructed based on the acoustic element corresponding to the voice and the time information, so that the generation efficiency of the caption file can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of an information processing method according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of an information processing method according to another embodiment of the present disclosure;

FIG. 3 is a flow chart illustrating the conversion of speech into pinyin in an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart illustrating the establishment of a correspondence relationship between Chinese character information and associated information in the embodiment of the present disclosure;

fig. 5 is a schematic diagram of an electronic device for implementing any one of the information processing methods of the embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a schematic flowchart of an information processing method according to an embodiment of the present disclosure, and as shown in fig. 1, the method includes:

101. and converting the voice corresponding to the subtitle into an acoustic element, and determining the time information of the voice.

102. And determining a content unit corresponding to the acoustic primitive according to a pre-established mapping relation, wherein the content unit is a content unit in a text corresponding to the subtitle, and the mapping relation comprises the mapping relation between the acoustic primitive and the content unit.

103. And constructing a text containing time information based on the time information and the content unit, and constructing a subtitle file based on the text containing the time information.

The execution main body of the embodiment may be a single-side main body, such as a terminal or a server. In particular, it may be implemented in the form of an Application program (APP) or a plug-in.

With the APP form realization as an example, the user can install the APP customer end on mobile terminal, and the user can type in the pronunciation that the caption corresponds in the APP customer end, and the APP customer end gives the deployment APP service end with pronunciation transmission, and after the APP service end received pronunciation, can handle pronunciation to obtain the caption file.

Further, in order to facilitate the user to enter voice, the APP may further have a function of prompting, that is, the APP may be called a prompter APP, and assuming that text corresponding to subtitles is called lines text, the prompter APP has a function of displaying lines text and converting voice into subtitle files.

In the embodiment of the disclosure, voice is converted into a subtitle file based on acoustic primitives.

The acoustic primitives refer to pronunciation units, and may be different in different language systems, for example, for chinese, the acoustic primitives may be pinyin, and for japanese, the acoustic primitives may be kana (including katakana and hiragana).

The content unit is a unit forming a text corresponding to the subtitle, and taking Chinese as an example, the content unit can be a Chinese character; alternatively, the content unit may be japanese text corresponding to a kana corresponding to japanese.

For example, a mapping relationship between pinyin and chinese characters may be pre-established, and after acquiring the voice, the voice is converted into pinyin, and the pinyin is mapped to corresponding chinese characters according to the mapping relationship. For example, if the speech is "a", the corresponding pinyin is "a", and a mapping relationship exists between "a" and "a" in the mapping relationship between the pinyin and the chinese character, the corresponding chinese character "a" may be obtained based on "a". It can be understood that, in this embodiment, the mapping relationship is a pinyin as an example, for convenience of query, the mapping relationship may be a numeric index corresponding to the pinyin, for example, if the current pinyin is "a", the numeric index corresponding to "a" may be obtained first, for example, "28", and in the mapping relationship, if "28" is mapped to "o", the corresponding chinese character may be determined based on the numeric index of the pinyin.

When converting speech into pinyin, various related techniques can be used, for example, the existing acoustic model is used to process the speech to obtain the corresponding pinyin. The existing acoustic model is, for example, a Recurrent Neural Network (RNN) + Connected Time Classification (CTC) structure.

The speech recognition is to recognize speech as characters, input the speech into a speech recognition model, and output the speech as characters. However, the speech recognition model uses a more complex algorithm and is less accurate. If the speech recognition method is adopted, a more complex algorithm is needed or the accuracy is low.

In the embodiment of the disclosure, processing based on acoustic primitives is relatively simple and accurate, so that implementation complexity can be reduced and accuracy can be improved.

Further, the content unit is determined based on the acoustic primitives according to a mapping relationship between the acoustic primitives and the content unit established in advance. For example, a mapping relationship between pinyin and chinese characters may be pre-established, and based on the mapping relationship, pinyin may be mapped to chinese characters.

And determining the content unit corresponding to the acoustic primitive through the mapping relation, so that the accuracy can be further improved. For example, in speech recognition, there is a case where one pinyin corresponds to a plurality of chinese characters, and even if an accurate pinyin is determined in speech recognition, there may be a problem in that an incorrect chinese character is recognized due to a problem of one-tone multi-character. In the embodiment of the present disclosure, the chinese character is determined based on the mapping relationship, and since the chinese character in the mapping relationship is the chinese character in the text corresponding to the subtitle and the pinyin is the pinyin of the chinese character, the chinese character can be mapped to the chinese character in the subtitle text based on the mapping relationship, the determination of the basic chinese character is accurate, and when compared with the speech recognition, the chinese character is generated from the pinyin by using a complex algorithm.

When the user inputs the voice, the corresponding time can be recorded, for example, when the user inputs the voice form that "the weather is good today", the time information corresponding to "today" is recorded as the 1 st second, the time information corresponding to "good" is recorded as the 3 rd second, and the like.

After the time information of the content unit and the voice is acquired, a text containing the time information can be constructed based on the content unit and the time information. When a text containing time information is constructed, the text can be constructed sentence by sentence, specifically, the time information corresponding to the sentence can be added corresponding to the sentence composed of the content units, so as to obtain the sentence containing time information; and arranging the sentences containing the time information in sequence according to the time sequence of the sentences containing the time information to form a text containing the time information.

For example, the text corresponding to the subtitle is called a speech-line text, and in the speech-line text, if "weather is good today" is a sentence, the time information can be added to "weather is good today" in the 1 st to 3 rd seconds. Further, whether the content units form a sentence or not can be distinguished according to the punctuations in the speech text, for example, if a comma is provided after "good weather today" in the speech text, a comma can be provided after "good" is determined, so that the Chinese characters before "good" form a sentence, that is, "good weather today" is taken as a sentence. In addition, based on the comma, the Chinese character after "good" can be used as the starting point of the next sentence, for example, the Chinese character after "good weather today" is "me", and "me" can be used as the starting point of the next sentence of "good weather today". Similarly, "today" in "today's weather is good" may be determined from the above sentence of "today's weather is good". It is to be understood that the punctuation recognition of the text of the lines is not limited, but also a silent recognition of the speech, for example, a text corresponding to the speech between two adjacent silent points is taken as a sentence. When adding time information, the time information is added sentence by sentence.

By analogy, the user can record the voice until the recording is finished, and time information is added to each sentence. Thereafter, each sentence including time information may be arranged in time order of each sentence to constitute a text including time information. And when composing the text containing the time information, the serial numbers of the respective sentences may also be contained. For example, if the speech text includes "it is good today's weather" and "i go to badminton," and the corresponding time information is "1 st second to 3 rd second" and "3 rd second to 5 th second," the following text containing the time information may be composed:

1, 1-3 seconds

The weather is very good today

2, 3-5 seconds

I go to play badminton

...

And the rest is done in sequence until the recording is finished.

After the text containing the time information is acquired, the text can be stored as a file in a standard subtitle format, and the file is used as a subtitle file. For example, the text containing the time information is obtained and stored, and the suffix name of the stored file is modified into a standard subtitle format, such as srt format.

In the embodiment, the voice corresponding to the subtitle is converted into the acoustic primitive, the subtitle file is constructed based on the acoustic primitive and the time information of the voice, manual editing is not needed, the subtitle file is automatically generated based on the voice of the user, and therefore the subtitle file is efficiently generated.

In some embodiments, after the subtitle file is generated, the subtitle file may be further synthesized with a corresponding video to obtain a video including subtitles. During the synthesis, various related technologies may be used, for example, rendering characters onto the image frame corresponding to the time information of each sentence in a character rendering manner according to the time information of each sentence.

In the embodiment, the video containing the subtitle can be obtained by synthesizing the subtitle file and the video, so that a user can watch the video conveniently.

Fig. 2 is a schematic flow chart of an information processing method according to another embodiment of the present disclosure, where the present embodiment takes chinese as an example, as shown in fig. 2, the method includes:

201. and acquiring a speech text.

In this embodiment, a text corresponding to a subtitle to be added to a video is referred to as a speech text. Specifically, this embodiment can provide an APP, and this APP can be called prompter APP, and after the user opened prompter APP, clicked the speech text import button, can show the import interface to the user imports the speech text through this interface. Or, the user may also input the speech text into the APP by an input method, for example, the APP provides an input button, and after the user clicks the input button, the speech text may be input. The method comprises the steps of inputting a speech word text, wherein the input means that the speech word text is currently available, the user inputs the existing speech word text from a local database, and the input means that the user needs to perform online speech word input without the speech word text.

202. And determining the corresponding relation between the Chinese character information and the associated information of each Chinese character in the speech text.

The Chinese character information may be a Chinese character code, such as a unicode code.

The associated information may be pinyin or a numeric index corresponding to the pinyin. In this embodiment, the associated information is taken as a numerical index as an example.

For example, if the speech text includes "it is very good today's weather", the corresponding numeric index may be determined according to each chinese character ("today" or the like) to establish the correspondence between the chinese character code and the numeric index.

The process of establishing the corresponding relationship between the Chinese character codes and the numerical indexes corresponding to each Chinese character can be seen in the following description of fig. 3.

203. And collecting the voice corresponding to the caption input by the user.

For example, the APP includes a recording button, and after the user clicks the recording button, the APP starts to collect the voice input by the user.

Further, in order to complete the generation of the subtitle file during shooting, it may be that the user captures the voice in the video during shooting of the video.

Specifically, for example, after a user opens a certain video shooting APP, the prompter APP in the embodiment of the present disclosure may also be opened, and the prompter APP may be suspended on an interface of the video shooting APP, and the suspension technology may be implemented by using various related technologies. After the user clicks the video recording button of video shooting APP, the recording button of the prompter APP can be clicked again, so that when the user shoots the video, the voice in the video can be collected into the prompter APP, and the prompter APP processes the voice to generate a subtitle file. In addition, the prompter APP can also display a speech text, so that a user can make a voice according to the speech text when recording a video.

In the embodiment, the voice is collected when the video is shot, the video can be shot synchronously, and the voice in the video is converted into the subtitle file.

204. Time information of the voice is recorded.

205. And converting the voice into pinyin, and determining the associated information corresponding to the pinyin.

If the text corresponding to the voice is "o", the pinyin is "a" after conversion.

Taking the associated information as the numeric index as an example, assuming that the numeric index corresponding to "a" is "28", the associated information corresponding to the pinyin can be determined to be "28".

In some embodiments, the speech may be converted to pinyin based on a deep neural network model.

The specific flow can be seen in the following description of fig. 4.

206. And searching the corresponding relation between the Chinese character information and the associated information to determine the Chinese character corresponding to the associated information.

For example, if the associated information is "28", and the chinese character corresponding to the associated information is "a" if "28" corresponds to "a" unicode code in the correspondence between the chinese character information and the numeric index.

207. And adding the time information into the sentence consisting of the Chinese characters to obtain the text containing the time information.

Wherein the addition of the time information may be performed in sentences. For example, add 1 second to 3 seconds for "today's weather is good".

208. And judging whether the recording is finished, if so, executing 209, otherwise, repeatedly executing 203 and the subsequent steps.

The recording can be judged to be finished after the user is detected to click the button for finishing the recording, otherwise, the recording is not finished.

209. The text containing the time information is saved as a subtitle file in srt format.

After each sentence containing time information is obtained, the sentences containing time information are arranged according to the time sequence to form a text containing the time information, and the format of the text is modified into srt format to obtain a subtitle file with srt format.

In this embodiment, a subtitle file in srt format is taken as an example, and the text may be stored in other standard subtitle formats.

In the embodiment, the Chinese character is queried based on the pinyin numeric index, so that the query efficiency can be improved.

As shown in fig. 3, a flow of establishing a corresponding relationship between the chinese character information and the associated information is shown, in this embodiment, the chinese character information is a unicode code of a chinese character, and the associated information is a numeric index. As shown in fig. 3, the method for establishing the corresponding relationship includes:

301. converting the Chinese characters into unicode codes.

Wherein, the Chinese character refers to each Chinese character in the speech-line text.

302. Determining the pinyin of the Chinese characters, and establishing the corresponding relation between the unicode codes of the Chinese characters and the pinyin of the Chinese characters.

303. Determining the corresponding relation between the pinyin of the Chinese character and the numerical index, and establishing the corresponding relation between the unicode code of the Chinese character and the numerical index according to the corresponding relation between the unicode code of the Chinese character and the pinyin of the Chinese character and the corresponding relation between the pinyin of the Chinese character and the numerical index.

A unique numerical index may be defined for each pinyin character, for example, the numerical index corresponding to a is defined as 28, so that a correspondence between the unicode code of "o" and 28 may be established.

Further, if a Chinese character includes a plurality of Pinyin letters, the numeric index corresponding to the Pinyin letters can be in a splicing manner, for example, the Pinyin letter of "I" includes w and o, and if the numeric index corresponding to w is 12 and the numeric index corresponding to o is 18, the numeric index corresponding to "I" is "1218". In addition, the numeric index is not limited to the pinyin letters, and can also include tones, for example, "i" is three tones, and can also define a unique numeric index for each tone, and then the numeric index corresponding to the pinyin letters of the Chinese characters and the numeric index of the tones are spliced to serve as the numeric index corresponding to the Chinese characters.

In the embodiment, the corresponding relation between the Chinese character information and the associated information is established through the pinyin, so that the current text corresponding to the current pinyin is conveniently found in the following process.

As shown in fig. 4, a process of converting speech into pinyin is shown, and as shown in fig. 4, the process of converting speech into pinyin may include:

401. and obtaining a deep neural network model, wherein the deep neural network model is used for converting the voice into pinyin.

For example, a deep neural network model may be trained in advance, with the model having speech as input and pinyin as output, to convert speech to pinyin based on the model.

Specifically, the model may be implemented by using a Recurrent Neural Network (RNN) and a Connection Time Classification (CTC), which are established and used by a deep learning framework, such as tensoflow of ***.

402. And inputting the voice into the deep neural network model and outputting the voice as pinyin.

For example, the output of "current" speech is "JIN" after being input into the deep neural network model.

It is understood that, the pinyin is only taken as an example, and in practical implementation, the pinyin may also include tones. Correspondingly, when the identified current pinyin comprises the tone, the pinyin in the pre-established correspondence also comprises the tone.

In the embodiment, based on the deep neural network, the voice can be accurately converted into the current pinyin, so that the accuracy of the subtitle file is improved.

It is understood that the same or corresponding contents in different embodiments of the disclosure may be mutually referred to, and the contents which are not described in detail in one embodiment may be referred to the related contents in other embodiments.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 includes a computing unit 501, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic apparatus 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the electronic device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the electronic device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 executes the respective methods and processes described above, such as the information processing method. For example, in some embodiments, the information processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the information processing method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the information processing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An information processing method characterized by comprising:

converting the voice corresponding to the caption into an acoustic element, and determining the time information of the voice;

determining a content unit corresponding to the acoustic primitive according to a pre-established mapping relation, wherein the content unit is a content unit in a text corresponding to the subtitle, and the mapping relation comprises the mapping relation between the acoustic primitive and the content unit;

and constructing a text containing time information based on the time information and the content unit, and constructing a subtitle file based on the text containing the time information.

2. The method of claim 1, wherein converting the speech corresponding to the subtitles into the acoustic primitives comprises:

and converting the voice corresponding to the subtitle into an acoustic element by adopting a pre-established deep neural network model, wherein the deep neural network model is used for converting the voice into the acoustic element.

3. The method according to claim 1, wherein the mapping relationship is a corresponding relationship between text information and associated information, and the determining a content unit corresponding to an acoustic primitive according to a pre-established mapping relationship between the acoustic primitive and the content unit comprises:

determining association information based on the acoustic primitives;

searching content unit information corresponding to the associated information in a corresponding relation between pre-established text information and the associated information;

and determining the content unit information and the corresponding content unit as the content unit corresponding to the acoustic unit.

4. The method of claim 3, wherein the content unit information comprises a literal code for the content unit, the method further comprising:

converting each character in the text corresponding to the subtitle into character codes;

acquiring an acoustic element of each character, and determining associated information corresponding to the acoustic element;

and establishing a corresponding relation between the character codes and the associated information.

5. The method of claim 3, wherein the association information is: the acoustic primitives of the text, or, the numerical indices corresponding to the acoustic primitives of the text.

6. The method of claim 1, further comprising:

and when the video is shot, collecting the voice in the video as the voice corresponding to the subtitle.

7. The method of claim 1, wherein constructing the text containing the time information based on the time information and the content unit comprises:

adding time information corresponding to the sentences to obtain sentences containing time information corresponding to the sentences formed by the content units;

and arranging the sentences containing the time information in sequence according to the time sequence of the sentences containing the time information to form a text containing the time information.

8. The method of claim 1, further comprising:

and synthesizing the subtitle file and the video corresponding to the subtitle file to obtain the video containing the subtitle.

9. The method of any of claims 1-8, wherein the acoustic primitive is pinyin for a hanzi and the content element is a hanzi.

10. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-9.

12. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-9.