CN112653932B - Subtitle generating method, device, equipment and storage medium for mobile terminal - Google Patents

Subtitle generating method, device, equipment and storage medium for mobile terminal Download PDF

Info

Publication number
CN112653932B
CN112653932B CN202011497650.3A CN202011497650A CN112653932B CN 112653932 B CN112653932 B CN 112653932B CN 202011497650 A CN202011497650 A CN 202011497650A CN 112653932 B CN112653932 B CN 112653932B
Authority
CN
China
Prior art keywords
input
voice
video
user
mobile terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011497650.3A
Other languages
Chinese (zh)
Other versions
CN112653932A (en
Inventor
董晓飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202011497650.3A priority Critical patent/CN112653932B/en
Publication of CN112653932A publication Critical patent/CN112653932A/en
Application granted granted Critical
Publication of CN112653932B publication Critical patent/CN112653932B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The disclosure discloses a subtitle generating method, device, equipment and storage medium for a mobile terminal, which relate to the field of artificial intelligence, in particular to the technical field of voice recognition and natural language processing, and specifically comprise the following steps: acquiring input voice acquired based on audio data acquisition equipment; converting the input voice into input characters by adopting a voice recognition model; acquiring a time axis configured on an input video; the method for adding the subtitles to the video at the mobile terminal is provided by adding the input characters to the time axis segment of the input video selected by the user, and compared with the method for adding the subtitles to the video at the personal computer end by using a professional software tool, the method for adding the subtitles to the video at the mobile terminal saves learning cost and simplifies the subtitle adding process.

Description

Subtitle generating method, device, equipment and storage medium for mobile terminal
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence technologies such as speech recognition and natural language processing, and in particular, to a subtitle generating method, apparatus, device, and storage medium for a mobile terminal.
Background
With the further development of the mobile internet, it is important to build and provide more high-quality content in the current general trend of content. As a large user for content production, the ecology of the user content is continuously emerging like the spring bamboo shoots after rain, wherein the video bearing content system is superior to graphics, audio and the like. However, professional video editing, audio processing and the like have the problems of high learning cost, difficulty in entering a person user, long time and the like, and the problems can inhibit enthusiasm and ideas of user creators in a reverse way. Especially in the mobile internet era, most of user operation devices are only a mobile phone, and do not have more professional devices for post-processing.
Disclosure of Invention
The present disclosure provides a subtitle generating method, apparatus, device, and storage medium for a mobile terminal.
According to a first aspect of the present disclosure, there is provided a subtitle generating method for a mobile terminal, including: acquiring input voice acquired based on audio data acquisition equipment; converting the input voice into input characters by adopting a voice recognition model; acquiring a time axis configured on an input video; the input text is added to the timeline segment of the input video that is selected by the user.
According to a second aspect of the present disclosure, there is provided a subtitle generating apparatus for a mobile terminal, including: the first acquisition module is configured to acquire input voice acquired based on the audio data acquisition equipment; the conversion module is configured to convert input voice into input characters by adopting a voice recognition model; a second acquisition module configured to acquire a time axis configured to an input video; and an adding module configured to add the input text to the timeline segment of the input video selected by the user.
According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as described in any one of the implementations of the first aspect.
According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, performs a method as described in any of the implementations of the first aspect.
The title generation method, device, equipment and storage medium for mobile terminal provided by the present disclosure firstly acquire input voice acquired based on audio data acquisition equipment; then adopting a voice recognition model to convert the input voice into input characters; then, a time axis configured in the input video is acquired; finally, the input characters are added to the time axis segments of the input video selected by the user, so that the method for adding the subtitles to the video at the mobile terminal is provided, and compared with the method for adding the subtitles to the video at the personal computer end by using a professional software tool, the method for adding the subtitles to the video saves learning cost and simplifies the subtitle adding process.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which the present application may be used;
fig. 2 is a flow chart illustrating an embodiment of a subtitle generating method for a mobile terminal according to the present application;
fig. 3 is a flowchart illustrating another embodiment of a subtitle generating method for a mobile terminal according to the present application;
fig. 4 is an application scenario diagram of an embodiment of a subtitle generating method for a mobile terminal according to the present application;
fig. 5 is a schematic structural view of an embodiment of a subtitle generating apparatus for a mobile terminal according to the present application;
fig. 6 is a block diagram of an electronic device for implementing a subtitle generating method for a mobile terminal according to an embodiment of the present application.
Detailed Description
The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the present application are shown in the drawings.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 illustrates an exemplary system architecture 100 to which an embodiment of a caption generating method for a mobile terminal or a caption generating device for a mobile terminal of the present application may be applied.
As shown in fig. 1, the system architecture 100 may include a mobile terminal 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between mobile terminals 101 and servers 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
Mobile terminal 101 may interact with server 103 through network 102. Mobile terminals 101 include, but are not limited to, mobile terminals such as smartphones, tablets, and the like. The server 103 may provide various services, for example, the server 103 may perform processing such as online voice recognition on data such as user input voice acquired from the mobile terminal 101, and generate processing results (for example, convert the user input voice into input text).
The server 103 may be hardware or software. When the server 103 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 103 is software, it may be implemented as a plurality of software or software modules (for example, to provide distributed services), or may be implemented as a single software or software module. The present application is not particularly limited herein.
It should be noted that, the caption generating method for a mobile terminal according to the embodiment of the present application is generally executed by the mobile terminal 101, and accordingly, the caption generating device for a mobile terminal is generally disposed in the mobile terminal 101.
It should be understood that the number of mobile terminals, networks and servers in fig. 1 is merely illustrative. There may be any number of mobile terminals, networks and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of a caption generating method for a mobile terminal according to the present application is shown. The method comprises the following steps:
step S201, input speech acquired based on the audio data acquisition device is acquired.
In the present embodiment, an execution subject of a subtitle generating method for a mobile terminal (e.g., mobile terminal 101 shown in fig. 1) may acquire input speech acquired based on an audio data acquisition device.
Wherein the audio data acquisition device may be installed in the mobile terminal 101. The input speech may be sound captured, intercepted by audio processing software, such as stripping speech from video, or intercepting a segment of sound from audio, among others. Wherein, the stripping of the voice in the video may be image-sound data separation processing of the video, thereby extracting continuous image data and continuous sound data, respectively.
Alternatively, the input speech may also be user speech recorded using a microphone. Wherein the microphone device may be integrated in the mobile terminal 101, the executing body may collect the user's voice using the microphone device in response to turning on the microphone device. At this time, the user can input the content which wants to become the caption by voice, and compared with the simple manual input of the caption to be added, the input speed is improved, the operation flow is simplified, and therefore the timeliness of the user content is improved.
Step S202, adopting a voice recognition model to convert the input voice into input characters.
In this embodiment, the execution body may use a speech recognition model to convert the input speech into the input text.
The speech recognition model may be an offline speech recognition model or an online speech recognition model. The speech recognition model is mainly divided into three parts: pronunciation dictionary, acoustic model and language model, wherein pronunciation dictionary is constructed manually, and acoustic model and language model can be obtained by training through deep learning method.
Wherein automatic speech recognition (Automatic Speech Recognition, ASR) techniques may be employed to convert input speech into text. ASR technology is a technology for automatically converting language into text, and its flow includes input-encoding-decoding-output, encoding, i.e. converting sound into digital signal, and extracting features from the digital signal; and (3) decoding the feature vector to be obtained and converting the feature vector into characters.
In step S203, a time axis configured for the input video is acquired.
In this embodiment, the execution body may acquire a time axis configured to input video.
The video time axis can be used for connecting video frames in series according to the time sequence to obtain the video. It can be understood that the time axis of the video can be determined while the video is captured. In practice, the minimum unit on the time axis may be 1 second, that is, the interval between adjacent two time points on the time axis is 1 second. As an example, the starting point of the time axis may be 00:00 (0 minutes 0 seconds); the first time point may be 00:01 (0 minutes 01 seconds); the second time point may be 00:02 (0 minutes 02 seconds); the third time point may be 00:03 (0 minutes 03 seconds) … ….
Step S204, adding the input text to the timeline segment of the input video selected by the user.
In this embodiment, the execution body may add the input text to the timeline segment of the input video selected by the user.
The input video may be a video captured by the user using the mobile terminal 101, or may be video data acquired by the mobile terminal 101 through wireless transmission. Alternatively, after the user finishes capturing the video using the mobile terminal 101, the captured video may be processed, and the processed video may be used as the input video. For example, the photographed video may be subjected to editing, filtering, or the like.
Wherein, the user can select a segment of the input video and drag the input text to the segment, and the execution body can complete the operation of adding the input text to the segment.
Wherein, since each segment on the time axis of the input video has a start time and an end time, correspondingly, the input text corresponding to each segment has a start display time and an end display time, specifically, the start time of each segment is taken as the start display time of the input text, and the end time of each segment is taken as the end display time of the input text.
The method for generating the caption of the mobile terminal provided by the embodiment of the application provides a method for adding the caption to the video at the mobile terminal, and compared with the method for adding the caption to the video at the personal computer end by using a professional software tool, the method for generating the caption for the mobile terminal saves the learning cost and simplifies the caption adding flow.
In some optional implementations of this embodiment, the input voice includes a breakpoint identification made by the user to at least one time node of the input voice based on a preset operation mode, and the step S202 further includes: based on the breakpoint identification, the input speech is intercepted to obtain a plurality of input words.
The preset operation mode refers to a behavior mode, such as clicking, of performing a breakpoint operation on the input voice, which is set in the execution body. The user can perform breakpoint operation on any time node of input voice by adopting the preset operation mode. Illustratively, in response to the microphone device being turned on, the voice of the user is recorded through the microphone device, and the user performs a breakpoint operation on the audio recording through clicking or the like, at which time the audio recording is not stopped. It will be appreciated that the breakpoint of an audio recording is the tagging of a point in time when a user enters speech.
The execution body can intercept the input voice according to the breakpoint identification of the input voice, and divide the input voice into a plurality of voice sections. The breakpoint identifier may include a start identifier and an end identifier, where the start identifier corresponds to a start time of a certain segment of input voice, and the end identifier corresponds to an end time of the certain segment of input voice.
In this embodiment, through the breakpoint operation, the input voice may be divided into multiple segments of voice, and thus multiple segments of subtitle text may be identified and generated.
In some optional implementations of this embodiment, the method further includes: and decorating the input characters based on the preset artistic effect selected by the user.
Wherein the preset art effects include, but are not limited to, static effects and/or dynamic effects. Wherein static effects such as fonts, colors, etc., are applied to the input text. Dynamic effects on the input text include, but are not limited to, fading, floating, blinking, and the like.
Wherein, the user can select any artistic effect to act on the input characters, and the decoration of the input characters is completed. Through applying fine arts effect to the input characters, the form of caption can be enriched, and the attraction of user content is promoted.
Referring further to fig. 3, there is shown a flow chart of another embodiment of a caption generating method for a mobile terminal, the method comprising the steps of:
step S301, input speech acquired based on the audio data acquisition device is acquired.
Step S301 is substantially the same as step S201, and thus will not be described in detail.
Step S302, the input voice is converted into the input text by adopting a voice recognition model.
Step S302 is substantially the same as step S202, and thus will not be described in detail.
Step S303, a time axis configured for the input video is acquired.
Step S303 is substantially the same as step S203, and thus will not be described in detail.
Step S304, adding the input text to the timeline segment of the input video selected by the user.
Step S304 is substantially the same as step S204, and thus will not be described in detail.
In step S305, the input text and the input video are combined to generate video data with text.
Wherein, the input text and the input video can be made into the final product by adopting the video caption pressing technology. The video subtitle compacting technology is completed in a mode of encoding and decoding video, belongs to the prior art, and is not repeated here.
In some optional implementations of this embodiment, the audio data acquisition device is a mobile terminal microphone device. For example, the audio data acquisition device is a cell phone microphone device. Compared with the manual input of simple text subtitles, the subtitle input speed is improved, compared with the use of professional software tools at the personal computer end, the learning cost is saved, and the operation flow is simplified.
For ease of understanding, fig. 4 shows an application scenario diagram of an embodiment of a subtitle generating method for a mobile terminal according to the present application.
As shown in fig. 4, the user first performs the video recording operation for content production, and after recording video, beautifying the filter, and simply editing and cropping, the complete video is produced.
Then the user starts the voice subtitle function, and at the moment, starts the microphone function of the mobile phone to collect sound. At which time the user voice inputs the subtitle content that is desired to be added for the video. After the voice input is finished, clicking is finished. A complete audio file is obtained at this point. The user may repeat the audio and delete the re-recording if not satisfied. And when the audio is recorded, the user is provided for breakpoint operation. When one sentence is finished, the breakpoint can be added by manually clicking, and marking is carried out on the current time node. At the same time, the audio recording is not finished, and the user can continue recording.
After the voice recording is finished, the complete audio is analyzed, and corresponding caption text is generated through the capability of converting voice into text. When the subtitle text is generated, the breakpoint in recording is combined, automatic interception is performed, and a plurality of subtitle texts are produced.
Finally, aiming at the video time axis, a user selects a video clip, drags a piece of subtitle text, selects the adding position, and carries out fine adjustment on the subtitle color, special effects and the like. And (3) repeating the steps, after all the subtitles are added, pressing the video and the subtitles into a final product.
With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of subtitle generation for a mobile terminal, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically used in various electronic devices.
As shown in fig. 5, the subtitle generating apparatus 500 for a mobile terminal of the present embodiment may include: a first acquisition module 501, a conversion module 502, a second acquisition module 503, and an addition module 504. Wherein, the first obtaining module 501 is configured to obtain input voice collected by the audio data collecting device; a conversion module 502 configured to convert input speech into input text using a speech recognition model; a second acquisition module 503 configured to acquire a time axis configured to input video; an adding module 504 configured to add the input text to the timeline segment of the input video selected by the user.
In the present embodiment, in the subtitle generating apparatus 500 for a mobile terminal: the specific processing and technical effects of the first obtaining module 501, the converting module 502, the second obtaining module 503, and the adding module 504 may refer to the description of steps S201 to S204 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some optional implementations of this embodiment, the input speech includes breakpoint identification by the user of at least one time node of the input speech based on a preset manner of operation, and the conversion module 502 is further configured to: based on the breakpoint identification, the input speech is intercepted to obtain a plurality of input words.
In some optional implementations of this embodiment, the apparatus further includes: and a decoration module configured to decorate the input text based on a preset artistic effect selected by a user.
In some optional implementations of this embodiment, the apparatus further includes: and the merging module is configured to merge the input characters and the input video to generate video data with the characters.
In some optional implementations of this embodiment, the audio data acquisition device is a mobile terminal microphone device.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 606 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 606, such as a magnetic disk, an optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, for example, a subtitle generating method for a mobile terminal. For example, in some embodiments, the caption generating method for a mobile terminal may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 606. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the subtitle generating method for a mobile terminal described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the subtitle generating method for the mobile terminal in any other suitable way (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the application, firstly, input voice acquired based on audio data acquisition equipment is acquired; then adopting a voice recognition model to convert the input voice into input characters; then, a time axis configured in the input video is acquired; finally, the input characters are added to the time axis segments of the input video selected by the user, so that the method for adding the subtitles to the video at the mobile terminal is provided, and compared with the method for adding the subtitles to the video at the personal computer end by using a professional software tool, the method for adding the subtitles to the video saves learning cost and simplifies the subtitle adding process.
Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims (10)

1. A subtitle generating method for a mobile terminal, comprising:
acquiring input voice acquired based on audio data acquisition equipment, wherein the input voice comprises breakpoint identification made by a user for at least one time node of the input voice based on a preset operation mode, wherein the preset operation mode is a behavior mode of performing breakpoint operation on any time node of the input voice, a breakpoint is a certain time point of the input voice, the breakpoint operation does not cause the stop of the input voice, and the breakpoint identification comprises a start identification and an end identification;
a voice recognition model is adopted to convert the input voice into input characters;
acquiring a time axis configured on an input video, wherein the time axis is used for connecting video frames in series according to the sequence of time to obtain the input video;
adding the input text to a timeline segment of the input video selected by a user;
wherein, the voice recognition model is adopted to convert the input voice into the input text, and the method further comprises:
based on the breakpoint identification, intercepting the input voice to obtain a plurality of sections of voice;
recognizing the multi-segment voice to obtain multi-segment caption text;
wherein, the obtaining is based on the input voice that audio data collection equipment gathered, includes: responding to the starting of a voice caption function by a user, starting a microphone function of the mobile terminal and collecting caption contents input by the voice of the user;
wherein the adding the input text to the timeline segment of the input video selected by the user comprises:
taking the starting time of the time axis segment as the starting display time of the input text; and taking the ending time of the time axis segment as the ending display time of the input text.
2. The method of claim 1, further comprising:
and decorating the input characters based on the preset artistic effect selected by the user.
3. The method of claim 1, further comprising:
and combining the input characters with the input video to generate video data with the characters.
4. A method according to any of claims 1-3, wherein the audio data acquisition device is a mobile terminal microphone device.
5. A subtitle generating apparatus for a mobile terminal, comprising:
the first acquisition module is configured to acquire input voice acquired based on the audio data acquisition equipment, wherein the input voice comprises breakpoint identifications made by a user on at least one time node of the input voice based on a preset operation mode, the breakpoint identifications comprise start identifications and end identifications, the preset operation mode refers to a behavior mode of performing breakpoint operation on any time node of the input voice, a breakpoint is a certain time point of the input voice, and the breakpoint operation does not cause the stop of the input voice;
the conversion module is configured to convert the input voice into input characters by adopting a voice recognition model;
the second acquisition module is configured to acquire a time axis configured on an input video, wherein the time axis is used for connecting video frames in series according to the time sequence to obtain the input video;
an adding module configured to add the input text to a timeline segment of the input video selected by a user;
wherein the conversion module is further configured to:
based on the breakpoint identification, intercepting the input voice to obtain a plurality of sections of voice; recognizing the multi-segment voice to obtain multi-segment caption text;
wherein the first acquisition module is further configured to:
responding to the starting of a voice caption function by a user, starting a microphone function of the mobile terminal and collecting caption contents input by the voice of the user;
wherein the adding module is further configured to:
taking the starting time of the time axis segment as the starting display time of the input text; and taking the ending time of the time axis segment as the ending display time of the input text.
6. The apparatus of claim 5, wherein the apparatus further comprises:
and the decoration module is configured to decorate the input characters based on the preset artistic effect selected by the user.
7. The apparatus of claim 5, wherein the apparatus further comprises:
and the merging module is configured to merge the input characters with the input video to generate video data with attached characters.
8. The apparatus of any of claims 5-7, wherein the audio data acquisition device is a mobile terminal microphone device.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.
CN202011497650.3A 2020-12-17 2020-12-17 Subtitle generating method, device, equipment and storage medium for mobile terminal Active CN112653932B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011497650.3A CN112653932B (en) 2020-12-17 2020-12-17 Subtitle generating method, device, equipment and storage medium for mobile terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011497650.3A CN112653932B (en) 2020-12-17 2020-12-17 Subtitle generating method, device, equipment and storage medium for mobile terminal

Publications (2)

Publication Number Publication Date
CN112653932A CN112653932A (en) 2021-04-13
CN112653932B true CN112653932B (en) 2023-09-26

Family

ID=75354753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011497650.3A Active CN112653932B (en) 2020-12-17 2020-12-17 Subtitle generating method, device, equipment and storage medium for mobile terminal

Country Status (1)

Country Link
CN (1) CN112653932B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105338419A (en) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 Subtitle collection generating method and apparatus
CN105704538A (en) * 2016-03-17 2016-06-22 广东小天才科技有限公司 Method and system for generating audio and video subtitles
CN107690089A (en) * 2016-08-05 2018-02-13 阿里巴巴集团控股有限公司 Data processing method, live broadcasting method and device
CN109413478A (en) * 2018-09-26 2019-03-01 北京达佳互联信息技术有限公司 Video editing method, device, electronic equipment and storage medium
CN109495792A (en) * 2018-11-30 2019-03-19 北京字节跳动网络技术有限公司 A kind of subtitle adding method, device, electronic equipment and the readable medium of video
CN111246127A (en) * 2020-01-14 2020-06-05 安徽咪鼠科技有限公司 Cross-platform-based real-time subtitle display method and management system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9390169B2 (en) * 2008-06-28 2016-07-12 Apple Inc. Annotation of movies
CN108401192B (en) * 2018-04-25 2022-02-22 腾讯科技(深圳)有限公司 Video stream processing method and device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105338419A (en) * 2015-10-29 2016-02-17 网易传媒科技(北京)有限公司 Subtitle collection generating method and apparatus
CN105704538A (en) * 2016-03-17 2016-06-22 广东小天才科技有限公司 Method and system for generating audio and video subtitles
CN107690089A (en) * 2016-08-05 2018-02-13 阿里巴巴集团控股有限公司 Data processing method, live broadcasting method and device
CN109413478A (en) * 2018-09-26 2019-03-01 北京达佳互联信息技术有限公司 Video editing method, device, electronic equipment and storage medium
CN109495792A (en) * 2018-11-30 2019-03-19 北京字节跳动网络技术有限公司 A kind of subtitle adding method, device, electronic equipment and the readable medium of video
CN111246127A (en) * 2020-01-14 2020-06-05 安徽咪鼠科技有限公司 Cross-platform-based real-time subtitle display method and management system

Also Published As

Publication number Publication date
CN112653932A (en) 2021-04-13

Similar Documents

Publication Publication Date Title
CN109688463B (en) Clip video generation method and device, terminal equipment and storage medium
CN103456314B (en) A kind of emotion identification method and device
JP6718828B2 (en) Information input method and device
CN108630193B (en) Voice recognition method and device
US20210241498A1 (en) Method and device for processing image, related electronic device and storage medium
CN110602516A (en) Information interaction method and device based on live video and electronic equipment
JP2022088304A (en) Method for processing video, device, electronic device, medium, and computer program
CN112434139A (en) Information interaction method and device, electronic equipment and storage medium
CN109782997B (en) Data processing method, device and storage medium
CN111541938A (en) Video generation method and device and electronic equipment
CN115225829A (en) Video generation method and device and computer readable storage medium
CN111031373A (en) Video playing method and device, electronic equipment and computer readable storage medium
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN114222196A (en) Method and device for generating short video of plot commentary and electronic equipment
CN111540370A (en) Audio processing method and device, computer equipment and computer readable storage medium
CN113705300A (en) Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium
CN106550268B (en) Video processing method and video processing device
JP2024506495A (en) Methods, devices, equipment and media for processing minutes
CN112653932B (en) Subtitle generating method, device, equipment and storage medium for mobile terminal
CN113806570A (en) Image generation method and generation device, electronic device and storage medium
CN112037781B (en) Voice data acquisition method and device
CN113923477A (en) Video processing method, video processing device, electronic equipment and storage medium
CN112578965A (en) Processing method and device and electronic equipment
CN112542157A (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN113360712B (en) Video representation generation method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant