CN113923479A

CN113923479A - Audio and video editing method and device

Info

Publication number: CN113923479A
Application number: CN202111340475.1A
Authority: CN
Inventors: 曹溪语; 吴悦; 奉伟; 郑程; 单文睿; 陈进生
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-01-11

Abstract

The disclosure provides an audio and video editing method and device, relates to the technical field of multimedia, and particularly relates to the technical field of editing. The specific implementation scheme is as follows: acquiring an original audio and video clip and a corresponding script sentence set; recognizing a caption sentence set from the speech of the original audio and video clip; for each script sentence in the script sentence set, recalling a target sentence with similarity higher than a first threshold value with the script sentence from the subtitle sentence set; using the text alignment time identified by the caption of each target sentence to sequentially clip and splice the original audio and video segments to generate an intermediate audio and video segment; and replacing each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combining the script sentence with the intermediate audio and video fragment according to the alignment time to obtain a clipped audio and video fragment. This embodiment enables fast and accurate audio-video clips.

Description

Audio and video editing method and device

Technical Field

The present disclosure relates to the field of multimedia technologies, and in particular, to a method and an apparatus for audio and video editing.

Background

With the development of user demands and media technologies, the number of videos is also exponentially and explosively increased, and the editing of videos also becomes a video processing mode concerned by people. The video editing technology is a video processing mode for combining an object to be edited into a section of edited video in an editing mode, and is often applied to video editing scenes such as short video production, video collection and the like.

In a typical video editing process, a video recorder typically writes a video script (i.e., a script of video content) for a transcription record in the video recording. In the recording process, a recorder often has behaviors of blocking, missing and wrong characters, missing and wrong sentences, missing and bad words, repeated missing and draft and the like, and the behaviors are usually deleted manually in the post-production.

Disclosure of Invention

The present disclosure provides an audio/video clipping method, apparatus, device, storage medium and computer program product.

According to a first aspect of the present disclosure, there is provided an audio-video clipping method comprising: acquiring an original audio and video clip and a corresponding script sentence set; recognizing a caption sentence set from the voice of the original audio and video clip; for each script sentence in the script sentence set, recalling a target sentence with similarity higher than a first threshold value with the script sentence from the subtitle sentence set; using the text alignment time identified by the caption of each target sentence to sequentially clip and splice the original audio and video segments to generate an intermediate audio and video segment; and replacing each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combining the script sentence with the intermediate audio and video fragment according to the alignment time to obtain the clipped audio and video fragment.

According to a second aspect of the present disclosure, there is provided an audio-video clip device including: the acquisition unit is configured to acquire an original audio and video clip and a corresponding script sentence set; a recognition unit configured to speech-recognize a set of caption sentences from the original audio-video segment; a recall unit configured to recall, for each script sentence in the script sentence set, a target sentence from the subtitle sentence set, the target sentence having a similarity to the script sentence higher than a first threshold; the splicing unit is configured to sequentially splice the original audio and video fragments by using the text alignment time identified by the subtitles of each target sentence to generate an intermediate audio and video fragment; and the replacing unit is configured to replace each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combine the script sentence with the intermediate audio and video clip according to the alignment time to obtain a clipped audio and video clip.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect.

According to a fourth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the first aspect.

According to a fifth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of the first aspect.

According to the audio and video editing method and device provided by the embodiment of the disclosure, under the condition that the script and the audio and video original film shot by the script correspondingly are provided, the subtitle result and the script are corresponded sentence by using the recall strategy according to the audio and video voice subtitle recognition result, and the editing is completed, and the script is used for generating the accurate subtitle. Manual screening is not required, and therefore editing efficiency is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of an audio-video clipping method according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of an audio-video clipping method according to the present disclosure;

FIG. 4 is a flow diagram of yet another embodiment of an audio-video clipping method according to the present disclosure;

FIG. 5 is a schematic block diagram illustration of one embodiment of an audio-visual clip device according to the present disclosure;

FIG. 6 is a schematic block diagram of a computer system suitable for use with an electronic device implementing embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the audio-video clipping method or audio-video clipping device of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a camera application, an audio/video clip application, a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like, may be installed on the

terminal devices

101, 102, and 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting multimedia playing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background clip server providing support for audio and video displayed on the

terminal devices

101, 102, 103. The background clipping server can analyze and process the received data such as the clipping request and feed back the processing result (such as clipped audio and video) to the terminal equipment.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein. The server may also be a server of a distributed system, or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.

It should be noted that the audio/video clipping method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the audio/video clipping device is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of an audio-video clipping method according to the present disclosure is shown. The audio and video clipping method comprises the following steps:

step 201, obtaining an original audio and video clip and a corresponding script sentence set.

In this embodiment, an execution main body (for example, a server shown in fig. 1) of the audio/video clipping method may receive an audio/video clipping request from a terminal, which uses the execution main body to perform audio/video clipping by a user, in a wired connection manner or a wireless connection manner, where the audio/video clipping request includes an original audio/video segment and a corresponding script sentence set. The script sentence is the script content spoken by a character when recording a video, and may also be referred to as a speech. The script content requires the presence of a logo (e.g., punctuation) that can be used to distinguish punctuation and then split into script sentences.

Step 202, recognizing a subtitle sentence set from the original audio and video clips.

In the embodiment, the caption sentence set can be recognized from the original audio-video segment through a speech recognition technology. During speech recognition, the recognized caption content is cut into sentences through speech pause, and caption sentences are formed. The text alignment time is also recorded at the time of speech recognition.

In step 203, for each script sentence in the script sentence set, the target sentence with the similarity higher than the first threshold value with the script sentence is recalled from the subtitle sentence set.

In this embodiment, the script sentence set and the subtitle sentence set are matched with each other by using a multiple similarity determination strategy, and finally, the subtitle sentences meeting the conditions are used as recall candidates according to the comprehensive similarity. And reserving sentences with comprehensive similarity higher than a first threshold value in subtitle recall corresponding to all scripts as target sentences. In speech recognition, there may be misrecognition, especially homophones, so that similarity calculation may be performed in addition to character string matching and may also be considered in calculating the similarity of pronunciation. For example, if the target sentence with similarity higher than the predetermined threshold cannot be found by recalling according to the character string, the recall can be performed by pronunciation, that is, the Chinese is converted into pinyin and then matching search is performed.

Generally, the contents of the subtitle sentence include the contents of the script sentence because the character in the video utters a picnic word (e.g., then), makes a rich word (e.g., kayingzi), or repeats a souvenir script, etc. The script sentence can be matched with the subtitle sentence set. Such as script sentence 123, and subtitle sentence abc. 1 can be aligned with abc, 1+2 can be aligned with abc, and 1 can be aligned with a + b. If there are a plurality of subtitle sentences containing the same script sentence, the similarity of the subtitle sentences and the script sentence is calculated, respectively. The similarity of the two sentences can be calculated by cosine similarity and other similarity calculation methods, and the subtitle sentences with the similarity higher than a first threshold value are taken as target sentences. If there is more than one caption sentence above the first threshold degree of similarity, the caption sentence with the latest text alignment time may be selected. This is because the repeated recording is usually the case that the quality of the audio/video recorded in the front is not good, and the quality of the audio/video recorded later is better than that in the front.

And 204, sequentially clipping and splicing the original audio and video fragments by using the text alignment time identified by the subtitles of each target sentence to generate an intermediate audio and video fragment.

In the embodiment, the audio and video are sequentially clipped and spliced by using the text alignment time identified by the subtitle of each target sentence to generate a new audio and video segment, wherein the segment is a segment automatically clipped and generated according to the script content and named as an intermediate audio and video segment. Because the subtitle recognized by the voice has the problems of wrongly written characters and the like, the subtitle needs to be processed subsequently.

And step 205, replacing each target sentence with a corresponding script sentence according to the text alignment time of the subtitle identification of each target sentence, and combining the script sentence with the intermediate audio and video clip according to the alignment time to obtain a clipped audio and video clip.

In this embodiment, the text alignment time identified by the subtitle is replaced to the corresponding script sentence, and the text content of the script is combined with the audio/video segment completed in step 204 according to the alignment time, so as to obtain an audio/video segment almost completely consistent with the script, and the audio/video is provided with the subtitle content consistent with the script content.

The method provided by the embodiment of the disclosure can effectively reduce the threshold of video editing, provides a novel, convenient and accurate automatic editing mode for a video creator who records a script, and greatly improves the efficiency of video editing; compared with a common editing method, the method can improve the editing speed by more than 50 times in a script recording scene.

In some optional implementations of this embodiment, the method further includes: and for each script sentence in the script sentence set, recalling candidate sentences with the similarity higher than a second threshold value from the subtitle sentence set, wherein the second threshold value is lower than the first threshold value, and editing the candidate sentences except the target sentence according to the text alignment time identified by the subtitle to form candidate audio and video segments for the user to select. Some script sentences may recall more than 1 subtitle sentence, and except the sentences with the comprehensive similarity higher than the first threshold, the rest sentences with the similarity higher than the second threshold are used as candidate sentences (the audio and video are cut by using the text alignment time to form candidate audio and video segments), so that the user can freely replace the cutting.

In some optional implementations of this embodiment, the method further includes: detecting a pause segment from the clipped audio/video segment according to the time interval of each subtitle sentence in the subtitle sentence set; and deleting the Kadun segment from the clipped audio and video segment. The content of the caption sentences with the time interval exceeding the preset interval threshold value can be used as the pause segment, and the pause segment is deleted from the audio and video and is spliced again. The method can automatically detect and clip the stuck segment, and improve the clipping efficiency.

In some optional implementations of the embodiment, recalling, from the subtitle sentence set, a target sentence having a similarity higher than a first threshold with the script sentence includes: calculating the editing distance between the script sentence and each subtitle sentence in the subtitle sentence set; and determining the subtitle sentences with the editing distance smaller than a preset value as target sentences. The edit distance is a quantitative measure of the difference between two strings (e.g., english text) by how many times a string is changed into another string. The edit distance smaller than the preset value directly reflects the difference degree between the two texts literally, namely, the more similar the two texts are, the smaller the edit distance is. The method has high processing speed for short texts, is particularly suitable for application scenes such as subtitles, and is simple and quick.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the audio-video clipping method according to the present embodiment. In the application scenario of fig. 3, click "script clip" to enter a page where video and script are uploaded. Importing original sheets to be edited and script characters corresponding to punctuation sentences; and clicking the intelligent clip can automatically generate the video preview effect after clipping. For the content of the found repeated segment, clicking a prompt button beside the sentence shows that all the content can be used for replacing the video segment, and clicking the video segment can preview the replaced effect. After the operation is finished, clicking 'finished', the general clip page of the clip can be entered, other secondary editing optimization is carried out, and finally the film video is exported and generated.

With further reference to fig. 4, a flow 400 of yet another embodiment of an audiovisual clipping method is shown. The flow 400 of the audio/video clipping method comprises the following steps:

step 401, obtaining an original audio and video clip and a corresponding script sentence set.

Step 402, recognizing a caption sentence set from the original audio/video clip.

In step 403, for each script sentence in the script sentence set, a target sentence with similarity higher than a first threshold value with the script sentence is recalled from the subtitle sentence set.

And step 404, sequentially clipping and splicing the original audio and video segments by using the text alignment time identified by the subtitles of each target sentence to generate an intermediate audio and video segment.

And 405, replacing each target sentence with a corresponding script sentence according to the text alignment time of the subtitle identification of each target sentence, and combining the script sentence with the intermediate audio and video fragment according to the alignment time to obtain a clipped audio and video fragment.

Step 401-.

In step 406, for each script sentence in the script sentence set, if the target sentence with the similarity higher than the first threshold value to the script sentence cannot be recalled from the subtitle sentence set, outputting the prompt information that the script sentence is omitted.

In this embodiment, if the content of the provided audio-video file and the content of the script are too different, there may be a case where a similar sentence cannot be recalled, and this case may cause the automatic clipping effect to be deteriorated. The user can be prompted to supplement the recorded content with excessive differences, namely the missed audio and video clips. The method can detect specific time periods of the audio and video missing the script sentences as the text alignment time of the missed audio and video clips submitted by the subsequent users.

Step 407, in response to receiving the missed audio/video segments submitted by the user, inserting the missed audio/video segments into the clipped audio/video segments.

In this embodiment, the user may insert the missed audio/video segments into the clipped audio/video segments according to the text alignment time identified by the subtitles. Optionally, the missed audio/video clips submitted by the user may be checked, specifically, after speech recognition, the missed audio/video clips are matched with the script sentences, and it is determined whether the missed script sentences are detected in step 406. If the verification is passed, the audio and video segments can be inserted and spliced into the clipped audio and video segments.

As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the flow 400 of the audio/video clipping method in this embodiment embodies a step of detecting the integrity of the audio/video. Therefore, the scheme described in the embodiment can automatically detect the integrity of the audio and video, so that more comprehensive audio and video clipping is realized.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of an audio/video clip apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 5, the audio-video clip device 500 of the present embodiment includes: an acquisition unit 501, an identification unit 502, a recall unit 503, a splicing unit 504 and a replacement unit 505. The obtaining unit 501 is configured to obtain an original audio/video segment and a corresponding script sentence set. A recognition unit 502 configured to speech-recognize a set of caption sentences from the original audio-video segment. A recall unit 503 configured to recall, for each script sentence in the script sentence set, a target sentence having a similarity higher than a first threshold with the script sentence from the subtitle sentence set. And the splicing unit 504 is configured to sequentially splice and splice the original audio and video segments by using the text alignment time identified by the subtitles of each target sentence to generate an intermediate audio and video segment. And a replacing unit 505 configured to replace each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combine the script sentence with the intermediate audio/video clip according to the alignment time to obtain a clipped audio/video clip.

In this embodiment, the specific processing of the acquiring unit 501, the identifying unit 502, the recalling unit 503, the splicing unit 504 and the replacing unit 505 of the audio/video clip device 500 may refer to step 201, step 202, step 203, step 204 and step 205 in the corresponding embodiment of fig. 2.

In some alternative implementations of the present embodiment, the apparatus 500 further comprises a candidate unit (not shown in the drawings) configured to: and for each script sentence in the script sentence set, recalling candidate sentences with the similarity higher than a second threshold value from the subtitle sentence set, wherein the second threshold value is lower than the first threshold value, and clipping the candidate sentences except the target sentence according to the text alignment time identified by the subtitle to form candidate audio and video segments for the user to select.

In some optional implementations of this embodiment, the apparatus 500 further comprises a deletion unit (not shown in the drawings) configured to: and detecting a pause segment from the clipped audio/video segment according to the time interval of each subtitle sentence in the subtitle sentence set. And deleting the katton segment from the clipped audio and video segment.

In some optional implementations of this embodiment, the apparatus 500 further comprises a prompting unit (not shown in the drawings) configured to: and for each script sentence in the script sentence set, if a target sentence with the similarity higher than a first threshold value with the script sentence can not be recalled from the subtitle sentence set, outputting the prompt information that the script sentence is omitted.

In some optional implementations of this embodiment, the apparatus 500 further comprises an insertion unit (not shown in the drawings) configured to: and in response to receiving the missed audio and video segments submitted by the user, inserting the missed audio and video segments into the clipped audio and video segments.

In some optional implementations of the present embodiment, the recall unit 503 is further configured to: and calculating the editing distance between the script sentence and each subtitle sentence in the subtitle sentence set. And determining the subtitle sentences with the editing distance smaller than a preset value as target sentences.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of

flows

200 or 400.

A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of

flow

200 or 400.

A computer program product comprising a computer program which, when executed by a processor, implements the method of

flow

200 or 400.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the audio-video clip method. For example, in some embodiments, the audio-visual clipping method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the audio video clip method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the audio-visual clipping method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio-video clipping method comprising:

acquiring an original audio and video clip and a corresponding script sentence set;

recognizing a caption sentence set from the voice of the original audio and video clip;

for each script sentence in the script sentence set, recalling a target sentence with similarity higher than a first threshold value with the script sentence from the subtitle sentence set;

using the text alignment time identified by the caption of each target sentence to sequentially clip and splice the original audio and video segments to generate an intermediate audio and video segment;

and replacing each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combining the script sentence with the intermediate audio and video fragment according to the alignment time to obtain the clipped audio and video fragment.

2. The method of claim 1, wherein the method further comprises:

and for each script sentence in the script sentence set, recalling candidate sentences with the similarity higher than a second threshold value from the subtitle sentence set, wherein the second threshold value is lower than the first threshold value, and editing the candidate sentences except the target sentence according to the text alignment time identified by the subtitle to form candidate audio and video segments for the user to select.

3. The method of claim 1, wherein the method further comprises:

detecting a pause segment from the clipped audio/video segment according to the time interval of each subtitle sentence in the subtitle sentence set;

and deleting the Kadun segment from the clipped audio and video segment.

4. The method of claim 1, wherein the method further comprises:

and for each script sentence in the script sentence set, if a target sentence with the similarity higher than a first threshold value with the script sentence cannot be recalled from the subtitle sentence set, outputting the prompt information that the script sentence is omitted.

5. The method of claim 4, wherein the method further comprises:

and in response to receiving the missed audio and video segments submitted by the user, inserting the missed audio and video segments into the clipped audio and video segments.

6. The method of any of claims 1-5, wherein the recalling from the set of subtitle sentences a target sentence having a similarity to the script sentence above a first threshold comprises:

calculating the editing distance between the script sentence and each subtitle sentence in the subtitle sentence set;

and determining the subtitle sentences with the editing distance smaller than a preset value as target sentences.

7. An audio-video clipping device comprising:

the acquisition unit is configured to acquire an original audio and video clip and a corresponding script sentence set;

a recognition unit configured to speech-recognize a set of caption sentences from the original audio-video segment;

a recall unit configured to recall, for each script sentence in the script sentence set, a target sentence from the subtitle sentence set, the target sentence having a similarity to the script sentence higher than a first threshold;

the splicing unit is configured to sequentially splice the original audio and video fragments by using the text alignment time identified by the subtitles of each target sentence to generate an intermediate audio and video fragment;

and the replacing unit is configured to replace each target sentence with a corresponding script sentence according to the text alignment time identified by the subtitle of each target sentence, and combine the script sentence with the intermediate audio and video clip according to the alignment time to obtain a clipped audio and video clip.

8. The apparatus of claim 7, wherein the apparatus further comprises a candidate unit configured to:

9. The apparatus of claim 7, wherein the apparatus further comprises a deletion unit configured to:

and deleting the Kadun segment from the clipped audio and video segment.

10. The apparatus of claim 7, wherein the apparatus further comprises a prompting unit configured to:

11. The apparatus of claim 10, wherein the apparatus further comprises an insertion unit configured to:

12. The apparatus of any one of claims 7-11, wherein the recall unit is further configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.