CN115942043A - Video clipping method and device based on AI voice recognition - Google Patents

Video clipping method and device based on AI voice recognition Download PDF

Info

Publication number
CN115942043A
CN115942043A CN202310195644.XA CN202310195644A CN115942043A CN 115942043 A CN115942043 A CN 115942043A CN 202310195644 A CN202310195644 A CN 202310195644A CN 115942043 A CN115942043 A CN 115942043A
Authority
CN
China
Prior art keywords
voice
original
video data
caption
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310195644.XA
Other languages
Chinese (zh)
Inventor
张文和
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Aizhao Feida Imaging Technology Co ltd
Original Assignee
Nanjing Aizhao Feida Imaging Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Aizhao Feida Imaging Technology Co ltd filed Critical Nanjing Aizhao Feida Imaging Technology Co ltd
Priority to CN202310195644.XA priority Critical patent/CN115942043A/en
Publication of CN115942043A publication Critical patent/CN115942043A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Studio Circuits (AREA)

Abstract

The invention discloses a video clipping method and device based on AI voice recognition. Acquiring original video data; carrying out voice recognition on original video data to obtain a manuscript of original caption voice; presenting the text of the original caption voice, and allowing a user to edit the text of the original caption voice to generate a new text of the caption voice; and reversely editing the original video data according to the new caption voice manuscript to generate final video data. The invention can rapidly complete the video cutting work and save the operation time; the cutting work covers the deleting, field adding, statement sequence adjusting and optimized processing after adjusting, and the using requirements of users can be met.

Description

Video clipping method and device based on AI voice recognition
Technical Field
The invention relates to the technical field of video shredding and cutting, in particular to a video shredding and cutting method and device based on AI voice recognition.
Background
At present, voice recognition is widely applied to videos, and mainly the voice can be recognized and converted into characters to be used as subtitles. In the usual editing work, "garrulous scissors" is an important editing work, and the words with a speech segment are subtracted from unwanted parts, such as inappropriate phonics (kayaks, q-volume, etc.), cold field conditions (staying too long and not sounding), and unnecessary words are cut off in the process of refining golden sentences, and the process is called "garrulous scissors".
The prior art of cutting and clipping needs listening and clipping in a word. Often, the pictures must be audited and edited frame by frame, which is very time-consuming. The speech translation text in the prior art is used for whole translation to automatically generate subtitles.
Disclosure of Invention
The invention aims to provide a video cropping method and device based on AI voice recognition, aiming at the defects in the prior art.
To achieve the above object, in a first aspect, the present invention provides a video cropping method based on AI speech recognition, including:
acquiring original video data;
performing voice recognition on the original video data to obtain a manuscript of original caption voice;
presenting the manuscript of the original caption voice, and editing the manuscript of the original caption voice by a user to generate a new manuscript of the caption voice;
and reversely editing the original video data according to the new caption voice manuscript to generate final video data.
Further, the editing includes deleting a part of the text in the manuscript of the original caption voice, and the reverse clipping includes cutting out and splicing the video segments in the original video data that do not correspond to the manuscript of the new caption voice.
Further, the editing comprises adding characters to the original caption voice manuscript, before reverse editing, generating a corresponding voice signal based on voice data in an original video and an AI voice imitation technology, or obtaining a corresponding voice signal from character recording in the original video data, splicing the voice signal and the voice signal corresponding to the new caption voice manuscript into a new voice signal, the reverse editing comprises removing audio from the original video data or external video segment data with the same length according to the length of the new voice signal, splicing video segments with corresponding lengths in a frame insertion mode, and finally synthesizing the new voice signal and the extended video data into final video data.
Further, the editing also comprises replacing the positions of a plurality of words, and the reverse cutting comprises replacing the positions of the corresponding parts of the original video data and splicing.
Further, the method also comprises the following steps: and optimizing the splicing position by setting transition or AI processing.
In a second aspect, the present invention provides a video cropping device based on AI speech recognition, comprising:
the acquisition module is used for acquiring original video data;
the recognition module is used for carrying out voice recognition on the original video data to obtain a manuscript of an original caption voice;
the man-machine interaction module is used for presenting the text of the original caption voice and enabling a user to edit the text of the original caption voice so as to generate a new text of the caption voice;
and the processing module is used for reversely editing the original video data according to the new caption voice manuscript to generate final video data.
Further, the editing includes deleting a part of the text in the manuscript of the original caption voice, and the reverse clipping includes cutting out and splicing the video segments in the original video data that do not correspond to the manuscript of the new caption voice.
Further, the editing comprises adding characters to the original caption voice manuscript, before reverse editing, generating a corresponding voice signal based on voice data in an original video and an AI voice imitation technology, or obtaining a corresponding voice signal from character recording in original video data, splicing the voice signal and the voice signal corresponding to the new caption voice manuscript into a new voice signal, the reverse editing comprises splicing the original video data into a video segment with a corresponding length in a frame insertion mode according to the length of the new voice signal, and finally synthesizing the new voice signal and the prolonged video data into final video data.
Further, the editing also comprises replacing the positions of a plurality of words, and the reverse cutting comprises replacing the positions of the corresponding parts of the original video data and splicing.
Furthermore, the processing module is also used for optimizing the splicing position by setting transition or AI processing.
Has the beneficial effects that: the method and the device have the advantages that the original video data are subjected to voice recognition to obtain the manuscript of the original caption voice, the manuscript of the original caption voice is presented for a user to edit the manuscript of the original caption voice to generate a new manuscript of the caption voice, the original video data are reversely edited according to the new manuscript of the caption voice to generate final video data, the video cutting work can be rapidly completed, and the operation time is saved; the cutting work covers the deleting, field adding, statement sequence adjusting and optimized processing after adjusting, and the using requirements of users can be met.
Drawings
FIG. 1 is a flowchart illustrating a video cropping method based on AI speech recognition according to an embodiment of the invention;
fig. 2 is a schematic diagram of a video cropping device based on AI speech recognition according to an embodiment of the present invention.
Description of the preferred embodiment
The present invention will be further illustrated with reference to the accompanying drawings and specific examples, which are carried out on the premise of the technical solution of the present invention, and it should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides a video cropping method based on AI speech recognition, including:
raw video data is acquired. The original video data refers to a pre-recorded video of a speech, and the speech content in the video has certain defects, such as improper buddhist (kakoku, or other) or cold-field conditions (too long stay and no sound).
And carrying out voice recognition on the original video data to obtain a manuscript of the original caption voice. The voice recognition of video data is the prior art, and the technology is used for converting the voice into characters as subtitles after the voice recognition, and the like, and the principle of the technology is not repeated herein.
And presenting the text of the original caption voice, and editing the text of the original caption voice by a user to generate a new text of the caption voice.
And reversely editing the original video data according to the new caption voice manuscript to generate final video data.
Specifically, the editing includes deleting a part of characters in the text of the original caption voice, and the reversely editing includes cutting out and splicing video segments, which do not correspond to the text of the new caption voice, in the original video data. For example, the manuscript of the original caption voice recognized according to the voice is' this is an innovative era, and we must match this era, the era thought, as a science and technology developer, must see the needs and processes of the user to observe the market reflection. The new text of caption voice generated after editing is 'this is an innovative era, we must match the trend of the era, as a science and technology developer, must see the needs of the user clearly, and must do the following four things to observe the reaction of the market'. And finally, cutting out and splicing video clips which are not corresponding to the manuscript of the new caption voice in the original video data, so as to obtain the final video data. In addition, when performing speech recognition on the original video data, the time when the speech occurs can be recorded, and the frame precision (e.g. 1/30 second, 1/60 second, frame rate) can be used as the basis for editing.
The editing also comprises adding characters into the manuscript of the original caption voice, before reverse editing, generating a corresponding voice signal based on voice data in the original video and an AI voice imitation technology, or obtaining a corresponding voice signal from character recording in the original video data, splicing the voice signal and the voice signal corresponding to the new caption voice manuscript into a new voice signal, performing reverse editing, splicing the video segments with corresponding lengths into the original video data in a frame insertion mode according to the length of the new voice signal, and finally synthesizing the new voice signal and the prolonged video data into final video data. In addition, the lips of the characters in the frame-inserted and spliced video clips can be processed by AI to change the lip shapes and match with new scripts, or the final video data can be synthesized by matching the new voice signals with external audio-free video clip data with the same length, such as # directory/video name # default (assuming that the insertion video command = "####" which can be set is to insert the "video name" stored under the "directory" into "# × default" length (the time length of default = the length generated by the aforementioned new voice signal, and if # × 10.3 is substituted for # default, the video length of 10.3 seconds is inserted), the external video plus the new voice signals are used to present the editing when the video is played to the position, specifically, the video sequence can be moved by an AI voice recognition process or a marking command, for example, # +10 (assuming that the settable move command = "#" - "is to move" this "to the" # + + number ". This is an innovative era, we must match the trend of the era, as a science and technology developer, must see the user's needs clearly, and must do the following four things to observe the market's reaction" to modify "it is an innovative era, # 10 we must match the trend of the era, as a science and technology developer, must see the user's needs clearly, and ++10 must do the following four things to observe the market's reaction" ((the order of the video is changed) by software Move the segment after "# - -" to the place of ++ 10), as follows: "this is an innovative era, as a science and technology developer, must see the needs of the users clearly, we must match the trend of the era, and must do the following four things to observe the reaction of the market". Again, for example, for a cold field, the command may be # -5.3S # -, representing this segment minus 5.3 seconds. In addition, when the subtitle display is controlled, the movement instruction is cleared and is not displayed as the subtitle.
The splicing place can also be optimized by setting transition or AI processing to avoid that the image of the person in the video does not move too sharply, for example, a newly inserted video segment is visually impacted in an amplifying manner to reduce visual jumping discomfort during jumping, if the means for amplifying or zooming out is specified, other than the AI manner, an artificial instruction # & & ZoomIN or # & & zooout can be used, for example # - - - - - - - - - - - - - - & & ZoomIN10S, which can be reset after the zoom in is amplified for 10 seconds after subtracting 5.3 seconds from the video segment, and # - & ZoomIN + Default is that the AI recognizes that the following complete video segment is amplified and automatically reset when switching to another video segment (or a sentence bounded by "enter" in the document).
Referring to fig. 2, based on the above embodiment, those skilled in the art can easily understand that the video cutting device based on AI speech recognition of the present invention includes an obtaining module 1, a recognition module 2, a human-computer interaction module 3, and a processing module 4.
The acquisition module 1 is used for acquiring original video data. The original video data refers to a pre-recorded video of a speech, and the speech content in the video has certain defects, such as improper buddhist (kakoku, or other) or cold-field conditions (too long stay and no sound).
The recognition module 2 is used for performing voice recognition on the original video data to obtain a manuscript of the original caption voice. The voice recognition of video data is the prior art, and the technology is used for converting voice into characters as subtitles and the like, and the principle of the technology is not described herein again.
The man-machine interaction module 3 is used for presenting the manuscript of the original caption voice and enabling a user to edit the manuscript of the original caption voice so as to generate a new manuscript of the caption voice.
The processing module 4 is configured to perform reverse editing on the original video data according to the text of the new caption voice, and generate final video data.
Specifically, the editing includes deleting a part of the text in the text of the original subtitle sound, and the reversely cutting includes cutting out and splicing the video segments of the original video data that do not correspond to the text of the new subtitle sound. For example, the manuscript of the original caption voice recognized according to the voice is' this is an innovative era, and we must match this era, the era thought, as a science and technology developer, must see the needs and processes of the user to observe the market reflection. The new text of caption voice generated after editing is 'this is an innovative era, we must match the trend of the era, as a science and technology developer, must see the needs of the user clearly, and must do the following four things to observe the reaction of the market'. And finally, cutting out and splicing video clips which are not corresponding to the manuscript of the new caption voice in the original video data, so as to obtain the final video data. In addition, when performing speech recognition on the original video data, the time when the speech occurs can be recorded, and the frame precision (e.g. 1/30 second, 1/60 second, frame rate) can be used as the basis for editing.
The editing also comprises adding characters into the manuscript of the original caption voice, before reverse editing, generating a corresponding voice signal based on voice data in the original video and an AI voice imitation technology, or obtaining a corresponding voice signal from character recording in the original video data, splicing the voice signal and the voice signal corresponding to the new caption voice manuscript into a new voice signal, performing reverse editing, splicing the video segments with corresponding lengths into the original video data in a frame insertion mode according to the length of the new voice signal, and finally synthesizing the new voice signal and the prolonged video data into final video data. In addition, the lips of the characters in the frame-inserted and spliced video clips can be processed by AI to change the lip shapes and match with new scripts, or the final video data can be synthesized by matching the new voice signals with external audio-free video clip data with the same length, such as # directory/video name # default (assuming that the insertion video command = "####" which can be set is to insert the "video name" stored under the "directory" into "# × default" length (the time length of default = the length generated by the aforementioned new voice signal, and if # × 10.3 is substituted for # default, the video length of 10.3 seconds is inserted), the external video plus the new voice signals are used to present the editing when the video is played to the position, specifically, the video can be moved by an AI voice recognition process or a marking command, such as # - +10 (assuming that the settable move command = "# -" is the place where the "move" to "# + +" number ") so that" this is an innovative era, we must match the era trend, as a technology developer, and must see the user's needs, and must do the following four things to observe the market's reaction "modified" to be an innovative era, # - # 10 as a technology developer, and must see the user's needs, and ++10 must do the following four things to observe the market's reaction ". Thus, the software will change the order of the video (the order of the videos is changed: [ + +10 #, as a technology developer, and must see the user's needs clearly Move the segment after "# - -" to the place of ++ 10), as follows: "this is an innovative era, as a science and technology developer, must see the needs of the users clearly, we must match the trend of the era, and must do the following four things to observe the reaction of the market". Again, for example, for a cold field, the command may be # -5.3S # -, representing this segment minus 5.3 seconds. In addition, when the subtitle display is controlled, the movement instruction is cleared and is not displayed as the subtitle.
The processing module 4 may also optimize the splicing process by setting transition or AI process to avoid the image of the character in the video from moving too sharply to jump, for example, a newly inserted video segment is visually impacted in an enlarged manner to reduce visual jumping discomfort at the time of jumping, for example, the means for designating enlargement or reduction may be operated by using an artificial instruction # & & ZoomIN or # & & zooout other than the AI mode, for example # - - - - - - & ZoomIN10S, which is reset after the zoom in continues for 10 seconds after subtracting 5.3 seconds from the current segment, and # & & ZoomIN + Default is a sentence in which the AI recognizes that the following complete video segment is enlarged and automatically resets when switching to another video segment (or a sentence bounded by "carriage return" in the document).
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that other parts not specifically described are within the skill or common general knowledge of one of ordinary skill in the art. Without departing from the principle of the invention, several improvements and modifications can be made, and these improvements and modifications should also be construed as the scope of the invention.

Claims (10)

1. A video cropping method based on AI voice recognition is characterized by comprising the following steps:
acquiring original video data;
performing voice recognition on the original video data to obtain a manuscript of original caption voice;
presenting the text of the original caption voice, and allowing a user to edit the text of the original caption voice to generate a new text of the caption voice;
and reversely editing the original video data according to the new caption voice manuscript to generate final video data.
2. The AI speech recognition-based video cropping method of claim 1, wherein the editing comprises deleting a portion of text in the text of the original caption speech, and the reverse cropping comprises splicing together video segments of the original video data that do not correspond to the text of the new caption speech.
3. The AI-voice-recognition-based video cropping method according to claim 1, wherein the editing comprises adding words to a manuscript of an original caption voice, before reverse cropping, generating a corresponding voice signal based on voice data in the original video and an AI voice simulation technique, or obtaining a corresponding voice signal by a person recording in the original video, and splicing the voice signal and a voice signal corresponding to a new caption voice manuscript into a new voice signal, the reverse cropping comprises splicing a video segment of a corresponding length to the original video data or external video segment data of the same length in a frame insertion manner according to the new voice signal length, and finally synthesizing the new voice signal and the extended video data into final video data.
4. The AI-voice-recognition-based video cropping method of claim 1, wherein the editing further comprises replacing the locations of the words, and the reverse cropping comprises replacing the locations of the corresponding portions of the original video data and then splicing.
5. The AI speech recognition based video chunking method according to claim 2, or claim 3 or claim 4, further comprising: and optimizing the splicing position by setting transition or AI (artificial intelligence) processing.
6. A video cropping device based on AI voice recognition, comprising:
the acquisition module is used for acquiring original video data;
the recognition module is used for carrying out voice recognition on the original video data so as to obtain a manuscript of original caption voice;
the man-machine interaction module presents the manuscript of the original caption voice and enables a user to edit the manuscript of the original caption voice so as to generate a new manuscript of the caption voice;
and the processing module is used for reversely editing the original video data according to the new caption voice manuscript to generate final video data.
7. The AI speech recognition-based video cropping device of claim 6, wherein the editing comprises deleting a portion of text in the text of the original caption speech, and the reverse cropping comprises splicing together video segments of the original video data that do not correspond to the text of the new caption speech.
8. The AI-voice-recognition-based video cropping device of claim 6, wherein the editing comprises adding words to a script of an original caption voice, generating a corresponding voice signal based on voice data in the original video and an AI-voice emulation technique, or obtaining a corresponding voice signal from a character recording in the original video data before reverse cropping, and splicing the voice signal and a voice signal corresponding to a new caption voice script into a new voice signal, wherein the reverse cropping comprises splicing a video segment of a corresponding length to the original video data by a frame insertion method according to the new voice signal length, and finally synthesizing the new voice signal and the extended video data into final video data.
9. The AI-voice-recognition-based video cropping device of claim 6, wherein the editing further comprises replacing the positions of the words, and the backward cropping comprises replacing the positions of the corresponding portions of the original video data and then splicing.
10. An AI speech recognition based video cropping device according to claim 7, 8 or 9, characterized in that the processing module is further configured to optimize the splicing by setting a transition or an AI process.
CN202310195644.XA 2023-03-03 2023-03-03 Video clipping method and device based on AI voice recognition Pending CN115942043A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310195644.XA CN115942043A (en) 2023-03-03 2023-03-03 Video clipping method and device based on AI voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310195644.XA CN115942043A (en) 2023-03-03 2023-03-03 Video clipping method and device based on AI voice recognition

Publications (1)

Publication Number Publication Date
CN115942043A true CN115942043A (en) 2023-04-07

Family

ID=86701029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310195644.XA Pending CN115942043A (en) 2023-03-03 2023-03-03 Video clipping method and device based on AI voice recognition

Country Status (1)

Country Link
CN (1) CN115942043A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949390A (en) * 2017-12-21 2019-06-28 腾讯科技(深圳)有限公司 Image generating method, dynamic expression image generating method and device
CN110166816A (en) * 2019-05-29 2019-08-23 上海乂学教育科技有限公司 The video editing method and system based on speech recognition for artificial intelligence education
CN112001323A (en) * 2020-08-25 2020-11-27 成都威爱新经济技术研究院有限公司 Digital virtual human mouth shape driving method based on pinyin or English phonetic symbol reading method
CN112002301A (en) * 2020-06-05 2020-11-27 四川纵横六合科技股份有限公司 Text-based automatic video generation method
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN113225618A (en) * 2021-05-06 2021-08-06 阿里巴巴新加坡控股有限公司 Video editing method and device
CN115150660A (en) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 Video editing method based on subtitles and related equipment
CN115278356A (en) * 2022-06-23 2022-11-01 上海高顿教育科技有限公司 Intelligent course video clip control method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109949390A (en) * 2017-12-21 2019-06-28 腾讯科技(深圳)有限公司 Image generating method, dynamic expression image generating method and device
US20200404386A1 (en) * 2018-02-26 2020-12-24 Google Llc Automated voice translation dubbing for prerecorded video
CN110166816A (en) * 2019-05-29 2019-08-23 上海乂学教育科技有限公司 The video editing method and system based on speech recognition for artificial intelligence education
CN112002301A (en) * 2020-06-05 2020-11-27 四川纵横六合科技股份有限公司 Text-based automatic video generation method
CN112001323A (en) * 2020-08-25 2020-11-27 成都威爱新经济技术研究院有限公司 Digital virtual human mouth shape driving method based on pinyin or English phonetic symbol reading method
CN113225618A (en) * 2021-05-06 2021-08-06 阿里巴巴新加坡控股有限公司 Video editing method and device
CN115150660A (en) * 2022-06-09 2022-10-04 深圳市大头兄弟科技有限公司 Video editing method based on subtitles and related equipment
CN115278356A (en) * 2022-06-23 2022-11-01 上海高顿教育科技有限公司 Intelligent course video clip control method

Similar Documents

Publication Publication Date Title
EP1425736B1 (en) Method for processing audiovisual data using speech recognition
CN110166816B (en) Video editing method and system based on voice recognition for artificial intelligence education
CN108259965A (en) A kind of video clipping method and editing system
JPH11162107A (en) System for editing digital video information and audio information
CN108449629B (en) Audio voice and character synchronization method, editing method and editing system
JPH1074138A (en) Method and device for segmenting voice
WO2003088208A1 (en) Text structure for voice synthesis, voice synthesis method, voice synthesis apparatus, and computer program thereof
JP2012181358A (en) Text display time determination device, text display system, method, and program
EP0877378A3 (en) Method of and apparatus for editing audio or audio-visual recordings
CN115883935A (en) Video editing method and device
US20140019132A1 (en) Information processing apparatus, information processing method, display control apparatus, and display control method
CN110797003A (en) Method for displaying caption information by converting text into voice
JP4496358B2 (en) Subtitle display control method for open captions
CN115942043A (en) Video clipping method and device based on AI voice recognition
US8538244B2 (en) Recording/reproduction apparatus and recording/reproduction method
CN110796718A (en) Mouth-type switching rendering method, system, device and storage medium
KR101783872B1 (en) Video Search System and Method thereof
JP2005129971A (en) Semi-automatic caption program production system
JP2003216200A (en) System for supporting creation of writing text for caption and semi-automatic caption program production system
JP4235635B2 (en) Data retrieval apparatus and control method thereof
JP4459077B2 (en) Narration support device, original editing method and program
JP2003223199A (en) Preparation support system for writing-up text for superimposed character and semiautomatic superimposed character program production system
JP4124416B2 (en) Semi-automatic subtitle program production system
JPH07160289A (en) Voice recognition method and device
JPS58160993A (en) Voice confirmation of document editting unit editing unit

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination