CN114760493A - Method, device and storage medium for adding lyric progress image - Google Patents
Method, device and storage medium for adding lyric progress image Download PDFInfo
- Publication number
- CN114760493A CN114760493A CN202210305241.1A CN202210305241A CN114760493A CN 114760493 A CN114760493 A CN 114760493A CN 202210305241 A CN202210305241 A CN 202210305241A CN 114760493 A CN114760493 A CN 114760493A
- Authority
- CN
- China
- Prior art keywords
- lyric
- target
- video
- playing time
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 66
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 238000012545 processing Methods 0.000 claims abstract description 59
- 238000009877 rendering Methods 0.000 claims description 34
- 230000015654 memory Effects 0.000 claims description 21
- 230000000694 effects Effects 0.000 claims description 15
- 238000005520 cutting process Methods 0.000 claims description 13
- 238000000926 separation method Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 4
- 230000006870 function Effects 0.000 description 10
- 230000002093 peripheral effect Effects 0.000 description 10
- 230000001133 acceleration Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000700605 Viruses Species 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 230000036461 convulsion Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/233—Processing of audio elementary streams
- H04N21/2335—Processing of audio elementary streams involving reformatting operations of audio signals, e.g. by converting from one coding standard to another
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/235—Processing of additional data, e.g. scrambling of additional data or processing content descriptors
- H04N21/2355—Processing of additional data, e.g. scrambling of additional data or processing content descriptors involving reformatting operations of additional data, e.g. HTML pages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/435—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream
- H04N21/4355—Processing of additional data, e.g. decrypting of additional data, reconstructing software from modules extracted from the transport stream involving reformatting operations of additional data, e.g. HTML pages on a television screen
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4398—Processing of audio elementary streams involving reformatting operations of audio signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/488—Data services, e.g. news ticker
- H04N21/4884—Data services, e.g. news ticker for displaying subtitles
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Quality & Reliability (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
The application discloses a method, equipment and a storage medium for adding a lyric progress image, and belongs to the technical field of internet. The method comprises the following steps: determining the playing time period of each lyric of the target song in the video of the target song to which the lyric belongs; for each playing time period, acquiring a target video frame of the playing time in the playing time period from the target song video; performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song; determining a lyric time stamp corresponding to the lyric text, wherein the lyric time stamp comprises the starting playing time and the ending playing time of each word in the lyric text; and generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame in the target song video based on the lyric time stamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image. By the method and the device, the lyric progress image can be added in the song video.
Description
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a storage medium for adding a lyric progress image.
Background
A lyric progress image is generally added in a song video provided by a K song platform, and each word in the lyric progress image is dynamically displayed along with the singing progress of a song in the song video. For example, the lyrics sung and the lyrics not sung in the lyric progress image are displayed by different colors. Therefore, the user can sing the song played in the song video according to the prompt of the lyric progress image.
However, the song videos in the K song platform are manufactured and provided by the song video publishing company, and if the song video publishing company does not add the lyric progress image when manufacturing the song video, the user cannot sing the song played in the song video according to the prompt of the lyric progress image. Therefore, how to add corresponding lyric progress images in the existing song videos without the lyric progress images becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application provides a method, equipment and a storage medium for adding a lyric progress image, which can be used for adding a corresponding lyric progress image in a song video without the lyric progress image. The technical scheme is as follows:
in a first aspect, a method for adding a lyric progress image is provided, the method comprising:
Determining the playing time period of each lyric of the target song in the video of the target song to which the lyric belongs;
for each playing time period, acquiring a target video frame with playing time within the playing time period from the target song video;
performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song;
determining a lyric time stamp corresponding to the lyric text, wherein the lyric time stamp comprises the starting playing time and the ending playing time of each word in the lyric text;
and generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image.
Optionally, the determining a playing time period of each lyric of the target song in the target song video includes:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
determining a signal energy value corresponding to each audio frame of the human voice audio;
Determining each human voice frequency segment in the human voice frequency based on a signal energy value corresponding to each audio frame of the human voice frequency and a signal energy threshold value;
and determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.
Optionally, the obtaining, in the target song video, a target video frame whose playing time is within the playing time period includes:
selecting a specified number of video frames with playing time within the playing time period from the target song video, and determining the specified number of video frames as target video frames;
the performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song comprises:
for each playing time period, respectively carrying out character recognition processing on the specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;
and determining a lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame, and forming the lyric text corresponding to each playing time period into a lyric text of the target song.
Optionally, the recognition result includes at least one region displaying text in the corresponding target video frame, and text displayed in each region;
the determining the lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame comprises the following steps:
and determining a target area with the maximum occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.
Optionally, before the target song video to which the lyric progress image is added is obtained, the method further includes:
and carrying out fuzzy processing on a specified area of the video frame with the playing time in any playing time period, wherein the specified area is a target area of the lyric text of any playing time period in the target video frame.
Optionally, the performing word recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song includes:
cutting out the image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition;
and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.
Optionally, the determining the lyric timestamp corresponding to the lyric text includes:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
and inputting the lyric text and the human voice frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.
Optionally, the generating a lyric image corresponding to the lyric text includes:
and generating a plurality of lyric images corresponding to the lyric text based on preset lyric display information and the lyric text, wherein each lyric image displays at least one lyric, and the lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.
Optionally, adding a lyric progress image to a video frame in the target song video based on a lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added thereto, including:
determining position information of the lyric image in a video frame of the target song video, wherein the position information comprises the position of each word in the lyric image in the corresponding video frame;
And inputting the lyric timestamp, the position information, the lyric image and a video frame in the target song video into a shader rendering module, and adding the lyric image to a corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric timestamp and the position information to obtain the target song video added with the lyric progress image.
In a second aspect, there is provided an apparatus for adding a lyric progress image, the apparatus comprising:
the determining unit is used for determining the playing time period of each lyric of the target song in the target song video to which the lyric belongs;
the acquisition unit is used for acquiring a target video frame with the playing time within the playing time period from the target song video for each playing time period;
the identification unit is used for carrying out character identification processing on the target video frame corresponding to each playing time slot to obtain a lyric text of the target song;
the determining unit is further configured to determine a lyric timestamp corresponding to the lyric text, where the lyric timestamp includes a start playing time and an end playing time of each word in the lyric text;
And the generation unit is used for generating a lyric image corresponding to the lyric text, and adding a lyric progress image into a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added.
Optionally, the determining unit is configured to:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
determining a signal energy value corresponding to each audio frame of the human voice audio;
determining each human voice frequency segment in the human voice frequency based on a signal energy value corresponding to each audio frame of the human voice frequency and a signal energy threshold value;
and determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.
Optionally, the obtaining unit is configured to:
selecting a specified number of video frames with playing time within the playing time period from the target song video, and determining the specified number of video frames as target video frames;
The identification unit is configured to:
for each playing time period, respectively carrying out character recognition processing on the specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;
and determining the lyric texts corresponding to the playing time periods according to the identification result corresponding to each target video frame, and forming the lyric texts corresponding to each playing time period into the lyric texts of the target songs.
Optionally, the recognition result includes at least one region displaying text in the corresponding target video frame, and text displayed in each region;
the acquisition unit is configured to:
and determining a target area with the largest occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.
Optionally, before the lyric timestamp corresponding to the lyric text, the lyric image, and the target song video are input to the rendering module, the method further includes:
and carrying out fuzzy processing on a designated area of the video frame with the playing time in any playing time period, wherein the designated area is a target area of the lyric text of any playing time period in a target video frame.
Optionally, the identifying unit is configured to:
cutting out the image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition;
and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.
Optionally, the determining a lyric timestamp corresponding to the lyric text includes:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
and inputting the lyric text and the human voice audio frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.
Optionally, the generating unit is configured to:
and generating a plurality of lyric images corresponding to the lyric text based on preset lyric display information and the lyric text, wherein each lyric image displays at least one lyric, and the lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.
Optionally, the generating unit is configured to:
Determining position information of the lyric image in a video frame of the target song video, wherein the position information comprises the position of each word in the lyric image in the corresponding video frame;
and inputting the lyric timestamp, the position information, the lyric image and a video frame in the target song video into a shader rendering module, and adding the lyric image to a corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric timestamp and the position information to obtain the target song video added with the lyric progress image.
In a third aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded by the processor and executed to implement the operations performed by the method for adding a lyric progress image according to the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the method for adding a lyric progress image according to the first aspect.
In a fifth aspect, a computer program product is provided, and the computer program product includes at least one instruction, which is loaded and executed by a processor to implement the operations performed by the method for adding a lyric progress image according to the first aspect.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
the method and the device determine the lyric text corresponding to the target song according to the video frame of the playing time in the playing time period and determine the corresponding lyric time stamp by determining the playing time period of each sentence of lyrics in the target song video. And further generating a target song video added with a lyric progress image by using a lyric text corresponding to the target song and a corresponding lyric timestamp. By the method and the device, the corresponding lyric progress image can be added into the existing song video without the lyric progress image.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;
FIG. 2 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 3 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 6 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 7 is a schematic diagram illustrating a method for adding a lyric progress image according to an embodiment of the present application;
FIG. 8 is a schematic structural diagram of an apparatus for adding a lyric progress image according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a computer device provided by an embodiment of the present application;
fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The method for adding the lyric progress image can be realized by computer equipment. The computer device may have a processor and a memory, wherein the memory may be configured to store program codes for implementing the method for adding the lyric progress image, data to be processed (e.g., song video), and the like, and the processor may be configured to execute the program codes stored in the memory, process the data to be processed, and further obtain the song video to which the lyric progress image is added. The computer equipment can be a terminal or a server, when the computer equipment is the terminal, a karaoke application program provided by a karaoke platform can be operated in the terminal, and the terminal can be a mobile phone, a tablet computer, intelligent wearable equipment, a desktop computer, a notebook computer and the like. When the computer device is a server, the server can be a background server of the karaoke platform.
Fig. 1 is a schematic diagram of a possible implementation environment provided by an embodiment of the present application. Referring to fig. 1, a background server of the karaoke platform may obtain a song video to which a lyric progress image needs to be added from a song video storage server. And storing the song videos with the lyric progress images to be added in a music video library in the song video storage server. After the background server of the K song platform obtains the song video to which the lyric progress image is to be added, the lyric progress image can be added into the corresponding song video according to the method for adding the lyric progress image provided by the application, and then the song video to which the lyric progress image is added is obtained. Therefore, when the terminal runs a Karaoke application program (client) provided by the Karaoke platform, the song video added with the lyric progress image can be obtained from the background server, and then the song video is played. The user may sing according to the progress of the lyrics displayed in the song video. The terminal can record the audio and video when the user sings the song and sends the recorded audio and video to the background server.
According to the method for adding the lyric progress images, the corresponding lyric progress images can be added to the song audio without the lyric progress images. Fig. 2 is a method for adding a lyric progress image according to an embodiment of the present application, and referring to fig. 2, the method includes:
In an implementation, the computer device may obtain a target song video to which a lyric progress image needs to be added from a song video storage server. A target song is played in the target song Video, which may be a Music Video (MV) corresponding to the target song, or a Video made for the target song, etc. In the target song video, the lyrics of the target song can be displayed sentence by sentence along with the playing progress of the target song, but each word in each sentence of the lyrics is not dynamically displayed along with the singing progress of the song in the song video.
In an implementation, after the target song video is obtained, a playing time period of each lyric of the target song in the target song video, that is, a starting playing time point and an ending playing time point of each lyric of the target song in the target song video, may be determined according to audio data included in the target song video. In one possible scenario, the process of determining the playing time period of each lyric of the target song in the target song video may be as shown in fig. 3:
After the target song video is obtained, the target audio, that is, the audio played in the target song video, that is, the audio of the target song is played, may be extracted from the video file corresponding to the target song video. After the target audio is obtained, human voice separation processing can be performed on the target audio, and human voice audio included in the target audio is extracted. For example, the human voice audio and the accompaniment audio in the target audio can be separated through an end-to-end residual attention network. The obtained human voice audio is the singing audio of the target song.
It should be understood that the playing time duration of the target audio is the same as the playing time duration of the corresponding human voice audio. The playing time period of any one lyric or any one character in any one lyric is the same between the target audio and the corresponding voice audio.
After the obtained human Voice audio, the segmentation information of the target song, that is, the playing time period of each lyric of the target song in the video of the target song, can be obtained through Voice Activity Detection (VAD) technology. In the human voice frequency, the signal energy corresponding to the audio frequency frame with human voice is higher, and the signal energy corresponding to the audio frequency frame without human voice is lower. Wherein the signal energy value sigma 2(x) Can be defined asWhereinxtIs the audio frame at time t.
After the signal energy value corresponding to each audio frame in the human voice audio is calculated, the audio frames including the human voice and the audio frames not including the human voice can be determined according to the preset signal energy threshold value. The audio frames with the signal energy value greater than or equal to the signal energy threshold value can be determined as the audio frames including the human voice, and the audio frames with the signal energy value smaller than the signal energy threshold value can be determined as the audio frames not including the human voice. Thus, after the audio frames including the human voice and the audio frames not including the human voice are determined, the human voice audio segment in the human voice audio can be further determined. For example, a time period corresponding to a plurality of audio frames including a human voice that are consecutive in time may be determined as a human voice audio segment.
The skilled person may also set the omission time interval as there may be some pauses, intervals between each word in a lyric in a song that the song sings. That is, for at least one audio frame without human voice occurring between two consecutive audio frames with human voice, if the duration corresponding to the at least one audio frame without human voice is less than the preset omission time interval, the at least one audio frame without human voice may be ignored, that is, the two consecutive audio frames with human voice are regarded as one consecutive audio frame with human voice.
After determining the plurality of human voice audio segments included in the human voice audio, the playing time period of the human voice audio segments in the human voice audio can be determined, and the playing time period of each lyric of the target song in the target song video is determined.
And determining a target video frame of the playing time in the corresponding playing time period in the target song video after obtaining the playing time period of each lyric in the target song video. For example, for a playing time period of 1 minute 45 seconds to 1 minute 55 seconds, at least one video frame with a playing time of 1 minute 45 seconds to 1 minute 55 seconds may be acquired from the target song video as the target video frame corresponding to the playing time period.
And 204, performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song.
For a playing time period, the target video frame in the playing time period is a video frame displayed in the target song video when the audio of the playing time period is played. Although the lyric progress image is not displayed in the target song video, the lyrics of the target song are still displayed in the target song video sentence by sentence along with the playing progress of the target song. That is, when the audio of the target song is played, the lyrics corresponding to the current audio are displayed in the played target video frame.
Therefore, the lyrics of the corresponding playing time period are displayed in the target video frame acquired in each playing time period. Therefore, the character recognition processing can be carried out on the target video frame in each playing time period to obtain the corresponding lyrics of each playing time period, and further obtain the lyric text of the target song.
In order to improve the accuracy of text recognition, for each playing time period, multiple target video frames may be obtained in the corresponding playing time period, so as to determine lyrics in the corresponding playing time period in the multiple target video frames, and the corresponding process is as shown in fig. 4, and includes:
For any playing time period, a specified number of video frames can be selected from the target song video, and the selected video frames are taken as target video frames. Wherein the specified number may be preset by a technician. When selecting the video frames, a specified number of video frames may be randomly selected from the plurality of video frames in the playing time period, or the specified number of video frames may also be selected from the plurality of video frames in the playing time period at equal intervals.
Thus, one lyric corresponds to one playing time period, one playing time period corresponds to a plurality of target video frames, that is, one lyric corresponds to a plurality of target video frames, and the plurality of target video frames can all display the corresponding lyrics.
After obtaining the multiple target video frames corresponding to each playing time period, for any playing video segment, the multiple target video frames corresponding to the playing video segment may be subjected to text recognition processing, so as to obtain a recognition result for each target video frame. The recognition result comprises an area corresponding to the display text in the target video frame and the text in the corresponding area. For example, the text in the target video frame may be recognized by Optical Character Recognition (OCR) techniques.
In a possible case, after performing the word recognition processing on a plurality of target video frames corresponding to a playing time period, a plurality of recognition results are obtained to be consistent, and each recognition result only includes an area for displaying a text, and then the text in the area corresponding to the displayed text may be determined as the lyrics corresponding to the playing time period.
In another possible case, after performing the word recognition processing on a plurality of target video frames corresponding to a playing time period, and obtaining a plurality of recognition results that are inconsistent, the lyrics corresponding to the playing time period may be determined according to the plurality of recognition results, and the corresponding processing is as follows: and determining a target area with the largest occurrence frequency in the plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.
Since there may be other text displayed in the target song video in addition to the lyrics. For example, a billboard, a couplet, a calligraphy work, etc., appearing in the picture of the target song video may be recognized when the target video frame is subjected to the character recognition processing. However, unlike the lyrics displayed in the target song video, the other text appearing in the screen of the target song video may change in the area of the display of the target song video that is subsequently played, i.e., the display area of the other text in each target video frame may be different. The same lyrics are typically displayed in a fixed area, for example, below the target video frame. Therefore, after the obtained recognition results corresponding to the plurality of target video frames, the number of occurrences of each region displaying the text obtained in each recognition result can be determined, and the region with the largest number of occurrences is determined in each region, which is the region displaying the lyrics. The text displayed in the target area may then be determined to correspond to the text of the lyrics during the playing period.
In addition, in the flow of the method, before the character recognition processing is performed on the target video frame, the target video frame may be cut, and the corresponding processing is as follows: cutting out an image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition; and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.
As some watermarks of the song publishers and publishers may be displayed in the target song video. Such as "XX music", "XXTV", etc. And typically these watermarks appear in the upper left or right corner of the target song video. While lyrics typically appear below, to the left, or to the right of the target song video. In order to avoid recognizing the characters in the watermark during character recognition, the video frame image at the top of the target video frame can be cut off before the character recognition processing is performed on the target video frame. The technician may preset the corresponding clipping ratio, which may be, for example, 10: 1 or 5: 1. and then cutting the image at the top of each target video frame according to a preset cutting proportion. And then performing character recognition processing on the cut target image. Further, the influence of the watermark virus identification result in the target video frame can be avoided, and the accuracy rate of lyric identification can be improved.
In addition, in the flow of the method, after the character recognition processing is performed on the target video frame, the fuzzy processing can be performed on the specified area of the video frame with the playing time within any playing time period, wherein the specified area is the target area of the lyrics text of any playing time period in the target video frame. That is, the area of the target song video where the lyrics appear can be subjected to fuzzy processing, so that when the lyric progress image is added to the target song video, only the lyrics corresponding to the lyric progress image are displayed in the target song video, and the lyrics are not repeatedly displayed.
And step 205, determining a lyric time stamp corresponding to the lyric text.
The lyric timestamp comprises the starting playing time and the ending playing time of each word in the lyric text.
In an implementation, after determining the lyric text of the target song, a starting playing time and an ending playing time of each word in the lyric text in the video of the target song may be determined according to the audio frequency and the lyric text of the target song. For example, the audio and lyric text of the target song may be entered into the acoustic model. And outputting the starting playing time and the ending playing time of each word by the acoustic model so as to obtain a lyric time stamp corresponding to the lyric text. The acoustic Model may be a Gaussian Mixed Model-Hidden Markov Model (GMM-HMM) Model.
In order to obtain the corresponding lyric time stamp of the lyric text more accurately according to the audio frequency of the target song. The voice of the audio of the target song can be separated to obtain the voice audio of the target song. And determining the lyric time stamp corresponding to the lyric text according to the obtained human voice audio. Therefore, the influence of the accompaniment audio in the target song can be avoided, and the accuracy of the lyric timestamp is improved.
In one possible case, the audio frequency of the target song and the lyric text may be input together into the GMM-HMM model, and the starting playing time and the ending playing time of each word in the lyric text are output by the GMM-HMM model. In another possible case, the lyrics may be played according to the playing time period corresponding to each lyricAnd intercepting the singing audio corresponding to each lyric from the audio corresponding to the target song, inputting the singing audio and the corresponding lyrics into a GMM-HMM model to obtain a lyric time stamp corresponding to each lyric, and determining the lyric time stamp corresponding to the target song according to the playing time period corresponding to each lyric and the lyric time stamp corresponding to each lyric. Wherein, the process of identifying the lyric time stamp based on the GMM-HMM model can be as shown in FIG. 5: s1, cutting input audio into audio frames (frames) with equal length, and extracting Mel Frequency Cepstrum Coefficient (MFCC) characteristics for each audio frame. S2, the MFCC feature vector [ c1, c2, … c39 ]Inputting the input to a pre-trained GMM model to obtain each audio frame xiProbability P (x) of belonging to a phonemei) And S3, converting the input lyric text into phonemes, and converting the phonemes into states through a triphone model (triphones). S4, combining the state obtained by the input lyric text with the probability P (x) obtained by the GMM modeli) Calculating per state O using HMM transition probability state transition probabilityiGenerating a probability of the audio frame; and determining the word with the highest probability corresponding to the HMM sequence, and thus judging the corresponding relation between the audio and each word in the text. And then determining a starting playing time point and an ending playing time point corresponding to each word according to the audio frequency corresponding to each word, and further obtaining a lyric time stamp corresponding to the target song.
And step 206, generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame of the target song video based on the lyric time stamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image.
In implementation, after the lyric text of the target song is obtained, a corresponding lyric image can be generated for each sentence of lyrics in the lyric text, then the obtained lyric time stamp, the lyric image and the target song video can be input into the shader rendering module, the shader rendering module can add different display effects to the lyric image according to the lyric time stamp, render the lyric image into a target video frame, and further obtain the target song video added with the lyric progress image. Further processing, as shown in fig. 6, includes:
The lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.
The lyric display information may be a font configuration file pre-configured by a technician, and may be a JS Object Notation (json) text in data format. The method comprises the information of font, size, color, word spacing, stroke effect (stroke size and color), shadow effect (shadow radius, offset and color), single-line maximum length (if the length of the file information exceeds the width of the picture, the file needs to be disassembled into multiple lines for processing), and the like, wherein the font, the size, the color, the word spacing, the stroke effect (stroke size and color), the shadow effect (shadow radius, offset and color), the single-line maximum length and the like are added to the lyrics in the target song video. Therefore, a plurality of corresponding lyric images can be generated from the lyric text of the target song according to the preset font configuration file and the lyric text. For example, one lyric image may include two words of lyrics, where the first word is the lyrics being played and the second word is the next lyric to be played.
Wherein the position information comprises a position of each word in each lyric image in the corresponding video frame and a position of each lyric image in the corresponding video frame. The position of the lyrics image in the video frame may be preset by a technician, e.g. the position of the lyrics image may be below in the video frame. For the position of each word in each lyric image in the corresponding video frame, the position of the lyric image in the corresponding video frame may be calculated. For example, the lyric image may include the number of words and the word interval. In addition, a plurality of lyric images can be included in one video frame, that is, a plurality of words of lyrics can be included in one video frame, and the specific number of the words of lyrics can be preset by a technician, which is not limited herein.
After the lyric time stamp, the position information and the lyric image are obtained, the lyric time stamp, the position information, the lyric image and a video frame in the target song video can be input into the shader rendering module.
The shader rendering module can determine a video frame needing to be rendered with each lyric image according to the playing time corresponding to each lyric in the lyric timestamp, and determine a time period and a position for performing progress special effect rendering on each word in the lyric image according to the playing time corresponding to each word in the lyric timestamp and the position of each word in the position information in the corresponding video frame. The lyric image and the corresponding video frame may then be rendered by a shader rendering module for a progress special effect, for example, the progress special effect may include fade-in and fade-out, scroll play, font jerk, and the like. And after the shader rendering module finishes rendering all video frames in the target song video, the target song video added with the lyric progress image can be obtained.
Fig. 7 illustrates the process of lyric rendering, taking the rendering "but remember your smile" as an example, including: s1, generating a lyric image corresponding to 'remembering the smile of your' through a font configuration file; s2, obtaining position information by executing the processing of the step 2062 through a space-time analyzer, and obtaining lyric time stamps (time information) through lyrics recognized by an OCR technology; and S3, rendering the lyric image to a video frame of the corresponding song video through a shader rendering module, and rendering and adding a character-by-character display effect. S4, determining whether the rendering of all video frames corresponding to the smile is finished. And if the lyrics are finished, updating the lyrics of the next sentence and preparing to render the lyrics of the next sentence. If not, rendering of the next frame is prepared.
In the application, only one lyric picture needs to be generated for one sentence of lyrics. Once generated, the lyric picture can be displayed until the lyric text corresponding to the lyric picture is used. Correspondingly, the space-time analysis module only needs to analyze the space information once. And in order to prevent rendering blockage, multithreading can be started to generate lyric pictures corresponding to the subsequent lyrics in advance.
The method and the device determine the lyric text corresponding to the target song according to the video frame of the playing time in the playing time period and determine the corresponding lyric time stamp by determining the playing time period of each sentence of lyrics in the target song video. And further generating a target song video added with a lyric progress image by using a lyric text corresponding to the target song and a corresponding lyric timestamp. By the method and the device, the corresponding lyric progress image can be added into the existing song video without the lyric progress image.
All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present disclosure, and are not described in detail herein.
Fig. 8 is a device for adding a lyric progress image according to an embodiment of the present application, where the device may be a computer device according to the foregoing embodiment, and referring to fig. 8, the device includes:
A determining unit 810, configured to determine a playing time period of each lyric of the target song in the target song video to which the lyric belongs;
an obtaining unit 820, configured to obtain, for each playing time period, a target video frame with a playing time within the playing time period in the target song video;
the identifying unit 830 is configured to perform word identification processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song;
the determining unit 810 is further configured to determine a lyric timestamp corresponding to the lyric text, where the lyric timestamp includes a start playing time and an end playing time of each word in the lyric text;
the generating unit 840 is configured to generate a lyric image corresponding to the lyric text, and add a lyric progress image to a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added.
Optionally, the determining unit 810 is configured to:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
Determining a signal energy value corresponding to each audio frame of the human voice audio;
determining each human voice frequency segment in the human voice frequency based on a signal energy value corresponding to each audio frame of the human voice frequency and a signal energy threshold value;
and determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.
Optionally, the obtaining unit 820 is configured to:
selecting a specified number of video frames with playing time within the playing time period from the target song video, and determining the specified number of video frames as target video frames;
the identifying unit 830 is configured to:
for each playing time period, respectively carrying out character recognition processing on the specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;
and determining a lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame, and forming the lyric text corresponding to each playing time period into a lyric text of the target song.
Optionally, the recognition result includes at least one region displaying text in the corresponding target video frame, and text displayed in each region;
The obtaining unit 820 is configured to:
and determining a target area with the maximum occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.
Optionally, before the lyric timestamp, the lyric image and the target song video corresponding to the lyric text are input to the rendering module, the method further includes:
and carrying out fuzzy processing on a specified area of the video frame with the playing time in any playing time period, wherein the specified area is a target area of the lyric text of any playing time period in the target video frame.
Optionally, the identifying unit 830 is configured to:
cutting out the image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition;
and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.
Optionally, the determining the lyric timestamp corresponding to the lyric text includes:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
And inputting the lyric text and the human voice audio frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.
Optionally, the generating unit 840 is configured to:
and generating a plurality of lyric images corresponding to the lyric text based on preset lyric display information and the lyric text, wherein each lyric image displays at least one lyric, and the lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.
Optionally, the generating unit 840 is configured to:
determining position information of the lyric image in a video frame of the target song video, wherein the position information comprises the position of each word in the lyric image in the corresponding video frame;
and inputting the lyric time stamp, the position information, the lyric image and the video frame in the target song video into a shader rendering module, and adding the lyric image to the corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric time stamp and the position information to obtain the target song video added with the lyric progress image.
The method and the device determine the lyric text corresponding to the target song according to the video frame of the playing time in the playing time period and determine the corresponding lyric time stamp by determining the playing time period of each sentence of lyrics in the target song video. And further generating a target song video added with a lyric progress image by using a lyric text corresponding to the target song and a corresponding lyric timestamp. By the method and the device, the corresponding lyric progress image can be added into the existing song video without the lyric progress image.
It should be noted that: in the apparatus for adding an image of lyric progress according to the above embodiment, when adding an image of lyric progress, the division of each function module is merely used for illustration, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device may be divided into different function modules, so as to complete all or part of the functions described above. In addition, the apparatus for adding an image of a lyric progress and the method embodiment for adding an image of a lyric progress provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment, and are not described herein again.
Fig. 9 shows a block diagram of a computer device according to an exemplary embodiment of the present application. The computer device may be the terminal in the above-described embodiments (which may be referred to as terminal 900 hereinafter). The terminal 900 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (moving picture experts group audio layer III, motion picture experts group audio layer 3), an MP4 player (moving picture experts group audio layer IV, motion picture experts group audio layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.
In general, terminal 900 includes: a processor 901 and a memory 902.
In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.
The peripheral interface 903 may be used to connect at least one peripheral related to I/O (input/output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.
The radio frequency circuit 904 is used for receiving and transmitting RF (radio frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (wireless fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (near field communication) related circuits, which are not limited in this application.
The display screen 905 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The display panel 905 may be made of LCD (liquid crystal display), OLED (organic light-emitting diode), or other materials.
The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of a terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to implement a background blurring function, the main camera and the wide-angle camera are fused to implement panoramic shooting and a VR (virtual reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.
The positioning component 908 is utilized to locate a current geographic location of the terminal 900 for navigation or LBS (location based service). The positioning component 908 may be a positioning component based on the united states GPS (global positioning system), the chinese beidou system, or the russian galileo system.
In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 99, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.
The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 912 can detect the body direction and the rotation angle of the terminal 900, and the gyro sensor 912 can cooperate with the acceleration sensor 911 to acquire the 3D motion of the user on the terminal 900. Based on the data collected by gyroscope sensor 912, processor 901 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.
The pressure sensor 913 may be disposed on a side frame of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user to the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at the lower layer of the display screen 905, the processor 901 controls the operable control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.
The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.
The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.
Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.
Fig. 10 is a schematic structural diagram of a computer device, which may be a server (which may be referred to as server 1000) in the foregoing embodiments, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (cpus) 1001 and one or more memories 1002, where the memory 1002 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided by the foregoing method embodiments. Certainly, the server may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the server may further include other components for implementing functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of adding a lyric progress image in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In an exemplary embodiment, a computer program product is also provided, which includes at least one instruction loaded and executed by a processor to implement the operations performed by the method of adding a lyric progress image as in the above embodiments.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.), and signals (including but not limited to signals transmitted between a user terminal and other equipment, etc.) referred to in the present application are authorized by a user or are sufficiently authorized by various parties, and the collection, use, and processing of the relevant data need to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the target song videos referred to in this application are all obtained with sufficient authorization.
The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (12)
1. A method of adding a lyric progress image, the method comprising:
determining the playing time period of each lyric of the target song in the video of the target song to which the lyric belongs;
acquiring a target video frame with playing time within the playing time period from the target song video;
performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song;
determining a lyric time stamp corresponding to the lyric text, wherein the lyric time stamp comprises the starting playing time and the ending playing time of each word in the lyric text;
and generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image.
2. The method of claim 1, wherein determining the playing time period of each lyric of the target song in the target song video comprises:
Acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
determining a signal energy value corresponding to each audio frame of the human voice audio;
determining each voice audio segment in the voice audio based on a signal energy value corresponding to each audio frame of the voice audio and a signal energy threshold;
and determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.
3. The method according to claim 1, wherein the obtaining a target video frame with a playing time within the playing time period in the target song video comprises:
selecting a specified number of video frames with playing time within the playing time period from the target song video, and determining the specified number of video frames as target video frames;
the performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song comprises:
for each playing time period, respectively carrying out character recognition processing on a specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;
And determining a lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame, and forming the lyric text corresponding to each playing time period into a lyric text of the target song.
4. The method according to claim 3, wherein the recognition result comprises at least one region displaying text in the corresponding target video frame, and the text displayed in each region;
the determining the lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame comprises the following steps:
and determining a target area with the maximum occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.
5. The method of claim 4, wherein before obtaining the target song video with the lyric progress image added thereto, the method further comprises:
and carrying out fuzzy processing on a specified area of the video frame with the playing time in any playing time period, wherein the specified area is a target area of the lyric text of any playing time period in the target video frame.
6. The method of claim 1, wherein the performing a word recognition process on the target video frame corresponding to each playing time period to obtain a lyric text of the target song comprises:
Cutting out the image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition;
and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.
7. The method of claim 1, wherein the determining the lyrics text corresponds to a lyrics timestamp comprises:
acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;
and inputting the lyric text and the human voice audio frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.
8. The method of claim 1, wherein generating the lyric image corresponding to the lyric text comprises:
and generating a plurality of lyric images corresponding to the lyric text based on preset lyric display information and the lyric text, wherein each lyric image displays at least one lyric, and the lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.
9. The method of claim 1, wherein the adding a lyric progress image to a video frame in the target song video based on a lyric timestamp corresponding to the lyric text and the lyric image to obtain the target song video with the lyric progress image added comprises:
determining position information of the lyric image in a video frame of the target song video, wherein the position information comprises the position of each word in the lyric image in the corresponding video frame;
and inputting the lyric timestamp, the position information, the lyric image and a video frame in the target song video into a shader rendering module, and adding the lyric image to a corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric timestamp and the position information to obtain the target song video added with the lyric progress image.
10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to perform operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.
11. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.
12. A computer program product comprising at least one instruction which is loaded and executed by a processor to perform the operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210305241.1A CN114760493A (en) | 2022-03-25 | 2022-03-25 | Method, device and storage medium for adding lyric progress image |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210305241.1A CN114760493A (en) | 2022-03-25 | 2022-03-25 | Method, device and storage medium for adding lyric progress image |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114760493A true CN114760493A (en) | 2022-07-15 |
Family
ID=82326432
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210305241.1A Pending CN114760493A (en) | 2022-03-25 | 2022-03-25 | Method, device and storage medium for adding lyric progress image |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114760493A (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10222184A (en) * | 1997-02-04 | 1998-08-21 | Brother Ind Ltd | Musical sound reproducing device |
JP2000125199A (en) * | 1999-08-17 | 2000-04-28 | Daiichikosho Co Ltd | Method and system for displaying song caption on screen and for changing color of the caption in matching with music |
CN101577811A (en) * | 2009-06-10 | 2009-11-11 | 深圳市茁壮网络股份有限公司 | Digital television Kara OK system and method for realizing function of Kara OK thereof |
CN108829751A (en) * | 2018-05-25 | 2018-11-16 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, apparatus, electronic equipment and the storage medium for generating the lyrics, showing the lyrics |
CN110113677A (en) * | 2018-02-01 | 2019-08-09 | 阿里巴巴集团控股有限公司 | The generation method and device of video subject |
CN110246475A (en) * | 2018-03-07 | 2019-09-17 | 富泰华工业(深圳)有限公司 | Mobile terminal, KTV playing device and song-requesting service device |
CN112380378A (en) * | 2020-11-17 | 2021-02-19 | 北京字跳网络技术有限公司 | Lyric special effect display method and device, electronic equipment and computer readable medium |
CN112380379A (en) * | 2020-11-18 | 2021-02-19 | 北京字节跳动网络技术有限公司 | Lyric special effect display method and device, electronic equipment and computer readable medium |
CN112423107A (en) * | 2020-11-18 | 2021-02-26 | 北京字跳网络技术有限公司 | Lyric video display method and device, electronic equipment and computer readable medium |
CN112714355A (en) * | 2021-03-29 | 2021-04-27 | 深圳市火乐科技发展有限公司 | Audio visualization method and device, projection equipment and storage medium |
CN113537127A (en) * | 2021-07-28 | 2021-10-22 | 深圳创维-Rgb电子有限公司 | Film matching method, device, equipment and storage medium |
CN113626598A (en) * | 2021-08-11 | 2021-11-09 | 平安国际智慧城市科技股份有限公司 | Video text generation method, device, equipment and storage medium |
CN113676772A (en) * | 2021-08-16 | 2021-11-19 | 上海哔哩哔哩科技有限公司 | Video generation method and device |
-
2022
- 2022-03-25 CN CN202210305241.1A patent/CN114760493A/en active Pending
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10222184A (en) * | 1997-02-04 | 1998-08-21 | Brother Ind Ltd | Musical sound reproducing device |
JP2000125199A (en) * | 1999-08-17 | 2000-04-28 | Daiichikosho Co Ltd | Method and system for displaying song caption on screen and for changing color of the caption in matching with music |
CN101577811A (en) * | 2009-06-10 | 2009-11-11 | 深圳市茁壮网络股份有限公司 | Digital television Kara OK system and method for realizing function of Kara OK thereof |
CN110113677A (en) * | 2018-02-01 | 2019-08-09 | 阿里巴巴集团控股有限公司 | The generation method and device of video subject |
CN110246475A (en) * | 2018-03-07 | 2019-09-17 | 富泰华工业(深圳)有限公司 | Mobile terminal, KTV playing device and song-requesting service device |
CN108829751A (en) * | 2018-05-25 | 2018-11-16 | 腾讯音乐娱乐科技(深圳)有限公司 | Method, apparatus, electronic equipment and the storage medium for generating the lyrics, showing the lyrics |
CN112380378A (en) * | 2020-11-17 | 2021-02-19 | 北京字跳网络技术有限公司 | Lyric special effect display method and device, electronic equipment and computer readable medium |
CN112380379A (en) * | 2020-11-18 | 2021-02-19 | 北京字节跳动网络技术有限公司 | Lyric special effect display method and device, electronic equipment and computer readable medium |
CN112423107A (en) * | 2020-11-18 | 2021-02-26 | 北京字跳网络技术有限公司 | Lyric video display method and device, electronic equipment and computer readable medium |
CN112714355A (en) * | 2021-03-29 | 2021-04-27 | 深圳市火乐科技发展有限公司 | Audio visualization method and device, projection equipment and storage medium |
CN113537127A (en) * | 2021-07-28 | 2021-10-22 | 深圳创维-Rgb电子有限公司 | Film matching method, device, equipment and storage medium |
CN113626598A (en) * | 2021-08-11 | 2021-11-09 | 平安国际智慧城市科技股份有限公司 | Video text generation method, device, equipment and storage medium |
CN113676772A (en) * | 2021-08-16 | 2021-11-19 | 上海哔哩哔哩科技有限公司 | Video generation method and device |
Non-Patent Citations (2)
Title |
---|
徐艳菲;吴铁峰;: "基于Android的音视频播放器的研究与设计", 微处理机, no. 06, 15 December 2017 (2017-12-15) * |
陈义;李言俊;孙小炜;: "利用OCR识别技术实现视频中文字的提取", 计算机工程与应用, no. 10, 1 April 2010 (2010-04-01) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110379430B (en) | Animation display method and device based on voice, computer equipment and storage medium | |
CN110933330A (en) | Video dubbing method and device, computer equipment and computer-readable storage medium | |
CN110322760B (en) | Voice data generation method, device, terminal and storage medium | |
CN110956971B (en) | Audio processing method, device, terminal and storage medium | |
CN112735429B (en) | Method for determining lyric timestamp information and training method of acoustic model | |
CN110992927B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
CN112487940B (en) | Video classification method and device | |
CN111524501A (en) | Voice playing method and device, computer equipment and computer readable storage medium | |
CN111105788B (en) | Sensitive word score detection method and device, electronic equipment and storage medium | |
CN111428079B (en) | Text content processing method, device, computer equipment and storage medium | |
CN111081277B (en) | Audio evaluation method, device, equipment and storage medium | |
CN111415650A (en) | Text-to-speech method, device, equipment and storage medium | |
CN110867194B (en) | Audio scoring method, device, equipment and storage medium | |
CN113220590A (en) | Automatic testing method, device, equipment and medium for voice interaction application | |
CN108831423B (en) | Method, device, terminal and storage medium for extracting main melody tracks from audio data | |
CN113920979B (en) | Voice data acquisition method, device, equipment and computer readable storage medium | |
CN113593521B (en) | Speech synthesis method, device, equipment and readable storage medium | |
CN112786025B (en) | Method for determining lyric timestamp information and training method of acoustic model | |
CN113362836B (en) | Vocoder training method, terminal and storage medium | |
CN111028823B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
CN111063372B (en) | Method, device and equipment for determining pitch characteristics and storage medium | |
CN111091807B (en) | Speech synthesis method, device, computer equipment and storage medium | |
CN111125424B (en) | Method, device, equipment and storage medium for extracting core lyrics of song | |
CN114760493A (en) | Method, device and storage medium for adding lyric progress image | |
CN111212323A (en) | Audio and video synthesis method and device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |