CN114760493A

CN114760493A - Method, device and storage medium for adding lyric progress image

Info

Publication number: CN114760493A
Application number: CN202210305241.1A
Authority: CN
Inventors: 张悦; 王武城; 黄均昕; 董治; 赵伟峰; 姜涛
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-07-15

Abstract

The application discloses a method, equipment and a storage medium for adding a lyric progress image, and belongs to the technical field of internet. The method comprises the following steps: determining the playing time period of each lyric of the target song in the video of the target song to which the lyric belongs; for each playing time period, acquiring a target video frame of the playing time in the playing time period from the target song video; performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song; determining a lyric time stamp corresponding to the lyric text, wherein the lyric time stamp comprises the starting playing time and the ending playing time of each word in the lyric text; and generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame in the target song video based on the lyric time stamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image. By the method and the device, the lyric progress image can be added in the song video.

Description

Method, device and storage medium for adding lyric progress image

Technical Field

The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, and a storage medium for adding a lyric progress image.

Background

A lyric progress image is generally added in a song video provided by a K song platform, and each word in the lyric progress image is dynamically displayed along with the singing progress of a song in the song video. For example, the lyrics sung and the lyrics not sung in the lyric progress image are displayed by different colors. Therefore, the user can sing the song played in the song video according to the prompt of the lyric progress image.

However, the song videos in the K song platform are manufactured and provided by the song video publishing company, and if the song video publishing company does not add the lyric progress image when manufacturing the song video, the user cannot sing the song played in the song video according to the prompt of the lyric progress image. Therefore, how to add corresponding lyric progress images in the existing song videos without the lyric progress images becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a method, equipment and a storage medium for adding a lyric progress image, which can be used for adding a corresponding lyric progress image in a song video without the lyric progress image. The technical scheme is as follows:

in a first aspect, a method for adding a lyric progress image is provided, the method comprising:

Determining the playing time period of each lyric of the target song in the video of the target song to which the lyric belongs;

for each playing time period, acquiring a target video frame with playing time within the playing time period from the target song video;

performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song;

determining a lyric time stamp corresponding to the lyric text, wherein the lyric time stamp comprises the starting playing time and the ending playing time of each word in the lyric text;

and generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image.

Optionally, the determining a playing time period of each lyric of the target song in the target song video includes:

acquiring a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio;

determining a signal energy value corresponding to each audio frame of the human voice audio;

Determining each human voice frequency segment in the human voice frequency based on a signal energy value corresponding to each audio frame of the human voice frequency and a signal energy threshold value;

and determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.

Optionally, the obtaining, in the target song video, a target video frame whose playing time is within the playing time period includes:

selecting a specified number of video frames with playing time within the playing time period from the target song video, and determining the specified number of video frames as target video frames;

the performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song comprises:

for each playing time period, respectively carrying out character recognition processing on the specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;

and determining a lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame, and forming the lyric text corresponding to each playing time period into a lyric text of the target song.

Optionally, the recognition result includes at least one region displaying text in the corresponding target video frame, and text displayed in each region;

the determining the lyric text corresponding to the playing time period according to the identification result corresponding to each target video frame comprises the following steps:

and determining a target area with the maximum occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.

Optionally, before the target song video to which the lyric progress image is added is obtained, the method further includes:

and carrying out fuzzy processing on a specified area of the video frame with the playing time in any playing time period, wherein the specified area is a target area of the lyric text of any playing time period in the target video frame.

Optionally, the performing word recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song includes:

cutting out the image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition;

and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.

Optionally, the determining the lyric timestamp corresponding to the lyric text includes:

and inputting the lyric text and the human voice frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.

Optionally, the generating a lyric image corresponding to the lyric text includes:

and generating a plurality of lyric images corresponding to the lyric text based on preset lyric display information and the lyric text, wherein each lyric image displays at least one lyric, and the lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.

Optionally, adding a lyric progress image to a video frame in the target song video based on a lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added thereto, including:

determining position information of the lyric image in a video frame of the target song video, wherein the position information comprises the position of each word in the lyric image in the corresponding video frame;

And inputting the lyric timestamp, the position information, the lyric image and a video frame in the target song video into a shader rendering module, and adding the lyric image to a corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric timestamp and the position information to obtain the target song video added with the lyric progress image.

In a second aspect, there is provided an apparatus for adding a lyric progress image, the apparatus comprising:

the determining unit is used for determining the playing time period of each lyric of the target song in the target song video to which the lyric belongs;

the acquisition unit is used for acquiring a target video frame with the playing time within the playing time period from the target song video for each playing time period;

the identification unit is used for carrying out character identification processing on the target video frame corresponding to each playing time slot to obtain a lyric text of the target song;

the determining unit is further configured to determine a lyric timestamp corresponding to the lyric text, where the lyric timestamp includes a start playing time and an end playing time of each word in the lyric text;

And the generation unit is used for generating a lyric image corresponding to the lyric text, and adding a lyric progress image into a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added.

Optionally, the determining unit is configured to:

Optionally, the obtaining unit is configured to:

The identification unit is configured to:

and determining the lyric texts corresponding to the playing time periods according to the identification result corresponding to each target video frame, and forming the lyric texts corresponding to each playing time period into the lyric texts of the target songs.

the acquisition unit is configured to:

and determining a target area with the largest occurrence frequency in a plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.

Optionally, before the lyric timestamp corresponding to the lyric text, the lyric image, and the target song video are input to the rendering module, the method further includes:

and carrying out fuzzy processing on a designated area of the video frame with the playing time in any playing time period, wherein the designated area is a target area of the lyric text of any playing time period in a target video frame.

Optionally, the identifying unit is configured to:

Optionally, the determining a lyric timestamp corresponding to the lyric text includes:

and inputting the lyric text and the human voice audio frequency into an acoustic model to obtain a lyric time stamp corresponding to the lyric text.

Optionally, the generating unit is configured to:

In a third aspect, a computer device is provided, which includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded by the processor and executed to implement the operations performed by the method for adding a lyric progress image according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the method for adding a lyric progress image according to the first aspect.

In a fifth aspect, a computer program product is provided, and the computer program product includes at least one instruction, which is loaded and executed by a processor to implement the operations performed by the method for adding a lyric progress image according to the first aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

the method and the device determine the lyric text corresponding to the target song according to the video frame of the playing time in the playing time period and determine the corresponding lyric time stamp by determining the playing time period of each sentence of lyrics in the target song video. And further generating a target song video added with a lyric progress image by using a lyric text corresponding to the target song and a corresponding lyric timestamp. By the method and the device, the corresponding lyric progress image can be added into the existing song video without the lyric progress image.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application;

FIG. 2 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 4 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 7 is a schematic diagram illustrating a method for adding a lyric progress image according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an apparatus for adding a lyric progress image according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a computer device provided by an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The method for adding the lyric progress image can be realized by computer equipment. The computer device may have a processor and a memory, wherein the memory may be configured to store program codes for implementing the method for adding the lyric progress image, data to be processed (e.g., song video), and the like, and the processor may be configured to execute the program codes stored in the memory, process the data to be processed, and further obtain the song video to which the lyric progress image is added. The computer equipment can be a terminal or a server, when the computer equipment is the terminal, a karaoke application program provided by a karaoke platform can be operated in the terminal, and the terminal can be a mobile phone, a tablet computer, intelligent wearable equipment, a desktop computer, a notebook computer and the like. When the computer device is a server, the server can be a background server of the karaoke platform.

Fig. 1 is a schematic diagram of a possible implementation environment provided by an embodiment of the present application. Referring to fig. 1, a background server of the karaoke platform may obtain a song video to which a lyric progress image needs to be added from a song video storage server. And storing the song videos with the lyric progress images to be added in a music video library in the song video storage server. After the background server of the K song platform obtains the song video to which the lyric progress image is to be added, the lyric progress image can be added into the corresponding song video according to the method for adding the lyric progress image provided by the application, and then the song video to which the lyric progress image is added is obtained. Therefore, when the terminal runs a Karaoke application program (client) provided by the Karaoke platform, the song video added with the lyric progress image can be obtained from the background server, and then the song video is played. The user may sing according to the progress of the lyrics displayed in the song video. The terminal can record the audio and video when the user sings the song and sends the recorded audio and video to the background server.

According to the method for adding the lyric progress images, the corresponding lyric progress images can be added to the song audio without the lyric progress images. Fig. 2 is a method for adding a lyric progress image according to an embodiment of the present application, and referring to fig. 2, the method includes:

Step 201, obtaining a target song video to be added with the lyric progress image.

In an implementation, the computer device may obtain a target song video to which a lyric progress image needs to be added from a song video storage server. A target song is played in the target song Video, which may be a Music Video (MV) corresponding to the target song, or a Video made for the target song, etc. In the target song video, the lyrics of the target song can be displayed sentence by sentence along with the playing progress of the target song, but each word in each sentence of the lyrics is not dynamically displayed along with the singing progress of the song in the song video.

Step 202, determining the playing time period of each lyric of the target song in the target song video.

In an implementation, after the target song video is obtained, a playing time period of each lyric of the target song in the target song video, that is, a starting playing time point and an ending playing time point of each lyric of the target song in the target song video, may be determined according to audio data included in the target song video. In one possible scenario, the process of determining the playing time period of each lyric of the target song in the target song video may be as shown in fig. 3:

Step 2021, obtaining a target audio corresponding to the target song video, and performing voice separation processing on the target audio to obtain a voice audio in the target audio.

After the target song video is obtained, the target audio, that is, the audio played in the target song video, that is, the audio of the target song is played, may be extracted from the video file corresponding to the target song video. After the target audio is obtained, human voice separation processing can be performed on the target audio, and human voice audio included in the target audio is extracted. For example, the human voice audio and the accompaniment audio in the target audio can be separated through an end-to-end residual attention network. The obtained human voice audio is the singing audio of the target song.

It should be understood that the playing time duration of the target audio is the same as the playing time duration of the corresponding human voice audio. The playing time period of any one lyric or any one character in any one lyric is the same between the target audio and the corresponding voice audio.

Step 2022, determining a signal energy value corresponding to each audio frame of the human voice audio.

After the obtained human Voice audio, the segmentation information of the target song, that is, the playing time period of each lyric of the target song in the video of the target song, can be obtained through Voice Activity Detection (VAD) technology. In the human voice frequency, the signal energy corresponding to the audio frequency frame with human voice is higher, and the signal energy corresponding to the audio frequency frame without human voice is lower. Wherein the signal energy value sigma ²(x) Can be defined as

Wherein_xtIs the audio frame at time t.

Step 2023, determining each voice audio segment in the voice audio based on the signal energy value corresponding to each audio frame of the voice audio and the signal energy threshold.

After the signal energy value corresponding to each audio frame in the human voice audio is calculated, the audio frames including the human voice and the audio frames not including the human voice can be determined according to the preset signal energy threshold value. The audio frames with the signal energy value greater than or equal to the signal energy threshold value can be determined as the audio frames including the human voice, and the audio frames with the signal energy value smaller than the signal energy threshold value can be determined as the audio frames not including the human voice. Thus, after the audio frames including the human voice and the audio frames not including the human voice are determined, the human voice audio segment in the human voice audio can be further determined. For example, a time period corresponding to a plurality of audio frames including a human voice that are consecutive in time may be determined as a human voice audio segment.

The skilled person may also set the omission time interval as there may be some pauses, intervals between each word in a lyric in a song that the song sings. That is, for at least one audio frame without human voice occurring between two consecutive audio frames with human voice, if the duration corresponding to the at least one audio frame without human voice is less than the preset omission time interval, the at least one audio frame without human voice may be ignored, that is, the two consecutive audio frames with human voice are regarded as one consecutive audio frame with human voice.

Step 2024, determining the playing time period of the human voice audio segment in the human voice audio as the playing time period of each lyric of the target song in the target song video.

After determining the plurality of human voice audio segments included in the human voice audio, the playing time period of the human voice audio segments in the human voice audio can be determined, and the playing time period of each lyric of the target song in the target song video is determined.

Step 203, for each playing time segment, acquiring a target video frame with the playing time within the playing time segment from the target song video.

And determining a target video frame of the playing time in the corresponding playing time period in the target song video after obtaining the playing time period of each lyric in the target song video. For example, for a playing time period of 1 minute 45 seconds to 1 minute 55 seconds, at least one video frame with a playing time of 1 minute 45 seconds to 1 minute 55 seconds may be acquired from the target song video as the target video frame corresponding to the playing time period.

And 204, performing character recognition processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song.

For a playing time period, the target video frame in the playing time period is a video frame displayed in the target song video when the audio of the playing time period is played. Although the lyric progress image is not displayed in the target song video, the lyrics of the target song are still displayed in the target song video sentence by sentence along with the playing progress of the target song. That is, when the audio of the target song is played, the lyrics corresponding to the current audio are displayed in the played target video frame.

Therefore, the lyrics of the corresponding playing time period are displayed in the target video frame acquired in each playing time period. Therefore, the character recognition processing can be carried out on the target video frame in each playing time period to obtain the corresponding lyrics of each playing time period, and further obtain the lyric text of the target song.

In order to improve the accuracy of text recognition, for each playing time period, multiple target video frames may be obtained in the corresponding playing time period, so as to determine lyrics in the corresponding playing time period in the multiple target video frames, and the corresponding process is as shown in fig. 4, and includes:

step 2041, selecting a specified number of video frames with playing time within a playing time period from the target song video, and determining the specified number of video frames as target video frames.

For any playing time period, a specified number of video frames can be selected from the target song video, and the selected video frames are taken as target video frames. Wherein the specified number may be preset by a technician. When selecting the video frames, a specified number of video frames may be randomly selected from the plurality of video frames in the playing time period, or the specified number of video frames may also be selected from the plurality of video frames in the playing time period at equal intervals.

Thus, one lyric corresponds to one playing time period, one playing time period corresponds to a plurality of target video frames, that is, one lyric corresponds to a plurality of target video frames, and the plurality of target video frames can all display the corresponding lyrics.

Step 2042, for each playing time segment, respectively performing character recognition processing on the specified number of target video frames corresponding to the playing time segment to obtain a recognition result corresponding to each target video frame.

After obtaining the multiple target video frames corresponding to each playing time period, for any playing video segment, the multiple target video frames corresponding to the playing video segment may be subjected to text recognition processing, so as to obtain a recognition result for each target video frame. The recognition result comprises an area corresponding to the display text in the target video frame and the text in the corresponding area. For example, the text in the target video frame may be recognized by Optical Character Recognition (OCR) techniques.

Step 2043, according to the identification result corresponding to each target video frame, determining the lyric texts corresponding to the playing time periods, and forming the lyric texts corresponding to each playing time period into the lyric texts of the target song.

In a possible case, after performing the word recognition processing on a plurality of target video frames corresponding to a playing time period, a plurality of recognition results are obtained to be consistent, and each recognition result only includes an area for displaying a text, and then the text in the area corresponding to the displayed text may be determined as the lyrics corresponding to the playing time period.

In another possible case, after performing the word recognition processing on a plurality of target video frames corresponding to a playing time period, and obtaining a plurality of recognition results that are inconsistent, the lyrics corresponding to the playing time period may be determined according to the plurality of recognition results, and the corresponding processing is as follows: and determining a target area with the largest occurrence frequency in the plurality of recognition results, and determining characters displayed in the target area as corresponding lyric texts in the playing time period.

Since there may be other text displayed in the target song video in addition to the lyrics. For example, a billboard, a couplet, a calligraphy work, etc., appearing in the picture of the target song video may be recognized when the target video frame is subjected to the character recognition processing. However, unlike the lyrics displayed in the target song video, the other text appearing in the screen of the target song video may change in the area of the display of the target song video that is subsequently played, i.e., the display area of the other text in each target video frame may be different. The same lyrics are typically displayed in a fixed area, for example, below the target video frame. Therefore, after the obtained recognition results corresponding to the plurality of target video frames, the number of occurrences of each region displaying the text obtained in each recognition result can be determined, and the region with the largest number of occurrences is determined in each region, which is the region displaying the lyrics. The text displayed in the target area may then be determined to correspond to the text of the lyrics during the playing period.

In addition, in the flow of the method, before the character recognition processing is performed on the target video frame, the target video frame may be cut, and the corresponding processing is as follows: cutting out an image at the top of the target video frame according to a preset cutting proportion to obtain a video frame to be subjected to character recognition; and performing character recognition processing on the video frame to be subjected to character recognition to obtain a lyric text of the target song.

As some watermarks of the song publishers and publishers may be displayed in the target song video. Such as "XX music", "XXTV", etc. And typically these watermarks appear in the upper left or right corner of the target song video. While lyrics typically appear below, to the left, or to the right of the target song video. In order to avoid recognizing the characters in the watermark during character recognition, the video frame image at the top of the target video frame can be cut off before the character recognition processing is performed on the target video frame. The technician may preset the corresponding clipping ratio, which may be, for example, 10: 1 or 5: 1. and then cutting the image at the top of each target video frame according to a preset cutting proportion. And then performing character recognition processing on the cut target image. Further, the influence of the watermark virus identification result in the target video frame can be avoided, and the accuracy rate of lyric identification can be improved.

In addition, in the flow of the method, after the character recognition processing is performed on the target video frame, the fuzzy processing can be performed on the specified area of the video frame with the playing time within any playing time period, wherein the specified area is the target area of the lyrics text of any playing time period in the target video frame. That is, the area of the target song video where the lyrics appear can be subjected to fuzzy processing, so that when the lyric progress image is added to the target song video, only the lyrics corresponding to the lyric progress image are displayed in the target song video, and the lyrics are not repeatedly displayed.

And step 205, determining a lyric time stamp corresponding to the lyric text.

The lyric timestamp comprises the starting playing time and the ending playing time of each word in the lyric text.

In an implementation, after determining the lyric text of the target song, a starting playing time and an ending playing time of each word in the lyric text in the video of the target song may be determined according to the audio frequency and the lyric text of the target song. For example, the audio and lyric text of the target song may be entered into the acoustic model. And outputting the starting playing time and the ending playing time of each word by the acoustic model so as to obtain a lyric time stamp corresponding to the lyric text. The acoustic Model may be a Gaussian Mixed Model-Hidden Markov Model (GMM-HMM) Model.

In order to obtain the corresponding lyric time stamp of the lyric text more accurately according to the audio frequency of the target song. The voice of the audio of the target song can be separated to obtain the voice audio of the target song. And determining the lyric time stamp corresponding to the lyric text according to the obtained human voice audio. Therefore, the influence of the accompaniment audio in the target song can be avoided, and the accuracy of the lyric timestamp is improved.

In one possible case, the audio frequency of the target song and the lyric text may be input together into the GMM-HMM model, and the starting playing time and the ending playing time of each word in the lyric text are output by the GMM-HMM model. In another possible case, the lyrics may be played according to the playing time period corresponding to each lyricAnd intercepting the singing audio corresponding to each lyric from the audio corresponding to the target song, inputting the singing audio and the corresponding lyrics into a GMM-HMM model to obtain a lyric time stamp corresponding to each lyric, and determining the lyric time stamp corresponding to the target song according to the playing time period corresponding to each lyric and the lyric time stamp corresponding to each lyric. Wherein, the process of identifying the lyric time stamp based on the GMM-HMM model can be as shown in FIG. 5: s1, cutting input audio into audio frames (frames) with equal length, and extracting Mel Frequency Cepstrum Coefficient (MFCC) characteristics for each audio frame. S2, the MFCC feature vector [ c1, c2, … c39 ]Inputting the input to a pre-trained GMM model to obtain each audio frame x_iProbability P (x) of belonging to a phoneme_i) And S3, converting the input lyric text into phonemes, and converting the phonemes into states through a triphone model (triphones). S4, combining the state obtained by the input lyric text with the probability P (x) obtained by the GMM model_i) Calculating per state O using HMM transition probability state transition probability_iGenerating a probability of the audio frame; and determining the word with the highest probability corresponding to the HMM sequence, and thus judging the corresponding relation between the audio and each word in the text. And then determining a starting playing time point and an ending playing time point corresponding to each word according to the audio frequency corresponding to each word, and further obtaining a lyric time stamp corresponding to the target song.

And step 206, generating a lyric image corresponding to the lyric text, and adding a lyric progress image in a video frame of the target song video based on the lyric time stamp and the lyric image corresponding to the lyric text to obtain the target song video added with the lyric progress image.

In implementation, after the lyric text of the target song is obtained, a corresponding lyric image can be generated for each sentence of lyrics in the lyric text, then the obtained lyric time stamp, the lyric image and the target song video can be input into the shader rendering module, the shader rendering module can add different display effects to the lyric image according to the lyric time stamp, render the lyric image into a target video frame, and further obtain the target song video added with the lyric progress image. Further processing, as shown in fig. 6, includes:

Step 2061, generating a plurality of lyric images corresponding to the lyric text based on the preset lyric display information and the lyric text.

The lyric display information comprises the line number of the lyric text displayed in each lyric image and font attribute information corresponding to the words displayed in each lyric image.

The lyric display information may be a font configuration file pre-configured by a technician, and may be a JS Object Notation (json) text in data format. The method comprises the information of font, size, color, word spacing, stroke effect (stroke size and color), shadow effect (shadow radius, offset and color), single-line maximum length (if the length of the file information exceeds the width of the picture, the file needs to be disassembled into multiple lines for processing), and the like, wherein the font, the size, the color, the word spacing, the stroke effect (stroke size and color), the shadow effect (shadow radius, offset and color), the single-line maximum length and the like are added to the lyrics in the target song video. Therefore, a plurality of corresponding lyric images can be generated from the lyric text of the target song according to the preset font configuration file and the lyric text. For example, one lyric image may include two words of lyrics, where the first word is the lyrics being played and the second word is the next lyric to be played.

Step 2062, determining the position information of the lyric image in the video frame of the target song video.

Wherein the position information comprises a position of each word in each lyric image in the corresponding video frame and a position of each lyric image in the corresponding video frame. The position of the lyrics image in the video frame may be preset by a technician, e.g. the position of the lyrics image may be below in the video frame. For the position of each word in each lyric image in the corresponding video frame, the position of the lyric image in the corresponding video frame may be calculated. For example, the lyric image may include the number of words and the word interval. In addition, a plurality of lyric images can be included in one video frame, that is, a plurality of words of lyrics can be included in one video frame, and the specific number of the words of lyrics can be preset by a technician, which is not limited herein.

Step 2063, the lyric time stamp, the position information, the lyric image and the video frame in the target song video are input to the shader rendering module.

Step 2064, the shader rendering module adds the lyric image to the corresponding video frame and performs progress special effect rendering on each word in the lyric image based on the lyric timestamp and the position information to obtain the target song video added with the lyric progress image.

After the lyric time stamp, the position information and the lyric image are obtained, the lyric time stamp, the position information, the lyric image and a video frame in the target song video can be input into the shader rendering module.

The shader rendering module can determine a video frame needing to be rendered with each lyric image according to the playing time corresponding to each lyric in the lyric timestamp, and determine a time period and a position for performing progress special effect rendering on each word in the lyric image according to the playing time corresponding to each word in the lyric timestamp and the position of each word in the position information in the corresponding video frame. The lyric image and the corresponding video frame may then be rendered by a shader rendering module for a progress special effect, for example, the progress special effect may include fade-in and fade-out, scroll play, font jerk, and the like. And after the shader rendering module finishes rendering all video frames in the target song video, the target song video added with the lyric progress image can be obtained.

Fig. 7 illustrates the process of lyric rendering, taking the rendering "but remember your smile" as an example, including: s1, generating a lyric image corresponding to 'remembering the smile of your' through a font configuration file; s2, obtaining position information by executing the processing of the step 2062 through a space-time analyzer, and obtaining lyric time stamps (time information) through lyrics recognized by an OCR technology; and S3, rendering the lyric image to a video frame of the corresponding song video through a shader rendering module, and rendering and adding a character-by-character display effect. S4, determining whether the rendering of all video frames corresponding to the smile is finished. And if the lyrics are finished, updating the lyrics of the next sentence and preparing to render the lyrics of the next sentence. If not, rendering of the next frame is prepared.

In the application, only one lyric picture needs to be generated for one sentence of lyrics. Once generated, the lyric picture can be displayed until the lyric text corresponding to the lyric picture is used. Correspondingly, the space-time analysis module only needs to analyze the space information once. And in order to prevent rendering blockage, multithreading can be started to generate lyric pictures corresponding to the subsequent lyrics in advance.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present disclosure, and are not described in detail herein.

Fig. 8 is a device for adding a lyric progress image according to an embodiment of the present application, where the device may be a computer device according to the foregoing embodiment, and referring to fig. 8, the device includes:

A determining unit 810, configured to determine a playing time period of each lyric of the target song in the target song video to which the lyric belongs;

an obtaining unit 820, configured to obtain, for each playing time period, a target video frame with a playing time within the playing time period in the target song video;

the identifying unit 830 is configured to perform word identification processing on the target video frame corresponding to each playing time period to obtain a lyric text of the target song;

the determining unit 810 is further configured to determine a lyric timestamp corresponding to the lyric text, where the lyric timestamp includes a start playing time and an end playing time of each word in the lyric text;

the generating unit 840 is configured to generate a lyric image corresponding to the lyric text, and add a lyric progress image to a video frame in the target song video based on the lyric timestamp and the lyric image corresponding to the lyric text to obtain the target song video with the lyric progress image added.

Optionally, the determining unit 810 is configured to:

Optionally, the obtaining unit 820 is configured to:

the identifying unit 830 is configured to:

The obtaining unit 820 is configured to:

Optionally, before the lyric timestamp, the lyric image and the target song video corresponding to the lyric text are input to the rendering module, the method further includes:

Optionally, the identifying unit 830 is configured to:

Optionally, the generating unit 840 is configured to:

and inputting the lyric time stamp, the position information, the lyric image and the video frame in the target song video into a shader rendering module, and adding the lyric image to the corresponding video frame and performing progress special effect rendering on each word in the lyric image by the shader rendering module based on the lyric time stamp and the position information to obtain the target song video added with the lyric progress image.

It should be noted that: in the apparatus for adding an image of lyric progress according to the above embodiment, when adding an image of lyric progress, the division of each function module is merely used for illustration, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device may be divided into different function modules, so as to complete all or part of the functions described above. In addition, the apparatus for adding an image of a lyric progress and the method embodiment for adding an image of a lyric progress provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment, and are not described herein again.

Fig. 9 shows a block diagram of a computer device according to an exemplary embodiment of the present application. The computer device may be the terminal in the above-described embodiments (which may be referred to as terminal 900 hereinafter). The terminal 900 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (moving picture experts group audio layer III, motion picture experts group audio layer 3), an MP4 player (moving picture experts group audio layer IV, motion picture experts group audio layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to by other names such as user equipment, portable terminals, laptop terminals, desktop terminals, and the like.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (digital signal processing), an FPGA (field-programmable gate array), and a PLA (programmable logic array). The processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (graphics processing unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 901 may further include an AI (artificial intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 902 is used to store at least one instruction for execution by processor 901 to implement the method of adding a lyric progress image provided by the method embodiments herein.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (input/output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The radio frequency circuit 904 is used for receiving and transmitting RF (radio frequency) signals, also called electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 904 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (wireless fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (near field communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, disposed on the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display screen 905 may be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The display panel 905 may be made of LCD (liquid crystal display), OLED (organic light-emitting diode), or other materials.

The camera assembly 906 is used to capture images or video. Optionally, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of a terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each of the rear cameras is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to implement a background blurring function, the main camera and the wide-angle camera are fused to implement panoramic shooting and a VR (virtual reality) shooting function or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

Audio circuitry 907 may include a microphone and speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 901 for processing or inputting the electric signals to the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is utilized to locate a current geographic location of the terminal 900 for navigation or LBS (location based service). The positioning component 908 may be a positioning component based on the united states GPS (global positioning system), the chinese beidou system, or the russian galileo system.

Power supply 909 is used to provide power to the various components in terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: acceleration sensor 911, gyro sensor 912, pressure sensor 99, fingerprint sensor 914, optical sensor 915, and proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 912 can detect the body direction and the rotation angle of the terminal 900, and the gyro sensor 912 can cooperate with the acceleration sensor 911 to acquire the 3D motion of the user on the terminal 900. Based on the data collected by gyroscope sensor 912, processor 901 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side frame of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user to the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed at the lower layer of the display screen 905, the processor 901 controls the operable control on the UI interface according to the pressure operation of the user on the display screen 905. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 914 is used for collecting a fingerprint of the user, and the processor 901 identifies the user according to the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, processor 901 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 can also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

Proximity sensor 916, also known as a distance sensor, is typically disposed on the front panel of terminal 900. The proximity sensor 916 is used to collect the distance between the user and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user and the front face of the terminal 900 gradually decreases, the processor 901 controls the display 905 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user and the front surface of the terminal 900 gradually becomes larger, the display 905 is controlled by the processor 901 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 9 does not constitute a limitation of terminal 900, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 10 is a schematic structural diagram of a computer device, which may be a server (which may be referred to as server 1000) in the foregoing embodiments, where the server 1000 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (cpus) 1001 and one or more memories 1002, where the memory 1002 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided by the foregoing method embodiments. Certainly, the server may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the server may further include other components for implementing functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of adding a lyric progress image in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer readable storage medium may be a ROM (read-only memory), a RAM (random access memory), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product is also provided, which includes at least one instruction loaded and executed by a processor to implement the operations performed by the method of adding a lyric progress image as in the above embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

It should be noted that information (including but not limited to user equipment information, user personal information, etc.), data (including but not limited to data for analysis, stored data, displayed data, etc.), and signals (including but not limited to signals transmitted between a user terminal and other equipment, etc.) referred to in the present application are authorized by a user or are sufficiently authorized by various parties, and the collection, use, and processing of the relevant data need to comply with relevant laws and regulations and standards in relevant countries and regions. For example, the target song videos referred to in this application are all obtained with sufficient authorization.

The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of adding a lyric progress image, the method comprising:

acquiring a target video frame with playing time within the playing time period from the target song video;

2. The method of claim 1, wherein determining the playing time period of each lyric of the target song in the target song video comprises:

determining each voice audio segment in the voice audio based on a signal energy value corresponding to each audio frame of the voice audio and a signal energy threshold;

3. The method according to claim 1, wherein the obtaining a target video frame with a playing time within the playing time period in the target song video comprises:

for each playing time period, respectively carrying out character recognition processing on a specified number of target video frames corresponding to the playing time period to obtain a recognition result corresponding to each target video frame;

4. The method according to claim 3, wherein the recognition result comprises at least one region displaying text in the corresponding target video frame, and the text displayed in each region;

5. The method of claim 4, wherein before obtaining the target song video with the lyric progress image added thereto, the method further comprises:

6. The method of claim 1, wherein the performing a word recognition process on the target video frame corresponding to each playing time period to obtain a lyric text of the target song comprises:

7. The method of claim 1, wherein the determining the lyrics text corresponds to a lyrics timestamp comprises:

8. The method of claim 1, wherein generating the lyric image corresponding to the lyric text comprises:

9. The method of claim 1, wherein the adding a lyric progress image to a video frame in the target song video based on a lyric timestamp corresponding to the lyric text and the lyric image to obtain the target song video with the lyric progress image added comprises:

10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, the at least one instruction being loaded and executed by the processor to perform operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.

11. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.

12. A computer program product comprising at least one instruction which is loaded and executed by a processor to perform the operations performed by the method of adding a lyric progress image according to any one of claims 1 to 9.