CN113516971A - Lyric conversion point detection method, device, computer equipment and storage medium - Google Patents

Lyric conversion point detection method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN113516971A
CN113516971A CN202110775920.0A CN202110775920A CN113516971A CN 113516971 A CN113516971 A CN 113516971A CN 202110775920 A CN202110775920 A CN 202110775920A CN 113516971 A CN113516971 A CN 113516971A
Authority
CN
China
Prior art keywords
audio data
waveform
target audio
target
human voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110775920.0A
Other languages
Chinese (zh)
Other versions
CN113516971B (en
Inventor
萧博耀
高旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Wondershare Software Co Ltd
Original Assignee
Shenzhen Sibo Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sibo Technology Co ltd filed Critical Shenzhen Sibo Technology Co ltd
Priority to CN202110775920.0A priority Critical patent/CN113516971B/en
Publication of CN113516971A publication Critical patent/CN113516971A/en
Application granted granted Critical
Publication of CN113516971B publication Critical patent/CN113516971B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The embodiment of the invention discloses a method and a device for detecting a lyric conversion point, computer equipment and a storage medium, and relates to the technical field of audio processing. The method comprises the following steps: acquiring target audio data; detecting the target audio data to obtain the beat of the target audio data; carrying out voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the human voice data to obtain a human voice energy waveform; preprocessing a human acoustic energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine the conversion point of the lyrics. The method realizes effective recognition of music and human voice by the robot equipment, detects the processed human voice data through the beat of the target audio data and the preset conversion condition to accurately determine the conversion point of the lyrics, and greatly improves the precision and the efficiency of positioning the conversion point of the lyrics.

Description

Lyric conversion point detection method, device, computer equipment and storage medium
Technical Field
The invention relates to the technical field of audio processing, in particular to a method and a device for detecting a lyric conversion point, computer equipment and a storage medium.
Background
The main method is that a user adds dynamic and static images and selects a piece of music, the software automatically generates a piece of audio and video, wherein the transition or rendering time point of the video part and the music selected have specially designed relevance, such as the video part can appear on the drum point, the rephotograph point and the special effect point of the music, so that the automatically generated audio and video can not be conflicted, like the result of the user after spending a lot of time for carefully editing.
Based on the demand of the video of the card point, the commonly used music video editing results can be generalized, so that the lyric conversion point (the time point when the interlude in a song is finished and the voice begins singing the song) is also very suitable to be used as the point of transition or rendering time besides the traditional characteristic points of music such as the rephotograph, the drum point and the like.
However, detecting human voices in Music has been a difficult and challenging problem in the field of MIR (Music Information Retrieval). The content in the song includes two parts of music and human voice, and the frequency spectrums between the two parts overlap and influence each other. Although the human ear can clearly distinguish the music containing the human voice, the music and the human voice cannot be effectively distinguished for the computer and other machine equipment. In the prior art, the lyric conversion point is mainly positioned in a manual mode, and the positioning precision and efficiency of the mode are low.
Disclosure of Invention
The embodiment of the invention provides a method and a device for detecting a lyric conversion point, computer equipment and a storage medium, and aims to solve the problems of low precision and low efficiency of positioning the lyric conversion point in the conventional manual mode.
In a first aspect, an embodiment of the present invention provides a method for detecting a lyric conversion point, where the method for detecting a lyric conversion point includes:
acquiring target audio data; detecting the target audio data to obtain the beat of the target audio data; carrying out voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the human voice data to obtain a human voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics.
In a second aspect, an embodiment of the present invention further provides an apparatus for detecting a lyric conversion point, where the apparatus includes:
an acquisition unit configured to acquire target audio data;
the detection unit is used for detecting the target audio data to obtain the beat of the target audio data;
the separation unit is used for carrying out voice separation processing on the target audio data to obtain voice data;
the computing unit is used for computing the amplitude of the human voice data to obtain a human voice energy waveform;
the preprocessing unit is used for preprocessing the human voice energy waveform to obtain a target waveform;
and the determining unit is used for detecting the target waveform according to the beat of the target audio data and a preset conversion condition so as to determine the conversion point of the lyrics.
In a third aspect, an embodiment of the present invention further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.
In a fourth aspect, the present invention also provides a computer-readable storage medium, which stores a computer program, and the computer program realizes the above method when being executed by a processor.
The embodiment of the invention provides a method and a device for detecting a lyric conversion point, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring target audio data; detecting the target audio data to obtain the beat of the target audio data; carrying out voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the human voice data to obtain a human voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics. The method realizes effective recognition of music and human voice by the robot equipment, detects the processed human voice data through the beat of the target audio data and the preset conversion condition to accurately determine the conversion point of the lyrics, and greatly improves the precision and the efficiency of positioning the conversion point of the lyrics.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a method for detecting a lyric conversion point according to an embodiment of the present invention;
FIG. 2 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present invention;
FIG. 3 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present invention;
FIG. 4 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present invention;
FIG. 5 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present invention;
FIG. 6 is a schematic sub-flowchart of a lyric conversion point detection method according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of a lyric conversion point detection apparatus according to an embodiment of the present invention;
FIG. 8 is a schematic block diagram of a computer device provided by an embodiment of the present invention;
FIG. 9 is a waveform of a target audio and a waveform of human voice data isolated therefrom in one embodiment;
FIG. 10 is a human voice data waveform and a human voice energy waveform in one embodiment;
FIG. 11 is a waveform diagram of the intermediate waveform and lyric transition points processed in an embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It is to be understood that the terms "includes" and "including" when used in this specification and the appended claims are also to be construed to indicate that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Referring to fig. 1, fig. 1 is a schematic flow chart of a lyric conversion point detection method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps S1-S6.
S1, target audio data is acquired.
In a specific implementation, target audio data is obtained. The target audio data includes human voice data and background music data. In one embodiment, the target audio data may be in a common audio format such as mp3, wav, ogg, and so on. The format of the target audio data is not particularly limited herein.
It is understood that, in order to detect the lyric conversion point, the human voice data and the background music data in the target audio data need to be separated, i.e., step S3.
And S2, detecting the target audio data to obtain the beat of the target audio data.
In specific implementation, the target audio data is detected to obtain the beat of the target audio data. In an embodiment, the target audio data is input into a beat detection model for beat detection to obtain a beat of the target audio data.
The beat is an organization form expressing a fixed unit of a duration and a strength rule in the music, and is also called a beat. The beat has two characteristics: periodicity and continuity. The beat is periodically expressed as a beat structure, which is a periodically appearing sequence of rhythms in a musical piece. Common beats are 1/4, 2/4, 3/4, 4/4, 3/8, 6/8, 7/8, 9/8, 12/8 beats and so on, with the length of each bar being fixed. The tempo of a piece of music is fixed and does not change during composition. Therefore, accurate detection of the beat is beneficial to improving the precision of detecting the lyric conversion point.
The time length of one beat is different in different target audio data, and the time length of one beat may be different in different music pieces even in the same target audio data. It needs to be calculated in conjunction with the tempo BPM (beatpermount, beats per minute), 60/120-0.5 seconds if the tempo is 120 beats per minute, 60/80-0.75 seconds if the tempo is 80 beats per minute, and so on.
It should be noted that, in an embodiment, the tempo BPM of the target audio data is further estimated, the tempo of the target audio data is obtained by inputting the target audio data into the music analysis module, the duration of each beat can be calculated by obtaining the tempo of the target audio data, and the duration of one beat can be calculated according to the detected tempo to be used for determining the lyric conversion point. The target audio data was detected to have a tempo of 108 beats per minute in this embodiment.
Referring to fig. 2, in an embodiment, the step S2 specifically includes: steps S201-S202.
S201, performing audio feature extraction on the target audio data to obtain audio features of the target audio data.
In specific implementation, audio feature extraction is performed on the target audio data to obtain audio features of the target audio data. In one embodiment, the implementation manner of extracting the audio feature of the target audio data to obtain the audio feature of the target audio data may include: carrying out low-pass filtering processing on the target audio data to obtain a low-pass audio signal; performing framing processing on the low-pass audio signals according to a preset frame shift and at least one frame length threshold to obtain at least one frame audio signal set, wherein different frame audio signal sets correspond to different frame length thresholds, each frame audio signal set comprises at least two sub audio signals, and the frame length of each sub audio signal is equal to the frame length threshold corresponding to the audio signal set; respectively extracting the characteristics of each frame audio signal set in at least one frame audio signal set to obtain the corresponding sub audio characteristics of each frame audio signal set; and splicing the partial audio features corresponding to each frame audio signal set to obtain the audio features of the target audio data.
S202, performing beat detection on the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In specific implementation, a beat detection model is used for carrying out beat detection on the audio features of the target audio data to obtain the beat of the target audio data. In an embodiment, the beat detection model is trained based on the training samples and the beat labels corresponding to the training samples. And carrying out beat detection on the audio features of the detected audio by using a beat detection model so as to obtain the realization mode of the beat of the target audio data.
Referring to fig. 3, the step S202 specifically includes: steps S2021-S2022.
S2021, performing stacking processing on the audio features of the target audio data to obtain output features.
In specific implementation, the audio features of the target audio data are subjected to stacking processing to obtain output features. And after the audio features of the target audio data are subjected to stacking processing by the processing unit, obtaining output features, wherein the output features are time sequence data with the same length as the audio features of the target audio data.
S2022, inputting the output features into a classifier to obtain the beat of the target audio data.
In a specific implementation, the output features are input into a classifier to obtain the beat of the target audio data. And inputting the output characteristics into a classifier so that the output characteristics of each frame are mapped onto each time point along the time sequence to obtain a beat detection result corresponding to each time point, wherein the beat detection result is the beat of the target audio data.
In an embodiment, the beat detection model is obtained by training based on a training sample and a beat label corresponding to the training sample. In the specific implementation, a training sample is obtained, and the training sample has a corresponding beat label; performing audio feature extraction on the training sample to obtain the audio feature of the training sample; calling a beat detection model to detect the audio features to obtain a prediction result; and carrying out optimization training on the beat detection model based on the beat label and the prediction result to obtain the optimized beat detection model.
And S3, carrying out voice separation processing on the target audio data to obtain voice data.
In specific implementation, the target audio data is subjected to voice separation processing to obtain voice data. In one embodiment, the target audio data is input into an audio separation tool to extract human voice data from the target audio data. According to the method and the device, the voice track separator obtained based on the artificial intelligence technology can be used as the audio separation tool to realize voice separation, for example, an interface provided by an open source project Spleeter (namely voice track AI separation software) based on the MIT protocol is used for carrying out voice track separation on target audio data so as to obtain voice data in the target audio data. The above-mentioned audio track separator is only one implementation manner of the audio separation tool of the present application, and the audio separation tool for performing voice separation processing on target audio data is not particularly limited.
As shown in fig. 9, a curve W1 is a waveform of the target audio data, and a curve W2 is a waveform of the human voice data separated from the target audio data.
And S4, calculating the amplitude of the human voice data to obtain a human voice energy waveform.
In specific implementation, the amplitude of the human voice data is calculated to obtain a human voice energy waveform. In one embodiment, the dBFS (Decibels Full Scale) of the voice data is calculated as the amplitude of the voice energy waveform. The calculation formula is as follows:
value_dBFS=20*log10(rms(signal)*sqrt(2))=20*log10(rms(signal))+3.0103
wherein, the signal is human voice data.
As shown in fig. 10, the curve W2 is a human voice data waveform, and the curve W3 is a human voice energy waveform.
And S5, preprocessing the human voice energy waveform to obtain a target waveform.
In specific implementation, the human voice energy waveform is preprocessed to obtain a target waveform. The human voice energy is converted into square waves with certain amplitude by preprocessing the human voice energy waveform, so that the detection of the lyric conversion point is facilitated.
Referring to fig. 4, in an embodiment, the step S5 specifically includes: steps S501-S503.
S501, smoothing the human voice energy waveform to obtain a smooth energy waveform.
In specific implementation, the human voice energy waveform is smoothed to obtain a smoothed energy waveform. In the actual processing process, the human voice energy waveform obtained in step S4 is prone to have high-frequency glitches, which may interfere with the subsequent detection of the lyric transition point, and therefore, the human voice energy waveform needs to be smoothed to eliminate the high-frequency glitches on the waveform to improve the stability of the amplitude of the human voice energy waveform.
Referring to fig. 5, in an embodiment, the step S501 specifically includes: steps S5011-S5012.
S5011, calling a window function to calculate the weight.
In a specific implementation, a window function is called to calculate the weight. Different window functions have different effects on the signal spectrum, because different window functions produce different leakage magnitudes and different frequency resolution capabilities. The truncation of the signal produces an energy leak and the calculation of the spectrum by the fourier algorithm produces a barrier effect, both errors being in principle not eliminated, but the effect of which can be suppressed by choosing different window functions. In one embodiment, a Hanning (Hanning) window of 0.8 seconds is selected as the window function to calculate the weights. The user may select the window function according to the actual situation, which is not specifically limited in the present application.
S5012, performing convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
In specific implementation, the human voice energy waveform is subjected to convolution operation according to the weight to obtain the smooth energy waveform. In one embodiment, the smoothed energy waveform is obtained by calculating a convolution of an equally weighted exponential function.
And S502, performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold-limited waveform.
In specific implementation, the smooth energy waveform is subjected to threshold limiting processing according to a preset threshold value to obtain a threshold-limited waveform. In one embodiment, the irregular smooth energy waveform is simply converted into a square wave, i.e., a threshold-limited waveform, which is convenient for judgment through threshold-limiting processing.
It should be noted that, in one embodiment, the predetermined threshold is-34 (dBFS). The user may set the preset threshold according to the actual situation, which is not specifically limited in the present application.
S503, carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
In a specific implementation, the threshold-limited waveform is subjected to a hold process to obtain a target waveform. The time point number of detection and judgment can be reduced by carrying out holding processing on the threshold-limited waveform, so that the efficiency of detecting the lyric conversion point is improved.
Referring to fig. 6, in an embodiment, the step S503 specifically includes: steps S5031-S5032.
S5031, identifying a peak with a time interval smaller than a preset time interval in the threshold waveform as a target peak.
In specific implementation, a peak of the threshold-limited waveform with a time interval smaller than a preset time interval is identified as a target peak. In one embodiment, the time interval of the rising edges of two adjacent peaks is identified as the time interval between two peaks. And keeping the threshold-limited waveform to improve the precision of the detection of the lyric conversion point. The preset time interval may be set to 2s, and is generally less than the time of one beat of the target audio data. The user may set the holding time according to the actual situation, which is not specifically limited in the present application.
S5032, connecting all the target peaks to obtain a target waveform.
In specific implementation, all the target wave crests are connected to obtain a target waveform. If the time interval between two adjacent wave crests in the recognition threshold-limiting waveform is smaller than the preset time interval, the two adjacent wave crests are connected into a line, so that the detection of the time point with the time interval smaller than the preset time interval is avoided, and the detection efficiency is improved.
And S6, detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine the conversion point of the lyrics.
In specific implementation, the target waveform is detected according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics.
It should be noted that, before the target audio data is obtained, a preset conversion condition is received. In one embodiment, the predetermined transition condition is:
1) no voice appears at the last time point;
2) the voice appears at the current time point;
3) no voice appears in the past for a beat;
4) a continuous voice occurs for up to one beat in the future.
And detecting the threshold-limited waveform according to the beat of the target audio data, and detecting a time point which simultaneously meets the four conditions to be a lyric conversion time point.
As shown in fig. 11, where W3 is a human voice energy waveform, curve W4 is a smoothed energy waveform, curve W5 is a threshold-limited waveform, curve W6 is a target waveform, and curve W7 is a lyric transition point waveform. Wherein, the conversion point of the lyrics can be obtained by the waveform of the conversion point of the lyrics.
Through detection, the target audio frequency is 8 beats and 108 beats per minute, so the duration of one beat is 8 × 60/108 ≈ 4.44s, and as can be seen from fig. 11, the voice at two time points meets the conditions of 1) and 2), and is respectively positioned at the rising edge of the first peak and the second peak of the target waveform; further judging whether conditions 3) and 4) are satisfied; judging that a rising edge of the first peak is a non-lyric conversion point because the duration of the first peak is less than 4.44s, and judging that the duration of the second peak exceeds 4.44s and no voice appears in the time of one beat before the rising edge of the second peak; so the rising edge of the second peak is the lyric transition point.
The method for detecting the lyric conversion point provided by the embodiment of the invention comprises the following steps: acquiring target audio data; detecting the target audio data to obtain the beat of the target audio data; carrying out voice separation processing on the target audio data to obtain voice data; calculating the amplitude of the human voice data to obtain a human voice energy waveform; preprocessing the human voice energy waveform to obtain a target waveform; and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics. The method realizes effective recognition of music and human voice by the robot equipment, detects the processed human voice data through the beat of the target audio data and the preset conversion condition to accurately determine the conversion point of the lyrics, and greatly improves the precision and the efficiency of positioning the conversion point of the lyrics.
Fig. 7 is a schematic block diagram of a lyric conversion point detection apparatus according to an embodiment of the present invention. As shown in fig. 7, the present invention also provides a lyric conversion point detection apparatus 100 corresponding to the above lyric conversion point detection method. The lyric conversion point detecting apparatus 100 includes a unit for performing the above lyric conversion point detecting method, and may be configured in a desktop computer, a tablet computer, a laptop computer, or the like. Specifically, referring to fig. 7, the lyric conversion point detection apparatus 100 includes an acquisition unit 101, a detection unit 102, a separation unit 103, a calculation unit 104, a preprocessing unit 105, and a determination unit 106.
An acquisition unit 101 configured to acquire target audio data;
a detection unit 102, configured to detect the target audio data to obtain a beat of the target audio data;
a separation unit 103, configured to perform voice separation processing on the target audio data to obtain voice data;
a calculating unit 104, configured to calculate an amplitude of the voice data to obtain a voice energy waveform;
the preprocessing unit 105 is used for preprocessing the human voice energy waveform to obtain a target waveform;
a determining unit 106, configured to detect the target waveform according to the beat of the target audio data and a preset transition condition to determine a transition point of the lyric.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
performing audio characteristic extraction on the target audio data to obtain audio characteristics of the target audio data;
and carrying out beat detection on the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the performing beat detection on the audio feature of the target audio data by using a beat detection model to obtain a beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time sequence data with the same length as the audio features of the target audio data;
inputting the output features into a classifier to obtain a tempo of the target audio data.
In one embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the human voice energy waveform to obtain a smooth energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold-limited waveform;
and carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate the weight;
and carrying out convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
In an embodiment, the holding the threshold-limited waveform to obtain the target waveform includes:
identifying a peak with a time interval smaller than a preset time interval in the threshold-limited waveform as a target peak;
and connecting all the target wave crests to obtain a target waveform.
In an embodiment, the performing voice separation processing on the target audio data to obtain voice data includes:
inputting the target audio data into an audio separation tool to extract human voice data from the target audio data.
It should be noted that, as can be clearly understood by those skilled in the art, the specific implementation processes of the lyric conversion point detecting device and each unit may refer to the corresponding descriptions in the foregoing method embodiments, and for convenience and brevity of description, no further description is provided herein.
The above-described lyric conversion point detecting apparatus may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 8.
Referring to fig. 8, fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 300 is an upper computer. The upper computer can be a tablet computer, a notebook computer, a desktop computer and other electronic equipment.
Referring to fig. 8, the computer device 300 includes a processor 302, memory, and a network interface 305 connected by a system bus 301, where the memory may include a non-volatile storage medium 303 and an internal memory 304.
The nonvolatile storage medium 303 may store an operating system 3031 and a computer program 3032. The computer program 3032, when executed, causes the processor 302 to perform a lyric conversion point detection method.
The processor 302 is used to provide computing and control capabilities to support the operation of the overall computer device 300.
The internal memory 304 provides an environment for the execution of a computer program 3032 in the non-volatile storage medium 303, which computer program 3032, when executed by the processor 302, causes the processor 302 to perform a lyric conversion point detection method.
The network interface 305 is used for network communication with other devices. Those skilled in the art will appreciate that the configuration shown in fig. 8 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computer apparatus 300 to which the present application is applied, and that a particular computer apparatus 300 may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
Wherein the processor 302 is configured to run a computer program 3032 stored in the memory to implement the following steps:
acquiring target audio data;
detecting the target audio data to obtain the beat of the target audio data;
carrying out voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the human voice data to obtain a human voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
performing audio characteristic extraction on the target audio data to obtain audio characteristics of the target audio data;
and carrying out beat detection on the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the performing beat detection on the audio feature of the target audio data by using a beat detection model to obtain a beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time sequence data with the same length as the audio features of the target audio data;
inputting the output features into a classifier to obtain a tempo of the target audio data.
In one embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the human voice energy waveform to obtain a smooth energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold-limited waveform;
and carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate the weight;
and carrying out convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
In an embodiment, the holding the threshold-limited waveform to obtain the target waveform includes:
identifying a peak with a time interval smaller than a preset time interval in the threshold-limited waveform as a target peak;
and connecting all the target wave crests to obtain a target waveform.
In an embodiment, the performing voice separation processing on the target audio data to obtain voice data includes:
inputting the target audio data into an audio separation tool to extract human voice data from the target audio data.
It should be understood that, in the embodiment of the present Application, the Processor 302 may be a Central Processing Unit (CPU), and the Processor 302 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be understood by those skilled in the art that all or part of the flow of the method implementing the above embodiments may be implemented by a computer program instructing associated hardware. The computer program may be stored in a storage medium, which is a computer-readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer-readable storage medium. The storage medium stores a computer program. The computer program, when executed by a processor, causes the processor to perform the steps of:
acquiring target audio data;
detecting the target audio data to obtain the beat of the target audio data;
carrying out voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the human voice data to obtain a human voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics.
In an embodiment, the detecting the target audio data to obtain the beat of the target audio data includes;
performing audio characteristic extraction on the target audio data to obtain audio characteristics of the target audio data;
and carrying out beat detection on the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
In an embodiment, the performing beat detection on the audio feature of the target audio data by using a beat detection model to obtain a beat of the target audio data includes:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time sequence data with the same length as the audio features of the target audio data;
inputting the output features into a classifier to obtain a tempo of the target audio data.
In one embodiment, the preprocessing the human voice energy waveform to obtain a target waveform includes:
smoothing the human voice energy waveform to obtain a smooth energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold-limited waveform;
and carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
In an embodiment, the smoothing the human voice energy waveform to obtain a smoothed energy waveform includes:
calling a window function to calculate the weight;
and carrying out convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
In an embodiment, the holding the threshold-limited waveform to obtain the target waveform includes:
identifying a peak with a time interval smaller than a preset time interval in the threshold-limited waveform as a target peak;
and connecting all the target wave crests to obtain a target waveform.
In an embodiment, the performing voice separation processing on the target audio data to obtain voice data includes:
inputting the target audio data into an audio separation tool to extract human voice data from the target audio data.
The storage medium is an entity and non-transitory storage medium, and may be various entity storage media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a magnetic disk, or an optical disk.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, various elements or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be merged, divided and deleted according to actual needs. In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a terminal, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, while the invention has been described with respect to the above-described embodiments, it will be understood that the invention is not limited thereto but may be embodied with various modifications and changes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for detecting a lyric conversion point, comprising:
acquiring target audio data;
detecting the target audio data to obtain the beat of the target audio data;
carrying out voice separation processing on the target audio data to obtain voice data;
calculating the amplitude of the human voice data to obtain a human voice energy waveform;
preprocessing the human voice energy waveform to obtain a target waveform;
and detecting the target waveform according to the beat of the target audio data and a preset conversion condition to determine a conversion point of the lyrics.
2. The lyric conversion point detection method according to claim 1, wherein the detecting the target audio data to obtain the tempo of the target audio data comprises;
performing audio feature extraction on the target audio data to obtain audio features of the target audio data;
and carrying out beat detection on the audio features of the target audio data by using a beat detection model to obtain the beat of the target audio data.
3. The method of claim 2, wherein the obtaining the tempo of the target audio data by performing tempo detection on the audio features of the target audio data using a tempo detection model comprises:
stacking the audio features of the target audio data to obtain output features, wherein the output features are time sequence data with the same length as the audio features of the target audio data;
inputting the output features into a classifier to obtain a tempo of the target audio data.
4. The lyric transition point detection method of claim 1, wherein the pre-processing the human voice energy waveform to obtain a target waveform comprises:
smoothing the human voice energy waveform to obtain a smooth energy waveform;
performing threshold limiting processing on the smooth energy waveform according to a preset threshold value to obtain a threshold-limited waveform;
and carrying out holding processing on the threshold-limited waveform to obtain a target waveform.
5. The lyric transition point detecting method of claim 4, wherein smoothing the human voice energy waveform to obtain a smoothed energy waveform comprises:
calling a window function to calculate the weight;
and carrying out convolution operation on the human voice energy waveform according to the weight to obtain the smooth energy waveform.
6. The method of claim 4, wherein the retaining the threshold-limited waveform to obtain the target waveform comprises:
identifying a peak with a time interval smaller than a preset time interval in the threshold-limited waveform as a target peak;
and connecting all the target wave crests to obtain a target waveform.
7. The method of claim 1, wherein the performing the human voice separation process on the target audio data to obtain human voice data comprises:
inputting the target audio data into an audio separation tool to extract human voice data from the target audio data.
8. A lyric conversion point detecting apparatus, comprising:
an acquisition unit configured to acquire target audio data;
the detection unit is used for detecting the target audio data to obtain the beat of the target audio data;
the separation unit is used for carrying out voice separation processing on the target audio data to obtain voice data;
the computing unit is used for computing the amplitude of the human voice data to obtain a human voice energy waveform;
the preprocessing unit is used for preprocessing the human voice energy waveform to obtain a target waveform;
and the determining unit is used for detecting the target waveform according to the beat of the target audio data and a preset conversion condition so as to determine the conversion point of the lyrics.
9. A computer arrangement, characterized in that the computer arrangement comprises a memory having stored thereon a computer program and a processor implementing the method according to any of claims 1-7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.
CN202110775920.0A 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium Active CN113516971B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110775920.0A CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110775920.0A CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113516971A true CN113516971A (en) 2021-10-19
CN113516971B CN113516971B (en) 2023-09-29

Family

ID=78066502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110775920.0A Active CN113516971B (en) 2021-07-09 2021-07-09 Lyric conversion point detection method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113516971B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187147A (en) * 1996-10-23 1998-07-14 Yamaha Corp Device and method for voice inputting, and storage medium
JP2001125582A (en) * 1999-10-26 2001-05-11 Victor Co Of Japan Ltd Method and device for voice data conversion and voice data recording medium
JP2002082665A (en) * 2000-09-11 2002-03-22 Toshiba Corp Device, method and processing program for assigning lyrics
JP2006048808A (en) * 2004-08-03 2006-02-16 Fujitsu Ten Ltd Audio apparatus
CN101751914A (en) * 2008-12-04 2010-06-23 江亮都 Lyric display system and method
US20120210844A1 (en) * 2007-10-24 2012-08-23 Funk Machine Inc. Personalized music remixing
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
JP2016050974A (en) * 2014-08-29 2016-04-11 株式会社第一興商 Karaoke scoring system
US20170092247A1 (en) * 2015-09-29 2017-03-30 Amper Music, Inc. Machines, systems, processes for automated music composition and generation employing linguistic and/or graphical icon based musical experience descriptors
CN108206029A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 A kind of method and system for realizing the word for word lyrics
CN112399247A (en) * 2020-11-18 2021-02-23 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio processing device and readable storage medium
CN112669811A (en) * 2020-12-23 2021-04-16 腾讯音乐娱乐科技(深圳)有限公司 Song processing method and device, electronic equipment and readable storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10187147A (en) * 1996-10-23 1998-07-14 Yamaha Corp Device and method for voice inputting, and storage medium
JP2001125582A (en) * 1999-10-26 2001-05-11 Victor Co Of Japan Ltd Method and device for voice data conversion and voice data recording medium
JP2002082665A (en) * 2000-09-11 2002-03-22 Toshiba Corp Device, method and processing program for assigning lyrics
JP2006048808A (en) * 2004-08-03 2006-02-16 Fujitsu Ten Ltd Audio apparatus
US20120210844A1 (en) * 2007-10-24 2012-08-23 Funk Machine Inc. Personalized music remixing
CN101751914A (en) * 2008-12-04 2010-06-23 江亮都 Lyric display system and method
JP2016050974A (en) * 2014-08-29 2016-04-11 株式会社第一興商 Karaoke scoring system
CN104252872A (en) * 2014-09-23 2014-12-31 深圳市中兴移动通信有限公司 Lyric generating method and intelligent terminal
CN105096987A (en) * 2015-06-01 2015-11-25 努比亚技术有限公司 Audio data processing method and terminal
US20170092247A1 (en) * 2015-09-29 2017-03-30 Amper Music, Inc. Machines, systems, processes for automated music composition and generation employing linguistic and/or graphical icon based musical experience descriptors
CN108206029A (en) * 2016-12-16 2018-06-26 北京酷我科技有限公司 A kind of method and system for realizing the word for word lyrics
CN112399247A (en) * 2020-11-18 2021-02-23 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, audio processing device and readable storage medium
CN112669811A (en) * 2020-12-23 2021-04-16 腾讯音乐娱乐科技(深圳)有限公司 Song processing method and device, electronic equipment and readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘晓曦;: "人工智能语音技术在广电媒体的应用", 广播电视信息, no. 03 *

Also Published As

Publication number Publication date
CN113516971B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
US6721699B2 (en) Method and system of Chinese speech pitch extraction
US8193436B2 (en) Segmenting a humming signal into musical notes
Holzapfel et al. Three dimensions of pitched instrument onset detection
CN110599987A (en) Piano note recognition algorithm based on convolutional neural network
US8543387B2 (en) Estimating pitch by modeling audio as a weighted mixture of tone models for harmonic structures
US9070370B2 (en) Technique for suppressing particular audio component
CN104978962A (en) Query by humming method and system
JP2004538525A (en) Pitch determination method and apparatus by frequency analysis
CN110265064A (en) Audio sonic boom detection method, device and storage medium
EP2962299B1 (en) Audio signal analysis
JP5569228B2 (en) Tempo detection device, tempo detection method and program
Holzapfel et al. Beat tracking using group delay based onset detection
CN110111811A (en) Audio signal detection method, device and storage medium
Bouzid et al. Voice source parameter measurement based on multi-scale analysis of electroglottographic signal
JP4217616B2 (en) Two-stage pitch judgment method and apparatus
CN113516971A (en) Lyric conversion point detection method, device, computer equipment and storage medium
JPWO2003107326A1 (en) Speech recognition method and apparatus
JP4128848B2 (en) Pitch pitch determination method and apparatus, pitch pitch determination program and recording medium recording the program
CN109817205B (en) Text confirmation method and device based on semantic analysis and terminal equipment
JP2002287744A (en) Method and device for waveform data analysis and program
JPH01219627A (en) Automatic score taking method and apparatus
Joysingh et al. Chirp Group Delay-Based Onset Detection in Instruments with Fast Attack
Zien et al. Monophonic piano music transcription
Bapat et al. Pitch tracking of voice in tabla background by the two-way mismatch method
CN116403613A (en) Music main melody recognition method and device based on BP neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211118

Address after: 518000 1001, block D, building 5, software industry base, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant after: Shenzhen Wanxing Software Co.,Ltd.

Address before: 518000 1002, block D, building 5, software industry base, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Applicant before: SHENZHEN SIBO TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant