CN112233691A - Singing evaluation method and system - Google Patents

Singing evaluation method and system Download PDF

Info

Publication number
CN112233691A
CN112233691A CN202010969451.1A CN202010969451A CN112233691A CN 112233691 A CN112233691 A CN 112233691A CN 202010969451 A CN202010969451 A CN 202010969451A CN 112233691 A CN112233691 A CN 112233691A
Authority
CN
China
Prior art keywords
note
pitch
segment
value
singing voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010969451.1A
Other languages
Chinese (zh)
Other versions
CN112233691B (en
Inventor
李伟
王沐书
赵天歌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202010969451.1A priority Critical patent/CN112233691B/en
Publication of CN112233691A publication Critical patent/CN112233691A/en
Application granted granted Critical
Publication of CN112233691B publication Critical patent/CN112233691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)

Abstract

The invention belongs to the technical field of music evaluation, and particularly relates to a singing evaluation method and system. The method comprises the following steps: respectively calculating the pitch time lines of the vocal sections to be evaluated and the vocal section of the target vocal section; extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line; solving a DTW optimal path and a difference penalty matrix between a note time line of the singing voice segment to be evaluated and a note time line of the target singing voice segment; traversing the whole DTW optimal path, and when a transverse segment and a longitudinal segment appear in the DTW optimal path, respectively indicating values are +1 and-1, wherein the indicating values are beat accuracy evaluation indexes of the song sound piece to be evaluated; and aligning the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and performing note accuracy evaluation on the singing voice segment to be evaluated after aligning. The invention realizes the accurate evaluation of the beat accuracy and the tone accuracy.

Description

Singing evaluation method and system
Technical Field
The invention belongs to the technical field of music evaluation, and particularly relates to a singing evaluation method and system.
Background
Singing evaluation, i.e. making positive or negative descriptions of various aspects of a singing voice segment performed by a singer, has long been one of the topics of interest to the music community. The singing evaluation system for professional singers can select excellent singers and provide a reference for improving the self singing level for the singers; the singing evaluation system for the singing match which is popular in recent years can reasonably evaluate the singers participating in the match, so that the appreciation and the authority of the match are greatly improved; the singing evaluation system facing to amateur singers can meet the psychological pursuit of competition of people, so that the interest of amateur singing activities is greatly enhanced. A set of common singing evaluation method acknowledged by vast singers can achieve good effects in various occasions and meet the requirements of various crowds.
An automatic singing evaluation system based on a note accuracy system is already used by a wide range of users on various platforms such as computers and smart phone applications. First, most designers implement such a system by directly comparing the "fundamental frequency" sung by the user, or the "notes" (notes) directly quantized from the fundamental frequency, with the frequency sequence or note sequence corresponding to the original song score; secondly, many systems attempt to make beat accuracy evaluations with reference to information that accompaniments and the like are unstable or cannot be acquired in most cases; thirdly, when the accuracy of the musical notes of the singer is evaluated based on the target singer of the same song, most systems perform difference calculation on the fundamental tone or musical note of each time sung by the singer and the target fundamental tone or musical note of the same time to give a score.
An automatic singing evaluation system based on a note accuracy system is already used by vast users on various platforms such as computers and smart phone applications, but the system still has inevitable defects:
first, most designers have implemented such a system by directly comparing the "fundamental tones" (fundamental frequencies) sung by the user, or the "notes" (notes) directly quantized by the fundamental tones, with the corresponding frequency sequence or note sequence of the original song score, without considering the characteristic that the fundamental tones would still fluctuate around the notes (notes) sung by the singer without pitch, and regarding such normal fluctuations as "pitch", and thus making an unreasonable decision that makes the singer have to sing the original song with a flat and mechanical sound without any modification or emotion if he wants to obtain a high rating in such a system, which is obviously an unreasonable practice.
Second, many systems do not give the singer an evaluation of "tempo accuracy", or only make an attempt to evaluate tempo accuracy with reference to information that is unstable or unavailable in many cases, such as accompaniment. For popular songs with distinct accompaniment beats, the beat accuracy evaluation can still give relatively reliable results, but for some popular songs with gradually generalized accompaniment in recent years, the beat accuracy evaluation depending on accompaniment information can hardly give convincing results because the accompaniment does not contain obvious drumbeats, musical instrument and other sound references.
Thirdly, when the target singer of the same song is taken as a reference to evaluate the accuracy of the musical notes of the singer, only the fundamental tone or the musical notes at each moment sung by the singer and the target fundamental tone or the musical notes at the same moment are subjected to difference calculation to give a score, but the influence of possible beat accuracy errors of the singer on the algorithm is not considered. This allows errors in tempo accuracy to occur only, such as "pre-beat" or "slow beat", without a singer having a wrong note accuracy, yet still being given an unreasonable note accuracy score due to the occurrence of misaligned points on the note sequence or timeline.
Disclosure of Invention
The invention aims to provide a singing evaluation method and a singing evaluation system which can accurately evaluate the accuracy of beats and the accuracy of phonetic symbols.
The singing evaluation method provided by the invention specifically comprises the following steps:
(1) respectively calculating the pitch time lines of the vocal section to be evaluated and the vocal region of the target vocal section;
(2) extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line;
(3) solving a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment by adopting a dynamic time warping algorithm (DTW algorithm);
(4) traversing the whole DTW optimal path, wherein when a transverse segment appears in the DTW optimal path, the indicated value is +1, and when a longitudinal segment appears in the DTW optimal path, the indicated value is-1, and the indicated value is a beat accuracy evaluation index of the song piece to be evaluated;
(5) and aligning the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and evaluating the note accuracy of the singing voice segment to be evaluated after aligning.
In step (2) of the present invention, said extracting notes from said voiced region pitch timeline by transcription to obtain a note timeline, and the specific process includes:
(2.1) determining a pitch value time sequence of said voiced region pitch timeline;
(2.2) quantizing the pitch value time sequence by taking the first set proportion of the semitone as resolution to obtain a quantized pitch value time sequence;
(2.3) continuously carrying out three times of dynamic average filtering processing on the quantized time sequence of the pitch values; wherein the dynamic average filtering process specifically includes: calculating the difference value between the pitch value of the next pitch point m +1 and the mean value of the notes of the current note paragraph i at the pitch point m of the current note paragraph i, and taking the pitch point m +1 as the start of a new note paragraph i +1 when the difference value is larger than a set amplitude, otherwise, incorporating the pitch point m +1 into the current note paragraph i;
the set amplitude in the first dynamic average filtering is a second set proportion of semitones, the set amplitude in the second dynamic average filtering is a third set proportion of semitones, and the set amplitude in the second dynamic average filtering is a fourth set proportion.
Optionally, the first set proportion amount is 1/5, the second set proportion amount is 2/5, the third set proportion amount is 3/5, and the fourth set proportion amount is 1.
Further, after the three-time dynamic average filtering processing is continuously performed on the quantized time series of pitch values, the method further includes:
and assigning the note values of the numeric fluctuation area with the pitch point number smaller than the first set number and the countless area with the pitch point number smaller than the second set number as the note values of the previous nearest note paragraph.
Optionally, the first set number is 2, and the second set number is 5.
Further, after assigning the note values of the numeric fluctuation region with the pitch point number smaller than the first set number and the countless region with the pitch point number smaller than the second set number as the note values of the last note segment, the method further comprises:
and setting the note segment with the length within the third set number of the fundamental tone points and the note frequency difference value within the set frequency with the next note segment as the note value of the next note segment.
Optionally, the third set number is 15.
Furthermore, the invention can further evaluate the emotional fullness and the appropriate degree of audio domain combination of the singing voice segment to be evaluated. The method specifically comprises the following steps:
(1) selecting a segment which has stable note value on a note time line corresponding to the segment of the singing voice to be evaluated and has the duration time longer than a set time value, and recording as a target segment;
(2) extracting the frequency spectrum shape characteristic, harmonic energy characteristic, vibrato characteristic and sound intensity characteristic of the target segment;
(3) and inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into a trained neural network model to obtain emotion fullness evaluation and audio domain fitness evaluation.
The invention also provides a singing evaluation system, which comprises:
the vocal zone pitch timeline determining module is used for respectively calculating the vocal zone pitch timelines of the vocal segment to be evaluated and the target vocal segment;
the note extraction module is used for extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line;
the DTW algorithm solving module is used for solving a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment;
the beat accuracy evaluation module is used for traversing the whole DTW optimal path, when a transverse segment appears in the DTW optimal path, the indicated value is +1, when a longitudinal segment appears in the DTW optimal path, the indicated value is-1, and the indicated value is a beat accuracy evaluation index of the song piece to be evaluated;
and the note accuracy evaluation module is used for aligning the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and carrying out note accuracy evaluation on the singing voice segment to be evaluated after alignment.
Namely, the above 5 modules respectively execute the operations of 5 steps in the singing evaluation method.
Optionally, the system further includes:
the target segment selection module is used for selecting segments which have stable note values on the note time lines corresponding to the singing voice segments to be evaluated and have the duration time longer than the set time value, and recording the segments as target segments;
the characteristic extraction module is used for extracting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic of the target segment;
and the advanced evaluation module is used for inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into the trained neural network model to obtain emotion fullness evaluation and audio domain appropriate degree evaluation.
Namely, the 3 modules respectively execute the operations of 3 steps in emotion fullness evaluation and audio domain appropriateness evaluation of the singing voice segment to be evaluated.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the singing evaluation method and the singing evaluation system provided by the invention realize accurate evaluation on the aspect of the beat accuracy of the singing voice segment to be evaluated through three links of fundamental tone extraction, transcription and time sequence comparison of the singing voice segment to be evaluated and the target singing voice segment. Meanwhile, note alignment is carried out on the note time sequence of the singing voice segment to be evaluated and the note time sequence of the target singing voice segment based on the DTW algorithm, note accuracy evaluation is carried out after alignment, inaccuracy of note accuracy evaluation caused by inaccurate beats is avoided, and accuracy of note accuracy evaluation is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a singing evaluation method provided in embodiment 1 of the present invention.
Fig. 2 shows the effect of the three-layer dynamic average filtering in embodiment 1 of the present invention. Where (a), (b), (c), and (d) are the sequences of conversion back to frequency values before, after the first round, after the second round, and after the third round, respectively.
Fig. 3 is a schematic structural diagram of a singing evaluation system provided in embodiment 3 of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments and the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Example 1
Referring to fig. 1, the present embodiment provides a singing evaluation method, including the steps of:
step 101: respectively calculating the pitch time lines of the vocal section to be evaluated and the vocal region of the target vocal section;
step 102: extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line;
step 103: solving a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment;
step 104: traversing the whole DTW optimal path, wherein when a transverse segment appears in the DTW optimal path, the indicated value is +1, and when a longitudinal segment appears in the DTW optimal path, the indicated value is-1, and the indicated value is a beat accuracy evaluation index of the song piece to be evaluated;
step 105: and aligning the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and evaluating the note accuracy of the singing voice segment to be evaluated after aligning.
The specific steps of step 101 may be: respectively carrying out windowing and framing operation on the singing voice segment to be evaluated and the target singing voice segment; respectively carrying out pitch calculation based on a YIN algorithm on the singing voice segment to be evaluated and the target singing voice segment frame by frame; connecting the fundamental tones of each adjacent frame together to obtain a fundamental tone vector; combining the pitch vector with the time corresponding to each frame to determine a pitch timeline; and detecting the sound zone at each time point on the pitch timeline, and extracting the sound zone to obtain the pitch timeline of the sound zone.
In this embodiment, first, the user inputs the singing segment, and the system performs windowing and framing on two segments of singing voice audio to perform subsequent operations; then, the system carries out pitch calculation based on YIN algorithm on two segments of singing voice frame by frame, and forms a vector by connecting the pitch values of each adjacent frame together, and the vector is combined with the specific time value corresponding to each frame to form a segment of pitch-time curve, namely a 'pitch timeline', and it needs to be noted that the 'timeline' mentioned later in the invention refers to the vector formed by connecting the values of the frames one by one and corresponding to the specific time point; next, performing Voice Activity Detection (VAD) on each specific time point on the pitch timeline, namely, determining whether singing Voice appears at each time point, reserving the singing Voice area, and screening out the singing Voice area to obtain a new pitch timeline only having numerical values in the singing Voice area; extracting a 'note' vector surrounded by the pitch continuously fluctuated in the singing voice audio from the pitch continuously fluctuated in the singing voice audio through a transcription link based on the pitch timeline extracted in the last step to form a 'note timeline'; after the note Time lines are obtained, a Dynamic Time Warping algorithm (DTW for short) is used, similarity of the two note Time lines in Time is compared, similarity of a user and a target in the aspect of tempo is given, and therefore the tempo accuracy of the user is evaluated; meanwhile, the result obtained by the DTW algorithm is innovatively applied to the mutual correction of the two note time lines, so that the singing voice of the user with the wrong tempo can be aligned with the target singing voice, and then more reasonable evaluation on the accuracy of the user notes without being influenced by the wrong tempo is given.
The present embodiment gives evaluation of the note accuracy and tempo accuracy for a user by comparing the similarity between the user and a target singing voice of the same singing voice segment imitated by the user. Before the evaluation, it is necessary to collect "dry" voices of both the singer and the target singing voice (acapella, no accompanying singing voice), which may be the original singing of the original song or the musical tones (e.g., MIDI audio files) restored from the time series of the notes of the present song. Then, the frame division processing is carried out on the song, so that the fundamental tone calculation and the sound intensity calculation can be carried out subsequently. The present invention follows the approach of mainstream singing voice processing systems, using rectangular windows, which may be 25ms in length, with window overlap being half the window length.
The transcription in step 102 refers to the transcription of the fundamental tone, i.e. the process of calculating the sequence of notes of a musical passage on the basis of the extracted fundamental tone. Similarly, a pitch timeline similar to the pitch timeline definition is given in connection with the start time of each frame.
The transcription segment of the present invention is based on two important assumptions: firstly, when the pitch sequence is divided into different note segments, if two segments of pitch actually belong to two tones, the difference value of the two segments of pitch must be larger than a certain amplitude; secondly, in any case, in the singing voice performed by a human being, the difference between two consecutive pitch values cannot be greater than a certain magnitude, otherwise one of them is an erroneous point in the pitch calculation. Based on these two assumptions, the present invention constructs the entire note sequence by processing the pitch values of each frame on a frame-by-frame basis.
The input of the transcription link is a pitch sequence obtained before, the sequence has the condition of no numerical value at some positions, and because of the strict threshold setting in the previous YIN algorithm link and VAD detection link, the condition can cover most non-singing voice areas, but can also cause the misjudgment of non-singing voice for some singing voice areas in a few cases. Therefore, before performing note calculation, the pitch value sequence is preferably quantized, for example, with 1/5 semitones as resolution:
qpitch=round(60*log2(p/440))
where p denotes the input pitch value, qpitchAs a result of quantization of the pitch value, round () operation is an operation of taking an integer closest to the value in parentheses, and the quantization operation is based on 440 Hz.
After the pitch sequence is quantized, the quantized pitch sequence is subjected to triple dynamic average filtering. First "dynamic average filtering" process: at the pitch point m of the current note paragraph i, if the next pitch value m +1 is a numeric value, calculating the difference between the next pitch value m +1 and the note mean n of the current note paragraph i, if the difference is greater than a certain amplitude a1, regarding the difference as the beginning of a new note paragraph i +1, otherwise, incorporating the pitch value m +1 into the current note paragraph i, and calculating the new note mean of the paragraph, each point in the same note paragraph being assigned to the final note mean of the paragraph; if the current pitch value is an infinite number, it is temporarily assigned to the note value of the previous point. Preferably, a1 takes the value 2/5 semitones. After the first "moving average filtering" process, the resulting sequence is again subjected to the same "moving average filtering" process twice, in which case no-value is processed any more, and the thresholds a2 and A3 for the second and third rounds of note change are set to 3/5 semitones and 1 semitone, respectively. The dynamic average processing of the pitch sequence in three layers can prevent the damage of small-amplitude note change information caused by too low note change threshold value setting in one dynamic average and prevent the problem of gene fluctuation caused by too high note change threshold value setting. The effect obtained after the three-level "dynamic average filtering" process is shown in fig. 2, where the ordinate is frequency (Hz) and the abscissa is time(s), and the sequence in fig. 2(d) has undergone an infinite number of point recovery links.
Then, carrying out mutation value screening in the obtained note sequence, specifically searching note value points with amplitude difference larger than B compared with the previous note value points, regarding the note value points as error points, and assigning the error points as note values of the previous points; then, an infinite number point recovery is performed, i.e. the note values at the infinite number points encountered in the previous first round of "dynamic average filtering" are recovered to an infinite number and all note values are converted back to frequency values in hertz (Hz). Preferably, the value of B is 9 semitones.
After the above steps are completed, the sequences obtained still need to be alignedThe fluctuation values of very short duration and the innumerable value regions of very short duration are processed because the fluctuation values of very short duration are mostly caused by "modifier" and "pitch spike" of each tone in the vicinity of the start and end points, and the non-value regions of very short duration are mostly caused by the aforementioned misjudgment of the non-singing voice region. The processing method of the invention is that when the number of points is less than t respectivelysAnd tnWhen the numeric fluctuation region and the countless region of (a) appear, the note value of the region is assigned to the note value of the last note paragraph before. Preferably, t issAnd tnThe values of (a) are 2 and 5, respectively.
In addition, in the singing voice played by many high-level singers, a phenomenon of "embellishment" in the note time line often occurs, and here, embellishment is defined as a phenomenon that a certain section of the interior of a short note or a long note starts, and the pitch of the section of the interior of the short note or the long note is smaller than that of embellishment, and when the phenomenon is reflected on the pitch time line, the section of the interior of the short note or the long note starts, a small section of the interior of the short note or the long note starts to "bulge down. Modifier tones are a very common phenomenon. Since the modifier is a highly personalized artistic process, in singing evaluation-related applications, although the presence of the modifier may cause the singer to deviate from the target paragraph in pitch, it is still not considered a negative evaluation in terms of note accuracy.
For the modified tone phenomenon, the following detection and processing are performed: when a note timeline processed by previous links appears with length lthWithin, and the note difference from the next note sequence is at cthWhen the note sequence is within the range, the note sequence is manually set to the note value in the next note sequence, that is, the modified note is directly regarded as a part of the modified note. Preferably, |thTaken as 15 points, cthTaken as 30 Hz.
And step 103 to step 105, comparing and aligning the note time lines by adopting a dynamic time warping algorithm (DTW algorithm). And comparing the similarity of the time of the user and the time of the target singer based on the note time line obtained in the transcription link, thereby giving a prompt whether the user carries out 'quick shooting' or 'slow shooting' in the singing voice paragraph sung by the user, and finally evaluating the accuracy of the beat. In addition, the two pitch timelines are aligned with each other according to the comparison result of the timeline similarity, and therefore objective evaluation of note accuracy is given.
The specific process is as follows: first, a "difference penalty matrix" (cost matrix) between two note sequences of the user and the target is calculated:
M(i,j)=min{(n0T(i)-n0U(j))2,α}
wherein M (i, j) is a difference penalty value between the ith point in the target note sequence and the jth point in the user note sequence and is also an element of the ith row and jth column of the difference penalty matrix, n0TFor a sequence of target notes, n0TFor a user note sequence, a difference penalty is simply represented by the square of the difference between two values, up to a penalty that does not exceed a. Preferably, α is 0.5.
After the matrix is obtained, traversing each point in the penalty matrix, setting the currently traversed point as (i, j), finding out a point with the minimum value from the three points of the point (i +1, j) in the next row and the point (i, j +1) in the next column, adding the value of the point to the point (i +1, j +1), recording the direction of the minimum value at the point as a selection direction, representing the direction of each step and the penalty borne in the process of moving from each possible point to the upper left when the point moves back to the upper left from the lower right, and using a new matrix accumulated in the step as an accumulated difference penalty matrix. In the cumulative penalty matrix, each element can be calculated using the following formula:
Figure BDA0002683559440000091
wherein M issTo accumulate the elements in row i and column j in the penalty matrix, M is the previously defined differential penalty matrix.
Then, starting from the bottom right corner element of the difference accumulation penalty matrix, the best path in the matrix is found by stepwise "backtracking". Because the direction information with the minimum penalty on the next step is selected and recorded when the trace back is carried out on the upper left corner at any point of the matrix, and the specific penalty value on the next step is added to the current point, the existing direction information of each passing point is recorded in the horizontal direction vector and the vertical direction vector, and the optimal path formed by the two direction vectors can be finally obtained.
When the "best path" is plotted with the "vertical" vector in the "best path" as the vertical coordinate and the "horizontal" vector as the horizontal coordinate, the following is found: in the two-segment note sequence, if the time points of occurrence of the respective notes (even with small amplitude deviation) of the two are similar, that is, the user singing voice has no beat error, the "best path" will be reflected as a diagonal straight line; on the contrary, if the time points of certain notes of the two are different, that is, the singing voice of the user has the phenomenon of "shooting in a rush" or "shooting in a slow way" at the position of certain notes, the "best path" will be reflected as a horizontal or vertical straight line. In the prior art, the specific time point and degree of the beat error of the singing voice of the user can be found out by linearly fitting the optimal path and according to the difference between the optimal path and the fitting straight line, so that the beat accuracy of the singer can be visually evaluated. The method is a rough beat accuracy evaluation method, which is a feasible method for short-time song segments, but for longer-time songs, even for the whole song, the beat error of a singer at a certain time point may cause the objective difference between the whole linear fitting path and the DTW optimal path at other time points, so that the beat accuracy evaluation at different time points has objectivity. As an implementation manner of this embodiment, this embodiment sets a "beat error indication value", and when evaluating beat accuracy, instead of performing linear fitting on the optimal path, the indication value is set to an initial 0 state, and the entire path is traversed, and when a horizontal segment of the path occurs, the indication value is +1, and when a vertical segment of the path occurs, the value is setIs-1, otherwise its value is unchanged. Finally, the numerical variation of the indicated value in the whole singing voice time period is observed, and the time period which is more than 0 or less than 0 means possible beat-up or beat-down. Since the human ear is not perceptive to a range of beat errors and since in the deduction of the same song by different singers, an artistic expression may be intentionally created by offsetting a small number of tones, a threshold value D for the amplitude of beat errors is settAnd merely shifting an indication value of the shift amount outside the value is regarded as the occurrence of a beat error. Preferably, DtIs 20 points.
In addition, since the specific manner of the difference in tempo between the user's singing voice and the target singing voice is actually recorded in the "best path" obtained in terms of the occurrence sequence of the indexes of the two sequences, the indexes of the two sequences of the user's singing voice and the target singing voice are adjusted in accordance with the occurrence sequence of the indexes reflected by the "best path" to obtain two sets of aligned sequences of notes (note that the aligned sequences are no longer "note timeline" here). The significance of the sequence is that the basic evaluation of the sequence on the aspect of 'note accuracy' can be given reasonably and substantially without being influenced by the beat error of the user.
Since the singing voices of the user and the target are considered to be performed with the same accompaniment in the embodiment, the occurrence of phenomena such as "8 degrees high", "8 degrees low" in the evaluation of the note accuracy is not considered, but the difference between any two note sequences is simply considered as a note accuracy error. If the occurrence of different pitch references needs to be considered, the average values of the two sequences can be calculated respectively, and after the average values of the two sequences are subtracted respectively, the similarity between two rows of relative notes can be compared.
It is noted that neither note errors nor beat errors are perceptible by human listeners above a certain level. When the beat error index is given, the influence of the problem is considered, and when the note error is given, preferably, a threshold setting method is adopted, note sequence difference values within a certain degree are ignored, or a nonlinear difference value and score relation is adopted, so that the influence of small difference values on the score is weakened.
Example 2
On the basis of embodiment 1, this embodiment also performs emotional fullness evaluation and audio domain fitness evaluation on the singing voice segment to be evaluated. The method specifically comprises the following steps: selecting a segment which has stable note value on a note time line corresponding to the segment of the singing voice to be evaluated and has the duration time longer than a set time value, and recording as a target segment; extracting the frequency spectrum shape characteristic, harmonic energy characteristic, vibrato characteristic and sound intensity characteristic of the target segment; and inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into a trained neural network model to obtain emotion fullness evaluation and audio domain fitness evaluation.
In this embodiment, first, on the extracted note time line, a segment with stable note value and long duration is found, because in this segment, the singer has a more sufficient playing space in terms of singing skill and timbre feature (time), and has a greater chance of exposing problems of "insufficient breath", "inappropriate range" and the like in his own voice, and the feature operations of the three aspects of the frequency spectrum shape feature, harmonic energy feature and vibrato feature performed later are performed only in this region; then, on the basis of framing, Short-time Fourier Transform (STFT) and Root Mean Square (RMS) calculation are carried out on the singing voice signals of the user frame by frame, spectrum vectors obtained by each frame are juxtaposed frame by frame to form a spectrum time matrix, and the obtained root mean square values are connected frame by frame to form a sound intensity time line; from the frequency spectrum time matrix, the frequency spectrum shape characteristic and harmonic energy characteristic of the singing voice signal can be calculated and extracted; the calculation of harmonic energy features needs to know the fundamental tone of singing voice and the frequency positions of a plurality of harmonic overtones (harmonics) in advance, so that a fundamental tone timeline is used as a reference; meanwhile, the trill characteristic of the singing voice is obtained from the relation between the fundamental tone time line and the note time line at the same time position; and finally, inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into the trained neural network model to obtain emotion fullness evaluation and appropriate audio domain synthesis evaluation.
Example 3
Referring to fig. 3, the present embodiment provides a singing evaluation system including:
a pitch timeline determining module 301 for the vocal region, configured to calculate pitch timelines of the vocal segment to be evaluated and the vocal segment of the target vocal segment respectively;
a note extracting module 302, configured to extract notes from the voiced region pitch timeline through transcription to obtain a note timeline;
a DTW algorithm solving module 303, configured to solve a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment;
the beat accuracy evaluation module 304 is configured to traverse the entire DTW optimal path, and when a horizontal segment appears in the DTW optimal path, indicate a value +1, and when a longitudinal segment appears in the DTW optimal path, indicate a value-1, where the indicated value is a beat accuracy evaluation index of the song title to be evaluated;
a note accuracy evaluation module 305, configured to align the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and perform note accuracy evaluation on the singing voice segment to be evaluated after alignment
The voiced region pitch timeline determining module 301 may specifically include:
the windowing and framing operation unit is used for respectively carrying out windowing and framing operation on the singing voice segment to be evaluated and the target singing voice segment;
the pitch calculation unit is used for respectively carrying out pitch calculation based on a YIN algorithm on the singing voice segment to be evaluated and the target singing voice segment frame by frame;
a fundamental tone vector determining unit, which is used for connecting the fundamental tones of each adjacent frame together to obtain a fundamental tone vector;
a pitch time line determining unit, configured to combine the pitch vector with the time corresponding to each frame to determine a pitch time line;
and the voiced region detection unit is used for detecting the voiced regions at all time points on the pitch timeline and extracting the voiced regions to obtain the pitch timeline of the voiced regions.
As an alternative embodiment, the singing evaluation system further comprises:
the target segment selection module is used for selecting segments which have stable note values on the note time lines corresponding to the singing voice segments to be evaluated and have the duration time longer than the set time value, and recording the segments as target segments;
the characteristic extraction module is used for extracting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic of the target segment;
and the advanced evaluation module is used for inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into the trained neural network model to obtain emotion fullness evaluation and audio domain appropriate degree evaluation.
The accuracy and reliability of the singing evaluation method and system provided by the invention in the two links of pitch extraction and transcription are tested
1. Data set extraction and correct result calibration
In testing the system of the present invention, the data set disclosed in "Molina, emilio. evaluation frame for automatic formatting transfer [ J ]. 2014" was used. In the data set, each singing voice audio file has a sampling rate of 44100Hz, the duration is different from 15s to 86s, singing voice is recorded without accompaniment, and the data set is divided into three parts: the first part is "Molina, emilio. evaluation frame for automatic singing translation [ J ]. 2014" 14 segments of singing voices recorded by the author's own organization and sung by 8 different amateur level young singers; the second part is 13 segments of singing voices of 8 different higher male singers selected by the author from MTG-QBH public data set "Salamon J, Serr J, G Lo mez E. Total expressions for music Retrieval [ J ]. International Journal of Multimedia Information Retrieval,2013,2(1): 45-58."; the third part was 11 segments of singing by 5 different higher level female singers selected by the author from the MTG-QBH public dataset "Salamon J, Serr J, G Lo mez E. Total expressions for music Retrieval [ J ]. International Journal of Multimedia Information Retrieval,2013,2(1):45-58 ]. The above singing voice segments are respectively marked by two seniors (one having 12-year seniors and the other having 9-year seniors) of professional researchers in deep music science invited by authors to obtain the start time, pitch value and end time of correct notes, which are used as standard answers (ground-route) for evaluating the accuracy of the system. The time is in seconds, the pitch is in units of MIDI standard, and the formula for converting the frequency into the MIDI standard pitch is as follows.
nMIDI=69+12*log2(f/440)
Wherein n isMIDIRepresenting the pitch of the MIDI standard with a resolution of 1 semitone.
2. System performance evaluation index
In order to compare the relevant links of the method provided by the invention with several mainstream algorithms with the same function, the method follows' Molina, Emilio]2014 "evaluation index of transcription accuracy and description of transcription errors. Firstly, defining the ith note extracted by the method provided by the invention as nTR iAnd the note in the jth standard answer is nGT jThe total number of notes extracted by the system is NTRAnd the total number of notes in the standard answer is NGT. Since each note represents a paragraph with a note value, there are three pieces of information, i.e., start time, end time, and note value. The evaluation system provides that if the difference between the start time and the end time of a certain note transcribed by the system and the standard answer is within 50ms, the note with the correct start time or end time is considered as the note with the correct start time or end time, which is respectively represented by Con (correct one) and coff (correct offset); MIDI pitch of a note transcribed by the systemIf the value is within 0.5 semitones (i.e., 0.5) of the standard answer, it is considered as a note with correct pitch value, which is denoted by Cp (correct pitch).
Different correct conditions occur in the system transcribing the notes, and the following three correct notes are of interest: first, notes that meet the correct start time, Con condition; second, notes that satisfy the conditions of correct start time and pitch value, Con and Cp, which are denoted as Conp; third, notes that satisfy the conditions of correct start time, pitch value and end time, Con, Cp and Coff, which are denoted as Conpoff. In addition, when the notes in the standard answer are used as the standard, the number of notes satisfying the X condition (one of Con, Conp and Conpoff) in the transcribed note sequence is set as NGT XPrecision (precision) CX of system in X conditionprecisionIs defined as follows:
CXprecision=NGT X/NGT
when the notes transcribed by the system are taken as the standard, the number of notes meeting the X condition in the note sequence in the standard answer is set as NTR XThen the system recalls rate (call) CX in terms of X conditionrecallIs defined as follows:
CXrecall=NTR X/NTR
the accuracy and the recall rate are integrated, and a general evaluation can be made on the system, namely an F value (F-measure):
CXF-measure=2*CXprecision*CXrecall/(CXprecision+CXrecall)
the above scores are collectively referred to herein as "correct note performance".
3. Results of the experiment
The evaluation of the system of the present invention in terms of performance of the correct note is shown in tables 1, 2, 3 and 4, where table 1 shows the test results on the first part of the data set, tables 2 and 3 show the test results on the second and third parts, respectively, and table 4 shows the combined test results of all data sets.
TABLE 1
Figure BDA0002683559440000131
TABLE 2
Figure BDA0002683559440000132
Figure BDA0002683559440000141
TABLE 3
Figure BDA0002683559440000142
TABLE 4
Figure BDA0002683559440000143
And "Molina, emulsion, evaluation frame for automatic segmentation transformation [ J ]]2014 "3 other mainstream Algorithms mentioned herein [" Mez E, Bonada J.Towards Computer-Assisted Flamenco transfer: An Experimental Computer company of Automatic transfer Algorithms as Applied to A Cappella Single [ J].Computer Music Journal,2013,37(37):73-90”|,“
Figure BDA0002683559440000144
&Nen M P,Klapuri A P.Automatic transcription of melody,bass line,and chords in polyphonic music[J].Computer Music Journal,2008,32(3):72-86”,“De Mulder T,Martens J P,Lesaffre M,et al.Recent improvements of an auditory model based front-end for the transcription of vocal queries[C]//2004:257-260”]The correct note performance scores on the same dataset and standard were compared, thisThe method provided by the invention has similar levels with the algorithms in Con and Conpoff in the aspects of pitch extraction and transcription, and can be lower than 10% higher than the algorithms in Conp.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the implementation mode of the invention are explained by applying a specific example, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (12)

1. A singing evaluation method is characterized by comprising the following specific steps:
(1) respectively calculating the pitch time lines of the vocal section to be evaluated and the vocal region of the target vocal section;
(2) extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line;
(3) solving a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment by adopting a dynamic time warping algorithm, namely a DTW algorithm;
(4) traversing the whole DTW optimal path, wherein when a transverse segment appears in the DTW optimal path, the indicated value is +1, and when a longitudinal segment appears in the DTW optimal path, the indicated value is-1, and the indicated value is a beat accuracy evaluation index of the song piece to be evaluated;
(5) and aligning the note sequence of the singing voice segment to be evaluated with the note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and evaluating the note accuracy of the singing voice segment to be evaluated after aligning.
2. The singing evaluation method according to claim 1, wherein in step (2), said extracting notes from said vocal tract pitch timeline by transcription to obtain a note timeline comprises the following steps:
(2.1) determining a pitch value time sequence of said voiced region pitch timeline;
(2.2) quantizing the pitch value time sequence by taking the first set proportion of the semitone as resolution to obtain a quantized pitch value time sequence;
(2.3) continuously carrying out three times of dynamic average filtering processing on the quantized time sequence of the pitch values; wherein the dynamic average filtering process specifically includes: calculating the difference value between the pitch value of the next pitch point m +1 and the mean value of the notes of the current note paragraph i at the pitch point m of the current note paragraph i, and taking the pitch point m +1 as the start of a new note paragraph i +1 when the difference value is larger than a set amplitude, otherwise, incorporating the pitch point m +1 into the current note paragraph i;
the set amplitude in the first dynamic average filtering is a second set proportion of semitones, the set amplitude in the second dynamic average filtering is a third set proportion of semitones, and the set amplitude in the second dynamic average filtering is a fourth set proportion.
3. The singing evaluation method according to claim 2, wherein the first set proportion amount is 1/5, the second set proportion amount is 2/5, the third set proportion amount is 3/5, and the fourth set proportion amount is 1.
4. The singing evaluation method according to claim 2, further comprising, after said three successive dynamic average filtering processes on the quantized pitch value time series:
and assigning the note values of the numeric fluctuation area with the pitch point number smaller than the first set number and the countless area with the pitch point number smaller than the second set number as the note values of the previous nearest note paragraph.
5. The singing evaluation method according to claim 4, wherein said first set number is 2 and said second set number is 5.
6. The singing evaluation method according to claim 4, wherein after assigning the note values of the numeric fluctuation region having the pitch point number smaller than the first set number and the countless region having the pitch point number smaller than the second set number to the note values of the immediately preceding note segment, further comprising:
setting the note segment with the length within the third set number of the fundamental tone points and the note frequency difference value within the set frequency with the next note segment as the note value of the next note segment; the third set number is 15.
7. The singing evaluation method according to claim 1, further comprising the steps of performing emotional fullness evaluation and audio comfort evaluation on the singing voice segment to be evaluated, specifically comprising:
selecting a segment which has stable note value on a note time line corresponding to the segment of the singing voice to be evaluated and has the duration time longer than a set time value, and recording as a target segment;
extracting the frequency spectrum shape characteristic, harmonic energy characteristic, vibrato characteristic and sound intensity characteristic of the target segment;
and inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into a trained neural network model to obtain emotion fullness evaluation and audio domain fitness evaluation.
8. The singing evaluation method according to claim 1, wherein the steps (3) - (5) are comparing and aligning the note timeline by using a dynamic time warping algorithm (DTW algorithm); comparing the similarity of the time of the user and the time of the target singer based on the note time line obtained in the transcription link, thereby giving out the prompt whether the user carries out 'quick shooting' or 'slow shooting' in the singing voice paragraph and the final beat accuracy evaluation; in addition, according to the comparison result of the time line similarity, the two pitch time lines are mutually aligned, and objective note accuracy evaluation is given; the specific process is as follows:
first, a "difference penalty matrix" is calculated between two segments of the note sequence for the user and the target:
M(i,j)=min{(n0T(i)-n0U(j))2,α}
wherein M (i, j) is a difference penalty value between the ith point in the target note sequence and the jth point in the user note sequence and is also an element of the ith row and jth column of the difference penalty matrix, n0TFor a sequence of target notes, n0TFor a user note sequence, the difference penalty is expressed by the square of the difference between the two values; α is the upper limit of the penalty value;
after the matrix is obtained, traversing each point in a penalty matrix, setting the currently traversed point as (i, j), finding out a point with the minimum value from the three points of the next row of points (i +1, j) and the next column of points (i, j +1) at the point, adding the value of the point to the point (i +1, j +1), recording the direction of the minimum value at the point as a selection direction, representing the direction of each step and the penalty borne in the process of moving from each possible point to the upper left when the point moves back to the upper left from the lower right, wherein a new matrix accumulated in the step is an accumulated difference penalty matrix; in the cumulative penalty matrix, each element is calculated by the following formula:
Figure FDA0002683559430000031
wherein M issThe element of the ith row and the jth column in the cumulative penalty matrix, and M is the differential penalty matrix defined previously;
then, starting from the lower right corner element of the difference accumulation penalty matrix, gradually 'backtracking' to find the optimal path in the matrix; since the direction information with the minimum penalty on the next step is selected and recorded when the trace back is carried out on the upper left corner at any point of the matrix, and a specific penalty value on the next step is added to the current point, the existing direction information of each passing point is recorded in the horizontal direction vector and the vertical direction vector, and the optimal path consisting of the two direction vectors is finally obtained;
when the "best path" is drawn with the "vertical" vector in the "best path" as the vertical coordinate and the "horizontal" vector as the horizontal coordinate, the following results: in the two note sequences, if the time points of the two notes are similar, namely the singing voice of the user has no beat error, the 'best path' is reflected as a section of oblique straight line; on the contrary, if the time points of certain notes of the two notes are different, namely the singing voice of the user has the phenomenon of 'quick shooting' or 'slow shooting' at the position of certain notes, the 'best path' is reflected as a horizontal or vertical straight line;
setting a 'beat error indication value', setting the indication value to be in an initial 0 state when evaluating beat accuracy, traversing the whole path, wherein the indication value is +1 when the path has a horizontal segment, and the value is-1 when the path has a vertical segment, otherwise, the value is unchanged; finally, observing the numerical value change of the indicated value in the whole singing voice time period, wherein the time period which is more than 0 or less than 0 means possible shooting in a rush or slow mode; since the human ear is not perceptive to beat errors within a certain range, and since in the deduction of the same song by different singers, an artistic expression effect may be intentionally created by offsetting a small number of tones; for this purpose, a threshold value D for the beat error amplitude is settAnd regarding only the shift of the value of the offset outside the value as the occurrence of a beat error, DtIs 20 points;
since in the resulting "best path" it is actually the order of appearance of the indices of the two sequences, the specific way in which the tempo difference between the user's singing voice and the target singing voice occurs is recorded; adjusting the indexes of the two sequences of the singing voice of the user and the target singing voice according to the sequence of the indexes reflected by the optimal path to obtain two groups of aligned note sequences; the sequence is not affected by the beat error of the user, and the basic evaluation of the sequence on the aspect of 'note accuracy' is reasonably and considerably given.
9. The singing evaluation method according to claim 7, wherein the extracting of the spectral shape feature, harmonic energy feature, vibrato feature, and sound intensity feature of the target segment; the specific process is as follows:
on the basis of framing a target segment, carrying out short-time Fourier transform processing and root mean square calculation on the song signals frame by frame, paralleling the frequency spectrum vectors obtained by each frame by frame to form a frequency spectrum time matrix, and connecting the obtained root mean square values frame by frame to form a sound intensity time line; calculating and extracting the spectral shape characteristic and harmonic energy characteristic of the singing voice signal from the spectral time matrix; wherein, the calculation of harmonic energy characteristics requires the known fundamental tone of singing voice and the frequency positions of a plurality of harmonic multiples, so the fundamental tone time line is used as a reference; meanwhile, the trill feature of the singing voice is derived from the relationship between the pitch timeline and the note timeline at the same time position.
10. A singing evaluation system based on the method of any one of claims 1-9, comprising:
the vocal zone pitch timeline determining module is used for respectively calculating the vocal zone pitch timelines of the vocal segment to be evaluated and the target vocal segment;
the note extraction module is used for extracting notes from the pitch time line of the voiced region through transcription to obtain a note time line;
the DTW algorithm solving module is used for solving a DTW optimal path and a difference penalty matrix between the note time line of the singing voice segment to be evaluated and the note time line of the target singing voice segment;
the beat accuracy evaluation module is used for traversing the whole DTW optimal path, when a transverse segment appears in the DTW optimal path, the indicated value is +1, when a longitudinal segment appears in the DTW optimal path, the indicated value is-1, and the indicated value is a beat accuracy evaluation index of the song piece to be evaluated;
a note accuracy evaluation module, configured to align a note sequence of the singing voice segment to be evaluated with a note sequence of the target singing voice segment according to the difference penalty matrix and the DTW optimal path, and perform note accuracy evaluation on the singing voice segment to be evaluated after alignment;
namely, the above 5 modules respectively execute the operations of 5 steps in the singing evaluation method.
11. The singing evaluation system according to claim 10, wherein said voiced region pitch timeline determination module specifically comprises:
the windowing and framing operation unit is used for respectively carrying out windowing and framing operation on the singing voice segment to be evaluated and the target singing voice segment;
the pitch calculation unit is used for respectively carrying out pitch calculation based on a YIN algorithm on the singing voice segment to be evaluated and the target singing voice segment frame by frame;
a fundamental tone vector determining unit, which is used for connecting the fundamental tones of each adjacent frame together to obtain a fundamental tone vector;
a pitch time line determining unit, configured to combine the pitch vector with the time corresponding to each frame to determine a pitch time line;
and the voiced region detection unit is used for detecting the voiced regions at all time points on the pitch timeline and extracting the voiced regions to obtain the pitch timeline of the voiced regions.
12. The singing evaluation system of claim 10, further comprising:
the target segment selection module is used for selecting segments which have stable note values on the note time lines corresponding to the singing voice segments to be evaluated and have the duration time longer than the set time value, and recording the segments as target segments;
the characteristic extraction module is used for extracting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic of the target segment;
the high-level evaluation module is used for inputting the frequency spectrum shape characteristic, the harmonic energy characteristic, the vibrato characteristic and the sound intensity characteristic into a trained neural network model to obtain emotion fullness evaluation and audio domain appropriate degree evaluation;
namely, the 3 modules respectively execute the operations of 3 steps in emotion fullness evaluation and audio domain appropriateness evaluation of the singing voice segment to be evaluated.
CN202010969451.1A 2020-09-15 2020-09-15 Singing evaluation method and system Active CN112233691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010969451.1A CN112233691B (en) 2020-09-15 2020-09-15 Singing evaluation method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010969451.1A CN112233691B (en) 2020-09-15 2020-09-15 Singing evaluation method and system

Publications (2)

Publication Number Publication Date
CN112233691A true CN112233691A (en) 2021-01-15
CN112233691B CN112233691B (en) 2022-07-22

Family

ID=74117251

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010969451.1A Active CN112233691B (en) 2020-09-15 2020-09-15 Singing evaluation method and system

Country Status (1)

Country Link
CN (1) CN112233691B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674723A (en) * 2021-08-16 2021-11-19 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, computer equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007232750A (en) * 2006-02-27 2007-09-13 Yamaha Corp Karaoke device, control method and program
CN106776664A (en) * 2015-11-25 2017-05-31 北京搜狗科技发展有限公司 A kind of fundamental frequency series processing method and device
CN106935236A (en) * 2017-02-14 2017-07-07 复旦大学 A kind of piano performance appraisal procedure and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007232750A (en) * 2006-02-27 2007-09-13 Yamaha Corp Karaoke device, control method and program
CN106776664A (en) * 2015-11-25 2017-05-31 北京搜狗科技发展有限公司 A kind of fundamental frequency series processing method and device
CN106935236A (en) * 2017-02-14 2017-07-07 复旦大学 A kind of piano performance appraisal procedure and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EMILIO MOLINA ET AL.: "《Fundamental frequency alignment vs. note-based melodic similarity for singing voice assessment》", 《2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING》 *
王佳迪: "《鲁棒的音乐评分方法研究》", 《中国优秀博硕士学位论文全文数据库(硕士) 哲学与人文科学辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113674723A (en) * 2021-08-16 2021-11-19 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, computer equipment and readable storage medium
CN113674723B (en) * 2021-08-16 2024-05-14 腾讯音乐娱乐科技(深圳)有限公司 Audio processing method, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN112233691B (en) 2022-07-22

Similar Documents

Publication Publication Date Title
Ryynänen et al. Transcription of the Singing Melody in Polyphonic Music.
Marolt A connectionist approach to automatic transcription of polyphonic piano music
Rigaud et al. Singing Voice Melody Transcription Using Deep Neural Networks.
McNab et al. Tune retrieval in the multimedia library
Gupta et al. Perceptual evaluation of singing quality
CN101093661B (en) Pitch tracking and playing method and system
Toh et al. Multiple-Feature Fusion Based Onset Detection for Solo Singing Voice.
CN101093660B (en) Musical note syncopation method and device based on detection of double peak values
CN105825868A (en) Singer effective range extraction method
Dai et al. Singing together: Pitch accuracy and interaction in unaccompanied unison and duet singing
CN115050387A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
CN112233691B (en) Singing evaluation method and system
Lerch Software-based extraction of objective parameters from music performances
Wong et al. Automatic lyrics alignment for Cantonese popular music
CN111681674B (en) Musical instrument type identification method and system based on naive Bayesian model
CN105244021B (en) Conversion method of the humming melody to MIDI melody
Jie et al. A violin music transcriber for personalized learning
JP3934556B2 (en) Method and apparatus for extracting signal identifier, method and apparatus for creating database from signal identifier, and method and apparatus for referring to search time domain signal
CN113129923A (en) Multi-dimensional singing playing analysis evaluation method and system in art evaluation
Ali-MacLachlan Computational analysis of style in Irish traditional flute playing
Chuan et al. The KUSC classical music dataset for audio key finding
CN111368129A (en) Humming retrieval method based on deep neural network
Ng et al. Automatic detection of tonality using note distribution
Schuller et al. HMM-based music retrieval using stereophonic feature information and framelength adaptation
Panteli et al. Pitch Patterns of Cypriot Folk Music between Byzantine and Ottoman Influence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant