CN110265051A

CN110265051A - The sightsinging audio intelligent scoring modeling method of education is sung applied to root LeEco

Info

Publication number: CN110265051A
Application number: CN201910480919.8A
Authority: CN
Inventors: 徐民洪; 吴清强; 刘昆宏; 李昌春; 黄仙寿; 周道成; 林辉杰
Original assignee: Fujian Xiaozhi Dashu Information Technology Co Ltd
Current assignee: Fujian Xiaozhi Dashu Information Technology Co Ltd
Priority date: 2019-06-04
Filing date: 2019-06-04
Publication date: 2019-09-20

Abstract

The present invention relates to a kind of sightsinging audio intelligent scoring modeling methods that education is sung applied to root LeEco, step 1: the solfege data comprising expert analysis mode that system is collected in advance divide, data are divided by 2:1,2 parts therein are used as training data, and 1 part is test data；Step 2: audio data is denoised, and cuts out the blank segment of no audio, carries out the data prediction of voice enhancing；Step 3: after audio data is pre-processed, audio frequency characteristics is extracted using mel cepstrum coefficients method, extract pitch information；Step 4: high pitch information is extracted into frequency domain character using Short Time Fourier Transform, beat information wherein included is extracted, forms the feature based on rhythm.Step 5: based on characteristic informations such as pitch, rhythm, scoring modeling is carried out.The ability that the present invention helps user to promote oneself music sightsinging aspect.

Description

The sightsinging audio intelligent scoring modeling method of education is sung applied to root LeEco

Technical field

The present invention relates to a kind of sightsinging audio intelligent scoring modeling methods that education is sung applied to root LeEco.

Background technique

This system realizes that user recording and audio file upload, into system background server, to solfege audio into Row intelligent scoring, and appraisal result is fed back into client.Intelligent scoring module application machine learning modeling, by comparison The difference of voice and standard pronunciation in acoustic frequency is judged respectively from two angles of rhythm and accuracy in pitch, is precisely commented to realize It surveys, and result is fed back into user, the ability for helping user to promote oneself music sightsinging aspect.

Summary of the invention

The object of the present invention is to provide a kind of sightsinging audio intelligent scoring modeling sides that education is sung applied to root LeEco Method, the ability for helping user to promote oneself music sightsinging aspect.

Above-mentioned purpose is realized by following technical scheme:

A kind of sightsinging audio intelligent scoring modeling method for singing education applied to root LeEco, the acquisition of data and pre- place Reason the following steps are included:

Step 1: the solfege data comprising expert analysis mode that system is collected in advance divide, and data are pressed 2:1 It divides, 2 parts therein are used as training data, and 1 part is test data, are modeled using training data；

Step 2: audio data is denoised, and cuts out the blank segment of no audio, and the data for carrying out voice enhancing are located in advance Reason；

Step 3: audio data is extracted into audio frequency characteristics using mel cepstrum coefficients method, extracts pitch information；

Step 4: extracting frequency domain character using Short Time Fourier Transform for audio data, extracts beat letter wherein included Breath forms the feature based on rhythm；

Step 5: standard audio is extracted into accuracy in pitch and rhythm characteristic according to step 2 to step 4；

Step 6: standard audio and solfege audio are used and are based on dynamic time adjustment algorithm, is fallen to based on Meier The accuracy in pitch feature that spectral coefficient method obtains, is compared；

Step 7: standard audio and solfege audio are used, algorithm is scaled based on linear Hash, to based on Fu in short-term In leaf method obtain rhythm characteristic be compared；

Step 8: using the matching vector of the pitch of acquisition and rhythm as training data, using training neural network, when When test data set error rate is less than 1%, verification process terminates；

Step 9: using the client end interface of wechat small routine, sightsinging audio when upload user individual practices, to these The audio of upload carries out step 2 to step 4 and step 6, and the processing of step 7 inputs trained neural network later Model exports corresponding rhythm by neural network, accuracy in pitch scores；By the corresponding rhythm of neural network output, the scoring knot of accuracy in pitch Fruit exports to the interface of wechat small routine, flashes in client；

Step 10: corresponding accuracy in pitch vector sum rhythm vector is returned into user client interface.

The utility model has the advantages that

1. scoring effect of the invention can achieve the level of profession scoring, the scoring mean value error with multidigit human expert It is smaller.

2. scoring operation efficiency of the invention is higher, multiple angle scoring process can be completed within 5 seconds, reach industry Application requirement.

3. anti-noise ability of the invention is stronger, also can preferably score in the case where there is certain ambient noise.

4. scoring process of the invention merges various features, score can be judged from multiple angles such as rhythm, accuracy in pitch.

Detailed description of the invention

Attached drawing 1 is training process schematic diagram of the invention.

Attached drawing 2 is scoring process schematic diagram of the invention.

Attached drawing 3 is the dimensional variation schematic diagram of the filter group of melscale of the invention.

Attached drawing 4 is of the invention by signal decomposition, and the convolution of two signals is converted into the addition schematic diagram of two signals.

Attached drawing 5 is the similitude schematic diagram between the present invention two time serieses of calculating.

Attached drawing 6 is cost matrix schematic diagram of the invention.

Attached drawing 7 is that audio of the invention does after Fourier transformation overfrequency to separate unlike signal schematic diagram.

Specific embodiment

A kind of sightsinging audio intelligent scoring modeling method for singing education applied to root LeEco, it is characterized in that: data Obtain and pretreatment the following steps are included:

The neural network training process, which includes: (1), selects important parameter, including activation primitive according to data characteristic, The hidden layer quantity of neural network, each hidden neuron number of nodes, learning rate etc.；(2) feature and mark extracted training data Range data after quasi- audio frequency characteristics comparison is as two vectors, using the professional score data that expert gives as prediction target, Carry out the training of neural network.Target value is approached by neural network using back-propagation algorithm；Neural network after training iteration The target of output and the error of expert analysis mode are less than certain threshold value, when test data set error rate is less than 1%, verification process Terminate；If it exceeds 10,000 iteration cannot approach the error range of target, then (1) is returned to, readjusts setting for important parameter It is fixed；

Further, it is based on step 6, the pitch in the main piano pitch compared in standard audio and sightsinging audio is high Low variation matching degree, the method for having used linear pitch calibration here, first carries out linear scale for the pitch of voice and piano, Ensure that its average energy value is identical, on this basis in comparing audio sequence change in pitch matching vector.

Further, it is based on step 7, the rhythm in the main piano rhythm compared in standard audio and sightsinging audio is fast Slow variation matching degree.The rhythm of voice is carried out linear scale by the method for having used the calibration of linear rhythm here, it is ensured that its with The tempo variation rate of piano is identical, on this basis in comparing audio sequence tempo variation matching vector.

Further, it is based on step 10, interface parsed, and label is right in the sightsinging corresponding music score of Chinese operas position of song The poor position of user's matching degree carries out marking red annotation.

Mel-frequency cepstrum coefficient is exactly the coefficient for forming mel-frequency cepstrum, mel-frequency cepstrum coefficient feature extraction packet Containing two committed steps: being transformed into mel-frequency, then carry out cepstral analysis.

Further, the mel-frequency be it is a kind of based on human ear to the judgement of the sense organ of equidistant change in pitch depending on Non-linear frequency scale；It is as follows with the relationship of the hertz of frequency:

So if it on melscale be uniform indexing, for the distance between hertz will be increasing, institute With dimensional variation such as Fig. 1 of the filter group of melscale,

In the high resolution of low frequency part, the auditory properties with human ear are consistent the filter group of melscale, this It is the physical significance place of melscale.

This step is meant that: being carried out Fourier transformation to time-domain signal first and is transformed into frequency domain, then recycles Meier The filter group of frequency scale corresponds to frequency-region signal and carries out cutting, the last corresponding numerical value of each frequency band.

Cepstral analysis does Fourier transformation to time-domain signal, then takes log, then carries out inversefouriertransform again.It can be with It is divided into cepstrum, real cepstrum and power cepstrum, ours is power cepstrum；

Cepstral analysis is available: by signal decomposition, the convolution of two signals is converted into the addition of two signals.

Cepstral analysis is available: by signal decomposition, the convolution of two signals is converted into the addition of two signals.It is exemplified below:

Assuming that frequency spectrum X (k) above, time-domain signal is that x (n) so meets:

X (k)=DFT (x (n))

Consider frequency domain X (k) being split as two-part product:

X (k)=H (k) E (k)

Assuming that the corresponding time-domain signal of two parts is h (n) and e (n) respectively, then meeting:

X (n)=h (n) * e (n)

We are to cannot be distinguished to open h (n) and e (n) at this time.

Log is taken to frequency domain both sides:

Log (X (k))=log (H (k))+log (E (k))

Then inversefouriertransform is carried out:

IDFT (log (X (k)))=IDFT (log (H (k)))+IDFT (log (E (k)))

Assuming that the time-domain signal obtained at this time is as follows:

X'(n)=h'(n)+e'(n)

Although obtaining time-domain signal x ' (n) at this time is cepstrum, different with original time-domain signal x (n), The convolution relation of time-domain signal can be converted to linearly add relationship.

The frequency-region signal of corresponding upper figure, can split into two-part product: the envelope of frequency spectrum and the details of frequency spectrum.Frequency spectrum Peak value be formant, it determines the envelope of signal frequency domain, be distinguish sound important information, so carry out cepstral analysis Purpose is exactly to obtain the envelope information of frequency spectrum.It is the low-frequency information of frequency spectrum that envelope part is corresponding, and detail section is corresponding is The high-frequency information of frequency spectrum.Cepstral analysis closes the convolution relation conversion of the corresponding time-domain signal of two parts to linearly add System, so only needing cepstrum can be obtained corresponding time-domain signal h ' (t) in envelope part by a low-pass filter.

Accuracy in pitch based on dynamic time adjustment algorithm compares, and is the side of the similarity between a kind of two time serieses of measurement Method is mainly used in field of speech recognition to identify that two sections of voices indicate whether the same word；

In time series, the length for needing to compare two sections of time serieses of similitude is possible and unequal, knows in voice The word speed that other field shows as different people is different.And the rate of articulation of the different phonemes in the same word is also different, such as Somebody can drag " A " this sound very long, or the very short of " i " hair.In addition, when different time sequence may only exist Between displacement on axis, that is, in the case where restoring displacement, two time serieses are consistent.In these complex cases, make The distance between two time serieses that can not be effectively asked with traditional Euclidean distance/similitude；

DTW is by extending time series and shortened, and to calculate the similitude between two time serieses, such as schemes 3。

Enabling two time serieses that calculate similarity is X and Y, and length is respectively | X | and | Y |,

The form in consolidation path be W=w1, w2 ..., wK, wherein Max (| X |, | Y |)≤K≤| X |+| Y |,

The form of wk is (i, j), and wherein what i was indicated is the i coordinate in X, and what j was indicated is the j coordinate in Y,

Consolidation path W must start from w1=(1,1), to wK=(| X |, | Y |) ending, to guarantee each seat in X and Y Mark all occurs in W,

In addition, the i and j of w (i, j) must be increased monotonically in W, will not be intersected with the dotted line guaranteed in Fig. 1, so-called list It adjusts plus refers to:

w_k=(i, j), w_k+1=(i', j')

i≤i'≤i+1,j≤j'≤j+1

Obtained consolidation path is apart from a shortest consolidation path:

D (i, j)=Dist (i, j)+min [D (i-1, j), D (i, j-1), D (i-1, j-1)]

The consolidation path distance finally acquired be D (| X |, | Y |).

It is solved using Dynamic Programming such as Fig. 4, is cost matrix (Cost Matrix) D, D (i, j) indicates that length is Consolidation path distance between two time serieses of i and j.

Audio is extracted based on the frequency domain character of Fourier transformation to do after Fourier transformation and can separate different letters with overfrequency Number, the core concept of Fourier is exactly that all waves can be indicated with multiple sine-wave superimposeds, and wave here includes from sound Sound to light all waves, so, the signal of several frequencies can be separated by doing Fourier's series to a collected sound.

Fourier transform is a kind of method for analyzing signal, it can analyze the ingredient of signal, it is also possible to the synthesis of these ingredients Signal.It indicate can by some function representation met certain condition at trigonometric function (sinusoidal and/or cosine function) or it Integral linear combination.Many waveforms can be used as ingredient of signal, such as sine wave, square wave, sawtooth wave etc., and Fourier becomes Use ingredient of the sine wave as signal instead.

Frequency domain can also encounter frequency domain, frequency domain in high-speed figure application with more especially in radio frequency and communication system Most important property is: it is not true, a Mathematics structural.Time domain is the domain of only objective reality, and frequency domain is One follows the mathematics scope of ad hoc rules.

Sine wave is unique existing waveform in frequency domain, this is most important rule in frequency domain, i.e. sine wave is to frequency domain Description because any waveform in time domain can all be synthesized with sine wave.This is a very important property of sine wave.So And it is not the monopolizing characteristic of sine wave, there are many more other waveforms also to have the property that.Use sine wave as frequency Functional form in domain has its special place.If using sine wave, some problems relevant to the electric effect of interconnection line It will become more clearly understood from and solve.If transforming to frequency domain and being described using sine wave, sometimes than only in the time domain can Quickly obtain answer.

And in practice, it initially sets up comprising resistance, the circuit of inductance and capacitor, and input random waveform.Ordinary circumstance Under, the waveform of a similar sine wave will be obtained.Moreover, can easily describe these waves with the combination of several sine waves Shape.

Rhythm based on linear Hash scaling is compared to be needed to hum rotation in view of user from rhythm direction two section audios of comparison The difference for restraining speed, can quickly calculate the linear range between two different length sequences using linear extendible algorithm, groan Singing in rhythm scoring using the main reason for linear extendible algorithm is since the humming rate of user and the performance of original song are fast Rate is inconsistent, by linear extendible, humming segment can be stretched or be compressed, and is to keep one with the rate of original song It causes, the fundamental frequency sequence extracted in humming segment is carried out different degrees of stretch with identical contraction-expansion factor by being critical that for this algorithm Then contracting calculates the rhythm comparison of corresponding original song.

The humming rate and inconsistent problem of original song rate can solve using linear extendible algorithm, but this algorithm The premise of reliability is that humming rate and original song rate are completely proportional, that is, is not in sometimes slow phenomenon fastly sometimes, If humming rate is variation, will be gone wrong using linear extendible algorithm.

Certainly, the above description is not a limitation of the present invention, and the present invention is also not limited to the example above, this technology neck The variations, modifications, additions or substitutions that the technical staff in domain is made within the essential scope of the present invention also should belong to of the invention Protection scope.

Claims

The modeling method 1. a kind of sightsinging audio intelligent for singing education applied to root LeEco scores, it is characterized in that: data obtain Take and pre-process the following steps are included:

Step 1: the solfege data comprising expert analysis mode that system is collected in advance divide, and data are drawn by 2:1 Point, 2 parts therein are used as training data, and 1 part is test data, are modeled using training data；

Step 2: audio data is denoised, and cuts out the blank segment of no audio, carries out the data prediction of voice enhancing；

Step 3: audio data is extracted into audio frequency characteristics using mel cepstrum coefficients method, extracts pitch information；

Step 4: audio data is extracted into frequency domain character using Short Time Fourier Transform, extracts beat information wherein included, shape At the feature based on rhythm；

Step 5: standard audio is extracted into accuracy in pitch and rhythm characteristic according to step 2 to step 4；

Step 6: standard audio and solfege audio are used and are based on dynamic time adjustment algorithm, to based on mel cepstrum system The accuracy in pitch feature that number method obtains, is compared；

Step 7: standard audio and solfege audio are used, algorithm is scaled based on linear Hash, to based on Fourier in short-term The rhythm characteristic that method obtains is compared；

Step 8: using the matching vector of the pitch of acquisition and rhythm as training data, using training neural network, when testing When data set error rate is less than 1%, verification process terminates；

Step 9: using the client end interface of wechat small routine, sightsinging audio when upload user individual practices, to these uploads Audio carry out step 2 to step 4, and Step 6: step 7 processing, input trained neural network mould later Type exports corresponding rhythm by neural network, accuracy in pitch scores；By the corresponding rhythm of neural network output, the appraisal result of accuracy in pitch It exports to the interface of wechat small routine, flashes in client；

Step 10: corresponding accuracy in pitch vector sum rhythm vector is returned into user client interface.
2. the sightsinging audio intelligent scoring modeling method according to claim 1 that education is sung applied to root LeEco, It is characterized in: the pitch height variation matching based on step 6, in the main piano pitch compared in standard audio and sightsinging audio Degree；Here the method for having used linear pitch calibration, first carries out linear scale for the pitch of voice and piano, it is ensured that its energy Mean value is identical, on this basis in comparing audio sequence change in pitch matching vector.
3. the sightsinging audio intelligent scoring modeling method according to claim 1 that education is sung applied to root LeEco, Be characterized in: based on step 7, the rhythm speed in the main piano rhythm compared in standard audio and sightsinging audio changes matching Degree, the method for having used linear rhythm calibration here, carries out linear scale for the rhythm of voice, it is ensured that the rhythm of itself and piano Change rate is identical, on this basis in comparing audio sequence tempo variation matching vector.
4. the sightsinging audio intelligent scoring modeling method according to claim 1 that education is sung applied to root LeEco, Be characterized in: based on step 10, interface is parsed, and label matches journey in the sightsinging corresponding music score of Chinese operas position of song, to user Poor position is spent to carry out marking red annotation.