CN118314919A

CN118314919A - Voice repair method, device, audio equipment and storage medium

Info

Publication number: CN118314919A
Application number: CN202410733009.7A
Authority: CN
Inventors: 刘兵兵; 袁斌; 姜龙; 侯天峰; 蒋超; 李晶晶; 毛婷婷
Original assignee: Goertek Inc
Current assignee: Goertek Inc
Priority date: 2024-06-07
Filing date: 2024-06-07
Publication date: 2024-07-09

Abstract

The application discloses a voice repairing method, a device, audio equipment and a storage medium, which relate to the technical field of voice noise reduction and disclose a voice repairing method, comprising the following steps: determining a voice frequency band in the voice signal frame aiming at the voice signal frame after noise reduction; determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model, wherein each voice model is a frequency spectrum which is constructed by a voice sample data set corresponding to each of a plurality of phonemes and corresponds to the phonemes; and determining a target voice model with the highest voice similarity with the voice signal frame in each voice model, and repairing the voice signal frame according to the target voice model. The application can repair lost or even missing voice signals after noise reduction, and improves the communication quality.

Description

Voice repair method, device, audio equipment and storage medium

Technical Field

The present application relates to the field of speech noise reduction technologies, and in particular, to a speech repair method, apparatus, audio device, and storage medium.

Background

With the development of technology, the requirements of people on communication quality are increasingly improved. Noise interference is a common problem in the field of speech communication, which affects the clarity and intelligibility of speech. To solve this problem, ENC (EnvironmentalNoiseCancellation, environmental noise cancellation technique) noise reduction techniques have been developed.

ENC noise reduction is a technique for eliminating background noise using advanced algorithms. The noise component in the input voice signal is identified by analyzing the input voice signal, and the noise component is removed from the original signal, so that a purer voice signal is obtained.

However, when facing complex and changeable real application scenarios, such as various types of non-stationary noise and limited computing power and memory resource constraints of hardware devices, the current noise reduction method may damage the original voice signal while removing the noise, even cause information loss, thereby affecting the communication quality.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

The application mainly aims to provide a voice repairing method, a device, audio equipment and a storage medium, and aims to solve the technical problems that after noise reduction of a voice signal, damage and even information loss exist.

In order to achieve the above object, the present application provides a method for repairing speech, which includes:

Determining a voice frequency band in a voice signal frame aiming at the voice signal frame after noise reduction;

determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model, wherein each voice model is a frequency spectrum corresponding to a phoneme constructed by a voice sample data set corresponding to each phoneme;

And determining a target voice model with highest voice similarity with the voice signal frames in the voice models, and repairing the voice signal frames according to the target voice model.

In one embodiment, the step of repairing the speech signal frame according to the target speech model comprises:

determining a lossy frequency point outside the voice frequency band in the voice signal frame, and acquiring a frequency point amplitude of the voice signal frame at the lossy frequency point;

calculating prior amplitude average values of a plurality of prior voice signal frames in front of the voice signal frames at the lossy frequency points;

obtaining a model amplitude of the target voice model at the lossy frequency point;

Multiplying the model amplitude value by the prior amplitude value average value to obtain a reference amplitude value;

And carrying out weighted summation on the frequency point amplitude and the reference amplitude to obtain a new amplitude of the voice signal frame at a lossy frequency point so as to repair the lossy frequency point, wherein the sum of the weight of the frequency point amplitude and the weight of the reference amplitude is one.

In an embodiment, the step of performing weighted summation on the frequency point amplitude and the reference amplitude to obtain a new amplitude of the speech signal frame at the lossy frequency point includes:

Calculating the absolute value of the difference between the frequency point amplitude and the model amplitude;

and carrying out weighted summation on the reference amplitude and the frequency point amplitude based on the difference absolute value to obtain a new amplitude of the voice signal frame at the lossy frequency point, wherein the weight of the reference amplitude and the difference absolute value are positively correlated.

In one embodiment, the step of determining the speech frequency band in the speech signal frame comprises:

acquiring a plurality of formants in the voice signal frame;

determining two target formants with the distance between frequency points smaller than a preset threshold value from the formants as a target formant group;

for each target formant group, determining a first frequency band between two target formants in the target formant group;

and taking the first frequency band and the frequency bands within preset ranges at two sides of the first frequency band as a voice frequency band.

In an embodiment, after the step of determining the speech frequency band in the speech signal frame, the method further comprises:

according to the frequency interval of the voice frequency band, model signals in the frequency interval in each voice model are intercepted respectively;

Calculating interval similarity between the voice frequency band and each model signal;

And taking the interval similarity as the frequency band similarity between the voice frequency band and each voice model.

In one embodiment, when there are a plurality of the voice frequency bands, the step of determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model includes:

For each voice model, calculating a weighted average value of the frequency band similarity between the voice model and each voice frequency band, wherein the lower the frequency is, the larger the corresponding weight of the voice frequency band is;

the weighted average is taken as the voice similarity between the voice model and the voice signal frame.

In an embodiment, before the step of determining the speech frequency band in the speech signal frame for the noise-reduced speech signal frame, the method further includes:

for each phoneme in the plurality of phonemes, acquiring a respective amplitude-frequency curve of each voice sample data in the voice sample data set corresponding to the phoneme;

calculating the amplitude mean value of each amplitude-frequency curve at each frequency point;

normalizing each amplitude mean value, and taking each normalized amplitude mean value as the amplitude value at each frequency point in the voice model to construct the voice model corresponding to the phonemes.

In addition, in order to achieve the above object, the present application also proposes a speech restoration apparatus including:

the voice frequency band determining module is used for determining a plurality of voice frequency bands in each voice signal frame of the voice signal after noise reduction;

The similarity determining module is used for determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model, wherein each voice model is a frequency spectrum corresponding to a phoneme constructed by a voice sample data set corresponding to each phoneme;

And the voice repair module is used for determining a target voice model with highest voice similarity with the voice signal frame in each voice model and repairing the voice signal frame according to the target voice model.

In addition, to achieve the above object, the present application also proposes an audio apparatus, the apparatus comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the speech repair method as described above.

In addition, in order to achieve the above object, the present application also proposes a storage medium, which is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the speech restoration method as described above.

The application provides a voice repairing method, firstly constructing a voice model corresponding to each phoneme according to the phoneme characteristics of human voice when speaking, then detecting a plurality of voice frequency bands in each frame of a noise-reduced voice signal, then determining the similarity between the voice signal frame and each voice model according to the similarity between each voice frequency band and each voice model, and determining a target voice model with the highest similarity in each voice model, namely the voice model most similar to the voice signal frame, and repairing the voice signal frame according to the target voice model until each frame of signal is repaired, thereby repairing the noise-reduced voice signal.

In summary, according to the application, each speech signal frame in the speech signal is matched by arranging a plurality of speech models divided by phonemes, and the speech signal frame is matched with the most similar one target speech model, namely, the speech signal frame is considered to be the most similar to the phonemes of the target speech model, and then the phonemes of the speech signal frame can be restored by restoring the speech signal frame according to the target speech model, so that the whole speech signal is matched and restored, the speech signal with loss or even missing after noise reduction can be restored, the problems of information loss or even data loss possibly introduced by noise reduction processing are suppressed, and further the communication quality is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to a person skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a speech repair method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a formant acquisition flow provided in a second embodiment of a speech repair method according to the present application;

FIG. 3 is a schematic flow chart of a third embodiment of a speech repair method according to the present application;

FIG. 4 is a schematic block diagram of a speech repairing apparatus according to an embodiment of the present application;

Fig. 5 is a schematic device structure diagram of a hardware operating environment related to a speech repair method in an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the technical solution of the present application and are not intended to limit the present application.

For a better understanding of the technical solution of the present application, the following detailed description will be given with reference to the drawings and the specific embodiments.

The main solutions of the embodiments of the present application are: determining a voice frequency band in a voice signal frame aiming at the voice signal frame after noise reduction; determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model, wherein each voice model is a frequency spectrum corresponding to a phoneme constructed by a voice sample data set corresponding to each phoneme; and determining a target voice model with highest voice similarity with the voice signal frames in the voice models, and repairing the voice signal frames according to the target voice model.

In this embodiment, for convenience of description, the following description will be made with the audio device as the execution subject.

In view of the above problems, the present application provides a method for repairing speech, which comprises constructing a speech model corresponding to each phoneme according to the phoneme characteristics of a human voice when speaking, detecting a plurality of speech frequency bands in each frame of a noise-reduced speech signal, determining the similarity between the speech signal frame and each speech model according to the similarity between each speech frequency band and each speech model, determining a target speech model with the highest similarity in each speech model, namely, a speech model most similar to the speech signal frame, repairing the speech signal frame according to the target speech model until each frame of signal is repaired, and repairing the noise-reduced speech signal.

Based on this, an embodiment of the present application provides a speech repair method, referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of the speech repair method of the present application.

In this embodiment, the voice repair method includes steps S10 to S30:

Step S10, determining a voice frequency band in a voice signal frame aiming at the voice signal frame after noise reduction;

it should be noted that, the voice frequency band is a frequency band containing human voice, and it is understood that the portion identified as the voice frequency band has no loss or less loss in the noise reduction process, and may not be repaired.

In this embodiment, in order to match phonemes corresponding to a speech signal frame, speech frequency bands that may exist in the speech signal frame, that is, undamaged speech signal segments, may be first identified, and various manners of obtaining speech frequency bands in a frame of signal are available, for example, speech frequency bands in the speech signal frame may be screened through adaptive filtering, which is not limited herein.

In addition, before repairing the voice signal, the voice signal needs to be subjected to noise reduction, and when the noise reduction processing is performed, the voice signal needs to be divided into a plurality of voice signal frames, so that the subsequent processing is convenient. In order to reduce spectral leakage caused by window truncation and to improve frequency resolution when performing fourier transform, it is necessary to window the speech signal frames, i.e. to superimpose signals in the speech signal frames with a window function, to obtain windowed speech signal frames.

Preferably, a hamming window can be selected as a windowed window function, and the hamming window can be used as a window function, so that spectrum leakage caused by the edge of the window can be reduced, and meanwhile, the height of side lobes is reduced, so that the window function has better frequency resolution capability, window smoothness and frequency resolution are both considered, and the calculation cost is lower.

After windowing the voice signal frame, the voice signal frame needs to be transformed into a frequency domain, the time domain to frequency domain transformation of the voice signal frame is simply and directly carried out by adopting Fourier transformation, the types of the Fourier transformation are various selectable, and in a preferred scheme, the voice signal can be processed by selecting short-time fast Fourier transformation, so that the subsequent processing process is more visual.

After each voice signal frame in the frequency domain is obtained, the noise reduction operation can be performed on each voice signal frame, for example, a traditional noise reduction mode can be adopted, for example, the power spectrum density of background noise can be estimated, the estimated noise power spectrum is subtracted from the power spectrum density containing the voice signal, or other filtering modes based on mathematical principles and physical principles are adopted, the operation amount of the noise reduction mode is small, the noise reduction can be performed in real time, but ideal priori assumption based on human cognition is inevitably needed, so that the traditional noise reduction does not perform well on various unsteady noise frequently occurring in a real scene.

Preferably, AI (ARTIFICIALINTELLIGENCE ) noise reduction can be adopted, a large amount of audio data shared in an actual user scene is taken as a data source, and an appropriate data model is established as a basis for adaptively performing noise reduction, and although the calculated amount of AI noise reduction is large, the noise reduction effect is superior to that of the traditional noise reduction mode.

In a possible implementation manner, the method further comprises steps A10-A30 before the step S10:

Step A10, for each phoneme in the plurality of phonemes, acquiring a respective amplitude-frequency curve of each voice sample data in the voice sample data set corresponding to the phoneme;

Note that, the phonemes are the smallest units that a person composes one sound when speaking, for example, in mandarin chinese, 23 initials and 24 finals are each one phoneme, and similar phonemes exist in other voices as well, and the number and types of phonemes are not further limited.

In this embodiment, in order to repair a speech signal of a person when speaking, the voice of the person when speaking may be divided according to phonemes to improve accuracy and precision of repair, in order to restore a lost speech signal, a speech model corresponding to each phoneme needs to be constructed first as a restoration standard, when the speech model is constructed, each phoneme needs to select a plurality of different samples, then a model corresponding to each phoneme needs to be constructed according to each sample, and in order to facilitate similarity detection with the speech signal, the speech model needs to be a frequency spectrum.

Preferably, samples with different sexes and ages are needed to be included in the samples, and the tone colors of different people are different, but the amplitude curves of sounds are similar when the same tone is generated, so that the selection of abundant samples can not cause identification errors, the applicability of the model can be improved, and the precision of the model is further improved.

As an example, because the human voice is concentrated at 300hz to 4000hz, the sound signals of multiple persons on each initial consonant and vowel can be collected as samples in a quiet scene with a frequency spectrum of 300hz to 4000hz, and then the voice model of each phoneme can be obtained according to the samples.

In this embodiment, for each phoneme, each voice sample needs to be converted first, and a signal in a time domain is converted into a signal in a frequency domain, so as to obtain a respective amplitude-frequency curve of each voice sample.

Specifically, the human voice samples may be fourier transformed to convert the signal from the time domain to the frequency domain, respectively.

Step A20, calculating the amplitude mean value of each amplitude-frequency curve at each frequency point;

in this embodiment, the amplitude-frequency curve may be divided into a plurality of discrete amplitudes according to the frequency points, and the average value of the amplitudes at each frequency point may be calculated according to the amplitudes at each frequency point in each amplitude-frequency curve.

And step A30, normalizing each amplitude mean value, and taking each normalized amplitude mean value as the amplitude of each frequency point in the voice model to construct the voice model corresponding to the phonemes.

In this embodiment, after the amplitude mean value of each frequency point is obtained by calculation, normalization processing is further required to be performed on the amplitude mean value, that is, the model only represents the amplitude-frequency characteristic of the phoneme, and the voice model corresponding to the phoneme can be obtained after normalization.

As an example, where the sample set includes N different voice samples and there are multiple different phones, the construction formula of the speech model for a phone is as follows:

Wherein k refers to the frequency point index, frame_len refers to the frame length, the number of the frequency point indexes is one half of the frame length plus one, magnk refers to the amplitude of the ith sample at k, N refers to the sample number after the amplitude of the ith sample at k is normalized, model_ magnk refers to the amplitude of the voice model at the frequency point k, namely the amplitude average value of each sample at the frequency point k, and the voice model of the phonemes can be obtained by integrating the amplitudes of all the frequency points.

In the embodiment, the applicability and the robustness of the model can be improved by collecting different human voice samples to construct the model, and the voice model is constructed by only adopting the amplitude mean value, so that the required storage space is small, the calculated amount is small, and the practicability of the algorithm can be improved.

In addition, in another possible embodiment, after preprocessing the signal of the voice sample, feature extraction is performed on the signal, so as to obtain a feature sequence corresponding to the phonemes, and when similarity matching is performed, the feature sequence of the voice frequency band is also calculated, and whether the voice frequency band is similar or not is judged according to the similarity between the feature sequences.

It will be appreciated that the first embodiment of modeling provided above has less processing and less extreme complexity than the second embodiment, and that the second embodiment also builds models with greater accuracy because it is more capable of highlighting the phoneme features than the first embodiment.

The above are merely two possible implementations for constructing a speech model provided in this embodiment, and the embodiment is not particularly limited to a specific implementation for constructing a speech model.

Step S20, determining the voice similarity between the voice signal frame and each voice model according to the frequency band similarity between the voice frequency band and each voice model, wherein each voice model is a frequency spectrum corresponding to a phoneme constructed by a voice sample data set corresponding to each phoneme;

Note that, each of the voice models employed in the present embodiment is a spectrum corresponding to a phoneme constructed with a voice sample data set corresponding to each of a plurality of phonemes, and the voice model is constructed in the manner described above in the step of constructing a voice model.

In this embodiment, the frequency band similarity between the voice frequency band and each voice model needs to be calculated first, and then the frequency band similarities calculated by the same voice model are integrated, so that the voice similarity between the whole voice signal frame and each voice model can be obtained.

And step S30, determining a target voice model with highest voice similarity with the voice signal frame in each voice model, and repairing the voice signal frame according to the target voice model.

In this embodiment, after the voice similarity between the voice signal frame and each voice model is obtained through calculation, the voice model with the highest voice similarity is selected as the target voice model, that is, the signal characteristics of the voice signal frame are the most similar to those of the target voice model, the phonemes of the voice signal frame are similar to those of the target voice model, the target voice model can be used for repairing the voice signal frame, and various repairing methods exist, for example, the target voice model can be mapped to the voice signal frame according to the amplitude ratio between the target voice model and the voice signal frame, and the whole voice signal frame is repaired, which is not limited herein.

The embodiment can be applied to the voice repair after conventional noise reduction and also can be applied to the voice repair after AI (ARTIFICIALINTELLIGENCE ) noise reduction, and the noise reduction effect of the AI is better, so that the embodiment can show better performance in the AI noise reduction, after noise reduction, noise is inhibited to a great extent, voice spectrum information is more prominent, formant information can be more accurately obtained, thereby effectively repairing voice damage frequency bands, voice loss frequency bands and voice frequency bands seriously polluted by noise, and improving voice fidelity and noise reduction effect; and the calculation complexity is low, and the real-time processing is easy.

According to the application, each voice signal frame in the voice signal is matched by arranging a plurality of voice models which are divided by phonemes, and the voice signal frame is matched with a target voice model which is the most similar, namely, the voice signal frame is considered to be the most similar to the phonemes of the target voice model, and then the phonemes of the voice signal frame can be restored by restoring the voice signal frame according to the target voice model, so that the whole voice signal is matched and restored, the voice signal with loss or even missing after noise reduction can be restored, the problems of information loss or even data loss possibly caused by noise reduction processing are restrained, and the communication quality is further improved.

In the second embodiment of the present application, the same or similar content as in the first embodiment of the present application may be referred to the above description, and will not be repeated. On this basis, in step S20, the step of determining a plurality of voice frequency bands in the voice signal frame includes steps B10 to B40:

step B10, obtaining a plurality of formants in the voice signal frame;

Formants refer to regions in the spectrum of sound where energy is relatively concentrated, and reflect physical characteristics of the vocal tract (resonant cavity) as well as determining factors of sound quality.

In this embodiment, in order to detect a recognizable voice frequency band in a voice signal frame, the voice frequency band may be determined by using a formant as a detection reference, and therefore, it is necessary to detect a plurality of formants in the voice signal frame first. There are many methods of formant estimation, such as a band-pass filter bank method, a cepstrum method, a linear prediction method, and the like. And are not further limited herein.

Specifically, in one possible embodiment, step B10 may include the steps of:

step B11, preprocessing the voice signal frame to obtain a processed intermediate frame;

in this embodiment, the steps of preprocessing the speech signal frames can be varied, and it is more common to window the frame signal to reduce the edge effect.

Step B12, carrying out Fourier transform on the intermediate frame to obtain a frame signal in a frequency domain;

in this embodiment, after the frame signal is preprocessed to obtain the intermediate frame, fourier transformation is required to be performed on the intermediate frame, and the intermediate frame in the time domain is converted into the frequency domain, so as to obtain the frame signal in the frequency domain, so that the spectral characteristics of the signal can be observed and analyzed conveniently.

Step B13, after the amplitude of the frame signal is obtained through calculation, taking the logarithm of the amplitude to obtain a logarithmic amplitude spectrum of the frame signal;

in this embodiment, after obtaining the frame signal, the amplitude of the frame signal is calculated first, and then the logarithm is taken, so as to obtain the amplitude in the logarithmic domain, i.e. the logarithmic amplitude spectrum.

Step B14, performing inverse Fourier transform on the logarithmic magnitude spectrum to obtain a cepstrum sequence;

in this embodiment, after the log-amplitude spectrum is obtained, the log-amplitude spectrum is subjected to inverse fourier transform to obtain a cepstrum sequence.

Step B15, superposing the cepstrum sequence with a preset window function, and carrying out Fourier transformation to obtain an envelope curve of the frame signal;

In this embodiment, in order to facilitate detection of formants, it is also necessary to window the cepstrum sequence and then perform fourier transform to obtain the envelope of the frame signal. The window function is typically arranged as a rectangular window, the width of which is related to the resolution of the inverse frequency, i.e. to the sampling frequency and the length of the FFT (fast fourier transform).

And step B16, taking a plurality of extreme values of the envelope line as a plurality of formants of the voice signal frame.

In this embodiment, a plurality of corresponding formants are obtained by searching for a plurality of maxima in the envelope.

As an example, referring to fig. 2, the flow of obtaining formants is shown in fig. 2, where a frame signal is preprocessed, then FFT is performed, then a logarithmic magnitude spectrum is obtained, then IFFT (InverseFastFourierTransform, inverse fourier transform) is performed, a window function is superimposed, and after FFT is performed to obtain an envelope, extremum is obtained for the envelope, so as to obtain n formants.

In the embodiment, as noise is well suppressed after noise is reduced, the voice frequency spectrum becomes clearer and more full, so that the method is closer to the estimation of formants in a scene with high signal-to-noise ratio, and simultaneously, the method is easy to realize in consideration of low calculation complexity, and can reduce the calculation amount and improve the practicability of the algorithm under the condition of ensuring the accuracy of the algorithm by adopting a cepstrum method.

In other feasible embodiments, a deep learning method can be adopted to obtain formants, so that the robustness to noise is higher, but compared with the method adopting a cepstrum method, the method has the advantages of large time and resource consumption and higher complexity; a linear prediction method may also be employed, which has the lowest computational complexity, but may have poor tracking performance for fast-changing formants, as compared to a cepstrum method.

The above are merely two possible implementations of the step B10 provided in this embodiment, and the specific implementation of the step B10 in this embodiment is not specifically limited.

Step B20, two target formants with the distance between frequency points smaller than a preset threshold value are determined in each formant to be used as a target formant group;

Step B30, determining a first frequency band between two target formants in each target formant group;

And step B40, taking the first frequency band and the frequency bands within preset ranges at two sides of the first frequency band as a voice frequency band.

In this embodiment, the preset threshold may be a threshold set according to practical experience, which is not limited herein. The preset range may be a certain range set according to practical experience, and is not limited herein.

In this embodiment, after formants are determined, a speech frequency band of undamaged speech in a speech signal frame may be determined according to the formants, first, a preset threshold may be set empirically, if the distance between the frequency points of two formants is smaller than the preset threshold, the frequency band between two formants is considered to be an undamaged speech frequency band, and generally, the peak of the speech signal is not at the beginning and end of the speech signal, so that the frequency band needs to be extended beyond two formants, and a frequency band within a preset range beyond two formants is selected to be used as a speech frequency band together with the frequency band between two formants.

As an example, after determining a plurality of formants fmt _i (where i is the number of formants in the same speech signal frame), a local sub-speech band is determined from the formants:

Wherein j is the sequence number of the voice frequency band in the same voice signal frame, freq _{start_index} is the amplitude value at the initial frequency index, freq _{end_index} is the amplitude value at the end frequency index, the voice frequency band is the frequency spectrum consisting of the amplitude value at the initial frequency index to the amplitude value at the end frequency index, and the calculation formulas of the initial frequency index and the end frequency index are as follows:

The start_index is a start frequency index, the end_index is an end frequency index, fmt _m and fmt _n are frequency indexes of the mth and nth formants, n is greater than m, sigma is a preset range, delta is a preset threshold, if the frequency index of the nth formant minus the frequency index of the mth formant is less than the preset threshold, the start frequency index of the voice frequency band is the frequency index of the mth formant minus the preset range, the end frequency index is the frequency index of the nth formant plus the preset range, and the frequency spectrum between the start frequency index and the end frequency index is the voice frequency band.

In the third embodiment of the present application, the same or similar content as the first and/or second embodiments of the present application may be referred to the above description, and will not be repeated. On the basis, the voice repair method further comprises the steps of C11-C13:

step C11, according to the frequency interval of the voice frequency band, model signals in the frequency interval in each voice model are respectively intercepted;

step C12, calculating the interval similarity between the voice frequency band and each model signal;

and step C13, taking the interval similarity as the frequency band similarity between the voice frequency band and each voice model.

In this embodiment, when calculating the frequency band similarity between each voice frequency band and each voice model, the lengths between the voice frequency bands and each voice model need to be unified, that is, for each voice frequency band, according to the frequency interval of the voice frequency band, signals in the frequency interval are intercepted in each voice model to be used as model signals, then the similarity between each model signal and the voice frequency band is calculated respectively, and various ways of similarity calculation exist, for example, the similarity between two vectors can be calculated by calculating the euclidean distance between the vectors, the similarity can be calculated by calculating the cosine similarity between the vectors, and the calculated frequency band similarity is the frequency band similarity between the voice frequency band and each voice model respectively.

Specifically, as an example, the voice frequency band is:

Wherein, the value range of j is 1 to the number of voice frequency bands, N _spk is the number of voice frequency bands, and the model signal of the ith voice model is:

the formula of similarity calculation is:

ρ ^j is the similarity between the jth speech segment and the ith speech model.

In a possible implementation, step S30 may include steps S31 to S32:

step S31, calculating a weighted average value of the frequency band similarity between the voice model and each voice frequency band according to each voice model, wherein the lower the frequency is, the larger the weight corresponding to the voice frequency band is;

And step S32, taking the weighted average value as the voice similarity between the voice model and the voice signal frame.

In this embodiment, after the frequency band similarity between each voice frequency band and each voice model is calculated, the voice similarity between the whole voice signal frame and each voice model may be calculated according to each frequency band similarity. When calculating, for each voice model, calculating the weighted average value of the similarity of each frequency band corresponding to the voice model, wherein the effective information of the voice is mainly distributed in a region with lower frequency, so that when the weight is set, the weight is inversely related to the frequency band size of the voice frequency band, and the voice similarity between the voice signal frame and the voice model is obtained by the weighted average calculation.

As an example, the speech similarity calculation formula is:

wherein, Ρ _cur is the similarity between the speech signal frame and a certain speech model, which is the weight of the similarity of the jth frequency band.

In addition, as an implementation mode, when the weights are set, the setting can be performed according to the distribution area of the voice, the frequency bands of the voice distribution adopt the same weight A, the frequency bands of the voice distribution adopt the same weight B, and the weight A is far greater than the weight B, so that the importance of the voice area is improved.

In a possible implementation manner, referring to fig. 3, in step S40, the step of repairing the speech signal according to the target speech model may include steps S41 to S45:

Step S41, determining a lossy frequency point outside the voice frequency band in the voice signal frame, and acquiring a frequency point amplitude of the voice signal frame at the lossy frequency point;

step S42, calculating prior amplitude average values of a plurality of prior voice signal frames before the voice signal frames at the lossy frequency points;

step S43, obtaining a model amplitude of the target voice model at the lossy frequency point;

Step S44, multiplying the model amplitude by the prior amplitude mean value to obtain a reference amplitude;

And step S45, carrying out weighted summation on the frequency point amplitude and the reference amplitude to obtain a new amplitude of the voice signal frame at a lossy frequency point so as to repair the lossy frequency point, wherein the sum of the weight of the frequency point amplitude and the weight of the reference amplitude is one.

In this embodiment, when repairing the speech of the speech signal frame, the region that is not identified as the speech frequency band is mainly repaired, and since the similarity between the speech signal frame and the target speech model is high, the phonemes of the vocal sounds in the speech signal frame can be considered to be the same as the target speech model, i.e. the lost region can be restored by using the target speech model. Also, because the same phoneme may last for multiple frames while a person is speaking, in order to make the restored signal more fitting the person's voice, it is also necessary to repair the frame with a priori information of the previous frames.

Firstly, determining a plurality of lossy frequency points which are not recognized as a voice frequency band in a voice signal frame, acquiring prior amplitude average values of corresponding lossy frequency points in a plurality of frames before the voice signal frame in the voice signal, frequency point amplitude values of the lossy frequency points in the voice signal frame, and model amplitude values of the lossy frequency points in a target voice model, taking the integrated model amplitude values and the prior amplitude average values as reference amplitude values, and carrying out weighted summation on the reference amplitude values and the frequency point amplitude values to obtain new amplitude values of the frequency points of the voice signal frame, wherein the sum of the weight of the frequency point amplitude values and the weight of the reference amplitude values is one.

In a possible implementation manner, in step S45, the step of performing weighted summation on the frequency bin amplitude and the reference amplitude to obtain a new amplitude of the speech signal frame at the lossy frequency bin may include:

In this embodiment, the absolute value of the difference between the frequency point amplitude and the model amplitude can reflect the loss degree of the frequency point, and the greater the absolute value of the difference between the frequency point amplitude and the model amplitude, the more serious the loss is, the weight of the reference amplitude should be properly adjusted, and in addition, when the reference amplitude is calculated, an empirical value obtained by past experience can be multiplied to improve the performance of the algorithm.

As an example, when it is determined that residual noise exists in a certain sub-band of the current frame, according to the prior information of the amplitude of the previous n-1 frame, the residual noise existing in the certain sub-band of the current frame is suppressed, and the specific formula is as follows:

wherein, For the frequency point amplitude value,For the magnitude of the model,For the a priori magnitude mean,For the weight of the frequency point amplitude, the sum of the weight of the frequency point amplitude and the weight of the reference amplitude is 1,For a new amplitude of a subband in a frame of the speech signal,As a result of the empirical value,Is the a priori magnitude of the signal subbands of the previous frames.

In addition, as an example, the loss degree of the lossy frequency point may also be divided by setting an amplitude threshold, if the absolute value of the difference is greater than the amplitude threshold, the loss is indicated to be greater, the first weight η ₁ is adopted as the weight of the frequency point amplitude, if the absolute value of the difference is less than or equal to the amplitude threshold, the loss is indicated to be smaller, and the second weight η ₂ is adopted as the weight of the frequency point amplitude, specifically, the calculation formula is as follows:

Wherein η ₁ is less than η ₂ and φ is the amplitude threshold.

The method of dividing by the threshold value is less calculated but less accurate than the method of linearly changing the weight.

The above are merely two possible implementations of the speech repair provided by the present embodiment, and the specific implementation of the speech repair by the present embodiment is not specifically limited.

It should be noted that the foregoing examples are only for understanding the present application, and are not to be construed as limiting the speech restoration method of the present application, and that many simple modifications based on the technical idea are within the scope of the present application.

The present application also provides a voice repairing device, referring to fig. 4, the voice repairing device includes:

A voice frequency band determining module 10, configured to determine, for each voice signal frame of the voice signal after noise reduction, a plurality of voice frequency bands in the voice signal frame;

A similarity determining module 20, configured to determine a voice similarity between the voice signal frame and each voice model according to a frequency band similarity between the voice frequency band and each voice model, where each voice model is a frequency spectrum corresponding to a phoneme constructed with a voice sample dataset corresponding to each of a plurality of phonemes;

the speech restoration module 30 is configured to determine a target speech model with the highest speech similarity with the speech signal frame from the speech models, and restore the speech signal frame according to the target speech model.

Optionally, the speech repair module 30 is further configured to:

Optionally, the voice frequency band determining module 10 is further configured to:

acquiring a plurality of formants in the voice signal frame;

Optionally, the voice repairing device further includes:

The frequency band similarity calculation module is used for respectively intercepting model signals in the frequency intervals in each voice model according to the frequency intervals of the voice frequency band; calculating interval similarity between the voice frequency band and each model signal; and taking the interval similarity as the frequency band similarity between the voice frequency band and each voice model.

Optionally, the similarity determining module 20 is further configured to:

Optionally, the voice repairing device further includes:

The modeling module is used for acquiring a respective amplitude-frequency curve of each voice sample data in the voice sample data set corresponding to each phoneme aiming at each phoneme in the plurality of phonemes; calculating the amplitude mean value of each amplitude-frequency curve at each frequency point; normalizing each amplitude mean value, and taking each normalized amplitude mean value as the amplitude value at each frequency point in the voice model to construct the voice model corresponding to the phonemes.

The voice repairing device provided by the application can solve the technical problems that the voice signal is damaged and even information is lost after noise reduction by adopting the voice repairing method in the embodiment. Compared with the prior art, the beneficial effects of the voice repairing device provided by the application are the same as those of the voice repairing method provided by the embodiment, and other technical features in the voice repairing device are the same as those disclosed by the method of the embodiment, so that the description is omitted.

The present application provides an audio device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the speech repair method in the first embodiment.

Referring now to fig. 5, a schematic diagram of an audio device suitable for use in implementing embodiments of the present application is shown. The audio device in the embodiments of the present application may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (PortableApplicationDescription: tablet computers), PMPs (PortableMediaPlayer: portable multimedia players), car terminals (e.g., car navigation terminals), headphones, VR devices, etc., and stationary terminals such as digital TVs, desktop computers, audio devices, etc. The audio device shown in fig. 5 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present application.

As shown in fig. 5, the audio apparatus may include a processing device 1001 (e.g., a central processor, a graphics processor, etc.) that may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a Random Access Memory (RAM) 1004. In the RAM1004, various programs and data required for the operation of the audio device are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, the following systems may be connected to the I/O interface 1006: input devices 1007 including, for example, a touch screen, touchpad, keyboard, mouse, image sensor, microphone, accelerometer, gyroscope, and the like; an output device 1008 including, for example, a liquid crystal display (LCD: liquidCrystalDisplay), a speaker, a vibrator, and the like; storage device 1003 including, for example, a magnetic tape, a hard disk, and the like; and communication means 1009. The communication means 1009 may allow the audio device to communicate wirelessly or by wire with other devices to exchange data. Although audio devices having various systems are shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

The audio equipment provided by the application can solve the technical problems that the voice signal is damaged and even information is lost after noise reduction by adopting the voice repair method in the embodiment. Compared with the prior art, the beneficial effects of the audio device provided by the application are the same as those of the voice repair method provided by the embodiment, and other technical features of the audio device are the same as those disclosed by the method of the previous embodiment, and are not repeated here.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The present application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon for performing the speech repair method of the above-described embodiments.

The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM: erasableProgrammableReadOnlyMemory or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The above computer readable storage medium may be contained in an audio device; or may exist alone without being assembled into an audio device.

The computer-readable storage medium carries one or more programs that, when executed by an audio device, cause the audio device to:

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" speech or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: localAreaNetwork) or a wide area network (WAN: wideAreaNetwork), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer program) for executing the voice repairing method, so that the technical problem that the voice signal is damaged or even lost after noise reduction can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the voice repair method provided by the above embodiment, and are not described herein.

The foregoing description is only a partial embodiment of the present application, and is not intended to limit the scope of the present application, and all the equivalent structural changes made by the description and the accompanying drawings under the technical concept of the present application, or the direct/indirect application in other related technical fields are included in the scope of the present application.

Claims

1. A method of speech repair, said method comprising:

2. The method of claim 1, wherein the step of repairing the speech signal frame according to the target speech model comprises:

3. The method of claim 2, wherein said step of weighted summing said bin magnitudes with said reference magnitudes to obtain a new magnitude for said speech signal frame at a lossy bin comprises:

4. The method of claim 1 wherein said step of determining a speech band in said speech signal frame comprises:

acquiring a plurality of formants in the voice signal frame;

5. The method of claim 1, wherein after the step of determining the speech frequency band in the speech signal frame, the method further comprises:

6. The method of claim 1, wherein when there are a plurality of the voice frequency bands, the step of determining the voice similarity between the voice signal frame and each of the voice models based on the frequency band similarity between the voice frequency band and each of the voice models, respectively, comprises:

7. The method of claim 1, wherein prior to the step of determining the speech frequency band in the speech signal frame for the noise reduced speech signal frame, the method further comprises:

8. A speech restoration apparatus, the apparatus comprising:

9. An audio device, the device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the speech repair method according to any one of claims 1 to 7.

10. A storage medium, characterized in that the storage medium is a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the speech restoration method according to any one of claims 1 to 7.