CN113593604A

CN113593604A - Method, device and storage medium for detecting audio quality

Info

Publication number: CN113593604A
Application number: CN202110831738.2A
Authority: CN
Inventors: 张超鹏; 汪璐璐; 姜涛; 胡鹏
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2021-11-02

Abstract

The application discloses a method and a device for detecting audio quality and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio; for each audio frame to be detected, performing weight multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected; determining the existence probability of the human voice of each audio frame to be detected according to the corresponding power spectrum of each audio frame to be detected and the power spectrum after weighting processing; detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected; and determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame. By the aid of the method and the device, the audio quality of the dry sound audio can be judged more accurately.

Description

Method, device and storage medium for detecting audio quality

Technical Field

The present application relates to the field of audio data processing, and in particular, to a method and an apparatus for detecting audio quality, and a storage medium.

Background

The karaoke application is a commonly used entertainment application for people. The user can carry out the K song through the K song application program, and in the K song process, the terminal records the audio frequency collected by the microphone, and the audio frequency is generally called as the dry audio frequency. The user can further operate to upload the dry audio to the server for storage, and the dry audio can be downloaded subsequently when the singing work of the user is played, and is combined with the accompaniment audio for playing.

In the process of operating the karaoke application program, hundreds of millions of pieces of dry audio are stored in the server, and the dry audio stored in the server is increased along with the time, which is a great test for the storage capacity of the server, so a certain deleting machine is generally set in the server. For example, the user's dry audio that has not logged in for too long is deleted, and also, the lower quality audio is deleted, and so on. Generally, when the audio quality of the dry audio is evaluated, the total power of the dry audio is simply detected, the audio quality information is determined according to the total power, and if the power is too low (possibly because the user does not sing into the microphone), the dry audio is judged to be low in quality and can be deleted.

In the course of implementing the present application, the inventors found that the related art has at least the following problems:

according to the scheme, the audio quality is judged only through the total power, however, the total power cannot completely and accurately reflect the audio quality, and the accuracy of finally determined audio quality information is poor.

Disclosure of Invention

The embodiment of the application provides a method and a device for detecting audio quality and a storage medium, which can solve the problem of poor accuracy of audio quality information. The technical scheme is as follows:

in a first aspect, a method for detecting audio quality is provided, the method comprising:

determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry sound audio, wherein the power spectrum comprises power values of all frequency points;

for each audio frame to be detected, performing weight multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after weight multiplication processing, wherein the weight of the frequency point of positive integral multiple of the human voice base frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points;

determining the existence probability of the human voice of each audio frame to be detected according to the corresponding power spectrum of each audio frame to be detected and the power spectrum after weighting processing;

detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;

and determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.

In a possible implementation manner, the determining, according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, a human voice fundamental frequency estimation value corresponding to each audio frame to be detected includes:

and determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information.

In a possible implementation manner, the determining, according to the power spectrum corresponding to each audio frame to be detected and preset human frequency characteristic information, a human frequency fundamental estimation value corresponding to each audio frame to be detected includes:

performing smoothing treatment of a preset window length on a power spectrum corresponding to each audio frame to be detected;

and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.

In a possible implementation manner, for each audio frame to be detected, performing a weighting process on a power value of each frequency point in a power spectrum of the audio frame to be detected to obtain a power spectrum after the weighting process, including:

according to the voice base frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the voice base frequency estimated value;

and for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by the weight coefficient function to obtain the power spectrum after weighting processing.

In one possible implementation, the weight coefficient function is a trigonometric function.

In a possible implementation manner, the determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing includes:

for each audio frame to be detected, determining the ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and performing normalization processing on the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the human voice corresponding to the audio frame to be detected.

In a possible implementation manner, the detecting, according to the existence probability of human voice corresponding to each audio frame to be detected, a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio includes:

detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected;

and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.

In one possible implementation, the method further includes: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;

determining audio quality information of the target dry audio according to the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame, including:

determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;

and determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.

In a possible implementation manner, the determining, according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame, the human voice quality information of the target dry audio includes:

determining a signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;

determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame;

and determining the product of the signal-to-noise ratio estimation value and the human voice definition as the human voice quality information of the target dry audio.

In a possible implementation manner, the determining, according to the power spectrum corresponding to the human-voice audio frame and the power spectrum corresponding to the non-human-voice audio frame, the snr estimation value of the target dry audio includes:

determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to an average value of power values of all frequency points;

determining a first median of the power means corresponding to all the human voice audio frames and determining a second median of the power means corresponding to all the non-human voice audio frames;

and calculating a signal-to-noise ratio estimation value according to the ratio of the first median to the second median.

In a possible implementation manner, the determining, according to the existence probability of the human voice corresponding to the human voice audio frame, the human voice definition of the target dry audio includes:

and determining the human sound definition of the target dry sound frequency according to the median of the human sound existence probability corresponding to all human sound audio frames.

In one possible implementation, the method further includes:

if the human voice audio frames are not detected, determining a power penalty parameter of the target dry voice audio according to the power spectrum corresponding to each audio frame to be detected; determining the average value of the existence probabilities of the voices corresponding to all audio frames to be detected;

and determining the audio quality information of the target dry sound audio according to the average value of the existence probability of the human voice and the power penalty parameter.

In one possible implementation, the method further includes:

if the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining the median of the existence probabilities of the voices corresponding to all audio frames to be detected;

and determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter.

In a possible implementation manner, the determining a noise penalty parameter of the target dry audio according to a power spectrum corresponding to the non-human audio frame includes:

determining a power mean value corresponding to each non-human audio frame, wherein the power mean value is determined according to the mean value of the power values of all frequency points;

determining the median of the power mean values corresponding to all the non-human voice audio frames;

and determining a noise penalty parameter of the target dry sound audio according to the median of the power mean, wherein the noise penalty parameter is inversely related to the median of the power mean.

In a possible implementation manner, the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;

the determining the power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected includes:

determining a power mean value corresponding to each audio frame to be detected, wherein the power mean value is determined according to the mean value of the power values of all frequency points;

determining the average value of the power average values corresponding to all audio frames to be detected to obtain the total power average value of the target trunk audio;

and determining the power penalty parameter of the target trunk audio according to the total power mean value and the ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio.

In a possible implementation manner, the determining a power penalty parameter of the target trunk audio according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and the total power average value includes:

determining a first power penalty sub-parameter according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively correlated with a difference value of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value;

determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value and preset power upper limit and power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference of the total power mean value minus the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference of the power upper limit minus the total power mean value, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values;

and determining the product of the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter as the power penalty parameter of the target trunk audio.

In a second aspect, an apparatus for detecting audio quality is provided, the apparatus comprising:

the determining module is used for determining a human voice base frequency estimated value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, wherein the power spectrum comprises power values of all frequency points;

the weighting module is used for carrying out weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected to obtain a power spectrum after weighting processing, wherein the weight of the frequency points which are positive integer multiples of the human voice base frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points;

the probability module is used for determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing;

the detection module is used for detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;

and the quality module is used for determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.

In one possible implementation manner, the determining module is configured to:

In a possible implementation manner, the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining module is configured to:

In one possible implementation manner, the weighting module is configured to:

In one possible implementation, the probability module is configured to:

In a possible implementation manner, the detection module is configured to:

In one possible implementation, the quality module is further configured to: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;

the quality module is configured to:

In one possible implementation, the quality module is configured to:

In one possible implementation, the quality module is further configured to:

In one possible implementation, the quality module is configured to:

the quality module is configured to:

In one possible implementation, the quality module is configured to:

In a third aspect, a computer device is provided, and the computer device includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the operations performed by the method for detecting audio quality according to the first aspect.

In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the method for detecting audio quality according to the first aspect.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

according to the embodiment of the application, the existence probability of the human voice of the audio frame is identified through the power spectrum of the audio frame in the dry audio, the human voice audio frame and the non-human voice audio frame are further identified, the audio quality information of the dry audio is determined based on the power spectrum of the human voice audio frame and the power spectrum of the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power condition of the human voice audio frame and the power condition of the non-human voice audio frame because the high-quality dry audio is close to silence in the non-human voice part.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting audio quality according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for detecting audio quality according to an embodiment of the present application;

fig. 3 is a flowchart of a method for determining a voice existence probability according to an embodiment of the present application;

FIG. 4 is a waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 5 is a waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 6 is a waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 7 is a waveform diagram of a power spectrum provided by an embodiment of the present application;

FIG. 8 is a flowchart of a method for detecting audio quality according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an apparatus for detecting audio quality according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

According to the method for detecting the audio quality, an execution subject of the method can be a server. The server may be a background server of an application program, and the application program may be an application program with an audio recording function, such as a karaoke application program, a video application program, a recording application program, and the like. The server may be a single server or a server group, and if the server is a single server, the server may be responsible for all processing in the following scheme, and if the server is a server group, different servers in the server group may be respectively responsible for different processing in the following scheme, and the specific processing allocation condition may be arbitrarily set by a technician according to actual needs, and is not described herein again.

The server may include components such as a processor, memory, and communication components. The processor is respectively connected with the memory and the communication component.

The processor may be a Central Processing Unit (CPU).

The Memory may include a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic disk, an optical data storage device, and the like. The memory may be used for pre-stored data, generated intermediate data, generated result data, etc. needed in the process of detecting audio quality. Such as dry audio, various penalty parameters, audio quality information, etc.

The communication means may be a wired network connector, a WiFi (Wireless Fidelity) module, a bluetooth module, a cellular network communication module, etc. The communication component may be used for data transmission with other devices, and the other devices may be other servers, terminals, and the like. For example, the communication section may receive a dry audio transmitted by the terminal.

Fig. 1 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 1, the embodiment includes:

101, determining a human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio.

The power spectrum comprises power values of all frequency points.

And 102, performing weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected to obtain the power spectrum after weighting processing.

And the weight of the frequency points which are positive integral multiples of the human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points.

And 103, determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing.

And 104, detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.

And 105, determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.

Fig. 2 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 2, the embodiment includes:

and 201, acquiring a power spectrum corresponding to the audio frame to be detected of the target dry audio.

The power spectrum comprises power values of all frequency points. The selection method of the audio frames to be detected can be various, for example, the audio frames are selected according to a fixed interval, or the audio frames meeting a certain power requirement are selected, and the like.

In implementation, during audio recording of the terminal, a common audio data sampling rate is 44.1kHz (android system) or 48kHz (ios system). The 16kHz down-sampling processing is generally carried out on the collected audio data, so that the subsequent processing can be reduced in calculation. The most audio friendly open source tool, librescample, can be used for downsampling. And obtaining corresponding dry sound frequency after down sampling.

The server stores a large number of dry sound audios, and for any dry sound audio (i.e. target dry sound audio), the power spectrum of each audio frame thereof can be calculated as follows:

(1) framing

The audio data for each audio frame may be represented as x_n(i)＝x(n·M+i)。

Wherein n represents the audio data of the nth frame, M represents the frame shift, i.e. the number of samples of the next frame shifted relative to the previous frame, i represents the index of the sample point data in the nth frame, and the value range of i is 0,1,2, …, L-1, wherein L represents the frame length, i.e. the number of samples in an audio frame. Where M may correspond to a duration of t_frmhop0.01s (sec), t_frmhopWhich may be referred to as a frame interval duration, L may correspond to a duration of 0.03 s.

(2) Windowing:

the calculation formula of the windowing process can be xw_n(i)＝x_n(i)·w(i)。

Where w (i) represents a window function, a hanning (hanning) window may be used, and the expression is as follows:

(3) discrete Fourier transform:

n-th frame audio data xw_n(i) The fourier transform results of (a) are as follows:

where N represents the number of points of fourier transform, L and N may be set equal.

(4) Calculating a power spectrum:

P(n,k)＝||X(n,k)||²，n＝0,1,...,N_raw-1, wherein N_rawRepresents the total frame number of the current signal after STFT (Short Time Fourier Transform).

Wherein k identifies the frequency point index, and P (n, k) represents the power spectrum of the kth frequency point of the nth frame.

After determining the power spectrum of each audio frame, the audio frames to be detected may be screened based on the power spectrum of each audio frame. The audio frame to be detected can be an audio frame of which the average power value in the target dry audio is greater than the mute power threshold, wherein the average power value is determined according to the average value of the power values of all the frequency points. The selection process of the detected audio frames is described in detail below or in the following.

First, an average power sequence is calculated

Wherein N is_binsIndicating the number of frequency points. The 1/N term in the formula can also be removed.

Setting minimum power P_min1e-10 as a mute decision threshold. Finding effective power sequence

And the audio frame corresponding to each effective power in Pwr (n) is the audio frame to be detected.

And 202, determining the existence probability of the human voice corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected.

And determining a human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio. And for each audio frame to be detected, performing weight multiplication on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum subjected to weight multiplication. And the weight of the frequency points which are positive integral multiples of the human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points. And determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing.

The process of determining the existence probability of the human voice may be performed as follows according to the steps shown in fig. 3.

2021, determining the human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the preset human voice frequency characteristic information.

Wherein, the human voice frequency characteristic information is the human voice fundamental frequency range. The frequency range of the normal human voice base frequency is 40 Hz-1500 Hz, so the minimum value of the human voice base frequency can be set as

Maximum value of

The corresponding frequency points are represented as:

wherein f is_sIs the bandwidth.

Firstly, smoothing processing with preset window length is carried out on the power spectrum corresponding to each audio frame to be detected.

The subsequent step is to detect the peak of the fundamental frequency, and the smoothing process aims to filter the wavelet peak of the frightening wave distributed on the main peak of the fundamental frequency and the frequency doubling.

The power spectrum may be smoothed using a triangular window convolution operation. The triangular window length M + 1. Computing

Wherein

Expressing rounding up, a smoothing kernel with length M +1 points is obtained:

further normalization processing is carried out, namely:

the smoothed power spectrum can be expressed as:

the smoothing sequence length M here can be chosen as:

and then, respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.

For the smoothed power spectrum corresponding to each audio frame to be detected, the calculation for finding the peak position is as follows:

wherein the arg function is a look-up function. Further, k may be at all_peakIn the first range of

Inner k_peakIf there are more than one in the range, then take the minimum k_peakIs denoted by k_f0The corresponding frequency value is used as the estimated value of the fundamental frequency of the human voice, and the frequency resolution parameter can be used to obtain the fundamental frequency, namely the fundamental frequency f₀Can be expressed as:

2022, according to the human voice fundamental frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected.

The weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the human voice fundamental frequency estimated value.

Alternatively, the weight coefficient function may be a trigonometric function.

One possible form is given here:

the corresponding discrete expression is as follows:

2023, determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum and the weight coefficient function corresponding to each audio frame to be detected.

In implementation, for each audio frame to be detected, multiplying a power spectrum corresponding to the audio frame to be detected by a weight coefficient function to obtain a power spectrum subjected to weighting processing, determining a ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and normalizing the ratio according to a preset ratio upper limit and a ratio lower limit to obtain a normalized ratio which is used as the existence probability of human voice corresponding to the audio frame to be detected.

The raw power is defined as: p₀(n, k) ═ Ps (n, k), and the weighted power spectrum is represented as: p₁(n,k)＝Ps(n,k)·W_sin(k) Calculating an initial human voice existence parameter:

wherein, K represents the frequency point index corresponding to the maximum frequency width.

The lower limit of the ratio can be set to p_L0.2, upper limit of ratio p_UWhen the human voice existence parameter is normalized to 0.8, the human voice existence probability is obtained as follows:

wherein p (n) max (p)_L,min(p_U,prob_v(n)))。

Based on the weight coefficient function constructed as above, it can be known that: if the audio frame is a human voice audio frame, Ps (n, k) and W_sin(k) As can be seen from fig. 4, since the peaks appear at the positions of the fundamental frequency and the multiple frequency (integral multiple of the fundamental frequency) in the human voice audio frame, the peaks are uniformly distributed, and in the above manner of constructing the weight coefficient function, the peak of Ps (n, k) also appears at the positions of the fundamental frequency and the multiple frequency, so that Ps (n, k) and W (n, k) can be obtained_sin(k) The wave crest and the wave crest position correspond to the wave trough and the wave trough position, and P is obtained after multiplication₁(n, k) can be seen in FIG. 5, where W_sin(k) The function of the method is to enlarge the peak value of Ps (n, k) and reduce the valley value, and although the total power is reduced, the reduction amplitude is smaller; if the audio frame is a non-human audio frame, Ps (n, k) and W_sin(k) As can be seen from FIG. 6, since the peaks are not uniformly distributed in the non-human voice audio frame, however, the above-mentioned way of constructing the weight coefficient function, the peaks of Ps (n, k) are uniformly distributed, so that Ps (n, k) and W can be made_sin(k) Many peaks and valleys of the P-N-X-Y-Z are all non-corresponding, and P is obtained after multiplication₁(n, k) can be seen in FIG. 7, where W_sin(k) The magnitude of the total power reduction for Ps (n, k) may be relatively large. Therefore, based on the above characteristics, the normalized ratio can reflect the existence probability of the human voice of the audio frame. The above 4 function maps are illustrated using continuous function images for ease of viewing, and discrete function images are not employed.

And 203, detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.

There are many methods for detecting the human voice audio frame and the non-human voice audio frame based on the human voice existence probability, for example, a threshold may be set, and if the human voice existence probability of the audio frame to be detected is greater than the threshold, it is determined as the human voice audio frame, and if the human voice existence probability is less than the threshold, it is determined as the non-human voice audio frame.

Alternatively, it can be detected as follows: detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected; and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.

The method divides the detection of the human voice audio frame and the detection of the non-human voice audio frame into two processes, wherein each detection process can adopt one threshold value or two threshold values, and the detection method adopting the two threshold values is given below.

The process of detecting the human voice audio frame comprises the following steps: (wherein, the voice detection probability threshold value comprises a first threshold value and a second threshold value, and the first threshold value is larger than the second threshold value)

And acquiring the existence probability of the human voice corresponding to the audio frames to be detected one by one according to the sequence of time from first to last.

And when the acquired first human voice existence probability is greater than a first threshold value and the human voice existence probability which is smaller than a second threshold value or greater than the first threshold value and is acquired before the first human voice existence probability does not exist, determining the audio frame to be detected corresponding to the first human voice existence probability as a human voice starting audio frame.

And when the continuously acquired existence probabilities of the first preset number of the persons are all larger than the first threshold value, determining the audio frame to be detected corresponding to the existence probability of the second person as a person starting audio frame, wherein the existence probability of the second person is the first acquired existence probability of the person in the first preset number of the persons.

And determining the audio frame to be detected corresponding to the third voice existence probability as a voice ending audio frame when the continuously acquired second preset number of voice existence probabilities are all smaller than a second threshold value after the voice starting audio frame is determined, wherein the third voice existence probability is the voice existence probability acquired before the first acquired voice existence probability in the second preset number of voice existence probabilities.

And determining the audio frame to be detected corresponding to the fourth voice existence probability as the voice starting audio frame when the continuously acquired first preset number of voice existence probabilities are all larger than a first threshold value after the voice ending audio frame is determined, wherein the fourth voice existence probability is the voice existence probability acquired firstly in the second preset number of voice existence probabilities.

And determining the voice audio frames in all the audio frames to be detected according to the determined voice starting audio frame and the determined voice ending audio frame.

The process of detecting the human voice audio frame comprises the following steps: (wherein, the non-human voice detection probability threshold value comprises a third threshold value and a fourth threshold value, and the third threshold value is smaller than the fourth threshold value)

And when the acquired fifth voice existence probability is smaller than a third threshold value and the voice existence probability which is larger than a fourth threshold value or smaller than the third threshold value and is acquired before the fifth voice existence probability does not exist, determining the audio frame to be detected corresponding to the fifth voice existence probability as a non-voice starting audio frame.

And when the continuously acquired third preset number of voice existence probabilities are all smaller than the third threshold value, determining the audio frame to be detected corresponding to the sixth voice existence probability as a non-voice starting audio frame, wherein the sixth voice existence probability is the voice existence probability acquired firstly in the third preset number of voice existence probabilities.

And determining the audio frame to be detected corresponding to the seventh voice existence probability as the non-voice ending audio frame when the continuously acquired fourth preset number of voice existence probabilities are all larger than a fourth threshold value after the non-voice starting audio frame is determined, wherein the seventh voice existence probability is the voice existence probability acquired before the voice existence probability acquired firstly in the fourth preset number of voice existence probabilities.

And determining the audio frame to be detected corresponding to the eighth voice existence probability as the non-voice starting audio frame when the continuously acquired third preset number of voice existence probabilities are all smaller than a third threshold value after the non-voice ending audio frame is determined, wherein the eighth voice existence probability is the voice existence probability acquired firstly in the fourth preset number of voice existence probabilities.

And determining the non-human voice audio frames in all the audio frames to be detected according to the determined non-human voice starting audio frames and the determined non-human voice ending audio frames.

The first threshold and the fourth threshold may be equal and may be set to 0.6, and the second threshold and the third threshold may be equal and may be set to 0.5.

The shortest mute duration in the vocal segment can be set as

The corresponding frame number is

This is the second predetermined number. The shortest human voice time period (less than which is regarded as short-time noise) may be set to

Corresponding frame number of

This is the first predetermined number. The maximum length of the abnormal sound occurring in silence can be set

Number of corresponding frames

This is the fourth predetermined number. The shortest mute duration (considered to enter the mute region more than this duration) may be set

The corresponding frame number is

This is the third predetermined number.

Before the detection processing, the sequence of the existence probability of the human voice may be smoothed to implement denoising, and then the detection is performed. Spline curve S may be utilized_b(m) smoothing to obtain a smoothed probability sequence:

where M is the smoothing window length, which may be 30.

The above-mentioned set of detected human voice audio frames can be denoted as Seg_vocThe set of detected non-human audio frames may be denoted as Seg_sil。

And 204, determining the signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.

Firstly, determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to the average value of the power values of all frequency points.

Then, a first median of the power means corresponding to all the human voice audio frames is determined, and a second median of the power means corresponding to all the non-human voice audio frames is determined.

And finally, calculating the signal-to-noise ratio estimation value according to the ratio of the first median to the second median.

The specific calculation can respectively calculate the log power spectrum statistic of the human vocal range

Non-human voice segment log power spectrum statistic

Wherein, the base number x can be continuously set as required and can be 10 or e, Q_1/2And (-) means taking the median operation, namely taking the value arranged in the middle as the final output after sorting the current sequence.

Calculating signal-to-noise ratio parameters

Setting an upper limit and a lower limit of a signal-to-noise ratio:

further calculating normalized signal-to-noise ratioThe ratio parameter is as follows:

wherein, g₁Is expressed as a correction function

Having a sharpening function, r_snrCan be considered as a signal-to-noise ratio estimate.

Since the sound in the non-human audio frame in the dry audio can be considered as noise, the signal-to-noise ratio of the dry audio can be approximately represented by the ratio of the power information of the human audio frame and the non-human audio frame.

And 205, determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame.

Specifically, the human sound definition of the target dry sound can be determined from the median of the human sound existence probabilities corresponding to all human sound audio frames. That is to say, the greater the human voice existence probability of the human voice audio frame, the higher the human voice definition.

The corresponding human clarity can be expressed as:

and 206, determining the product of the signal-to-noise ratio estimation value and the human sound definition as the human sound quality information of the target dry audio.

Wherein the human voice quality information can be regarded as a body score in the audio quality information.

207, determining a noise penalty parameter of the target dry audio according to the power spectrum corresponding to the non-human voice audio frame.

The noise penalty parameter is determined by the noise intensity, and the noise penalty parameter is larger when the noise is larger, and can be a numerical value within a range of 0-1. The scheme can refer to various noises, such as common environmental noises.

The calculation of the back noise may be as follows: first, a power average corresponding to each non-human audio frame is determined. And the power mean value is determined according to the mean value of the power values of all the frequency points. Then, the median of the power means corresponding to all the non-human voice audio frames is determined. And finally, determining a noise penalty parameter of the target dry sound audio according to the median of the power mean value, wherein the noise penalty parameter is negatively correlated with the median of the power mean value.

The log power upper and lower bounds of the ambient noise can be defined:

when the non-human voice energy is too large (considered to be non-negligible when being larger than the lower limit), calculating a noise penalty parameter:

where g (-) is a correction function, expressed as:

has sharpening effect.

When calculating the noise punishment parameters, all the non-human voice audio frames can be divided into a plurality of sections, the noise punishment parameters are calculated for each section according to the method, and then the maximum noise punishment parameter is selected from the noise punishment parameters corresponding to each section as the noise punishment parameter of the target trunk audio.

The segmentation can be based on a preset frame number, and continuous non-human voice audio frames can also be divided into one section.

208, determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected.

The specific treatment may be as follows:

first, a first power penalty sub-parameter is determined.

The audio frames to be detected may be audio frames with a power mean value greater than a mute power threshold in the target dry audio. And determining a first power penalty sub-parameter according to the ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively related to the difference between the ratio of the first power penalty sub-parameter and the ratio of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value.

The minimum effective audio length can be defined as T_minCalculate the corresponding frame number for 1s

Obtaining the frame number N of the effective power sequence Pwr (N)_aIf N is present_a<N_frmminThe entire audio is regarded as input energy too low (insufficient effective high-energy audio data), and the audio quality information can be directly judged to be 0.

If N is present_a≥N_frmminThe effective power frame number ratio can be further calculated as:

if the occupancy is too low and is below the occupancy threshold, e.g. 0.1, the first power penalty parameter β is determined_a＝r_a+0.9, otherwise there is no penalty here, i.e. the first power penalty sub-parameter β_a＝1。

Then, a second power penalty sub-parameter and a third power penalty sub-parameter are determined.

And determining a power average value corresponding to each audio frame to be detected, wherein the power average value is determined according to the average value of the power values of all the frequency points. And determining the average value of the power average values corresponding to all the audio frames to be detected to obtain the total power average value of the target dry audio. And determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value, a preset power upper limit and a preset power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference between the total power mean value and the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference between the total power upper limit and the power upper limit, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values.

The calculation process may be as follows:

the average power value of the entire audio is calculated as:

wherein, the base number x can be continuously set as required, and can be 10 or e, etc.

Setting a lower power limit

Upper limit of power

(1) Determination of excessive power

If it is not

Then the overall energy is considered to be larger, and the following processing is performed at this time: setting a maximum threshold

The power over-probability is expressed as

Calculating a second power penalty sub-parameter beta_U＝1-r_Uextrem。

If it is not

The second power penalty sub-parameter is then beta_U＝1。

(2) Over-power determination

If it is not

Then the overall energy is considered to be smaller, and the following processing is performed at this time: setting a minimum threshold

Over powerThe small probability is expressed as

Calculating a third power penalty sub-parameter beta_L＝1-r_Lextrem。

If it is not

Then three power penalty sub-parameters beta_L＝1。

Finally, determining the product of the first power punishment sub-parameter, the second power punishment sub-parameter and the third power punishment sub-parameter as the power punishment parameter beta of the target trunk audio_W＝β_a·β_U·β_L。

When calculating the power penalty parameter, only one sub-parameter of the first power penalty sub-parameter, the second power penalty sub-parameter, and the third power penalty sub-parameter, or the product of two sub-parameters may be selected. The power penalty parameter is a value in a range of 0-1, and the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter are also values in a range of 0-1.

Optionally, when determining the power penalty parameters, only one or two of the three power penalty sub-parameters may be used, and other power penalty sub-parameters besides the three power penalty sub-parameters may also be used.

And 209, determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.

After the noise penalty parameter and the power penalty parameter are determined, the noise penalty parameter and the power penalty parameter may be multiplied to obtain a final penalty parameter as follows: beta is beta ═ beta_W·β_bkn。

The audio quality information of the target dry sound audio may be expressed as: p is a radical of_clean＝β·r_snr·r_c。

Optionally, the scheme may also directly use the human voice quality information as the audio quality information of the dry audio without considering the noise penalty parameter and the power penalty parameter.

Fig. 8 is a schematic diagram of the detection process of the audio quality information described above.

In addition, in the detection process of the human voice audio frame and the non-human voice audio frame, there are two possible situations, and the calculation process of the corresponding audio quality information may be as follows:

case one, no human voice audio frame is detected

And determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected. And determining the average value of the existence probabilities of the voices corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry audio according to the average value of the existence probability of the human voice and the power penalty parameter. Since the human voice audio distribution is relatively stable (has a short-time stationary characteristic), an average value with higher processing efficiency can be adopted. Of course, median values may also be used.

In this case, the overall sound quality is considered to be poor, the user may not sing, and the obtained audio data is accompaniment or other noise. In this case, the power penalty parameter may be calculated according to the above method, and in addition, a main body score may be calculated based on the human voice existence probability of the non-human voice audio frames (all the audio frames to be detected are non-human voice frames), and the power penalty parameter and the main body score may be multiplied to obtain the audio quality information. The audio quality information calculation formula may be as follows:

case two, no non-human audio frames are detected

And determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected. And determining the median of the existence probabilities of the voices corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter. There may be some probability parameters of abnormal changes in the non-human audio frames, and in order to prevent the individual extreme probability values from affecting the final result too much, the median value is used here.

In this case, it can be considered that the audio frames to be detected are all voice audio frames, but non-voice components often exist in the actual singing process, so that the false detection may actually exist. In this case, the power penalty parameter may be calculated according to the above method, and in addition, a main body score may be calculated based on the human voice existence probability of the non-human voice audio frames (all the audio frames to be detected are non-human voice frames), and the power penalty parameter and the main body score may be multiplied to obtain the audio quality information. The audio quality information calculation formula may be as follows:

p_clean＝β_W·0.9·(Q_1/2(prob^s(n)))。

and calculating audio quality information of the user-uploaded dry sound audio stored in the database based on the process, and then respectively determining whether each dry sound audio needs to be deleted based on the audio quality information. The specific deletion determination mechanism may be arbitrarily set according to a requirement, for example, the dry sound audio with the audio quality information lower than a preset threshold is deleted, for example, for an account that is not logged in for more than a first preset time period, the dry sound audio that is not accessed for more than a second preset time period is acquired, if the audio quality information of the dry sound audio is lower than the preset threshold, the dry sound audio is deleted, and for example, the dry sound audio is weighted and scored based on information of multiple dimensions such as the audio quality information, the access amount, the account activity and the like, and the dry sound audio with the score lower than the preset score threshold is deleted.

In addition, the audio quality information may be stored, and when the recommendation of the dry audio is performed, the audio quality information may be used as a reference attribute. Specifically, the audio quality information and other attribute information of the dry audio can be input into a first feature extraction model to obtain feature information of the dry audio, the account attribute of a target user account is input into a second feature extraction model to obtain feature information of the user account, then the feature information of the dry audio and the feature information of the user account are input into a scoring model to obtain the matching degree score of the dry audio and the user account, and the dry audio recommended to the user account is determined based on the matching degree score of each dry audio.

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

An embodiment of the present application further provides an apparatus for detecting audio quality, where the apparatus may be applied to the server in the foregoing embodiment, and as shown in fig. 9, the apparatus includes:

a determining module 910, configured to determine, according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, a human voice fundamental frequency estimated value corresponding to each audio frame to be detected, where the power spectrum includes power values of each frequency point;

the weighting module 920 is configured to perform weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected, so as to obtain a power spectrum after weighting processing, where a weight of a frequency point which is a positive integer multiple of a human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than weights of other frequency points;

a probability module 930, configured to determine a voice existence probability of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after weighting processing;

the detecting module 940 is configured to detect a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;

a quality module 950, configured to determine the audio quality information of the target dry audio according to the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame.

In a possible implementation manner, the determining module 910 is configured to:

In a possible implementation manner, the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining module 910 is configured to:

In a possible implementation manner, the weighting module 920 is configured to:

In one possible implementation, the probability module 930 is configured to:

In a possible implementation manner, the detecting module 940 is configured to:

In a possible implementation manner, the quality module 950 is further configured to: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;

the quality module 950 is configured to:

In one possible implementation, the quality module 950 is configured to:

In a possible implementation manner, the quality module 950 is further configured to:

In one possible implementation, the quality module 950 is configured to:

the quality module 950 is configured to:

In one possible implementation, the quality module 950 is configured to:

It should be noted that: in the apparatus for detecting audio quality according to the foregoing embodiment, when detecting audio quality, only the division of the functional modules is described as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for detecting audio quality and the method for detecting audio quality provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors 1001 and one or more memories 1002, where the memory 1002 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of detecting audio quality in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of detecting audio quality, the method comprising:

2. The method according to claim 1, wherein determining the human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio comprises:

3. The method according to claim 2, wherein the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining the human voice fundamental frequency estimated value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information comprises:

4. The method according to claim 1, wherein for each audio frame to be detected, performing weighting processing on a power value of each frequency point in a power spectrum of the audio frame to be detected to obtain a power spectrum after weighting processing, and the method comprises:

5. The method of claim 4, wherein the weight coefficient function is a trigonometric function.

6. The method according to claim 1, wherein determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing comprises:

7. The method according to claim 1, wherein the detecting human voice audio frames and non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected comprises:

8. The method of claim 1, further comprising: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;

9. The method according to claim 8, wherein the determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame comprises:

10. The method of claim 9, wherein determining the snr estimate for the target dry audio based on the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame comprises:

11. The method of claim 9, wherein the determining the human voice clarity of the target stem audio according to the human voice existence probability corresponding to the human voice audio frame comprises:

12. The method of claim 1, further comprising:

13. The method of claim 1, further comprising:

14. The method according to any one of claims 8-13, wherein determining the noise penalty parameter for the target dry audio according to the power spectrum corresponding to the non-human audio frame comprises:

15. The method according to any one of claims 8 to 13, wherein the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;

16. The method according to claim 15, wherein the determining a power penalty parameter of the target trunk audio according to the total power average and a ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio comprises:

17. A computer device, comprising a processor and a memory, wherein at least one instruction is stored in the memory, and is loaded and executed by the processor to perform operations performed by the method of detecting audio quality of any one of claims 1 to 16.

18. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the method of detecting audio quality of any one of claims 1 to 16.