CN113593604A - Method, device and storage medium for detecting audio quality - Google Patents

Method, device and storage medium for detecting audio quality Download PDF

Info

Publication number
CN113593604A
CN113593604A CN202110831738.2A CN202110831738A CN113593604A CN 113593604 A CN113593604 A CN 113593604A CN 202110831738 A CN202110831738 A CN 202110831738A CN 113593604 A CN113593604 A CN 113593604A
Authority
CN
China
Prior art keywords
audio
power
detected
audio frame
human voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110831738.2A
Other languages
Chinese (zh)
Inventor
张超鹏
汪璐璐
姜涛
胡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202110831738.2A priority Critical patent/CN113593604A/en
Publication of CN113593604A publication Critical patent/CN113593604A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The application discloses a method and a device for detecting audio quality and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio; for each audio frame to be detected, performing weight multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected; determining the existence probability of the human voice of each audio frame to be detected according to the corresponding power spectrum of each audio frame to be detected and the power spectrum after weighting processing; detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected; and determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame. By the aid of the method and the device, the audio quality of the dry sound audio can be judged more accurately.

Description

Method, device and storage medium for detecting audio quality
Technical Field
The present application relates to the field of audio data processing, and in particular, to a method and an apparatus for detecting audio quality, and a storage medium.
Background
The karaoke application is a commonly used entertainment application for people. The user can carry out the K song through the K song application program, and in the K song process, the terminal records the audio frequency collected by the microphone, and the audio frequency is generally called as the dry audio frequency. The user can further operate to upload the dry audio to the server for storage, and the dry audio can be downloaded subsequently when the singing work of the user is played, and is combined with the accompaniment audio for playing.
In the process of operating the karaoke application program, hundreds of millions of pieces of dry audio are stored in the server, and the dry audio stored in the server is increased along with the time, which is a great test for the storage capacity of the server, so a certain deleting machine is generally set in the server. For example, the user's dry audio that has not logged in for too long is deleted, and also, the lower quality audio is deleted, and so on. Generally, when the audio quality of the dry audio is evaluated, the total power of the dry audio is simply detected, the audio quality information is determined according to the total power, and if the power is too low (possibly because the user does not sing into the microphone), the dry audio is judged to be low in quality and can be deleted.
In the course of implementing the present application, the inventors found that the related art has at least the following problems:
according to the scheme, the audio quality is judged only through the total power, however, the total power cannot completely and accurately reflect the audio quality, and the accuracy of finally determined audio quality information is poor.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting audio quality and a storage medium, which can solve the problem of poor accuracy of audio quality information. The technical scheme is as follows:
in a first aspect, a method for detecting audio quality is provided, the method comprising:
determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry sound audio, wherein the power spectrum comprises power values of all frequency points;
for each audio frame to be detected, performing weight multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after weight multiplication processing, wherein the weight of the frequency point of positive integral multiple of the human voice base frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points;
determining the existence probability of the human voice of each audio frame to be detected according to the corresponding power spectrum of each audio frame to be detected and the power spectrum after weighting processing;
detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;
and determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.
In a possible implementation manner, the determining, according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, a human voice fundamental frequency estimation value corresponding to each audio frame to be detected includes:
and determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information.
In a possible implementation manner, the determining, according to the power spectrum corresponding to each audio frame to be detected and preset human frequency characteristic information, a human frequency fundamental estimation value corresponding to each audio frame to be detected includes:
performing smoothing treatment of a preset window length on a power spectrum corresponding to each audio frame to be detected;
and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.
In a possible implementation manner, for each audio frame to be detected, performing a weighting process on a power value of each frequency point in a power spectrum of the audio frame to be detected to obtain a power spectrum after the weighting process, including:
according to the voice base frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the voice base frequency estimated value;
and for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by the weight coefficient function to obtain the power spectrum after weighting processing.
In one possible implementation, the weight coefficient function is a trigonometric function.
In a possible implementation manner, the determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing includes:
for each audio frame to be detected, determining the ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and performing normalization processing on the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the human voice corresponding to the audio frame to be detected.
In a possible implementation manner, the detecting, according to the existence probability of human voice corresponding to each audio frame to be detected, a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio includes:
detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected;
and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.
In one possible implementation, the method further includes: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;
determining audio quality information of the target dry audio according to the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame, including:
determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
and determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.
In a possible implementation manner, the determining, according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame, the human voice quality information of the target dry audio includes:
determining a signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame;
and determining the product of the signal-to-noise ratio estimation value and the human voice definition as the human voice quality information of the target dry audio.
In a possible implementation manner, the determining, according to the power spectrum corresponding to the human-voice audio frame and the power spectrum corresponding to the non-human-voice audio frame, the snr estimation value of the target dry audio includes:
determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to an average value of power values of all frequency points;
determining a first median of the power means corresponding to all the human voice audio frames and determining a second median of the power means corresponding to all the non-human voice audio frames;
and calculating a signal-to-noise ratio estimation value according to the ratio of the first median to the second median.
In a possible implementation manner, the determining, according to the existence probability of the human voice corresponding to the human voice audio frame, the human voice definition of the target dry audio includes:
and determining the human sound definition of the target dry sound frequency according to the median of the human sound existence probability corresponding to all human sound audio frames.
In one possible implementation, the method further includes:
if the human voice audio frames are not detected, determining a power penalty parameter of the target dry voice audio according to the power spectrum corresponding to each audio frame to be detected; determining the average value of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the average value of the existence probability of the human voice and the power penalty parameter.
In one possible implementation, the method further includes:
if the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining the median of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter.
In a possible implementation manner, the determining a noise penalty parameter of the target dry audio according to a power spectrum corresponding to the non-human audio frame includes:
determining a power mean value corresponding to each non-human audio frame, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the median of the power mean values corresponding to all the non-human voice audio frames;
and determining a noise penalty parameter of the target dry sound audio according to the median of the power mean, wherein the noise penalty parameter is inversely related to the median of the power mean.
In a possible implementation manner, the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;
the determining the power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected includes:
determining a power mean value corresponding to each audio frame to be detected, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the average value of the power average values corresponding to all audio frames to be detected to obtain the total power average value of the target trunk audio;
and determining the power penalty parameter of the target trunk audio according to the total power mean value and the ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio.
In a possible implementation manner, the determining a power penalty parameter of the target trunk audio according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and the total power average value includes:
determining a first power penalty sub-parameter according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively correlated with a difference value of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value;
determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value and preset power upper limit and power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference of the total power mean value minus the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference of the power upper limit minus the total power mean value, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values;
and determining the product of the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter as the power penalty parameter of the target trunk audio.
In a second aspect, an apparatus for detecting audio quality is provided, the apparatus comprising:
the determining module is used for determining a human voice base frequency estimated value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, wherein the power spectrum comprises power values of all frequency points;
the weighting module is used for carrying out weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected to obtain a power spectrum after weighting processing, wherein the weight of the frequency points which are positive integer multiples of the human voice base frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points;
the probability module is used for determining the existence probability of the voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing;
the detection module is used for detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;
and the quality module is used for determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.
In one possible implementation manner, the determining module is configured to:
and determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information.
In a possible implementation manner, the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining module is configured to:
performing smoothing treatment of a preset window length on a power spectrum corresponding to each audio frame to be detected;
and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.
In one possible implementation manner, the weighting module is configured to:
according to the voice base frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the voice base frequency estimated value;
and for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by the weight coefficient function to obtain the power spectrum after weighting processing.
In one possible implementation, the weight coefficient function is a trigonometric function.
In one possible implementation, the probability module is configured to:
for each audio frame to be detected, determining the ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and performing normalization processing on the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the human voice corresponding to the audio frame to be detected.
In a possible implementation manner, the detection module is configured to:
detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected;
and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.
In one possible implementation, the quality module is further configured to: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;
the quality module is configured to:
determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
and determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.
In one possible implementation, the quality module is configured to:
determining a signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame;
and determining the product of the signal-to-noise ratio estimation value and the human voice definition as the human voice quality information of the target dry audio.
In one possible implementation, the quality module is configured to:
determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to an average value of power values of all frequency points;
determining a first median of the power means corresponding to all the human voice audio frames and determining a second median of the power means corresponding to all the non-human voice audio frames;
and calculating a signal-to-noise ratio estimation value according to the ratio of the first median to the second median.
In one possible implementation, the quality module is configured to:
and determining the human sound definition of the target dry sound frequency according to the median of the human sound existence probability corresponding to all human sound audio frames.
In one possible implementation, the quality module is further configured to:
if the human voice audio frames are not detected, determining a power penalty parameter of the target dry voice audio according to the power spectrum corresponding to each audio frame to be detected; determining the average value of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the average value of the existence probability of the human voice and the power penalty parameter.
In one possible implementation, the quality module is further configured to:
if the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining the median of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter.
In one possible implementation, the quality module is configured to:
determining a power mean value corresponding to each non-human audio frame, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the median of the power mean values corresponding to all the non-human voice audio frames;
and determining a noise penalty parameter of the target dry sound audio according to the median of the power mean, wherein the noise penalty parameter is inversely related to the median of the power mean.
In a possible implementation manner, the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;
the quality module is configured to:
determining a power mean value corresponding to each audio frame to be detected, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the average value of the power average values corresponding to all audio frames to be detected to obtain the total power average value of the target trunk audio;
and determining the power penalty parameter of the target trunk audio according to the total power mean value and the ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio.
In one possible implementation, the quality module is configured to:
determining a first power penalty sub-parameter according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively correlated with a difference value of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value;
determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value and preset power upper limit and power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference of the total power mean value minus the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference of the power upper limit minus the total power mean value, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values;
and determining the product of the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter as the power penalty parameter of the target trunk audio.
In a third aspect, a computer device is provided, and the computer device includes a processor and a memory, where at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the operations performed by the method for detecting audio quality according to the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the operations performed by the method for detecting audio quality according to the first aspect.
The technical scheme provided by the embodiment of the application has the following beneficial effects:
according to the embodiment of the application, the existence probability of the human voice of the audio frame is identified through the power spectrum of the audio frame in the dry audio, the human voice audio frame and the non-human voice audio frame are further identified, the audio quality information of the dry audio is determined based on the power spectrum of the human voice audio frame and the power spectrum of the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power condition of the human voice audio frame and the power condition of the non-human voice audio frame because the high-quality dry audio is close to silence in the non-human voice part.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a method for detecting audio quality according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for detecting audio quality according to an embodiment of the present application;
fig. 3 is a flowchart of a method for determining a voice existence probability according to an embodiment of the present application;
FIG. 4 is a waveform diagram of a power spectrum provided by an embodiment of the present application;
FIG. 5 is a waveform diagram of a power spectrum provided by an embodiment of the present application;
FIG. 6 is a waveform diagram of a power spectrum provided by an embodiment of the present application;
FIG. 7 is a waveform diagram of a power spectrum provided by an embodiment of the present application;
FIG. 8 is a flowchart of a method for detecting audio quality according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of an apparatus for detecting audio quality according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
According to the method for detecting the audio quality, an execution subject of the method can be a server. The server may be a background server of an application program, and the application program may be an application program with an audio recording function, such as a karaoke application program, a video application program, a recording application program, and the like. The server may be a single server or a server group, and if the server is a single server, the server may be responsible for all processing in the following scheme, and if the server is a server group, different servers in the server group may be respectively responsible for different processing in the following scheme, and the specific processing allocation condition may be arbitrarily set by a technician according to actual needs, and is not described herein again.
The server may include components such as a processor, memory, and communication components. The processor is respectively connected with the memory and the communication component.
The processor may be a Central Processing Unit (CPU).
The Memory may include a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM (Compact Disc Read-Only Memory), a magnetic disk, an optical data storage device, and the like. The memory may be used for pre-stored data, generated intermediate data, generated result data, etc. needed in the process of detecting audio quality. Such as dry audio, various penalty parameters, audio quality information, etc.
The communication means may be a wired network connector, a WiFi (Wireless Fidelity) module, a bluetooth module, a cellular network communication module, etc. The communication component may be used for data transmission with other devices, and the other devices may be other servers, terminals, and the like. For example, the communication section may receive a dry audio transmitted by the terminal.
Fig. 1 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 1, the embodiment includes:
101, determining a human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry audio.
The power spectrum comprises power values of all frequency points.
And 102, performing weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected to obtain the power spectrum after weighting processing.
And the weight of the frequency points which are positive integral multiples of the human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points.
And 103, determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing.
And 104, detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.
And 105, determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.
According to the embodiment of the application, the existence probability of the human voice of the audio frame is identified through the power spectrum of the audio frame in the dry audio, the human voice audio frame and the non-human voice audio frame are further identified, the audio quality information of the dry audio is determined based on the power spectrum of the human voice audio frame and the power spectrum of the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power condition of the human voice audio frame and the power condition of the non-human voice audio frame because the high-quality dry audio is close to silence in the non-human voice part.
Fig. 2 is a flowchart of a method for detecting audio quality according to an embodiment of the present application. Referring to fig. 2, the embodiment includes:
and 201, acquiring a power spectrum corresponding to the audio frame to be detected of the target dry audio.
The power spectrum comprises power values of all frequency points. The selection method of the audio frames to be detected can be various, for example, the audio frames are selected according to a fixed interval, or the audio frames meeting a certain power requirement are selected, and the like.
In implementation, during audio recording of the terminal, a common audio data sampling rate is 44.1kHz (android system) or 48kHz (ios system). The 16kHz down-sampling processing is generally carried out on the collected audio data, so that the subsequent processing can be reduced in calculation. The most audio friendly open source tool, librescample, can be used for downsampling. And obtaining corresponding dry sound frequency after down sampling.
The server stores a large number of dry sound audios, and for any dry sound audio (i.e. target dry sound audio), the power spectrum of each audio frame thereof can be calculated as follows:
(1) framing
The audio data for each audio frame may be represented as xn(i)=x(n·M+i)。
Wherein n represents the audio data of the nth frame, M represents the frame shift, i.e. the number of samples of the next frame shifted relative to the previous frame, i represents the index of the sample point data in the nth frame, and the value range of i is 0,1,2, …, L-1, wherein L represents the frame length, i.e. the number of samples in an audio frame. Where M may correspond to a duration of tfrmhop0.01s (sec), tfrmhopWhich may be referred to as a frame interval duration, L may correspond to a duration of 0.03 s.
(2) Windowing:
the calculation formula of the windowing process can be xwn(i)=xn(i)·w(i)。
Where w (i) represents a window function, a hanning (hanning) window may be used, and the expression is as follows:
Figure BDA0003175820660000121
(3) discrete Fourier transform:
n-th frame audio data xwn(i) The fourier transform results of (a) are as follows:
Figure BDA0003175820660000122
where N represents the number of points of fourier transform, L and N may be set equal.
(4) Calculating a power spectrum:
P(n,k)=||X(n,k)||2,n=0,1,...,Nraw-1, wherein NrawRepresents the total frame number of the current signal after STFT (Short Time Fourier Transform).
Wherein k identifies the frequency point index, and P (n, k) represents the power spectrum of the kth frequency point of the nth frame.
After determining the power spectrum of each audio frame, the audio frames to be detected may be screened based on the power spectrum of each audio frame. The audio frame to be detected can be an audio frame of which the average power value in the target dry audio is greater than the mute power threshold, wherein the average power value is determined according to the average value of the power values of all the frequency points. The selection process of the detected audio frames is described in detail below or in the following.
First, an average power sequence is calculated
Figure BDA0003175820660000131
Wherein N isbinsIndicating the number of frequency points. The 1/N term in the formula can also be removed.
Setting minimum power Pmin1e-10 as a mute decision threshold. Finding effective power sequence
Figure BDA0003175820660000132
And the audio frame corresponding to each effective power in Pwr (n) is the audio frame to be detected.
And 202, determining the existence probability of the human voice corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected.
And determining a human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio. And for each audio frame to be detected, performing weight multiplication on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum subjected to weight multiplication. And the weight of the frequency points which are positive integral multiples of the human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points. And determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing.
The process of determining the existence probability of the human voice may be performed as follows according to the steps shown in fig. 3.
2021, determining the human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the preset human voice frequency characteristic information.
Wherein, the human voice frequency characteristic information is the human voice fundamental frequency range. The frequency range of the normal human voice base frequency is 40 Hz-1500 Hz, so the minimum value of the human voice base frequency can be set as
Figure BDA0003175820660000133
Maximum value of
Figure BDA0003175820660000141
The corresponding frequency points are represented as:
Figure BDA0003175820660000142
wherein f issIs the bandwidth.
Firstly, smoothing processing with preset window length is carried out on the power spectrum corresponding to each audio frame to be detected.
The subsequent step is to detect the peak of the fundamental frequency, and the smoothing process aims to filter the wavelet peak of the frightening wave distributed on the main peak of the fundamental frequency and the frequency doubling.
The power spectrum may be smoothed using a triangular window convolution operation. The triangular window length M + 1. Computing
Figure BDA0003175820660000143
Wherein
Figure BDA0003175820660000144
Expressing rounding up, a smoothing kernel with length M +1 points is obtained:
Figure BDA0003175820660000145
further normalization processing is carried out, namely:
Figure BDA0003175820660000146
the smoothed power spectrum can be expressed as:
Figure BDA0003175820660000147
the smoothing sequence length M here can be chosen as:
Figure BDA0003175820660000148
and then, respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.
For the smoothed power spectrum corresponding to each audio frame to be detected, the calculation for finding the peak position is as follows:
Figure BDA0003175820660000149
wherein the arg function is a look-up function. Further, k may be at allpeakIn the first range of
Figure BDA00031758206600001410
Inner kpeakIf there are more than one in the range, then take the minimum kpeakIs denoted by kf0The corresponding frequency value is used as the estimated value of the fundamental frequency of the human voice, and the frequency resolution parameter can be used to obtain the fundamental frequency, namely the fundamental frequency f0Can be expressed as:
Figure BDA00031758206600001411
2022, according to the human voice fundamental frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected.
The weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the human voice fundamental frequency estimated value.
Alternatively, the weight coefficient function may be a trigonometric function.
One possible form is given here:
Figure BDA0003175820660000151
the corresponding discrete expression is as follows:
Figure BDA0003175820660000152
2023, determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum and the weight coefficient function corresponding to each audio frame to be detected.
In implementation, for each audio frame to be detected, multiplying a power spectrum corresponding to the audio frame to be detected by a weight coefficient function to obtain a power spectrum subjected to weighting processing, determining a ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and normalizing the ratio according to a preset ratio upper limit and a ratio lower limit to obtain a normalized ratio which is used as the existence probability of human voice corresponding to the audio frame to be detected.
The raw power is defined as: p0(n, k) ═ Ps (n, k), and the weighted power spectrum is represented as: p1(n,k)=Ps(n,k)·Wsin(k) Calculating an initial human voice existence parameter:
Figure BDA0003175820660000153
wherein, K represents the frequency point index corresponding to the maximum frequency width.
The lower limit of the ratio can be set to pL0.2, upper limit of ratio pUWhen the human voice existence parameter is normalized to 0.8, the human voice existence probability is obtained as follows:
Figure BDA0003175820660000154
wherein p (n) max (p)L,min(pU,probv(n)))。
Based on the weight coefficient function constructed as above, it can be known that: if the audio frame is a human voice audio frame, Ps (n, k) and Wsin(k) As can be seen from fig. 4, since the peaks appear at the positions of the fundamental frequency and the multiple frequency (integral multiple of the fundamental frequency) in the human voice audio frame, the peaks are uniformly distributed, and in the above manner of constructing the weight coefficient function, the peak of Ps (n, k) also appears at the positions of the fundamental frequency and the multiple frequency, so that Ps (n, k) and W (n, k) can be obtainedsin(k) The wave crest and the wave crest position correspond to the wave trough and the wave trough position, and P is obtained after multiplication1(n, k) can be seen in FIG. 5, where Wsin(k) The function of the method is to enlarge the peak value of Ps (n, k) and reduce the valley value, and although the total power is reduced, the reduction amplitude is smaller; if the audio frame is a non-human audio frame, Ps (n, k) and Wsin(k) As can be seen from FIG. 6, since the peaks are not uniformly distributed in the non-human voice audio frame, however, the above-mentioned way of constructing the weight coefficient function, the peaks of Ps (n, k) are uniformly distributed, so that Ps (n, k) and W can be madesin(k) Many peaks and valleys of the P-N-X-Y-Z are all non-corresponding, and P is obtained after multiplication1(n, k) can be seen in FIG. 7, where Wsin(k) The magnitude of the total power reduction for Ps (n, k) may be relatively large. Therefore, based on the above characteristics, the normalized ratio can reflect the existence probability of the human voice of the audio frame. The above 4 function maps are illustrated using continuous function images for ease of viewing, and discrete function images are not employed.
And 203, detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected.
There are many methods for detecting the human voice audio frame and the non-human voice audio frame based on the human voice existence probability, for example, a threshold may be set, and if the human voice existence probability of the audio frame to be detected is greater than the threshold, it is determined as the human voice audio frame, and if the human voice existence probability is less than the threshold, it is determined as the non-human voice audio frame.
Alternatively, it can be detected as follows: detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected; and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.
The method divides the detection of the human voice audio frame and the detection of the non-human voice audio frame into two processes, wherein each detection process can adopt one threshold value or two threshold values, and the detection method adopting the two threshold values is given below.
The process of detecting the human voice audio frame comprises the following steps: (wherein, the voice detection probability threshold value comprises a first threshold value and a second threshold value, and the first threshold value is larger than the second threshold value)
And acquiring the existence probability of the human voice corresponding to the audio frames to be detected one by one according to the sequence of time from first to last.
And when the acquired first human voice existence probability is greater than a first threshold value and the human voice existence probability which is smaller than a second threshold value or greater than the first threshold value and is acquired before the first human voice existence probability does not exist, determining the audio frame to be detected corresponding to the first human voice existence probability as a human voice starting audio frame.
And when the continuously acquired existence probabilities of the first preset number of the persons are all larger than the first threshold value, determining the audio frame to be detected corresponding to the existence probability of the second person as a person starting audio frame, wherein the existence probability of the second person is the first acquired existence probability of the person in the first preset number of the persons.
And determining the audio frame to be detected corresponding to the third voice existence probability as a voice ending audio frame when the continuously acquired second preset number of voice existence probabilities are all smaller than a second threshold value after the voice starting audio frame is determined, wherein the third voice existence probability is the voice existence probability acquired before the first acquired voice existence probability in the second preset number of voice existence probabilities.
And determining the audio frame to be detected corresponding to the fourth voice existence probability as the voice starting audio frame when the continuously acquired first preset number of voice existence probabilities are all larger than a first threshold value after the voice ending audio frame is determined, wherein the fourth voice existence probability is the voice existence probability acquired firstly in the second preset number of voice existence probabilities.
And determining the voice audio frames in all the audio frames to be detected according to the determined voice starting audio frame and the determined voice ending audio frame.
The process of detecting the human voice audio frame comprises the following steps: (wherein, the non-human voice detection probability threshold value comprises a third threshold value and a fourth threshold value, and the third threshold value is smaller than the fourth threshold value)
And acquiring the existence probability of the human voice corresponding to the audio frames to be detected one by one according to the sequence of time from first to last.
And when the acquired fifth voice existence probability is smaller than a third threshold value and the voice existence probability which is larger than a fourth threshold value or smaller than the third threshold value and is acquired before the fifth voice existence probability does not exist, determining the audio frame to be detected corresponding to the fifth voice existence probability as a non-voice starting audio frame.
And when the continuously acquired third preset number of voice existence probabilities are all smaller than the third threshold value, determining the audio frame to be detected corresponding to the sixth voice existence probability as a non-voice starting audio frame, wherein the sixth voice existence probability is the voice existence probability acquired firstly in the third preset number of voice existence probabilities.
And determining the audio frame to be detected corresponding to the seventh voice existence probability as the non-voice ending audio frame when the continuously acquired fourth preset number of voice existence probabilities are all larger than a fourth threshold value after the non-voice starting audio frame is determined, wherein the seventh voice existence probability is the voice existence probability acquired before the voice existence probability acquired firstly in the fourth preset number of voice existence probabilities.
And determining the audio frame to be detected corresponding to the eighth voice existence probability as the non-voice starting audio frame when the continuously acquired third preset number of voice existence probabilities are all smaller than a third threshold value after the non-voice ending audio frame is determined, wherein the eighth voice existence probability is the voice existence probability acquired firstly in the fourth preset number of voice existence probabilities.
And determining the non-human voice audio frames in all the audio frames to be detected according to the determined non-human voice starting audio frames and the determined non-human voice ending audio frames.
The first threshold and the fourth threshold may be equal and may be set to 0.6, and the second threshold and the third threshold may be equal and may be set to 0.5.
The shortest mute duration in the vocal segment can be set as
Figure BDA0003175820660000181
The corresponding frame number is
Figure BDA0003175820660000182
This is the second predetermined number. The shortest human voice time period (less than which is regarded as short-time noise) may be set to
Figure BDA0003175820660000183
Corresponding frame number of
Figure BDA0003175820660000184
This is the first predetermined number. The maximum length of the abnormal sound occurring in silence can be set
Figure BDA0003175820660000185
Number of corresponding frames
Figure BDA0003175820660000186
This is the fourth predetermined number. The shortest mute duration (considered to enter the mute region more than this duration) may be set
Figure BDA0003175820660000187
The corresponding frame number is
Figure BDA0003175820660000188
This is the third predetermined number.
Before the detection processing, the sequence of the existence probability of the human voice may be smoothed to implement denoising, and then the detection is performed. Spline curve S may be utilizedb(m) smoothing to obtain a smoothed probability sequence:
Figure BDA0003175820660000189
where M is the smoothing window length, which may be 30.
The above-mentioned set of detected human voice audio frames can be denoted as SegvocThe set of detected non-human audio frames may be denoted as Segsil
And 204, determining the signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.
Firstly, determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to the average value of the power values of all frequency points.
Then, a first median of the power means corresponding to all the human voice audio frames is determined, and a second median of the power means corresponding to all the non-human voice audio frames is determined.
And finally, calculating the signal-to-noise ratio estimation value according to the ratio of the first median to the second median.
The specific calculation can respectively calculate the log power spectrum statistic of the human vocal range
Figure BDA0003175820660000191
Non-human voice segment log power spectrum statistic
Figure BDA0003175820660000192
Wherein, the base number x can be continuously set as required and can be 10 or e, Q1/2And (-) means taking the median operation, namely taking the value arranged in the middle as the final output after sorting the current sequence.
Calculating signal-to-noise ratio parameters
Figure BDA0003175820660000193
Setting an upper limit and a lower limit of a signal-to-noise ratio:
Figure BDA0003175820660000194
Figure BDA0003175820660000195
further calculating normalized signal-to-noise ratioThe ratio parameter is as follows:
Figure BDA0003175820660000196
wherein, g1Is expressed as a correction function
Figure BDA0003175820660000197
Having a sharpening function, rsnrCan be considered as a signal-to-noise ratio estimate.
Since the sound in the non-human audio frame in the dry audio can be considered as noise, the signal-to-noise ratio of the dry audio can be approximately represented by the ratio of the power information of the human audio frame and the non-human audio frame.
And 205, determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame.
Specifically, the human sound definition of the target dry sound can be determined from the median of the human sound existence probabilities corresponding to all human sound audio frames. That is to say, the greater the human voice existence probability of the human voice audio frame, the higher the human voice definition.
The corresponding human clarity can be expressed as:
Figure BDA0003175820660000198
and 206, determining the product of the signal-to-noise ratio estimation value and the human sound definition as the human sound quality information of the target dry audio.
Wherein the human voice quality information can be regarded as a body score in the audio quality information.
207, determining a noise penalty parameter of the target dry audio according to the power spectrum corresponding to the non-human voice audio frame.
The noise penalty parameter is determined by the noise intensity, and the noise penalty parameter is larger when the noise is larger, and can be a numerical value within a range of 0-1. The scheme can refer to various noises, such as common environmental noises.
The calculation of the back noise may be as follows: first, a power average corresponding to each non-human audio frame is determined. And the power mean value is determined according to the mean value of the power values of all the frequency points. Then, the median of the power means corresponding to all the non-human voice audio frames is determined. And finally, determining a noise penalty parameter of the target dry sound audio according to the median of the power mean value, wherein the noise penalty parameter is negatively correlated with the median of the power mean value.
The log power upper and lower bounds of the ambient noise can be defined:
Figure BDA0003175820660000201
when the non-human voice energy is too large (considered to be non-negligible when being larger than the lower limit), calculating a noise penalty parameter:
Figure BDA0003175820660000202
where g (-) is a correction function, expressed as:
Figure BDA0003175820660000203
has sharpening effect.
When calculating the noise punishment parameters, all the non-human voice audio frames can be divided into a plurality of sections, the noise punishment parameters are calculated for each section according to the method, and then the maximum noise punishment parameter is selected from the noise punishment parameters corresponding to each section as the noise punishment parameter of the target trunk audio.
The segmentation can be based on a preset frame number, and continuous non-human voice audio frames can also be divided into one section.
208, determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected.
The specific treatment may be as follows:
first, a first power penalty sub-parameter is determined.
The audio frames to be detected may be audio frames with a power mean value greater than a mute power threshold in the target dry audio. And determining a first power penalty sub-parameter according to the ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively related to the difference between the ratio of the first power penalty sub-parameter and the ratio of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value.
The minimum effective audio length can be defined as TminCalculate the corresponding frame number for 1s
Figure BDA0003175820660000211
Obtaining the frame number N of the effective power sequence Pwr (N)aIf N is presenta<NfrmminThe entire audio is regarded as input energy too low (insufficient effective high-energy audio data), and the audio quality information can be directly judged to be 0.
If N is presenta≥NfrmminThe effective power frame number ratio can be further calculated as:
Figure BDA0003175820660000212
if the occupancy is too low and is below the occupancy threshold, e.g. 0.1, the first power penalty parameter β is determineda=ra+0.9, otherwise there is no penalty here, i.e. the first power penalty sub-parameter βa=1。
Then, a second power penalty sub-parameter and a third power penalty sub-parameter are determined.
And determining a power average value corresponding to each audio frame to be detected, wherein the power average value is determined according to the average value of the power values of all the frequency points. And determining the average value of the power average values corresponding to all the audio frames to be detected to obtain the total power average value of the target dry audio. And determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value, a preset power upper limit and a preset power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference between the total power mean value and the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference between the total power upper limit and the power upper limit, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values.
The calculation process may be as follows:
the average power value of the entire audio is calculated as:
Figure BDA0003175820660000213
wherein, the base number x can be continuously set as required, and can be 10 or e, etc.
Setting a lower power limit
Figure BDA0003175820660000214
Upper limit of power
Figure BDA0003175820660000215
(1) Determination of excessive power
If it is not
Figure BDA0003175820660000216
Then the overall energy is considered to be larger, and the following processing is performed at this time: setting a maximum threshold
Figure BDA0003175820660000221
The power over-probability is expressed as
Figure BDA0003175820660000222
Calculating a second power penalty sub-parameter betaU=1-rUextrem
If it is not
Figure BDA0003175820660000223
The second power penalty sub-parameter is then betaU=1。
(2) Over-power determination
If it is not
Figure BDA0003175820660000224
Then the overall energy is considered to be smaller, and the following processing is performed at this time: setting a minimum threshold
Figure BDA0003175820660000225
Over powerThe small probability is expressed as
Figure BDA0003175820660000226
Calculating a third power penalty sub-parameter betaL=1-rLextrem
If it is not
Figure BDA0003175820660000227
Then three power penalty sub-parameters betaL=1。
Finally, determining the product of the first power punishment sub-parameter, the second power punishment sub-parameter and the third power punishment sub-parameter as the power punishment parameter beta of the target trunk audioW=βa·βU·βL
When calculating the power penalty parameter, only one sub-parameter of the first power penalty sub-parameter, the second power penalty sub-parameter, and the third power penalty sub-parameter, or the product of two sub-parameters may be selected. The power penalty parameter is a value in a range of 0-1, and the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter are also values in a range of 0-1.
Optionally, when determining the power penalty parameters, only one or two of the three power penalty sub-parameters may be used, and other power penalty sub-parameters besides the three power penalty sub-parameters may also be used.
And 209, determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.
After the noise penalty parameter and the power penalty parameter are determined, the noise penalty parameter and the power penalty parameter may be multiplied to obtain a final penalty parameter as follows: beta is beta ═ betaW·βbkn
The audio quality information of the target dry sound audio may be expressed as: p is a radical ofclean=β·rsnr·rc
Optionally, the scheme may also directly use the human voice quality information as the audio quality information of the dry audio without considering the noise penalty parameter and the power penalty parameter.
Fig. 8 is a schematic diagram of the detection process of the audio quality information described above.
In addition, in the detection process of the human voice audio frame and the non-human voice audio frame, there are two possible situations, and the calculation process of the corresponding audio quality information may be as follows:
case one, no human voice audio frame is detected
And determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected. And determining the average value of the existence probabilities of the voices corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry audio according to the average value of the existence probability of the human voice and the power penalty parameter. Since the human voice audio distribution is relatively stable (has a short-time stationary characteristic), an average value with higher processing efficiency can be adopted. Of course, median values may also be used.
In this case, the overall sound quality is considered to be poor, the user may not sing, and the obtained audio data is accompaniment or other noise. In this case, the power penalty parameter may be calculated according to the above method, and in addition, a main body score may be calculated based on the human voice existence probability of the non-human voice audio frames (all the audio frames to be detected are non-human voice frames), and the power penalty parameter and the main body score may be multiplied to obtain the audio quality information. The audio quality information calculation formula may be as follows:
Figure BDA0003175820660000231
case two, no non-human audio frames are detected
And determining a power penalty parameter of the target dry audio according to the power spectrum corresponding to each audio frame to be detected. And determining the median of the existence probabilities of the voices corresponding to all the audio frames to be detected. And determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter. There may be some probability parameters of abnormal changes in the non-human audio frames, and in order to prevent the individual extreme probability values from affecting the final result too much, the median value is used here.
In this case, it can be considered that the audio frames to be detected are all voice audio frames, but non-voice components often exist in the actual singing process, so that the false detection may actually exist. In this case, the power penalty parameter may be calculated according to the above method, and in addition, a main body score may be calculated based on the human voice existence probability of the non-human voice audio frames (all the audio frames to be detected are non-human voice frames), and the power penalty parameter and the main body score may be multiplied to obtain the audio quality information. The audio quality information calculation formula may be as follows:
pclean=βW·0.9·(Q1/2(probs(n)))。
and calculating audio quality information of the user-uploaded dry sound audio stored in the database based on the process, and then respectively determining whether each dry sound audio needs to be deleted based on the audio quality information. The specific deletion determination mechanism may be arbitrarily set according to a requirement, for example, the dry sound audio with the audio quality information lower than a preset threshold is deleted, for example, for an account that is not logged in for more than a first preset time period, the dry sound audio that is not accessed for more than a second preset time period is acquired, if the audio quality information of the dry sound audio is lower than the preset threshold, the dry sound audio is deleted, and for example, the dry sound audio is weighted and scored based on information of multiple dimensions such as the audio quality information, the access amount, the account activity and the like, and the dry sound audio with the score lower than the preset score threshold is deleted.
In addition, the audio quality information may be stored, and when the recommendation of the dry audio is performed, the audio quality information may be used as a reference attribute. Specifically, the audio quality information and other attribute information of the dry audio can be input into a first feature extraction model to obtain feature information of the dry audio, the account attribute of a target user account is input into a second feature extraction model to obtain feature information of the user account, then the feature information of the dry audio and the feature information of the user account are input into a scoring model to obtain the matching degree score of the dry audio and the user account, and the dry audio recommended to the user account is determined based on the matching degree score of each dry audio.
According to the embodiment of the application, the existence probability of the human voice of the audio frame is identified through the power spectrum of the audio frame in the dry audio, the human voice audio frame and the non-human voice audio frame are further identified, the audio quality information of the dry audio is determined based on the power spectrum of the human voice audio frame and the power spectrum of the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power condition of the human voice audio frame and the power condition of the non-human voice audio frame because the high-quality dry audio is close to silence in the non-human voice part.
All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.
An embodiment of the present application further provides an apparatus for detecting audio quality, where the apparatus may be applied to the server in the foregoing embodiment, and as shown in fig. 9, the apparatus includes:
a determining module 910, configured to determine, according to a power spectrum corresponding to each audio frame to be detected of the target dry audio, a human voice fundamental frequency estimated value corresponding to each audio frame to be detected, where the power spectrum includes power values of each frequency point;
the weighting module 920 is configured to perform weighting processing on the power value of each frequency point in the power spectrum of each audio frame to be detected, so as to obtain a power spectrum after weighting processing, where a weight of a frequency point which is a positive integer multiple of a human voice fundamental frequency estimated value corresponding to the audio frame to be detected is greater than weights of other frequency points;
a probability module 930, configured to determine a voice existence probability of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after weighting processing;
the detecting module 940 is configured to detect a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;
a quality module 950, configured to determine the audio quality information of the target dry audio according to the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame.
In a possible implementation manner, the determining module 910 is configured to:
and determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information.
In a possible implementation manner, the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining module 910 is configured to:
performing smoothing treatment of a preset window length on a power spectrum corresponding to each audio frame to be detected;
and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.
In a possible implementation manner, the weighting module 920 is configured to:
according to the voice base frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the voice base frequency estimated value;
and for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by the weight coefficient function to obtain the power spectrum after weighting processing.
In one possible implementation, the weight coefficient function is a trigonometric function.
In one possible implementation, the probability module 930 is configured to:
for each audio frame to be detected, determining the ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and performing normalization processing on the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the human voice corresponding to the audio frame to be detected.
In a possible implementation manner, the detecting module 940 is configured to:
detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected;
and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.
In a possible implementation manner, the quality module 950 is further configured to: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;
the quality module 950 is configured to:
determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
and determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.
In one possible implementation, the quality module 950 is configured to:
determining a signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame;
and determining the product of the signal-to-noise ratio estimation value and the human voice definition as the human voice quality information of the target dry audio.
In one possible implementation, the quality module 950 is configured to:
determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to an average value of power values of all frequency points;
determining a first median of the power means corresponding to all the human voice audio frames and determining a second median of the power means corresponding to all the non-human voice audio frames;
and calculating a signal-to-noise ratio estimation value according to the ratio of the first median to the second median.
In one possible implementation, the quality module 950 is configured to:
and determining the human sound definition of the target dry sound frequency according to the median of the human sound existence probability corresponding to all human sound audio frames.
In a possible implementation manner, the quality module 950 is further configured to:
if the human voice audio frames are not detected, determining a power penalty parameter of the target dry voice audio according to the power spectrum corresponding to each audio frame to be detected; determining the average value of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the average value of the existence probability of the human voice and the power penalty parameter.
In a possible implementation manner, the quality module 950 is further configured to:
if the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining the median of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter.
In one possible implementation, the quality module 950 is configured to:
determining a power mean value corresponding to each non-human audio frame, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the median of the power mean values corresponding to all the non-human voice audio frames;
and determining a noise penalty parameter of the target dry sound audio according to the median of the power mean, wherein the noise penalty parameter is inversely related to the median of the power mean.
In a possible implementation manner, the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;
the quality module 950 is configured to:
determining a power mean value corresponding to each audio frame to be detected, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the average value of the power average values corresponding to all audio frames to be detected to obtain the total power average value of the target trunk audio;
and determining the power penalty parameter of the target trunk audio according to the total power mean value and the ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio.
In one possible implementation, the quality module 950 is configured to:
determining a first power penalty sub-parameter according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively correlated with a difference value of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value;
determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value and preset power upper limit and power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference of the total power mean value minus the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference of the power upper limit minus the total power mean value, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values;
and determining the product of the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter as the power penalty parameter of the target trunk audio.
According to the embodiment of the application, the existence probability of the human voice of the audio frame is identified through the power spectrum of the audio frame in the dry audio, the human voice audio frame and the non-human voice audio frame are further identified, the audio quality information of the dry audio is determined based on the power spectrum of the human voice audio frame and the power spectrum of the non-human voice audio frame, and the audio quality of the dry audio can be more accurately judged based on the power condition of the human voice audio frame and the power condition of the non-human voice audio frame because the high-quality dry audio is close to silence in the non-human voice part.
It should be noted that: in the apparatus for detecting audio quality according to the foregoing embodiment, when detecting audio quality, only the division of the functional modules is described as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to complete all or part of the above described functions. In addition, the apparatus for detecting audio quality and the method for detecting audio quality provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 1000 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors 1001 and one or more memories 1002, where the memory 1002 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1001 to implement the methods provided by the foregoing method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.
In an exemplary embodiment, a computer-readable storage medium, such as a memory, including instructions executable by a processor in a terminal to perform the method of detecting audio quality in the above embodiments is also provided. The computer readable storage medium may be non-transitory. For example, the computer-readable storage medium may be a ROM (Read-Only Memory), a RAM (Random Access Memory), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (18)

1. A method of detecting audio quality, the method comprising:
determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to a power spectrum corresponding to each audio frame to be detected of the target dry sound audio, wherein the power spectrum comprises power values of all frequency points;
for each audio frame to be detected, performing weight multiplication processing on the power value of each frequency point in the power spectrum of the audio frame to be detected to obtain a power spectrum after weight multiplication processing, wherein the weight of the frequency point of positive integral multiple of the human voice base frequency estimated value corresponding to the audio frame to be detected is greater than the weight of other frequency points;
determining the existence probability of the human voice of each audio frame to be detected according to the corresponding power spectrum of each audio frame to be detected and the power spectrum after weighting processing;
detecting a human voice audio frame and a non-human voice audio frame in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected;
and determining the audio quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame.
2. The method according to claim 1, wherein determining the human voice fundamental frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected of the target dry audio comprises:
and determining a human voice base frequency estimation value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information.
3. The method according to claim 2, wherein the human voice frequency characteristic information is a human voice fundamental frequency range, and the determining the human voice fundamental frequency estimated value corresponding to each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and preset human voice frequency characteristic information comprises:
performing smoothing treatment of a preset window length on a power spectrum corresponding to each audio frame to be detected;
and respectively determining the minimum peak frequency of the smoothed power spectrum corresponding to each audio frame to be detected in the human voice fundamental frequency range as the human voice fundamental frequency estimated value corresponding to each audio frame to be detected.
4. The method according to claim 1, wherein for each audio frame to be detected, performing weighting processing on a power value of each frequency point in a power spectrum of the audio frame to be detected to obtain a power spectrum after weighting processing, and the method comprises:
according to the voice base frequency estimated value corresponding to each audio frame to be detected, constructing a weight coefficient function corresponding to each audio frame to be detected, wherein the weight coefficient function is used for representing weights corresponding to different frequency points, the waveform of the weight coefficient function has a plurality of wave crests, and the plurality of wave crests respectively correspond to positive integral multiples of the voice base frequency estimated value;
and for each audio frame to be detected, multiplying the power spectrum corresponding to the audio frame to be detected by the weight coefficient function to obtain the power spectrum after weighting processing.
5. The method of claim 4, wherein the weight coefficient function is a trigonometric function.
6. The method according to claim 1, wherein determining the existence probability of the human voice of each audio frame to be detected according to the power spectrum corresponding to each audio frame to be detected and the power spectrum after the weighting processing comprises:
for each audio frame to be detected, determining the ratio of the total power of the power spectrum subjected to weighting processing to the total power of the power spectrum not subjected to weighting processing, and performing normalization processing on the ratio according to a preset ratio upper limit and a preset ratio lower limit to obtain a normalized ratio which is used as the existence probability of the human voice corresponding to the audio frame to be detected.
7. The method according to claim 1, wherein the detecting human voice audio frames and non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability corresponding to each audio frame to be detected comprises:
detecting a voice audio frame in the audio frames to be detected of the target dry audio according to the voice existence probability and the voice detection probability threshold value corresponding to each audio frame to be detected;
and detecting the non-human voice audio frames in the audio frames to be detected of the target dry audio according to the human voice existence probability and the non-human voice detection probability threshold value corresponding to each audio frame to be detected.
8. The method of claim 1, further comprising: determining a noise penalty parameter of the target trunk audio according to the power spectrum corresponding to the non-human voice audio frame; determining a power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected;
determining audio quality information of the target dry audio according to the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame, including:
determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
and determining the audio quality information of the target dry audio according to the human voice quality information, the noise penalty parameter and the power penalty parameter.
9. The method according to claim 8, wherein the determining the human voice quality information of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame comprises:
determining a signal-to-noise ratio estimation value of the target dry audio according to the power spectrum corresponding to the human voice audio frame and the power spectrum corresponding to the non-human voice audio frame;
determining the human sound definition of the target dry sound according to the human sound existence probability corresponding to the human sound audio frame;
and determining the product of the signal-to-noise ratio estimation value and the human voice definition as the human voice quality information of the target dry audio.
10. The method of claim 9, wherein determining the snr estimate for the target dry audio based on the power spectrum corresponding to the human audio frame and the power spectrum corresponding to the non-human audio frame comprises:
determining a power average value corresponding to each voice audio frame and a power average value corresponding to each non-human audio frame, wherein the power average value is determined according to an average value of power values of all frequency points;
determining a first median of the power means corresponding to all the human voice audio frames and determining a second median of the power means corresponding to all the non-human voice audio frames;
and calculating a signal-to-noise ratio estimation value according to the ratio of the first median to the second median.
11. The method of claim 9, wherein the determining the human voice clarity of the target stem audio according to the human voice existence probability corresponding to the human voice audio frame comprises:
and determining the human sound definition of the target dry sound frequency according to the median of the human sound existence probability corresponding to all human sound audio frames.
12. The method of claim 1, further comprising:
if the human voice audio frames are not detected, determining a power penalty parameter of the target dry voice audio according to the power spectrum corresponding to each audio frame to be detected; determining the average value of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the average value of the existence probability of the human voice and the power penalty parameter.
13. The method of claim 1, further comprising:
if the non-human voice audio frames are not detected, determining a power penalty parameter of the target dry audio according to a power spectrum corresponding to each audio frame to be detected; determining the median of the existence probabilities of the voices corresponding to all audio frames to be detected;
and determining the audio quality information of the target dry sound audio according to the median of the existence probability of the human voice and the power penalty parameter.
14. The method according to any one of claims 8-13, wherein determining the noise penalty parameter for the target dry audio according to the power spectrum corresponding to the non-human audio frame comprises:
determining a power mean value corresponding to each non-human audio frame, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the median of the power mean values corresponding to all the non-human voice audio frames;
and determining a noise penalty parameter of the target dry sound audio according to the median of the power mean, wherein the noise penalty parameter is inversely related to the median of the power mean.
15. The method according to any one of claims 8 to 13, wherein the audio frame to be detected is an audio frame in which a power mean value in the target dry audio is greater than a mute power threshold, wherein the power mean value is determined according to an average value of power values of each frequency point;
the determining the power penalty parameter of the target trunk audio according to the power spectrum corresponding to each audio frame to be detected includes:
determining a power mean value corresponding to each audio frame to be detected, wherein the power mean value is determined according to the mean value of the power values of all frequency points;
determining the average value of the power average values corresponding to all audio frames to be detected to obtain the total power average value of the target trunk audio;
and determining the power penalty parameter of the target trunk audio according to the total power mean value and the ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio.
16. The method according to claim 15, wherein the determining a power penalty parameter of the target trunk audio according to the total power average and a ratio of the number of the audio frames to be detected to the total number of the audio frames of the target trunk audio comprises:
determining a first power penalty sub-parameter according to a ratio of the number of the audio frames to be detected in the total number of the audio frames of the target trunk audio and a preset ratio threshold, wherein when the ratio is smaller than or equal to the ratio threshold, the first power penalty sub-parameter is negatively correlated with a difference value of the ratio threshold minus the ratio, and when the ratio is larger than the ratio threshold, the first power penalty sub-parameter is a fixed value;
determining a second power penalty sub-parameter and a third power penalty sub-parameter according to the total power mean value and preset power upper limit and power lower limit, wherein when the total power mean value is greater than or equal to the power upper limit, the second power penalty sub-parameter is negatively related to the difference of the total power mean value minus the power upper limit, when the total power mean value is less than or equal to the power lower limit, the third power penalty sub-parameter is negatively related to the difference of the power upper limit minus the total power mean value, and when the total power mean value is less than the power upper limit and greater than the power lower limit, the second power penalty sub-parameter and the third power penalty sub-parameter are both fixed values;
and determining the product of the first power penalty sub-parameter, the second power penalty sub-parameter and the third power penalty sub-parameter as the power penalty parameter of the target trunk audio.
17. A computer device, comprising a processor and a memory, wherein at least one instruction is stored in the memory, and is loaded and executed by the processor to perform operations performed by the method of detecting audio quality of any one of claims 1 to 16.
18. A computer-readable storage medium having stored therein at least one instruction which is loaded and executed by a processor to perform operations performed by the method of detecting audio quality of any one of claims 1 to 16.
CN202110831738.2A 2021-07-22 2021-07-22 Method, device and storage medium for detecting audio quality Pending CN113593604A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110831738.2A CN113593604A (en) 2021-07-22 2021-07-22 Method, device and storage medium for detecting audio quality

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110831738.2A CN113593604A (en) 2021-07-22 2021-07-22 Method, device and storage medium for detecting audio quality

Publications (1)

Publication Number Publication Date
CN113593604A true CN113593604A (en) 2021-11-02

Family

ID=78249051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110831738.2A Pending CN113593604A (en) 2021-07-22 2021-07-22 Method, device and storage medium for detecting audio quality

Country Status (1)

Country Link
CN (1) CN113593604A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476040A (en) * 2023-12-25 2024-01-30 深圳市鑫闻达电子有限公司 Audio identification method and identification system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08221092A (en) * 1995-02-17 1996-08-30 Hitachi Ltd Nose eliminating system using spectral subtraction
CN104269180A (en) * 2014-09-29 2015-01-07 华南理工大学 Quasi-clean voice construction method for voice quality objective evaluation
EP2830064A1 (en) * 2013-07-22 2015-01-28 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for decoding and encoding an audio signal using adaptive spectral tile selection
CN106653048A (en) * 2016-12-28 2017-05-10 上海语知义信息技术有限公司 Method for separating sound of single channels on basis of human sound models
WO2017147951A1 (en) * 2016-03-01 2017-09-08 邦彦技术股份有限公司 Method and device for objective voice quality assessment processing of internet phone calls
US20190313187A1 (en) * 2018-04-05 2019-10-10 Holger Stoltze Controlling the direction of a microphone array beam in a video conferencing system
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110867194A (en) * 2019-11-05 2020-03-06 腾讯音乐娱乐科技(深圳)有限公司 Audio scoring method, device, equipment and storage medium
CN112233689A (en) * 2020-09-24 2021-01-15 北京声智科技有限公司 Audio noise reduction method, device, equipment and medium
CN112967738A (en) * 2021-02-01 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 Human voice detection method and device, electronic equipment and computer readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08221092A (en) * 1995-02-17 1996-08-30 Hitachi Ltd Nose eliminating system using spectral subtraction
EP2830064A1 (en) * 2013-07-22 2015-01-28 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for decoding and encoding an audio signal using adaptive spectral tile selection
CN104269180A (en) * 2014-09-29 2015-01-07 华南理工大学 Quasi-clean voice construction method for voice quality objective evaluation
WO2017147951A1 (en) * 2016-03-01 2017-09-08 邦彦技术股份有限公司 Method and device for objective voice quality assessment processing of internet phone calls
CN106653048A (en) * 2016-12-28 2017-05-10 上海语知义信息技术有限公司 Method for separating sound of single channels on basis of human sound models
US20190313187A1 (en) * 2018-04-05 2019-10-10 Holger Stoltze Controlling the direction of a microphone array beam in a video conferencing system
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network
CN110867194A (en) * 2019-11-05 2020-03-06 腾讯音乐娱乐科技(深圳)有限公司 Audio scoring method, device, equipment and storage medium
CN112233689A (en) * 2020-09-24 2021-01-15 北京声智科技有限公司 Audio noise reduction method, device, equipment and medium
CN112967738A (en) * 2021-02-01 2021-06-15 腾讯音乐娱乐科技(深圳)有限公司 Human voice detection method and device, electronic equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王文益;伊雪;: "基于改进语音存在概率的自适应噪声跟踪算法", 信号处理, no. 01, 25 January 2020 (2020-01-25) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117476040A (en) * 2023-12-25 2024-01-30 深圳市鑫闻达电子有限公司 Audio identification method and identification system
CN117476040B (en) * 2023-12-25 2024-03-29 深圳市鑫闻达电子有限公司 Audio identification method and identification system

Similar Documents

Publication Publication Date Title
US10504539B2 (en) Voice activity detection systems and methods
KR101266894B1 (en) Apparatus and method for processing an audio signal for speech emhancement using a feature extraxtion
CN106486131B (en) A kind of method and device of speech de-noising
US7133826B2 (en) Method and apparatus using spectral addition for speaker recognition
CN111128213B (en) Noise suppression method and system for processing in different frequency bands
EP2828856B1 (en) Audio classification using harmonicity estimation
WO2022012195A1 (en) Audio signal processing method and related apparatus
CN110880329A (en) Audio identification method and equipment and storage medium
US20140177853A1 (en) Sound processing device, sound processing method, and program
CN108847253B (en) Vehicle model identification method, device, computer equipment and storage medium
Kumar Real-time performance evaluation of modified cascaded median-based noise estimation for speech enhancement system
CN112712816A (en) Training method and device of voice processing model and voice processing method and device
CN112151055B (en) Audio processing method and device
CN113593604A (en) Method, device and storage medium for detecting audio quality
CN111755025B (en) State detection method, device and equipment based on audio features
CN109741761B (en) Sound processing method and device
CN115223584B (en) Audio data processing method, device, equipment and storage medium
JP6724290B2 (en) Sound processing device, sound processing method, and program
CN113393852B (en) Method and system for constructing voice enhancement model and method and system for voice enhancement
CN112233693B (en) Sound quality evaluation method, device and equipment
CN116959486A (en) Customer satisfaction analysis method and device based on speech emotion recognition
CN115206345A (en) Music and human voice separation method, device, equipment and medium based on time-frequency combination
EP2760022B1 (en) Audio bandwidth dependent noise suppression
CN114512141A (en) Method, apparatus, device, storage medium and program product for audio separation
Andersen Wind noise reduction in single channel speech signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination