WO2015000401A1 - 音频信号分类处理方法、装置及设备 - Google Patents

音频信号分类处理方法、装置及设备 Download PDF

Info

Publication number
WO2015000401A1
WO2015000401A1 PCT/CN2014/081400 CN2014081400W WO2015000401A1 WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1 CN 2014081400 W CN2014081400 W CN 2014081400W WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
classified
audio signal
frames
energy distribution
Prior art date
Application number
PCT/CN2014/081400
Other languages
English (en)
French (fr)
Inventor
许丽净
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015000401A1 publication Critical patent/WO2015000401A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music

Definitions

  • Audio signal classification processing method device and device
  • the embodiments of the present invention relate to the field of signal processing technologies, and in particular, to an audio signal classification processing method, apparatus, and device. Background technique
  • the signal to be analyzed in the actual application may include a music signal, such as a ring tones.
  • the speech quality assessment model treats it as a speech signal and gives an incorrect quality assessment.
  • the signal to be analyzed should be classified before being input to the speech quality assessment module. If the segment signal is recognized as a speech signal, it is sent to the speech quality evaluation module for quality evaluation; if the segment signal is recognized as a music signal, it is not sent to the speech quality evaluation module.
  • the prior art provides an audio signal classification method applied to a speech music joint encoder, but the classification method is directed to a speech music joint encoder with a high sampling rate.
  • the existing music signal is generally lacking.
  • High-frequency information, using the existing audio signal classification method applied to the combined combination of speech and music, can only identify a small number of music signals, and the classification accuracy is low, which can not meet the requirements of voice quality assessment.
  • the invention provides an audio signal classification processing method, device and device for improving the classification correctness rate of an audio signal.
  • a first aspect of the present invention provides an audio signal classification processing method, including: acquiring a number of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, and a frame to be classified in the audio signal in a low frequency region At least one of a continuous frame number and a continuous frame number of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, or is determined in the audio signal
  • the frame to be classified is a voice signal.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal comprises:
  • N1 is a positive integer
  • N1 is a positive integer
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the obtained audio signal, and the pitch distribution parameter of the N1 frame before the frame to be classified include:
  • the obtaining, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the N1 frame before the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than the sixth threshold, in combination with the first aspect
  • the energy distribution parameter of the frame to be classified in the obtained audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the to-be-classified frame in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • L1 is a positive integer
  • the audio signal is acquired.
  • the number of tonal components in the to-be-classified frame that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the N2 frame and the energy distribution parameter of the L1 frame after the frame to be classified acquire the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameters of the frame include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the acquiring an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a L1 after the frame to be classified The energy distribution parameters of the frame include:
  • the energy distribution parameter of the frame to be classified, the high-frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are the energy distribution parameters of the N2 frame before the frame to be classified and the high-frequency energy distribution ratio of the L1 frame after the frame to be classified and The sound pressure level is used as an energy distribution parameter of the L 1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the energy distribution parameter of the L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, Determining the number of consecutive frames of the classified frame in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, and otherwise determining that the frame to be classified in the audio signal is Voice signals include:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the frame to be classified in the audio signal is a music signal, determining whether the number of frames determined as a voice signal in the N4 frame before the frame to be classified and the frame to be classified is greater than a fourth threshold, if exceeded, The frame to be classified in the audio signal is modified into a voice signal, and N4 is a positive integer. If it is determined that the frame to be classified in the audio signal is a voice signal, the N4 frame before the frame to be classified and the L3 frame after the frame to be classified are determined. Determining whether the number of frames of the music signal is greater than a fifth threshold, and if greater, correcting the frame to be classified in the audio signal to a music signal.
  • the pitch distribution parameters of the frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified
  • frequency domain distribution information of a tonal component of the pre-frame N3 to be classified as The frequency domain distribution information of the tone distribution parameter frame of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame frame to be classified;
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the acquiring sound The energy distribution parameter of the frame to be classified in the frequency signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • a second aspect of the present invention provides an audio signal classification processing apparatus, including: a first acquisition module, configured to acquire, in an audio signal, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, At least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region;
  • a classification determining module configured to determine, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame And at least one of determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame And the pitch distribution parameter of the N1 frame before the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or
  • the method is specifically configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and an N1 frame before the frame to be classified, and obtain the foregoing according to the to-be-classified frame in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified.
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the classification determining module acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes: Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame to be classified:
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • L1 is a positive integer
  • the first acquiring module is used. Specifically, the frame to be classified in the audio signal, the N2 frame before the frame to be classified, and the tone distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the L1 frame to be classified
  • the pitch distribution parameter of the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a N2 frame to be classified and to be classified
  • the energy distribution parameter of the L1 frame after the frame is classified, and the continuous frame of the frame to be classified in the low frequency region is obtained according to the frame to be classified in the audio signal, the N2 frame of the frame to be
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the post L1 frame include:
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a to-be-classified
  • the energy distribution parameters of the L1 frame after the frame include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the classification determining module acquires, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the continuous frame of the frame to be classified in the low frequency region.
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module is to be classified according to an energy distribution parameter of a frame to be classified in an audio signal
  • the energy distribution parameter of the N2 frame before the frame and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, a N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and The pitch distribution parameter of the L2 frame after the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; or
  • the method is configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and a pre-frame N3 frame and an L3 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and
  • the energy distribution parameter of the L3 frame after the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-classified frame is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if determining the audio signal If the frame to be classified is a music signal, it is determined whether the number of frames determined as a voice signal in the N3 frame before the frame to be classified and the frame in the to-be-classified frame is greater than a fourth threshold, and if yes, the audio signal is to be received.
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a to-be-classified
  • the pitch distribution parameters of the L2 frame after the classification frame include:
  • the pitch distribution parameter of the N3 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame frame to be classified are used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the to-be-classified
  • the energy distribution parameters of the L2 frame after the classification frame include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the L2 frame after the frame to be classified;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the number of frames includes:
  • the ratio is greater than the ninth threshold
  • the sound pressure level is greater than the tenth threshold.
  • the tone that the number of persistent frames in the to-be-classified frame acquired by the first acquiring module is greater than a sixth threshold
  • the number of components is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain combined with the first possible second possibility or the third possible possibility of the second aspect
  • the first acquiring module is specifically configured to acquire a high frequency energy distribution ratio and a sound pressure level of each frame in the received audio signal, and according to a high frequency energy distribution ratio of each frame in the received audio signal.
  • a third aspect of the present invention provides an audio signal classification processing apparatus, including: a receiver, configured to receive an audio signal;
  • a processor configured to obtain, by the receiver, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal received by the receiver, and a continuous frame in the low frequency region of the to-be-classified frame in the audio signal And the at least one of the number of consecutive frames of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, and the continuous frame of the frame to be classified in the low frequency region And determining at least one of the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and to be classified
  • the pitch distribution parameter of the N frames before the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified Obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or the duration of the frame to be classified in the high frequency region
  • the number of frames, N1 is a positive integer; the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than the first threshold, and the to-be-classified De
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the distribution information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the processor acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes: according to the received audio signal And obtaining, by the high-frequency energy distribution ratio and the sound pressure level, the high-frequency energy distribution ratio of the to-be-classified frame and the high-frequency energy distribution ratio that is less than the eighth threshold;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region includes: Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
  • L1 is a positive integer
  • the processor is specifically used.
  • the tone distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; acquiring the frame to be classified in the audio signal, and the energy of the L2 frame before the frame to be classified and the frame after the frame to be classified Distributing parameters, and obtaining, according to the to-be-classified frame in the audio signal, the N2 frame of the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L1 frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution of a tonal component of the N2 frame before the frame to be classified
  • the information is used as the pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L 1 frame include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the processor obtains the continuous frame number of the to-be-classified frame in the low frequency region.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the processor Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the processor acquiring the continuous frame of the frame to be classified in the high frequency region
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third aspect when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame, L2 and L3 are positive integers,
  • the processor is specifically configured to obtain a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame to be classified, and according to the to-be-classified frame, the pre-frame N3 frame and the to-be-classified frame
  • the tone distribution parameter of the L2 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; acquiring the frame to be classified in the audio signal, and the N3 frame before the frame to be classified and the frame to be classified An energy distribution parameter of the L2 frame, and obtaining, according to the to-be-classified frame in the audio signal, the N3 frame of the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of
  • the processor acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L2 frame include:
  • frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame
  • frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The frequency domain distribution information of the tone distribution parameter of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the processor obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L2 frame include:
  • the processor obtains the continuous frame number of the frame to be classified in the low frequency region Includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified acquired by the processor that is greater than a sixth threshold Is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain.
  • the continuity constraint is satisfied in the frame to be classified in the audio signal.
  • the number of tonal components, and the number of consecutive frames of the audio signal to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming that the type of the frame to be classified is a music signal according to the above information , or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart 1 of a specific embodiment of the present invention
  • Figure 3a is a waveform diagram 1 of the input signal "French male + ⁇ ";
  • Figure 3b is a spectrogram corresponding to Figure 3a;
  • Figure 4a is a waveform diagram of an input signal of an audio signal "Jinghu + French male voice";
  • Figure 4b is a spectrum diagram corresponding to Figure 4a;
  • Figure 5a is a waveform diagram of the input signal "Korean male + ensemble"
  • Figure 5b is a spectrum diagram corresponding to Figure 5a;
  • Figure 6a is a waveform diagram 2 of the input signal "French male + ⁇ ";
  • Figure 6b is the initial tone detection result of the input signal shown in Figure 6a;
  • Figure 6c is the result of the tone detection after the input signal is filtered as shown in Figure 6a;
  • Figure 7a is a waveform diagram 3 of the input signal "French male + ⁇ ";
  • Figure 7b is a graph of the pitch characteristic "" m - to ⁇ z - ⁇ corresponding to Figure 7a;
  • Figure 8a is a waveform diagram of the input signal "Jinghu + French male voice"
  • Figure 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ corresponding to Figure 8a;
  • Figure 9a is a waveform diagram of the input signal "Korean male + ensemble";
  • Figure 1 is a graph of the high frequency energy distribution ratio - ⁇ - ⁇ ) corresponding to Figure 9a;
  • Figure 10 is a schematic flow chart 1 of the audio signal classification rule in the embodiment of the present invention.
  • Figure 11a is a waveform diagram 1 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure l ib is a schematic diagram of the classification result corresponding to Figure 11a;
  • Figure 12a is a waveform diagram 2 of the input signal "Chinese female voice + ensemble + English male voice + ⁇ + German male voice + castanets";
  • Figure 12b is a schematic diagram of the smoothed classification result corresponding to Figure 12a;
  • FIG. 13 is a second schematic diagram of an audio signal classification rule according to an embodiment of the present invention.
  • Figure 14a shows the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets" Waveform diagram three;
  • Figure 14b is a schematic diagram of the real-time classification result corresponding to Figure 14a;
  • 15 is a flow chart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure 16b is a schematic diagram showing the classification results of the three classification methods corresponding to Figure 16a;
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps. Step:
  • Step 101 Acquire an amount of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, a continuous frame number of a frame to be classified in the audio signal in the low frequency region, and a duration of the frame to be classified in a high frequency region At least one of the number of frames;
  • Step 102 The number of tonal components satisfying the continuity constraint in the acquired to-be-classified frame, the number of persistent frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame according to the obtained And determining at least one of the audio signals to be classified into a music signal, and determining that the to-be-classified frame in the audio signal is a voice signal.
  • the audio signal classification processing method provided by the embodiment of the present invention can output the classification result without output delay when the frames in the audio signal are classified, that is, output the classification result in real time for the received audio signal frame. There may be a certain output delay, that is, for the received audio signal frame, the classification result is given for a delay.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the following may be classified into three cases according to different output delay requirements.
  • the N frame information is judged, and the second is to allow a smaller classification result output delay, that is, when the output delay is L1 frame, L1 is a positive integer, which can be classified according to the frame to be classified, the L1 frame before the frame to be classified, and the to-be-classified
  • the L1 frame is judged after the frame;
  • the third is to allow the output of the large classification result to be delayed, that is, when the output delay is L2+L3 frame, L2 and L3 are positive integers, first according to the frame to be classified, the L2 frame before the frame to be classified, and After the frame to be classified, the L2 frame is judged, and the classification result of the frame to be classified is obtained, and then modified according to the L3 frame before the frame to be classified and the L3 frame in the frame to be classified.
  • the frames in the first received audio signal cannot be classified, and the first received frame can be set to a default value, and the default is a voice signal or a music signal.
  • the step 101 in the embodiment shown in FIG. 1 acquires the tonal component of the to-be-classified frame in the audio signal that satisfies the continuity constraint condition.
  • the quantity specifically includes:
  • N1 is a positive integer
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal is obtained, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the frame and the to-be-classified N1 frame acquires the number of the tonal components in the frame to be classified that is greater than the sixth threshold.
  • the energy distribution parameter of the frame to be classified in the audio signal is obtained, and
  • the energy distribution parameters of the N1 frame before the frame to be classified include:
  • obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • the sound distribution level acquires the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is less than an eighth threshold
  • the obtaining the continuous frame number of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the number of tonal components includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the L1 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal the pitch distribution parameter of the N2 frame before the frame to be classified, and the tone distribution parameter packet of the L1 frame after the frame to be classified are acquired.
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the foregoing obtains an energy distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the energy distribution parameters of the N2 frame and the energy distribution parameters of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the frame to be classified, the N2 frame to be classified, and the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the LI frame after the frame acquire the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is greater than a ninth threshold and the sound pressure level is greater than a tenth threshold.
  • the classification result output delay is allowed to be L2+L3 frames, that is, the delay L2+L3 frame is used to obtain the classification result of the to-be-classified frame
  • the step 101 of the embodiment shown in FIG. 1 acquires the to-be-classified frame in the audio signal.
  • the number of tonal components that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the post L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region.
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the acquiring the pitch distribution parameter of the frame to be classified in the audio signal is to be
  • the pitch distribution parameters of the N3 frame before the classification frame, and the pitch distribution parameters of the L2 frame after the frame to be classified include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the obtaining the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the high-frequency energy distribution ratio and the sound pressure level of the frame to be classified in the received audio signal are used as energy distribution parameters of the L2 frame after the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the number of tonal components in which the number of persistent frames in the frame to be classified is greater than the sixth threshold is the number of tonal components larger than the seventh threshold in the frequency domain.
  • Step 201 Perform FFT transformation on an ith frame of a current frame, where each step is received for each Frames are all subjected to FFT transformation;
  • Step 202 Obtain a pitch distribution parameter of the ith frame and an energy distribution parameter based on the FFT transformation result
  • Step 203 Determine whether i>L1 is established, that is, whether L1 frames exist before the current frame. If the process is step 204, the process ends. Otherwise, the execution of the foregoing steps 201 and 202 is performed for subsequent frames. Operation
  • Step 204 At: [>1 ⁇ , the audio signal classification result of the i-L1 frame may be obtained, and the specific past information, that is, the i-L1 frame obtained according to the above steps 201 and 202 The pitch distribution parameters and energy distribution parameters of the previous frames, the current information, that is, the pitch distribution parameters and energy distribution parameters of the i-L1 frame, and the future information, that is, the pitch distribution of the L1 frame after the i-L1 frame. Parameter and energy distribution parameters, obtaining audio signal classification results of the i-th frame;
  • Step 205 Output an audio signal classification result of the i-L1 frame.
  • FIG. 3a is a waveform diagram 1 of the input signal "French male voice + ⁇ ”
  • FIG. 3b is a spectrum diagram corresponding to FIG.
  • the sampling rate is 8 kHz, wherein the horizontal axis is the sample point and the vertical axis is the normalized amplitude;
  • the spectral map of Fig. 3b the corresponding sampling rate is also 8 kHz, and the frequency analysis range is (T4kHz.
  • the horizontal axis is the frame, corresponding to the sample point on the horizontal axis of Figure 3a; the vertical axis is the frequency (Hz).
  • the higher the brightness in a certain frequency range the signal is in the band The greater the energy, if the signal continues to maintain a large amount of energy in a certain frequency band, on the spectrum A "bright band” is formed, which is the tone.
  • the pitch duration at the fundamental frequency is slightly longer, the pitch duration at the higher frequency is very short.
  • the voice signal the place where the tone can be detected is voiced. Since the length of the voiced sound is usually short, the corresponding tone duration is also shorter; in the latter half of the music signal, the tone duration is significantly longer.
  • FIG. 4a is a waveform diagram of an input signal of the audio signal "Jinghu + French male voice”
  • FIG. 4b is a spectrum diagram corresponding to FIG. 4a.
  • the horizontal axis is the sample point; the vertical axis is the normalized amplitude; in the spectrum diagram of Fig. 4b, the horizontal axis is the frame; and the vertical axis is the frequency (Hz).
  • the energy distribution of Fig. 4b in the music signal of the first half, the energy is basically distributed above 1 kHz and distributed at 1 kHz to 4 kHz. In the latter half of the speech signal, most of the voiced energy is mainly distributed at 1 kHz. Below; unvoiced energy is distributed from low frequency to higher frequency range. Therefore, the energy of the speech signal cannot be continuously distributed over a relatively high frequency range.
  • Fig. 5a is a waveform diagram of the input signal "Korean male voice + ensemble", wherein the horizontal axis is the sample point; the vertical axis is the normalized Figure 5b is a spectrogram corresponding to Figure 5a, where the horizontal axis is the frame and the vertical axis is the frequency (Hz).
  • the energy distribution can be seen by the following: The energy distribution of the speech signal in the first half of Fig. 5b is similar to the speech signal of Fig.
  • the energy distribution of the speech signal Due to the different energy distribution characteristics of voiced and unvoiced sounds, the energy distribution of the speech signal has a large fluctuation. Therefore, the energy of the speech signal is neither continuously distributed in a relatively high frequency range nor continuously distributed in the low frequency range; in the latter half of the music signal, the energy is mainly distributed below 1 kHz.
  • the difference between the music signal and the speech signal mainly includes: First, the tone duration of the partial music signal is long, the tone duration of the speech signal is usually short; second, the energy of the partial music signal can be continuously distributed. In a relatively high frequency range; the energy of the speech signal cannot be continuously distributed in a relatively high frequency range; the third is that the energy of part of the music signal can be continuously distributed in the low frequency region; the energy of the speech signal cannot be continuously distributed in the low frequency region.
  • the low frequency and high frequency division in the embodiments of the present invention may be determined according to the distribution area of the voice signal, and the area where the voice signal is mainly distributed is defined as a low frequency area, for example, 1 kHz or less is defined as a low frequency area, The 1 kHz is defined as a high frequency region.
  • the specific value may also be different according to the specific application scenario and the specific voice signal.
  • the features to be extracted mainly include pitch characteristics and energy characteristics. Specifically, extracting the tonal features can be divided into three steps:
  • tone component refers to a distribution form of energy in the frequency domain
  • the obtaining the initial pitch detection result may include: first, performing FFT transformation on data of each frame to obtain a power density spectrum; second, determining a local maximum point in the power density spectrum; and finally, focusing on the local maximum point A number of power density spectral coefficients are analyzed to determine whether the local maximum point is a true tonal component.
  • the sampling rate of the input signal is 8 kHz
  • the effective bandwidth is 4 kHz
  • the FFT value is 1024.
  • the local maximum point of the power density spectrum is In this embodiment, how to select a plurality of power density spectral coefficients centered on the local maximum point is relatively flexible, and can be set according to an algorithm. For example, it can be implemented as follows
  • V 2 represents the initial pitch detection result
  • a value of 1 indicates that the k-th frame data has a tonal component at f
  • a value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the kth frame is referred to as a past frame
  • the L1 data located after the kth frame is referred to as a future frame.
  • the kth frame data have a tone score at /;
  • c Quantity, ie to ⁇ L/Z ⁇ r ⁇ Vm ] [ ] l.
  • the steps of the tone continuity analysis are:
  • Step 2 Statistically, the tonal component has continuity with a number of future tonal components, expressed as num right. Similar to step 1 above, sequentially detecting the kth frame, the (k+i) frame, and the like Whether there is continuity between the tonal components, output " Mm - Ai .
  • Step 3 According to " ⁇ TM_ ⁇ , filter the initial tone detection results, such as If one of the following two conditions is met:
  • Num right ⁇ a3 indicates that the tonal component at fk frame fx has a certain continuity, retaining the initial pitch detection result, otherwise it is not retained.
  • Fig. 6a is a waveform diagram 2 of the input signal "French male voice + ⁇ "
  • Fig. 6b is the initial tone detection result of the input signal shown in Fig. 6a, wherein the horizontal axis is a frame, and the horizontal axis of Fig.
  • the tone feature extraction is performed, wherein for the filtered tone detection result, the number of tonal components per frame from the lower frequency to the high frequency range (corresponding to fl4 ⁇ ⁇ F / 2 ) is expressed as
  • Fig. 7a is a waveform diagram 3 of the input signal "French male voice + ⁇ "
  • Fig. 7b A graph of the pitch characteristics corresponding to Fig. 7a.
  • the horizontal axis is a frame, and the graph
  • nwn j mal -flag is always 0 , which is significantly different from the tonal characteristics of the second half.
  • the energy feature extraction method in the above embodiment of the present invention is as follows. Before extracting the energy feature, firstly, the high frequency energy distribution ratio and the sound pressure level ⁇ ⁇ ⁇ of each frame need to be calculated, where k represents the number of frames.
  • Im_ (/) is the imaginary part of the FFT transform of the k-th frame.
  • the denominator represents the total energy of the kth frame; the numerator represents the kth frame at
  • Ratio—energy ⁇ hf ⁇ is small, indicating that the energy of the kth frame is mainly distributed at low frequencies; on the contrary, it indicates that the energy of the kth frame is mainly distributed in a higher frequency range.
  • the distribution characteristics of energy at high frequencies and the distribution characteristics of energy at low frequencies are further analyzed.
  • Fig. 8a is a waveform diagram of the input signal "Jinghu + French male voice”.
  • FIG. 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ ) corresponding to Fig. 8a, wherein the horizontal axis is a frame corresponding to the sample point on the horizontal axis of Fig. 8a; and the vertical axis is the high-frequency energy distribution ratio.
  • the variation of the high-frequency energy distribution ratio curve can be seen from Figure 8b:
  • the high-frequency energy distribution ratio is substantially greater than 0.8, indicating that the energy of the Jinghu signal can be continuously distributed in the higher frequency range;
  • Num_big_ratio_energy_left Represents the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame;
  • Draw — big — mtio — energy — right Indicates the number of frames in the LI frame data after the kth frame that can be continuously distributed in the high frequency future frame.
  • Step 1 Num - big - ratio - ener sy - le ft 0;
  • Step 2 Initialize the variable "Draw as 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are met:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Num non big ratio size if num non big ratio ⁇ 8, indicating that the energy cannot be continuously distributed in the higher frequency range has exceeded the preset range, do not continue to detect, the output listens to m big ratio energy left . If num Non big ratio ⁇ "8, indicating that the energy cannot be continuously distributed in the higher frequency range is still within the preset range, continue to detect until the detection of the past L1 frame data, output "paint - g-rario - i rg ) je/. The steps to get awake-big _ ratio _ energy _ right are similar. Detect whether the (k+ 1)th frame is continuously distributed in a higher frequency range, and output
  • Figure 9a shows the input signal "Korean The waveform of the male voice + ensemble"
  • the graph % is a graph of the high frequency energy distribution ratio ⁇ - ⁇ - ⁇ corresponding to Fig. 9a.
  • the horizontal axis is the frame; the vertical axis is the high frequency energy distribution ratio.
  • the high-frequency energy distribution ratio curve shown in the figure % By observing the change of the high-frequency energy distribution ratio curve shown in the figure %, it can be seen that in the first half of the speech signal, the fluctuation of the high-frequency energy distribution ratio curve is large, indicating that the energy of the speech signal cannot be continuously distributed in the low frequency. In the music signal of the latter half, the high-frequency energy distribution ratio is substantially less than 0.1, indicating that the energy of the ensemble signal can be continuously distributed at low frequencies.
  • ⁇ ⁇ II II mtio — energy — left indicates that the energy can be continuously distributed in the low frequency past frame num _ small _ ratio _ energy _ right .
  • TM m_sm ⁇ _ ra ⁇ _ e / ⁇ _ fe / t is not only drawn for the last L1 frame data analysis, and a ratio -energy _hf ⁇ i) ⁇ i ⁇ 0 ) f will be updated
  • Step 1 initialize num sma ⁇ ra ti energy right to Q ⁇
  • Step 2 sequentially detecting the high-frequency energy distribution ratio ratio _ energy _ hf ⁇ i ) of the (k+1)th frame, the (k+2)th frame, etc. ( ⁇ ⁇ (whether or not the condition: ratio_energy_hf is satisfied) (f) ⁇ a9. If the above conditions are not met, it is not necessary to continue the test, and the output listens to / «-legs ⁇ -/3 ⁇ 4! ⁇ -£ ⁇ /3 ⁇ 4)-/ ⁇ / ⁇ ; if the above conditions are met,
  • Num small ratio energy right- num small ratio energy right + 1 continue to check ⁇ be [] down, until the detection of the future LI frame data, output "paint _ leg" 1 / ⁇ 0 _ £ ⁇ / ⁇ - / ⁇ .
  • the classification rule may be as shown in FIG. 10.
  • the classification rule may include the following steps:
  • Step 301 Determine whether the number of tonal components is greater than 0, that is, "Draw-to" - g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, continue to analyze the U.S. step 302, and analyze the energy in the comparison. The distribution characteristics in the high frequency range, first judge a 6 && S plW> a) . If yes, go to step 303, otherwise execute the step
  • Step 303 determining whether "painting_g-rari 0 _£ rg" -n ⁇ "ll, or satisfying num big ratio energy left + num big ratio energy right ⁇ alO or
  • Step 304 Determine whether the high frequency energy distribution ratio is less than a9, that is,
  • Step 305 Determine whether the "painting_leg" 1/ ⁇ 0 _£ ⁇ / ⁇ -/£ ⁇ ⁇ "13 is satisfied, or num small ratio energy left + num small ratio energy right ⁇ al2 or num _ small _ ratio _ energy _ right >a ⁇ ⁇ If it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 11a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", three of which are: ensemble, cymbal and castanets, In terms of pitch characteristics or energy characteristics, it has a certain typicality;
  • Figure lib is a schematic diagram of the classification result corresponding to Figure 11a, wherein the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal. The value is not 0 corresponding to the music signal. From bottom to top, the vertical axis gives four classification results:
  • MUSIC_ Tone Feature The classification result obtained using only the tone feature is expressed as a solid line. It can be seen from which signals in Figure 11a are applicable to the classification rules for tonal features; MUSIC Energy: Special Features __11:: Only use the classification results obtained with ""Energy: Feature_1" , denoted as a dotted line. The "energy characteristic_1" here refers to whether the energy can be continuously distributed in a higher frequency range. It can be seen which signals in Fig. 11a are suitable for the high frequency distribution characteristics of the energy. Classification rules;
  • the ensemble signal between 100000-300000 points The energy fluctuation of this piece of music signal is very large, only a few frames of energy can be continuously distributed in a higher frequency range, the energy characteristic _1/2 can not afford effect.
  • the pitch of the segment signal has good persistence and can be detected by using the tonal feature;
  • the energy of the segment signal is mainly distributed in the low frequency, and can be detected by using the energy feature_2;
  • the castanick signal after 600000 This segment of the signal can hardly detect the tonal component, and the tonal feature does not work.
  • the energy of this segment of the signal is mainly distributed at high frequencies and can be detected by the energy characteristic _1.
  • the technical solution provided by the embodiment of the present invention can also be applied to an application scenario with a large output delay.
  • the output delay is L2+L3
  • the first embodiment may be provided according to the foregoing embodiment.
  • the technical solution when i>L2, according to the past information, the pitch distribution parameter and the energy distribution parameter of several frames before the i_L2 frame, the current information, that is, the pitch distribution parameter and the energy distribution parameter of the i_L2 frame, and the future The information, that is, the pitch distribution parameter and the energy distribution parameter of the L2 frame after the i_L2 frame, obtain the audio signal classification result of the i-th frame, and the specific implementation manner can be referred to the above embodiment, and further, i>(L2+L3) When it is smoothed, that is, according to the frame before the i_L2-L3 frame to be classified, the N4 frame and the frame to be classified
  • the initial classification result of the L3 frame is corrected.
  • the foregoing foregoing N4 frame may be the first L3 frame, and for the kth frame, the process of the above correction processing is:
  • the initial classification result of the L3 frame located before the kth frame and the L3 frame located after the kth frame is counted, and the number of frames classified as a music signal "" m - mw , and the towel classified as a voice signal are acquired.
  • Awake up - wake up _ music is acquired.
  • the k-th frame classification result is corrected to the music signal; if the result of the initial classification of the k-th frame is a music signal, and "Draw -". "-Listen ⁇ ⁇ " 1 4 , the classification result of the kth frame is corrected to a speech signal.
  • Figure 12a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, Figure 12 shows the smoothed results, as shown in Figure 12, Down to the top, the vertical axis gives two types of classification results:
  • MUS IC_ smoothed result smooth the initial classification result, and obtain the smoothed result, table Shown as a dotted line.
  • the initial classification result has a misjudgment between 250,000 and 300000 points, and the music signal is misjudged as a voice signal; ⁇ between 400,000 and 550,000 points
  • the signal, the initial classification result has a misjudgment at the end of the signal, and the music signal is misjudged as a speech signal.
  • the above misjudgment was corrected by smoothing.
  • the principle and the step of acquiring the pitch distribution parameter and obtaining the energy distribution parameter are similar to the above technical solution, except that the reference information is used in the past.
  • the current information because there is no output delay, requires real-time access to the classification results, and cannot refer to future information.
  • the tone feature can be extracted by referring to the foregoing embodiment, and can be divided into three steps:
  • step A reference may be made to the above embodiment, and the following mainly describes the steps B and C.
  • tonal -fl a g -Original [k] [f] (0 ⁇ f ⁇ tone detection result represents the initial value of 1 k-th frame data showing the presence of tonal components at f
  • the value of A value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the k-th frame is referred to as a past frame.
  • the steps of the tone continuity analysis are:
  • Step 1 Count the continuity of the tonal component with the pitch component of the past frame, expressed as the leg m— ⁇ , initialize the variable “ " ⁇ - to 0, initialize the variable indicating discontinuity
  • step 2 Similar to step 2, it is sequentially detected whether there is continuity between the (k-1)th frame, the (k-2)th frame, and the like and the pitch component of the previous frame. Before each test, you first need to determine the size of "" " ⁇ - ⁇ :
  • Step 2 Filter the initial pitch detection result according to "- ⁇ ;
  • This feature refers to the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame.
  • Step 1 Num - big - ratio - ener sy - ⁇ 0;
  • Step 2 Initialize the variable ""m_M.”_b ⁇ _rari. is 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are satisfied: ⁇ ratio energy _hf(k - l)> If the above conditions are not satisfied, the energy of the (k-1)th frame data is not distributed at a higher frequency. In the range, i has recorded this event - m non big ratio - num non big ratio + 1 If the above conditions are met, the energy of the (ki) frame data is continuously distributed in the higher frequency range:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Wake up - wake up II - ratio - energy - left This feature refers to the number of frames of past frames whose energy can be continuously distributed at low frequencies.
  • the classification rule may be as shown in FIG. 13, and for the k-th frame data, it may include the following steps:
  • Step 401 Determine whether the number of tonal components is greater than 0, g ⁇ " ⁇ -to ?M Z- i 3 ⁇ 4g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, the energy feature is continuously analyzed;
  • Step 402 Analyze the distribution characteristics of energy in a higher frequency range, first determine
  • step 403 If yes, execute step 403, otherwise execute the step
  • Step 403 determining whether "paint-b ⁇ -rari 0 _i / ⁇ -fe / t ⁇ b 8 is satisfied, if yes, output the initial classification result as a music signal, otherwise, performing step 404;
  • Step 404 determining whether the high frequency energy distribution ratio is less than b7, that is,
  • Step 405 determining whether the "painting is satisfied" j e / ⁇ 9, if it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 14a is the waveform diagram 3 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure b gives an example of real-time classification results, where the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal, and the value is not A music signal corresponding to 0 can be seen from Fig. 14a and Fig. 14b. Since there is no future information for reference, a little music signal is misjudged as a voice signal.
  • FIG. 15 is a flowchart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention, and as shown in FIG. 15, the following steps are included:
  • Step 501 performing an FFT transformation on the i-th frame of the current frame
  • Step 502 Obtain a tone distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 503 Obtain an energy distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 504 Generate and cache a real-time classification result of the ith frame.
  • the past information generated and cached in the step 502 and the step 503 in the step that is, the pitch distribution of each frame before the ith frame
  • the parameter and the energy distribution parameter are used to obtain the tonal feature and the energy feature of the ith frame, and generate and cache the real-time classification result.
  • Step 505 When 1>11, where L1 is a small amount of output delay allowed, in addition to obtaining the real-time classification result of each received frame, the initial classification result of the i-L1 frame may also be generated and cached, specifically, When generating the initial classification result of the i-th frame, reference may be made to the past information, that is, the pitch distribution parameter and the energy distribution parameter of several frames before the i-L1 frame, and the current information, that is, the tone of the i-L1 frame.
  • Step 506 When i>(L2+L3), generate and buffer the (i_L2-L3) frame-corrected classification result, specifically, refer to the past information, that is, before the (i_L2-L3) frame.
  • the initial classification result of several frames the future information, that is, the initial classification result of the L3 frame located after the (i_L2-L3) frame, the initial classification result of the (i_L2-L3) frame is corrected, and the specific implementation can be seen.
  • Step 507 Select, according to the allowed output delay, the classification result of the foregoing step 504, step 505, and step 506 as the classification result of the jth frame of the to-be-classified frame:
  • the suboptimal result is output, that is, the initial classification result of the jth frame;
  • the zero delay result is output, that is, the real time classification result of the jth frame.
  • the value of L2 can be set equal to L1.
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure 16b shows the classification results obtained by the three classification methods, as shown in Figure 16b, where the three classification results given on the vertical axis are 31 ( _ The results of real-time classification are indicated by solid lines, ⁇ The classification results are indicated by dotted lines, and the MUSIC_ corrected classification results are indicated by dotted lines.
  • the extracted feature can reflect the more essential features of the music signal different from the voice signal, so that the classification accuracy rate at the low sampling rate is significantly improved. Since the method for extracting features of the technical solution of the embodiment of the present invention is not limited to the sampling rate, it is applicable not only to a low sampling rate but also to signal classification at a high sampling rate. Under the premise of ensuring low algorithm complexity, users can flexibly select real-time classification results, sub-optimal classification results or optimal classification results according to their needs.
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention. As shown in FIG. 17, the apparatus includes a first obtaining module 11 and a classification determining module 12, wherein the first acquiring module 11 is configured to acquire an audio signal.
  • the classification determining module 12 is configured to: according to the number of tonal components satisfying the continuity constraint in the to-category frame, the continuous frame number of the to-be-classified frame in the low frequency region, and the persistent frame of the high frequency region of the to-be-classified frame At least one of the numbers determines that the frame to be classified in the audio signal is a music signal, or determines that the frame to be classified in the audio signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the execution steps of each module may be different according to the presence or absence of the output delay and the output delay length, and specifically include the following situations:
  • the first acquisition module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the to-be-classified frame
  • the tone distribution parameter of the first N1 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a frame before the frame to be classified
  • the energy distribution parameter and obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the frame to be classified in the high frequency region Number of consecutive frames
  • the classification determining module 12 is specifically configured to satisfy a continuity constraint bar in the to-be-classified frame. Determining that the number of tonal components of the piece is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than a third threshold.
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the classification determining module obtains, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the pre-frame N1 frame, the number of tonal components satisfying the continuity constraint in the frame to be classified, including:
  • the module obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the foregoing classification determining module acquires the continuous frame number of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • the foregoing classification determining module acquires the continuous frame number of the to-be-classified frame in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified Includes:
  • the second acquiring module is configured to obtain a to-be-classified frame in the audio signal, a pre-framed N2 frame, and a to-be-classified.
  • N2 is a positive integer; or, specifically, is used to obtain a frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the frame to be classified in the audio signal, Obtaining, according to the energy distribution parameter of the pre-frame N2 frame and the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the first acquiring module acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N2 frame before the frame to be classified and the energy distribution parameter of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified.
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the numbers include:
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the number of frames includes:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is to obtain a classification result of the to-be-classified frame, and the L2 and L3 are positive integers, and the first acquiring module is specifically configured to acquire a frame to be classified in the audio signal, and the N3 frame to be classified before the frame.
  • N3 is a positive integer; or, specifically, for acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the audio signal
  • the number of consecutive frames in the low frequency region or the number of consecutive frames in the high frequency region of the frame to be classified is obtained by the energy distribution parameter of the frame to be classified, the N3 frame before the frame to be classified, and the L2 frame after the frame to be classified;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L2 frame after the frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the first acquiring module acquires an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and an energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame before the frame to be classified, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are used as energy distribution parameters of the L2 frame after the frame to be classified.
  • the foregoing classification determining module acquires the number of consecutive frames of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components whose number of persistent frames in the frame to be classified acquired by the first acquiring module is greater than the sixth threshold is the number of tonal components greater than the seventh threshold in the frequency domain.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing device according to an embodiment of the present invention.
  • the device includes a receiver 21 and a processor 22, where The receiver 21 is configured to receive an audio signal; the processor 22 is connected to the receiver 21, and configured to acquire the number of tonal components satisfying continuity constraints in the to-be-classified frame in the audio signal received by the receiver, the audio And at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, Determining, in the audio signal, the frame to be classified as a music signal, or determining the audio, by using at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region
  • the frame to be classified in the signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is usually not continuously distributed in the high frequency region or the low frequency region, and based on the above characteristics of the music signal,
  • the technical solution provided by the embodiment of the present invention first, the number of tonal components satisfying the continuity constraint in the frame to be classified in the audio signal, and the number of persistent frames of the frame to be classified in the low frequency region and/or the to-be-classified in the audio signal are obtained.
  • the audio signal classification processing method provided by the above technical solution can improve the correct rate of the audio signal classification and satisfy the voice. Requirements for quality assessment.
  • the processor may be implemented by a software flow, or may be implemented by using a hardware entity device such as a digital signal processing (DSP) chip.
  • DSP digital signal processing
  • the processor may include the following situations according to the real-time acquisition of the classification result of the to-be-classified frame or the length of the delay of the classification result output:
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the tone of the N frame before the frame to be classified
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified, and according to the audio
  • the energy distribution parameter of the to-be-classified frame in the signal, and the N1 frame before the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified includes:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the processor according to the pitch distribution parameter of the frame to be classified, and the tone of the N1 frame before the frame to be classified
  • the obtaining, by the distribution parameter, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: obtaining the to-be-classified frame according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the pre-frame N1 frame to be classified The number of the tone components whose number of consecutive frames is greater than the sixth threshold value.
  • the processor acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • the second is that when the classification result of the to-be-classified frame is obtained, the L1 is a positive integer, and the processor is specifically configured to acquire a frame to be classified in the audio signal, a N2 frame before the frame to be classified, and a frame to be classified.
  • N2 is a positive integer
  • the energy distribution parameter of the L1 frame after the frame to be classified acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region; satisfies continuity in the frame to be classified
  • the number of tonal components of the constraint is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the pitch distribution parameters of the N2 frame, and the pitch distribution parameters of the L1 frame after the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the received audio signal is used as the pre-frame N2 frame to be classified.
  • the pitch distribution parameter, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified acquires the number of tonal components whose number of consecutive frames in the frame to be classified is greater than a sixth threshold.
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and an energy distribution parameter of the L1 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are used as the energy distribution parameters of the N2 frame before the frame to be classified, and to be classified
  • the high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame frame are used as energy distribution parameters of the L1 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor according to the energy distribution parameter of the frame to be classified in the audio signal, the N2 frame to be classified before the frame
  • the energy distribution parameter and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is that when the classification result output delay is L2+L3 frame, L2 and L3 are positive integers, and the processor is specifically configured to acquire the to-be-classified frame in the audio signal, the N3 frame before the frame to be classified, and the L2 after the frame to be classified.
  • the to-be-classified frame N3 frame and the tone distribution parameter of the L2 frame after the frame to be classified obtain the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is positive And obtaining an energy distribution parameter of the to-be-classified frame in the audio signal, and the L3 frame of the to-be-classified frame and the L2 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and the to-be-classified frame
  • the energy distribution parameter of the L2 frame after the classification frame acquires the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region; the continuity constraint is satisfied in the frame to be classified
  • the number of conditional tonal components is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number
  • the frame to be classified in the audio signal is corrected to a voice signal
  • N4 is a positive integer
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the tonal component of the L2 frame after the frame to be classified The frequency domain distribution information is used as a pitch distribution parameter of the L2 frame after the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal
  • the energy distribution parameter of the N3 frame before the frame to be classified and the energy distribution parameter of the L2 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high-frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high-frequency energy distribution ratio and the sound pressure level of the N3 frame before the frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified
  • the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame are used as the energy distribution parameters of the L2 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region is included. :
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified that are acquired by the processor that are greater than the sixth threshold is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing Storage media include: R0M, RAM, disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供一种音频信号分类处理方法、装置及设备,所述方法包括:获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至少一项(101);根据所述待分类帧中满足连续性约束条件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至少一项,确定所述音频信号中待分类帧是音乐信号或是语音信号(102)。

Description

音频信号分类处理方法、 装置及设备
技术领域
本发明实施例涉及信号处理技术领域, 尤其涉及一种音频信号分类处 理方法、 装置及设备。 背景技术
在移动通信***的语音质量评估中, 现有的语音质量评估模型不适用 于音乐信号。 但是, 实际应用中的待分析信号中可能会包括音乐信号, 比 如彩铃等。 语音质量评估模型会将其视为语音信号, 给出错误的质量评估 结果。 针对该问题, 在将待分析信号输入至语音质量评估模块之前, 应先 对其进行信号分类。 如果识别出该段信号为语音信号, 将其送入语音质量 评估模块进行质量评估; 如果识别出该段信号为音乐信号, 则不送入语音 质量评估模块。
现有技术提供有应用于语音音乐联合编码器的音频信号分类方法, 但 是该分类方法是针对具有高采样率的语音音乐联合编码器, 对于语音质量 评估模型而言, 其中存在的音乐信号普遍缺少高频信息, 利用现有的应用 于语音音乐联合编码器的音频信号分类方法, 仅能识别出少数的音乐信 号, 且分类正确率低, 不能够满足语音质量评估的要求。 发明内容
本发明提供一种音频信号分类处理方法、 装置及设备, 用于提高音频 信号的分类正确率。
本发明的第一个方面是提供一种音频信号分类处理方法, 包括: 获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类帧在低频区域的持续帧数和所述待分类帧在高频 区域的持续帧数中的至少一项;
根据获取的所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续 帧数, 确定所述音频信号中待分类帧为音乐信号, 或确定所述音频信号中 待分类帧为语音信号。
在上述第一个方面的第一种可能中, 在所述获取音频信号中待分类帧 中满足连续性约束条件的音调分量的数量包括:
获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧 中满足连续性约束条件的音调分量的数量, N1为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参 数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数, N1为正整数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
结合上述第一个方面的第一种可能的第二种可能中, 上述获取音频信 号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分布参数包 括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧作为 待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分量的频域分布 信息作为待分类帧前 N1帧的音调分布参数;
所述根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第一个方面的第一种可能的第三种可能中, 上述获取所音频 信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能量分布参数 包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
在结合上述第一个方面或第一个方面的任一种可能的第四种可能中, 在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述获取音 频信号中待分类帧中满足连续性约束条件的音调分量的数量包括:
获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧 后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N2为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧 后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
在结合上述第一个方面的第四种可能的第五种可能中, 所述获取音频 信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以 及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
在结合上述第一个方面的第四种可能的第六种可能中, 所述获取所音 频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以 及待分类帧后 L1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
在结合上述第一个方面、 第一个方面的上述任一种可能的第七种可能 中, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量 包括:
获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及待分类帧 后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N3为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧 后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。
在结合上述第一个方面的第七中可能的第八种可能中, 所述获取音频 信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以 及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3的 音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数帧和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的音调 分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
在结合上述第一个方面的第七中可能的第九种可能中, 所述获取所音 频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以 及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧前 N3帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
在结合上述第一个方面的第二种可能、第五种可能或第八种可能的第 十种可能中, 所述待分类帧中持续帧数大于第六阈值的音调分量的数量为 在频域上大于第七阈值的音调分量的数量。 本发明的第二个方面是提供一种音频信号分类处理装置, 包括: 第一获取模块, 用于获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所 述待分类帧在高频区域的持续帧数中的至少一项;
分类确定模块, 用于根据所述待分类帧中满足连续性约束条件的音调 分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧的高频 区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信 号, 或确定所述音频信号中待分类帧为语音信号。 在结合上述第二个方面的第一种可能中, 所述第一获取模块具体用于 获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根 据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量, N1为正整数; 或,
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能 量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的 能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分类帧 在高频区域的持续帧数;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。
结合上述第二个方面第一种可能的第二种可能中, 所述第一获取模块 获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调 分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第二个方面第一种可能的第三种可能中, 所述第一获取模块 获取所音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能 量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
结合上述第二个方面或第二个方面的任一种可能的第四种可能中, 在 延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述第一获取 模块具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类 帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及 待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的 音调分量的数量, N2为正整数; 或, 具体用于获取所述音频信号中待分类 帧, 以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据 所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量 分布参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在 高频区域的持续帧数;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。 结合上述第二个方面第四种可能的第五种可能中, 所述第一获取模块 获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布 参数, 以及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
在结合上述第二个方面第四种可能的第六种可能中, 所述第一获取模 块获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量 分布参数以及待分类帧后 L1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L1帧的高频能量分 布比和声压级作为待分类帧后 L1帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
结合上述第二个方面和第二个方面的上述任一种可能的第七种可能 中, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量, N3为正整数; 或,
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及 待分类帧后 L3帧的能量分布参数, 并根据所述音频信号中待分类帧, 待 分类帧前 N3帧以及待分类帧后 L3帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数或所述待分类帧在高频区域的持续帧数;
所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。
在结合上述第二个方面的第七种可能的第八种可能中, 所述第一获取 模块获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音 调分布参数;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。
在结合上述第二个方面的第七种可能的第九种可能中, 所述第一获取 模块获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能 量分布参数以及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 在结合上述第二个方面的第二种可能、第五种可能或第八种可能的第 十种可能中, 所述第一获取模块获取的待分类帧中持续帧数大于第六阈值 的音调分量的数量为在频域上大于第七阈值的音调分量的数量。 满足连续 性约束条件的音调分量的数量为在频域上大于第七阈值的音调分量的数 结合上述第二个方面的第一种可能、 第二种可能或第三中可能的第六 种可能中, 上述第一获取模块具体用于获取接收到的音频信号中的各帧的 高频能量分布比和声压级; 以及根据所述接收到的音频信号中的各帧的高 频能量分布比和声压级, 获取包括所述待分类帧在内的高频能量分布比小 于第八阈值的持续帧数, 或, 根据所述接收到的音频信号中的各帧的高频 能量分布比和声压级, 获取包括所述待分类帧在内的高频能量分布比大于 第九阈值、 声压级大于第十阈值的持续帧数。 本发明的第三个方面是提供一种音频信号分类处理设备, 包括: 接收器, 用于接收音频信号;
处理器, 与所述接收器连接, 用于获取接收器接收到的音频信号中待 分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述 音频信号中待分类帧为语音信号。
在第三个方面的第一种可能中, 所述处理器具体用于获取音频信号中 待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N1为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及 待分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧 数和 /或所述待分类帧在高频区域的持续帧数, N1为正整数; 在所述待分 类帧中满足连续性约束条件的音调分量的数量大于第一阈值、 所述待分类 帧在低频区域的持续帧数大于第二阈值或所述待分类帧在高频区域的持 续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音乐信号, 否则 确定所述音频信号中待分类帧为语音信号。
结合上述第第三个方面的第一种可能的第二种可能中, 所述处理器获 取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 以及和待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包 括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第第三个方面的第一种可能的第三种可能中, 所述处理器获 取所音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能量 分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 Nl帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
结合第三个方面或第三个方面的上述任一种可能的第四种可能中, 在 延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述处理器具 体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1 帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类 帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分 量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分 类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待 分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的持续帧 数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一阈 值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧在 高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音 乐信号, 否则确定所述音频信号中待分类帧为语音信号。
在结合第三个方面的第四种可能的第五种可能中, 所述处理器获取音 频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧帧的 音调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2 帧的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以 及待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧 的音调分布参数;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 LI帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
在结合第三个方面的第四种可能的第六种可能中, 所述处理器获取所 音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数 以及待分类帧后 L 1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
结合第三个方面、 第三个方面的上述任一种可能的第七种可能中, 在 延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述 处理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分 类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以 及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N3为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述音频信 号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获 取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的 持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于 第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分 类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类 帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若确定所 述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分 类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则 将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音 频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧 后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所 述音频信号中待分类帧修正为音乐信号。
结合上述第三个方面的第七种可能的第八种可能中, 所述处理器获取 音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调分 布参数;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
结合上述第三个方面的第七种可能的第九种可能中, 所述处理器获取 所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参 数以及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
结合上述第三个方面的第二种可能、 第五种可能或第八种可能的第十 种可能中, 所述处理器获取的待分类帧中持续帧数大于第六阈值的音调分 量的数量为在频域上大于第七阈值的音调分量的数量。 满足连续性约束条 件的音调分量的数量为在频域上大于第七阈值的音调分量的数量。 本发明提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信 号的音调持续时间较长, 而语音信号的音调持续时间较短, 音乐信号的能 量可以持续分布在高频区域或低频区域, 而语音信号通常不能持续分布在 高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本发明实施例 提供的技术方案中, 首先获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量, 以及音频信号中待分类帧在低频区域的持续帧数和 / 或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认待分类帧的 类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信号分类处理 方法, 能够提高音频信号分类的正确率, 满足语音质量评估的要求。
附图说明 为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述 中所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图是 本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳 动性的前提下, 还可以根据这些附图获得其他的附图。
图 1为本发明实施例中音频信号分类处理方法的流程示意图一; 图 2为本发明具体实施例中的流程示意图一;
图 3a为输入信号 "法语男声 +笙" 的波形图一;
图 3b为与图 3a对应的语谱图;
图 4a为音频信号 "京胡 +法语男声的信号" 的输入信号的波形图; 图 4b为与图 4a对应的语谱图;
图 5a为输入信号 "韩语男声 +合奏" 的波形图;
图 5b为与图 5a对应的语谱图;
图 6a为输入信号 "法语男声 +笙" 的波形图二;
图 6b为图 6a所示输入信号的初始音调检测结果;
图 6c为图 6a所示输入信号筛选后的音调检测结果;
图 7a为输入信号 "法语男声 +笙" 的波形图三;
图 7b为图 7a对应的音调特征"" m-to∞z - ^的曲线图;
图 8a为输入信号 "京胡 +法语男声" 的波形图;
图 8b为与图 8a对应的高频能量分布比值^ - -^^的曲线图; 图 9a为输入信号 "韩语男声 +合奏" 的波形图;
图%为与图 9a对应的高频能量分布比值 -^ -^^)的曲线图; 图 10为本发明实施例中音频信号分类规则流程示意图一;
图 11a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图一;
图 l ib为图 11a对应的分类结果示意图;
图 12a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图二;
图 12b为图 12a对应的平滑后的分类结果示意图;
图 13为本发明实施例中音频信号分类规则流程示意图二;
图 14a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图三;
图 14b为图 14a对应的实时分类结果示意图;
图 15为本发明实施例中输出延时不固定的情况下语音分类方法流程 图;
图 16a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图四;
图 16b为图 16a对应的三种分类方式的分类结果示意图;
图 17为本发明实施例中音频信号分类处理装置的结构示意图; 图 18为本发明实施例中音频信号分类处理设备的结构示意图。 具体实施方式
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描 述, 显然,所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例, 都属于本发明保护的范围。
针对现有技术中的缺陷, 本发明实施例提供了一种音频信号分类处理 方法, 图 1为本发明实施例中音频信号分类处理方法的流程示意图一, 如 图 1所示, 该方法包括如下歩骤:
歩骤 101、 获取音频信号中待分类帧中满足连续性约束条件的音调分 量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项;
歩骤 102、 根据获取的所述待分类帧中满足连续性约束条件的音调分 量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区 域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
本发明实施例提供的音频信号分类处理方法, 在进行音频信号中的各 帧进行分类时, 既可以无输出延时的输出分类结果, 即对于接收到的音频 信号帧, 实时输出分类结果, 也可以存在一定的输出延时, 即对于接收到 的音频信号帧, 延迟一段时间给出分类结果。 本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。
本发明上述实施例中, 其中根据输出延时要求的不同, 可以分为三种 情况, 一是在实时获取所述待分类帧的分类结果时, 需要根据待分类帧, 以及待分类帧之前的 N帧的信息进行判断, 二是在允许较小的分类结果输 出延时, 即输出延时为 L1帧时, L1为正整数, 可以根据待分类帧, 待分 类帧前 L1帧, 以及待分类帧后 L1帧进行判断; 三是允许较大分类结果输 出延时, 即输出延时为 L2+L3帧时, L2和 L3为正整数, 先根据待分类帧, 待分类帧前 L2帧, 以及待分类帧后 L2帧进行判断, 获取初歩的待分类帧 的分类结果,然后再根据待分类帧前 L3帧和待分类帧中后 L3帧进行修改。 其中,在无输出延时时,对于最先接收到的音频信号中的帧无法进行分类, 可以将最先接收到的帧设置默认值, 默认其为语音信号或音乐信号。
具体的, 在无输出延时, 即实时获取所述待分类帧的分类结果时, 图 1所示实施例中的歩骤 101获取音频信号中待分类帧中满足连续性约束条 件的音调分量的数量具体包括:
获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类 帧中满足连续性约束条件的音调分量的数量, N1为正整数;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参 数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数, N1为正整数;
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
上述实施例中, 其中获取音频信号中待分类帧的音调分布参数, 以及 待分类帧前 N1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧作为 待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分量的频域分布 信息作为待分类帧前 N1帧的音调分布参数。
而上述的根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音 调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 上述获取所音频信号中待分类帧的能量分布参数, 以及待分类 帧前 N1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。
而上述根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
上述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
在允许 L1帧分类结果输出延时, 即延时 L1帧获取所述待分类帧的分 类结果时, 图 1所示实施例的歩骤 101中获取音频信号中待分类帧中满足 连续性约束条件的音调分量的数量包括:
获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧 后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N2为正整数;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧 后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数;
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
在上述实施例中, 其中获取音频信号中待分类帧的音调分布参数, 待 分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包 括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
另外, 上述获取所音频信号中待分类帧的能量分布参数, 待分类帧前
N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L1帧的高频能量分布 比和声压级作为待分类帧后 L1帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 LI帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
在允许分类结果输出延时为 L2+L3帧, 即延时 L2+L3帧获取所述待分 类帧的分类结果时, 图 1所示实施例的歩骤 101中获取音频信号中待分类 帧中满足连续性约束条件的音调分量的数量包括:
获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及待分类帧 后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N3为正整数;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧 后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数。
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 L3帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于第四阈 值, 若超过, 则将所述音频信号中待分类帧修正为语音信号;
若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 L3帧和待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈 值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。
在上述实施例中, 所述获取音频信号中待分类帧的音调分布参数, 待 分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包 括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的 音调分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
另外, 所述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
上述针对是否允许输出延时的三种情形下, 其中待分类帧中持续帧数 大于第六阈值的音调分量的数量为在频域上大于第七阈值的音调分量的 数量。
以下分别针对上述允许分类结果输出延时等情况进行详细说明。 首 先, 以允许 L1帧的少量固定输出延时为例, 本实施例中 L1取值为 15。 图 2为本发明具体实施例中的流程示意图一, 如图 2所示, 包括如下的歩骤: 歩骤 201、 对当前帧第 i帧进行 FFT变换, 本歩骤中是针对接收到的 每帧都进行 FFT变换;
歩骤 202、 基于 FFT变换结果, 获取第 i帧的音调分布参数, 及其能 量分布参数;
歩骤 203、 判断 i〉Ll是否成立, 即当前帧之前是否已存在 L1个帧, 如果是执行歩骤 204, 否则结束本流程, 继续执行针对后续的各帧执行上 述歩骤 201和歩骤 202的操作;
歩骤 204、 在:[〉1^时, 则可以获取第 i-Ll帧的音频信号分类结果, 具体的可以过去的信息, 即按照上述歩骤 201和歩骤 202获取的第 i-Ll 帧之前的若干帧的音调分布参数和能量分布参数, 现在的信息, 即第 i-Ll 帧的音调分布参数和能量分布参数, 以及未来的信息, 即第 i-Ll帧之后 的 L1帧的音调分布参数和能量分布参数,获取第 i-Ll帧的音频信号分类 结果;
歩骤 205, 输出第 i-Ll帧的音频信号分类结果。
具体的, 对于音乐信号和语音信号的音调分布情况, 可以参照图 3a 和图 3b, 图 3a为输入信号 "法语男声 +笙" 的波形图一, 图 3b为与图 3 对应的语谱图。 在图 3a的输入信号波形中, 采样率为 8kHz, 其中, 横轴 为样本点, 纵轴为归一化幅值; 图 3b的语谱图, 对应的采样率也为 8kHz, 频率分析范围为 (T4kHz。 其中, 横轴为帧, 与图 3a横轴的样本点相对应; 纵轴为频率 (Hz)。 在语谱图中, 某个频率范围内的亮度越高, 表示信号在 该频段的能量越大。 如果信号在某频段持续保持较大的能量, 在语谱图上 就会形成一条 "亮带" , 也就是音调。 通过图 3b的音调分布情况可知, 在前半段的语音信号中, 除了基频处的音调持续时间稍长一些, 更高频率 处的音调持续时间都是很短的。 在语音信号中, 能够检测出音调的地方为 浊音。 由于浊音的长度通常较短, 与之相对应的音调持续时间也较短; 而 在后半段的音乐信号中, 音调持续时间明显较长。
对于音乐信号和语音信号的能量分布情况, 可以参照图 4a和图 4b, 图 4a为音频信号 "京胡 +法语男声的信号" 的输入信号的波形图, 图 4b 为与图 4a对应的语谱图。 在图 4a的波形图中, 其中, 横轴为样本点; 纵 轴为归一化幅值; 图 4b的语谱图中, 横轴为帧; 纵轴为频率 (Hz)。 通过 图 4b的能量分布情况可知:在前半段的音乐信号中,能量基本分布在 1kHz 以上, 在 1kHz至 4kHz均有分布; 在后半段的语音信号中, 大部分浊音的 能量主要分布在 1kHz以下; 清音的能量在低频至较高频率范围内均有分 布。 因此, 语音信号的能量不可能持续分布在相对较高的频率范围内。
另外, 部分音乐信号的能量能够持续分布在低频区域; 相比之下, 语 音信号的能量不可能持续分布在低频区域。 以图 5a和图 5b所示的 "韩语 男声 +合奏" 的音频信号为例说明, 图 5a为输入信号 "韩语男声 +合奏" 的波形图, 其中, 横轴为样本点; 纵轴为归一化幅值; 图 5b为与图 5a对 应的语谱图, 其中, 横轴为帧; 纵轴为频率 (Hz)。 通过可以看出如下的能 量分布情况: 图 5b前半段的语音信号的能量分布情况与图 4b的语音信号 类似。 由于浊音和清音的能量分布特性不同, 造成语音信号的能量分布具 有较大的波动。 因此, 语音信号的能量既不可能持续分布在相对较高的频 率范围内, 也不可能持续分布在低频范围内; 在后半段的音乐信号中, 能 量主要分布在 1kHz以下。
综上所述, 音乐信号与语音信号的不同之处主要有: 一是部分音乐信 号的音调持续时间较长, 语音信号的音调持续时间通常较短; 二是部分音 乐信号的能量能够持续分布在相对较高的频率范围内; 语音信号的能量不 能持续分布在相对较高的频率范围内; 三是部分音乐信号的能量能够持续 分布在低频区域; 语音信号的能量不能持续分布在低频区域。 本发明各实 施例中的低频和高频的划分, 可以根据语音信号的分布区域确定, 将语音 信号主要分布的区域定义为低频区域, 例如将 1kHz以下定义为低频区域, 而将 1kHz定义为高频区域, 当然其具体取值也可以根据具体的应用场景 的不同, 针对的具体语音信号的不同而有所区别。
基于上述分类原理, 需要提取的特征主要有音调特征及能量特征。 具体的, 提取音调特征可以分为三个歩骤:
A、 获取初始音调检测结果, 即各帧的音调分布参数;
B、 通过连续性分析, 对初始音调检测结果进行筛选, 确定待分类帧 中满足连续性约束条件的音调分量, 该音调分量是指能量在频域上的一种 分布形式;
C、 基于筛选后的音调检测结果, 提取音调特征, 即待分类帧的满足 连续性约束条件的音调分量的数量。
其中, 上述获取初始音调检测结果可以包括: 首先, 对各个帧的数据 进行 FFT变换, 获取功率密度谱; 其次, 确定功率密度谱中的局部极大点; 最后, 针对以局部极大点为中心的若干功率密度谱系数进行分析, 进一歩 确定局部极大点是否为真正的音调分量。
本实施例中, 设输入信号的采样率为 8kHz, 有效带宽为 4kHz, FFT 取值为 1024, 功率密度谱的局部极大点为
Figure imgf000030_0001
本实施例中, 如何选取以局部极大点为中心的若干功率密度谱系数进 行分析, 是比较灵活的, 可以根据算法需要设定。 例如可以采用如下方式 实现
如果局部极大点 " ^满足以下条件:
Pf - p{f±i)≥7dB , 其中 = 2,3 · · ,10 即判断局部极大点与相邻的其他点的数值差异较大时, 本实施例中差 异为 ΊάΒ , 则说明该局部极大点是真正的音调分量。 对于上述音调连续性分析的歩骤, 可以设
Figure imgf000030_0002
V2)表示初始音调检测结果, 取值为 1表示第 k帧数据在 f 处存在音调分量, 取值为 0表示第 k帧数据在 f 处不存在音 调分量。 相对于第 k帧, 位于第 k帧之前的 L 1帧数据被称为过去帧, 位 于第 k帧之后的 L 1数据被称为未来帧。 设第 k帧数据在/; c处存在音调分 量, 即 to^L/Z^^r^Vm ] [ ] = l。 针对位于第 k帧/; c处的音调分量, 音调 连续性分析的歩骤为:
歩骤 1、 统计该音调分量与过去多少帧的音调分量具有连续性, 表示 为 num_left,初始化变量" "m— 为 0,不具有连续性的巾贞数用 """^"。"-。"^ 标识, 初始化变量" -" ^- 为 0, 并记录待分析音调分量所处的位置: pos _cur = fie ,
检杳 tonal _ flag _ original[k -
Figure imgf000031_0001
3))的取值.
如果取值全为 o, 说明第(k-i)帧数据在^" - e"^3^/^^^-^ 3)区 间不存在音调分量, 即位于第 k帧 ^处的音调分量与第(k-l)帧的音调分 量之间出现间断, 记录下本次不连续性事件:
num _ non _ tonal = num _ non _ tonal + 1.
说明第 ( k_ 1 )巾贞数 据在
Figure imgf000031_0002
位于第 k帧 处的 音调分量与第(k-1)帧的音调分量之间具有连续性:
记录第(k-1)帧音调分量所处的位置: poS_CUr = pOS_CUr + X;
统计出现连续性的巾贞数: n画-1 Φ = num— left + 1.
设置变量 num _ non _ tonal为 Q .
依次检测第(k-1)帧、第(k-2)帧等与前一帧的音调分量之间是否存在 连续性。 在每次检测之前, 首先需要判断 大小:
如果 "Mm -画 - 画1 ≥ al , 说明待分析音调分量与过去帧音调分量之间 的间断已经超过预设的范围, 已不再具有连续性。 不必继续检测下去, 输 出 num— left ·
如果 / < , 说明待分析音调分量与过去帧音调分量之间 的间断还在预设的范围内, 继续检测下去。 直到检测完过去 L1帧数据, 输出"画- fe
歩骤 2、 统计该音调分量与未来多少帧的音调分量具有连续性, 表示 为 num right . 类似于上述歩骤 1, 依次检测第 k帧、第(k+i)帧等与后一帧的音调分 量之间是否存在连续性, 输出" Mm- Ai
歩骤 3: 根据 及 "Μ™_π , 对初始音调检测结果进行筛选, 如 果满足以下两个条件之一:
(num left + num right)≥ al
num right≥ a3 说明位于第 k帧 fx处的音调分量具有一定的连续性, 保留初始音调 检测结果, 否则不保留。 在本实施例中, 可以设 "1 = 5; Ω2 = 10 . Ω3 = 8 0 以图 3a和图 3b给出的法语男声 +笙的音频信号为例, 给出音调连续 性分析的实例, 如图 6a和 6b所示, 图 6a为输入信号 "法语男声 +笙" 的 波形图二; 图 6b为图 6a所示输入信号的初始音调检测结果。 其中, 横轴 为帧, 与图 6a横轴的样本点相对应; 纵轴取值为(T511 , 每点对应的频域 分辨率为 4000 Hz /512= 7. 8125Hz。 如果某帧数据在纵轴某点对应的频率 范围内存在音调分量, 将其标识为白色, 否则为黑色。 如果连续若干帧信 号在某个频率范围内存在音调分量, 会形成 "白线" 。 该 "白线"与图 3b 语谱图中的 "亮带"是相对应的; 图 6c为图 6a所示输入信号筛选后的音 调检测结果。 与图 6b的初始音调检测结果相比, 在前半段的语音信号中, 仅保留了基频及其附近的音调持续时间稍长的少量音调分量, 其余的音调 分量均已去掉; 在后半段的音乐信号中, 绝大部分的音调分量均被保留下 来。
最后进行音调特征提取, 其中针对筛选后的音调检测结果, 统计较低 频率至高频范围(对应于 fl4≤ < F/2 )的每帧音调分量的数量, 表示为
- tonal jag 如果" 越大,说明对应信号中音调分量持续时 间越长, 该信号是音乐信号的可能性越大。
如上述图 6c所示, 语音信号在基频及其附近频率范围内可能会存在 少许音调持续时间稍长的音调分量。 因此, 统计每帧音调分量的数量的范 围不是从 = G开始的, 而是从 , = 4开始的, 这样可以避免将某些基频音 调分量持续时间较长的语音信号误判为音乐信号。 即上述统计的满足连续 性约束条件的音调分量的数量为在频域上大于第七阈值的音调分量的数 量。 在本实施例中, 可以设 "4 = 40
仍以图 3a和图 3b给出的 "法语男声 +笙" 的音频信号为例说明, 如 图 7a和图 7b所示, 图 7a为输入信号 "法语男声 +笙"的波形图三; 图 7b 为图 7a对应的音调特征" 的曲线图。 其中, 横轴为帧, 与图
7a横轴的样本点相对应; 纵轴为音调分量的数量。 由图 7a和图 7b可见, 在前半段的语音信号中, nwn j mal - flag始终为 0, 与后半段笙的音调特征 具有明显区别。
本发明上述实施例中的能量特征提取方式如下, 在提取能量特征之前, 首先需要计算各帧的高频能量分布比值 及声压级 ^Ζ^, 其中 k表示帧数。
Figure imgf000033_0001
其中, 表示第 k帧的 FFT变换的实部, Im_ (/)表示第 k帧的 FFT变换的虚部。 分母表示第 k帧的总能量; 分子表示第 k帧在
/ = Ω5 ~ /2 _1)所对应的较高频率范围内的能量总和。 如果
ratio— energy -hf ^软小, 说明第 k帧能量主要分布在低频; 反之, 说明第 k 帧能量主要分布在较高频率范围内。
Figure imgf000033_0002
其中, 表示第 k帧的功率密度谱。 如果 ^)较小, 说明第 k帧总能量较小, 如果 较大, 则说明第 k帧总能量较大。
基于高频能量分布比值及声压级, 进一歩分析能量在高频的分布特性 及能量在低频的分布特性。
在获取能量在高频的分布特性时, 仍以图 4给出的 "京胡 +法语男声" 的音频信号为例, 其中图 8a为输入信号 "京胡 +法语男声" 的波形图, 图
8b为与图 8a对应的高频能量分布比值^^- -^^)的曲线图, 其中, 横轴为帧, 与图 8a横轴的样本点相对应; 纵轴为高频能量分布比值。 通 过图 8b可知高频能量分布比值曲线的变化情况:
在前半段的音乐信号中, 除了演奏间隙的短暂停顿处, 高频能量分布 比值基本上大于 0. 8, 说明该段京胡信号的能量能够持续分布在较高频率 范围内;
在后半段的语音信号中, 少量的浊音以及部分清音的高频能量分布比 值较大, 大部分浊音以及部分清音的高频能量分布比值都是比较小的, 导 致高频能量分布比值曲线的波动较大, 说明语音信号的能量是无法持续分 布在较高频率范围内的。
针对第 k帧, 为了表示能量在高频的分布特性, 基于高频能量分布比 值 及声压级 ( 提取以下特征:
num_big_ratio_energy_left . 表示位于第 k帧之前的 L1帧数据中, 能量 能够持续分布在高频的过去帧的帧数;
画— big— mtio— energy— right : 表示位于第 k帧之后的 LI帧数据中, 能量 能够持续分布在高频的未来帧的帧数。
在提取上述特征之前,首先检査高频能量分布比值 ^^ -^W及 声压级 是否满足以下条件: ifati。― energy _hf、k、> a6、 &&、spl k、> αΊ)。如果 满足该条件,进一歩分析第 k帧能量是否能够持续分布在较高频率范围内。
获取聽 m _ big _ ratio _ energy _ left的歩骤为:
歩骤 1、
Figure imgf000034_0001
num - big - ratio - enersy - left 0;
歩骤 2、 初始化变量"画 为 0;
歩骤 3、 检査 raz '。- j/^-1)及 ^ -1)是否满足以下条件:
{ratio energy _hf(k— i)> αβ) & & (spl(k -l)> al) 如果不满足上述条件, 说明第(k-i)帧数据的能量没有分布在较高频 率范围内, 记录下本次事件. 聽 m non big ratio - num non big ratio + 1. 如果满足上述条件, 说明第(k-1)帧数据的能量持续分布在较高频率 范围内, 统计能量能够持续分布在高频的过去帧的帧数:
num big ratio energy left― num big ratio energy left + 1.
设置变量 num non big ratio为 Q。
类似于歩骤 3, 依次检测第(k-2)帧、 第(k-1)帧等数据的能量能否持 续分布在较高频率范围内。 在每次检测之前, 首先需要判断
num non big ratio的大小, 如果 num non big ratio≥ 8, 说明能量无法持续 分布在较高频率范围内的状态已经超过预设的范围, 不必继续检测下去, 输出聽 m big ratio energy left . 如果 num non big ratio < "8, 说明能量无法 持续分布在较高频率范围内的状态还在预设的范围内, 继续检测下去, 直 到检测完过去 L1帧数据, 输出"画— g-rario— i rg) je/。 获取醒—big _ ratio _ energy _ right的歩骤是类似的。 依次检测第(k+ 1 )帧 否持续分布在较高频率范围内, 输出
Figure imgf000035_0001
对于低频能量的分布特性获取, 以图 5a给出的 "韩语男声 +合奏" 的 输入信号为例, 观察能量在低频的分布特性, 如图 9a和图 9b所示, 图 9a 为输入信号 "韩语男声 +合奏" 的波形图, 图%为与图 9a对应的高频能 量分布比值 ^- ^-^^的曲线图。 其中, 横轴为帧; 纵轴为高频能量 分布比值。 通过观察图%所示的在高频能量分布比值曲线的变化情况, 可知, 在前半段的语音信号中, 高频能量分布比值曲线的波动较大, 说明 语音信号的能量是无法持续分布在低频的; 在后半段的音乐信号中, 高频 能量分布比值基本上小于 0.1, 说明该段合奏信号的能量能够持续分布在 低频。
针对第 k帧, 为了表示能量在低频的分布特性, 基于高频能量分布比 值 mtio energy D及声腿 , 提取以下特征:
醒―羅 II mtio— energy— left :表示能量能够持续分布在低频的过去帧的 num _ small _ ratio _ energy _ right . 表示位于第 k帧之后的 LI帧数据中, 能 量能够持续分布在低频的未来帧的帧数;
与聽 m _ big _ ratio _ e" - 等参数的获取过程不同,
™m_sm^_ra^_ e/^_fe/t并不是仅仅针对过去 L1帧数据分析得出的, 而 一帧 ratio -energy _hf{i){i≥0) f 就会更新一次
Figure imgf000035_0002
rari。_e"erg) j/ 是否满足条件: ratio— energy— hf、k、<a9。 如果满足该条件, 进一歩分析第 k帧能量是否能够持续分布在低频范围内。
中, 获取 num small ratio energy right的歩骤为.
歩骤 1、 初始化 num sma^ rati energy right为 Q ·
歩骤 2、 依次检测第(k+1)帧、 第(k+2)帧等的高频能量分布比值 ratio _ energy _ hf {i ) ( < ζ·≤ ( 是否满足条件: ratio— energy _hf(f)< a9。如果不 满足上述条件, 不必继续检测下去, 输出 聽/«-腿^-/¾!^-£^/¾)-/^/^; 如 果满足上述条件,
num small ratio energy right― num small ratio energy right + 1, 继续检 ^贝 []下 去, 直到检测完未来 LI帧数据, 输出"画_腿"1/^0_£^/^-/^ 。
在本实施例中, 可以设置 = 15(3; «6 = 0.4. α7 = 30. Ω8 = 5. Ω9 = 0.1。 如上述分类原理分析所述, 绝大多数音乐信号具有不同于语音信号的 特性; 相比之下, 语音信号缺乏独有的特性, 很难 100%确定某段信号就是 语音信号。 因此, 在分类时将明显不同于语音信号的音乐信号识别出来, 其余则判为语音信号。
具体的, 分类规则可以如图 10所示, 对于第 k帧数据, 其可以包括 如下的歩骤:
歩骤 301、 判断音调分量的数量是否大于 0, 即"画 -to" - g >0 如 果满足条件, 则可以输出初始分类结果为音乐信号; 否则继续分析育 特 歩骤 302、 分析能量在较高频率范围内的分布特性, 首先判断
Figure imgf000036_0001
a6 && SplW> a )。 若是, 执行歩骤 303, 否则执行歩骤
304;
歩骤 303、 判断是否满足 "画 _ g-rari0_£ rg)-n ≥"ll, 或者满足 num big ratio energy left + num big ratio energy right≥ alO 或者
腿 m— big— ratio— energy— left≥ cdi, 如果满足, 则输出初始分类结果为音乐信 号, 否则, 执行歩骤 304;
歩骤 304、 判断高频能量分布比值是否小于 a9, 即
ratio _energy_hf{k)≤a9 f 如果是, 则执行歩骤 305, 否则输出初始分类结果 为语音信号; 歩骤 305、 判断是否满足 "画 _腿"1/^0_£^/^-/£^≥"13, 或者满足 num small ratio energy left + num small ratio energy right≥ al2 或者 num _ small _ ratio _ energy _ right >a\\ ^ 如果满足, 则输出初始分类结果为音乐 信号, 否则输出初始分类结果为语音信号。
在本实施例中, 可以设置 ω10 = 15 ; "11 = 10; «12 = 30. "13 = 30。
参见图 11a和图 lib所示的, 图 11a为输入信号 "中文女声 +合奏 +英 语男声 +塡 +德语男声 +响板" 的波形图, 其中的三种音乐信号: 合奏、 塡 及响板, 在音调特征或是能量特征方面, 均具有一定的典型性; 图 lib为 图 11a对应的分类结果示意图一, 其中, 横轴为样本点; 纵轴为分类结果, 取值为 0对应语音信号, 取值不为 0对应音乐信号。 由下至上, 纵轴给出 四类分类结果:
MUSIC_音调特征: 仅使用音调特征得到的分类结果, 表示为实线。 由 此可以看出, 图 11a中的哪些信号是适用于有关音调特征的分类规则的; MUSIC 能量 :特特征征__11:: 仅仅使使用用 ""能能量: 特征 _1"得到的分类结果, 表示为 虚线。 这里的 "能量特征 _1"指的是能量是否能够持续分布在较高频率范 围内。 由此可以看出, 图 11a中的哪些信号是适用于有关能量高频分布特 性的分类规则的;
MUSIC_能量 :特特征征__22:: 仅仅使使用用 ""能能量: 特征 _2"得到的分类结果, 表示为 点划线。 这里的 "能量特征 _2 "指的是能量是否能够持续分布在低频。 由 此可以看出, 图 11a中的哪些信号是适用于有关能量低频分布特性的分类 规则的;
1^1(_初始分类结果: 将 MUSIC_音调特征、 MUSIC_能量特征_1及 MUSIC_能量特征_2的分类结果综合起来, 就可以得到初始分类结果, 表示 为点线。
通过观察图 lib, 可以看出, 针对不同类型的音乐信号, 不同的分类 规则是如何发挥作用的:
位于 100000-300000点之间的合奏信号: 该段音乐信号在能量上的波 动是很大的, 仅有少数帧的能量能够持续分布在较高频率范围内, 能量特 征_1/2基本不起作用。 但是, 该段信号的音调具有较好的持续性, 可以利 用音调特征检测出来; 位于 400000-550000点之间的塡信号:音调特征能够起到一定的作用, 但是仅依靠音调特征是无法把完整的塡信号检测出来的, 如图断续分布的 实线所示。该段信号的能量主要分布在低频, 可以利用能量特征 _2检测出 来;
位于 600000点之后的响板信号: 该段信号几乎检测不出音调分量, 音调特征不起作用。 该段信号的能量主要分布在高频, 可以利用能量特征 _1检测出来。
本发明实施例提供的技术方案, 还可以适应于输出延时较大的应用场 景, 例如当输出延时为 L2+L3时, 设当前帧为第 i帧, 则可以首先按照上 述实施例提供的技术方案, 当 i〉L2时, 根据过去的信息, 第 i_L2帧之前 的若干帧的音调分布参数和能量分布参数, 现在的信息, 即第 i_L2帧的 音调分布参数和能量分布参数, 以及未来的信息, 即第 i_L2帧之后的 L2 帧的音调分布参数和能量分布参数, 获取第 i_L2帧的音频信号分类结果, 其具体的实现方式可以参见上述的实施例, 进一歩当 i〉(L2+L3)时, 可以 进行平滑处理, 即根据待分类帧第 i_L2-L3帧前 N4帧和待分类帧第
1-L2-L3帧后 L3帧的初始分类结果进行修正。
具体的, 上述的前 N4帧可以为前 L3帧, 针对第 k帧, 此时上述修正 处理的过程为:
首先, 对位于第 k帧之前的 L3帧及位于第 k帧之后的 L3帧的初始分 类结果进行统计, 获取被分类为音乐信号的帧数"" m-mw , 以及被分类为 语音信号的巾贞数醒—醒 _ music .
其次, 如果第 k帧的初始分类结果为语音信号, 并且" " _m^c≥fll4 , 将第 k帧的分类结果修正为音乐信号; 如果第 k帧的初始分类结果为音乐 信号, 并且"画 - "。 "—聽 ^≥"14, 将第 k帧的分类结果修正为语音信号。
在本实施例中, 可以设置" 14 = 16
图 12a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形示意图, 同图 11a所示, 图 12进一歩给出平滑后的结果, 如图 12 所示, 由下至上, 纵轴给出两类分类结果:
1^ 1( _初始分类结果: 表示为实线;
MUS IC_平滑后结果: 对初始分类结果进行平滑, 得到平滑后结果, 表 示为虚线。
观察图 12可知, 位于 100000-300000点之间的合奏信号: 初始分类 结果在 250000-300000点之间存在一处误判,将音乐信号误判为语音信号; 位于 400000-550000点之间的塡信号, 初始分类结果在该信号结尾部分存 在一处误判, 将音乐信号误判为语音信号。 通过平滑处理, 对上述误判进 行了修正。
另外,对于不能够引入输出延时的应用场景,其中获取音调分布参数, 获取能量分布参数的原理和歩骤与上述技术方案类似, 不同之前仅在于, 在进行分类时参考的是过去的信息和现在的信息, 由于无输出延时, 需要 实时获取分类结果, 无法参考未来的信息。
具体的, 提取音调特征可以参照上述实施例, 可以分为三个歩骤:
A、 获取初始音调检测结果, 即各帧的音调分布参数;
B、 通过连续性分析, 对初始音调检测结果进行筛选;
C、 基于筛选后的音调检测结果, 提取音调特征, 即待分类帧的音调 分量的数量。
其中上述歩骤 A, 可以参照上述实施例, 以下主要对歩骤 B和歩骤 C 进行详细说明。
在进行连续性分析时, 设 tonal-flag -Original[k][f](0≤f < 表示初始音 调检测结果, 取值为 1表示第 k帧数据在 f 处存在音调分量, 取值为 0表 示第 k帧数据在 f 处不存在音调分量。 相对于第 k帧, 位于第 k帧之前的 L1帧数据被称为过去帧。
设第 k帧数据在 fx处存在音调分量, 即
Figure imgf000039_0001
i。 针 对位于第 k帧 fx处的音调分量, 音调连续性分析的歩骤为:
歩骤 1: 统计该音调分量与过去多少帧的音调分量具有连续性, 表示 为腿 m— Ιφ , 初始化变量" "^- 为 0, 初始化表示不连续的变量
n飄―画 丽 1为 Q, 并记录待分析音调分量所处的位置: poS_CUr = fx., 检杳 tonal flag
Figure imgf000039_0002
l][ ] ((pos _cur-3)≤ f ≤ (pos_cur + 3))的取值. 如果取值全为 o, 说明第(k-i)帧数据在 ^- ^-3^/ ^^"^3)区 间不存在音调分量, 即位于第 k帧 处的音调分量与第(k-l)帧的音调分 量之间出现间断, 记录下本次不连续性事件: ― ηοη _ tonal― num _ non _ tonal + 1.
如果 tonal - Aag - oriSinal[k - l][pos _cur + x]^l{-3≤ x≤3) ^ 说明第 (k- 1 )巾贞数 据在
Figure imgf000040_0001
即位于第 k帧 处的 音调分量与第(k-1)帧的音调分量之间具有连续性:
记录第(k-1)帧音调分量所处的位置: pos—c n c + x
统计出现连续性的巾贞数: n画-1 Φ =腿 m— left + 1
设置变量 num _ non _ tonal为。。
类似于歩骤 2, 依次检测第(k-1)帧、 第(k-2)帧等与前一帧的音调分 量之间是否存在连续性。 在每次检测之前, 首先需要判断 "" "^-^^的 大小:
如果" -m^-to^ W, 说明待分析音调分量与过去帧音调分量之间 的间断已经超过预设的范围, 已不再具有连续性。 不必继续检测下去, 输 出 num left ·
如果" -rn^ ^ W, 说明待分析音调分量与过去帧音调分量之间 的间断还在预设的范围内, 继续检测下去。 直到检测完过去 L1帧数据, 输出 numιΦ
歩骤 2: 根据" -^ 对初始音调检测结果进行筛选;
如果满足条件: 醒— left≥bl, 说明位于第 k帧 fx处的音调分量具有 一定的连续性, 保留初始音调检测结果, 否则不保留。
在本实施例中, 可以设置 W = 5 = 5
进一歩的, 类似上述实施例, 针对筛选后的音调检测结果, 统计较低 频率至高频范围(对应于½≤,< /2)的待分类帧的帧音调分量的数量, 表 示为醒 tonal jag。 如果 MMm_toM _/¾g越大, 说明对应信号中音调分量 持续时间越长, 该信号是音乐信号的可能性越大。 在本实施例中, 设置 ½ = 40
对于能量特征提取, 在提取能量特征之前, 首先需要计算每帧高频能 量分布比值 ^-^^^-^^及声压级^ 其中 k表示帧数。 计算每帧 高频能量分布比值 及声压级 的公式与上述是相同 的。
基于高频能量分布比值及声压级, 进一歩分析能量在高频及低频的分 布特性
Figure imgf000041_0001
量分布比值 ratio -energy _hf k)及 ^级 ^), 提取特征
m_big_mtiQ rgy— Ιφ 该特征是指, 位于第 k帧之前的 L1帧数据中, 能量能够持续分布在高频的过去帧的帧数。
在提取该特征之前,首先检査高频能量分布比值 ^- -^^及声 压级 是否满足以下条件: io— energy - hf b4、 & & (Μί > b5、 如果满 足该条件, 进一歩分析第 k帧能量是否能够持续分布在较高频率范围内。
获取聽 m _ big _ ratio _ energy _ left的歩骤为: 歩骤 1、
Figure imgf000041_0002
num - big - ratio - enersy - ι 0;
歩骤 2初始化变量" "m_M。"_b^_rari。为 0;
歩骤 3、 检査 raz '。- j/^-1)及 ^ -1)是否满足以下条件: {ratio energy _hf(k— l)> 如果不满足上述条件, 说明第(k-1)帧数据的能量没有分布在较高频 率范围内, i己录下本次事件- m non big ratio - num non big ratio + 1 如果满足上述条件, 说明第(k-i)帧数据的能量持续分布在较高频率 范围内:
统计能量能够持续分布在高频的过去帧的帧数:
num big ratio energy left― num big ratio energy left + 1
设置变量 num - non - - rati°为 0
类似于歩骤 3, 依次检测第(k-2)帧、 第(k-1)帧等数据的能量能否持 续分布在较高频率范围内。 在每次检测之前, 首先需要判断
num non big ratio的大小 ·
如果 " _" _^_ra^≥ ,说明能量无法持续分布在较高频率范围内 的状态已经超过预设的范围, 不必继续检测下去, 输出
num big ratio energy left .
如果" _" _^_ra^<^,说明能量无法持续分布在较高频率范围内 的状态还在预设的范围内, 继续检测下去, 直到检测完过去 L1帧数据, 输出 num big ratio energy left。 另外, 针对第 k帧, 为了表示能量在低频的分布特性, 基于高频能量 分布比值 ' -^ 及声压级 ^), 提取特征
醒―醒 II— ratio— energy— left。该特征是指能量能够持续分布在低频的过去帧 的帧数。
与聽 m _ big _ ratio _ 参数的获取过程不同,
" -^^-™^_ e/^_fe/t并不是仅仅针对过去 L1帧数据分析得出的, 而 是每计算出一帧 ratio -energy _hf{i){i≥0)f 就会更新一次
num small ratio energy left
获取 num smaU ratio energy left的歩骤为.
当 二 0时, 初始化腿 m small ratio energy left为 Q .
检查每一巾贞 - -^')^0)是否满足条件: ratio— energy— hf i、<b,; 如果满足上述条件,
num small ratio energy left― num small ratio energy left + 1.
如果不满足上述条件, num small ratio energy left - 0 ·
在本实施例中, 设置 Μ = 0·3; ½ = 30. 6 = 5 ; W = (U。
具体的, 分类规则可以如图 13所示, 对于第 k帧数据, 其可以包括 如下的歩骤:
歩骤 401、 判断音调分量的数量是否大于 0, g卩"目 -to?MZ-i¾g>0。 如 果满足条件, 则可以输出初始分类结果为音乐信号; 否则继续分析能量特 征;
歩骤 402、 分析能量在较高频率范围内的分布特性, 首先判断
Ό - / ^- )〉M)&& )〉 b5)。 若是, 执行歩骤 403, 否则执行歩骤
404;
歩骤 403、 判断是否满足 "画 -b^-rari0_i /^-fe/t≥b8, 如果满足, 则 输出初始分类结果为音乐信号, 否则, 执行歩骤 404;
歩骤 404、 判断高频能量分布比值是否小于 b7, 即
ratio _energy _hf{k)≤bl ^ 如果是, 则执行歩骤 405, 否则输出初始分类结果 为语音信号;
歩骤 405、 判断是否满足 "画
Figure imgf000042_0001
je/≥ 9, 如果满足, 则输出初始分类结果为音乐信号, 否则输出初始分类结果为语音信号。 在 本实施例中, 可以设置 ^ = 10, ^ = 30。 图 14a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图三, 同图 11a所示, 其中的三种音乐信号: 合奏、 埙及响板, 在 音调特征或是能量特征方面, 均具有一定的典型性, 图 b进一歩给出实 时分类结果的实例, 其中, 横轴为样本点; 纵轴为分类结果, 取值为 0对 应语音信号, 取值不为 0对应音乐信号, 由图 14a和图 14b可见, 由于没 有未来的信息可供参考, 会将少许音乐信号误判为语音信号。
本发明上述实施例提供的技术方案, 针对无输出延时、 少量输出延时 和大量输出延时三种情况进行了说明, 使得在对输出延时要求不固定的场 景中, 例如语音质量评估应用中, 可以根据实际需要提供上述三种情况下 的分类结果, 且随着输出延时时间的增长, 不仅可以参照待分类帧过去的 信息, 而且可以参照待分类帧未来的信息, 参考信息越多分类的正确率也 会随之提高。 具体的, 图 15为本发明实施例中输出延时不固定的情况下 语音分类方法流程图, 如图 15所示, 包括如下的歩骤:
歩骤 501、 对当前帧第 i帧进行 FFT变换;
歩骤 502、 基于 FFT变换结果, 获取第 i帧的音调分布参数并缓存; 歩骤 503、 基于 FFT变换结果, 获取第 i帧的能量分布参数并缓存; 上述的歩骤 501-503中, 不仅针对第 i帧, 而且针对第 i帧之前接收 到的各个帧的, 都进行了相应处理, 获取了其音调分布参数和能量分布参 数。
歩骤 504、 生成并缓存第 i帧的实时分类结果, 具体的, 本歩骤中基 于歩骤 502和歩骤 503中生成并缓存的过去的信息, 即第 i帧之前的各个 帧的音调分布参数和能量分布参数, 获取第 i帧的音调特征和能量特征, 生成并缓存实时分类结果, 具体实现方式可以参照上述的实施例;
歩骤 505、 当 1〉11时, 其中 L1为允许的少量输出延时, 除了获取接 收的各个帧的实时的分类结果, 还可以生成并缓存第 i-Ll帧的初始分类 结果, 具体的, 在生成第 i-Ll帧的初始分类结果时, 可以参考过去的信 息, 即第 i-Ll帧之前的若干帧的音调分布参数和能量分布参数, 现在的 信息, 即第 i-Ll帧的音调分布参数和能量分布参数, 未来的信息, 即第 i-Ll帧之后 L1帧帧音调分布参数和能量分布参数, 获取更为准确的第 i-Ll帧的初始分类结果, 具体实现方式可以参见上述实施例。 歩骤 506, 当 i〉(L2+L3)时, 生成并缓存第(i_L2-L3)帧修正后的分类 结果, 具体的, 即可以参照过去的信息, 即位于第(i_L2-L3)帧之前若干 帧的初始分类结果, 未来的信息, 即位于第(i_L2-L3)帧之后的 L3帧的初 始分类结果, 对第(i_L2-L3)帧的初始分类结果进行修正, 具体的实现方 式可以参见上述的实施例。
歩骤 507、 根据允许的输出延时的不同, 选择上述歩骤 504、 歩骤 505 和歩骤 506的分类结果, 作为待分类帧第 j帧的分类结果:
如果输出延时满足条件: (i_j)〉= (L2+L3), 输出最优结果, 即第 j帧 修正后的分类结果;
如果输出延时满足条件: (L2+L3)〉(i-j)〉=Ll, 输出次优结果, 即第 j 帧的初始分类结果;
如果输出延时满足条件: (i_j)〈Ll, 输出零延时结果, 即第 j帧的实 时分类结果。
本发明上述实施例中可以将 L2的取值设为与 L1相等。
图 16a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图四, 同图 11a所示, 其中的三种音乐信号: 合奏、 塡及响板, 在 音调特征或是能量特征方面, 均具有一定的典型性, 图 16b给出了三种分 类方法得到的分类结果,如图 16b所示,其中纵轴上给出的三种分类结果, 依次是 31( _实时分类结果, 用实线表示, ΜΙ^Κ^ 始分类结果, 用点线 表示, MUSIC_修正后的分类结果, 用虚线表示。
如图 16b所示, 根据分类结果的正确率, 修正后的分类结果〉初始分 类结果〉实时分类结果。 因此, 在输出延时允许的情况下, 用户可以充分 利用尽可能多的未来信息, 输出当前条件下可以得到的最好的分类结果。
本发明实施例提供的技术方案, 其提取的特征能够反映出音乐信号不 同于语音信号的更为本质的特征, 使得在低采样率下的分类正确率明显提 高。 由于本发明实施例的技术方案提取特征的方法并不受限于采样率, 因 此其不仅适用于低采样率, 也适用于高采样率下的信号分类。 在确保较低 的算法复杂度的前提下, 用户可以根据需求灵活选择实时分类结果、 次优 分类结果或是最优分类结果。
本发明实施例还提供了一种与上述方法对应的音频信号分类处理装 置, 图 Π为本发明实施例中音频信号分类处理装置的结构示意图, 如图 17所示, 该装置包括第一获取模块 11和分类确定模块 12, 其中第一获取 模块 11用于获取音频信号中待分类帧中满足连续性约束条件的音调分量 的数量、 所述音频信号中待分类帧在低频区域的持续帧数和所述待分类帧 在高频区域的持续帧数中的至少一项; 分类确定模块 12用于根据所述待 分类帧中满足连续性约束条件的音调分量的数量、所述待分类帧在低频区 域的持续帧数和所述待分类帧的高频区域的持续帧数中的至少一项, 确定 所述音频信号中待分类帧为音乐信号, 或确定所述音频信号中待分类帧为 语音信号。
本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。
本发明上述实施例中, 其中根据有无输出延时和输出延时长度的不 同,其中的各个模块的执行的歩骤也会有所不同,具体包括如下几种情况: 一是在实时获取所述待分类帧的分类结果时, 所述第一获取模块具体 用于获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类 帧中满足连续性约束条件的音调分量的数量, N1为正整数; 或, 具体用于 获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的 持续帧数;
所述分类确定模块 12具体用于在所述待分类帧中满足连续性约束条 件的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数 大于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确 定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类 帧为语音信号。
具体的, 上述的第一获取模块获取音频信号中待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信 号中的待分类帧的音调分量的频域分布信息作为待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1 帧的音调分布参数。
上述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 上述的第一获取模块获取所音频信号中待分类帧的能量分布参 数, 以及待分类帧前 N1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数。
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 Nl帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 二是在延时 L1帧获取所述待 分类帧的分类结果时, L1为正整数, 所述第一获取模块具体用于获取音频 信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧的音调分布参 数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, N2为正 整数; 或, 具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N2 帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分类 帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分 类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧数;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。
其中, 上述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数 包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信息 作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调分 量的频域分布信息作为待分类帧帧后 L1帧的音调分布参数。
上述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 LI帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
另外, 上述第一获取模块获取所音频信号中待分类帧的能量分布参 数, 待分类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参 数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L 1帧的高频能量分 布比和声压级作为待分类帧后 L 1帧的能量分布参数。
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
三是在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正 整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待 分类帧前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足 连续性约束条件的音调分量的数量, N3为正整数; 或, 具体用于获取所述 音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧后 L2帧的能量 分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分 类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域的持续帧数或 所述待分类帧在高频区域的持续帧数; 所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。
其中, 上述的第一获取模块获取音频信号中待分类帧的音调分布参 数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布 参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N3帧的音调分量的频域分布信息 作为待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分量 的频域分布信息作为待分类帧后 L2帧的音调分布参数。
上述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。
另外, 上述第一获取模块获取所音频信号中待分类帧的能量分布参 数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参 数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数。
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
上述三种情况下, 第一获取模块获取的待分类帧中持续帧数大于第六 阈值的音调分量的数量为在频域上大于第七阈值的音调分量的数量。
本发明实施例还提供了一种音频信号分类处理设备, 图 18为本发明 实施例中音频信号分类处理设备的结构示意图, 如图 18所示, 该设备包 括接收器 21和处理器 22, 其中的接收器 21用于接收音频信号; 处理器 22与所述接收器 21连接, 用于获取接收器接收到的音频信号中待分类帧 中满足连续性约束条件的音调分量的数量、所述音频信号中待分类帧在低 频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述音频信号 中待分类帧为语音信号。
本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。
本发明上述实施例中, 其中的处理器可以由软件流程实现, 也可以通 过使用数字信号处理 (Digital Signal Processing, 以下简称: DSP ) 芯 片等硬件实体设备实现。
本发明上述实施例中, 其中根据有实时获取所述待分类帧的分类结 果,或者是允许分类结果输出延时的长短,处理器可以包括如下几种情况: 一是在实时获取所述待分类帧的分类结果时, 所述处理器具体用于获 取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据 所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧中满足 连续性约束条件的音调分量的数量, N1为正整数; 获取所述音频信号中待 分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中 待分类帧, 以及待分类帧前 N1帧的能量分布参数获取所述待分类帧在低 频区域的持续帧数和 /或所述待分类帧在高频区域的持续帧数, N1为正整 数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一阈 值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧在 高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音 乐信号, 否则确定所述音频信号中待分类帧为语音信号。
其中, 处理器获取音频信号中待分类帧的音调分布参数, 以及待分类 帧前 N1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信 号中的待分类帧的音调分量的频域分布信息作为待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1 帧的音调分布参数。
处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 处理器获取所音频信号中待分类帧的能量分布参数, 以及待分 类帧前 N1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。
处理器根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数。
处理器根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
二是在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所 述处理器具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待 分类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条 件的音调分量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频 信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域 的持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大 于第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待 分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分 类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
其中, 处理器获取音频信号中待分类帧的音调分布参数, 待分类帧前
N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信息 作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调分 量的频域分布信息。
处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布 参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧的音调分量的频域分布信息作 为待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信 息作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调 分量的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量 的数量。
另外, 处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧 前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数, 以及待分类帧帧后 L1帧的高频能 量分布比和声压级作为待分类帧后 L1帧的能量分布参数。
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧 的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧 在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧 的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧 在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
三是在分类结果输出延时为 L2+L3帧时, L2和 L3为正整数, 所述处 理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类 帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及 待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的 音调分量的数量, N3为正整数; 获取所述音频信号中待分类帧, 以及待分 类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述音频信号 中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取 所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的持 续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第 一阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类 帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧 为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若确定所述 音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类 帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将 所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频 信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音 频信号中待分类帧修正为音乐信号。
其中, 处理器获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N3帧的音调分量的频域分布信息 作为待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分量 的频域分布信息作为待分类帧后 L2帧的音调分布参数。
处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布 参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。
另外, 处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧 前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数。
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧 的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧 在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧 的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧 在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
上述三种情况下, 处理器获取的待分类帧中持续帧数大于第六阈值的 音调分量的数量为在频域上大于第七阈值的音调分量的数量。 本领域普通 技术人员可以理解: 实现上述各方法实施例的全部或部分歩骤可以通过程 序指令相关的硬件来完成。 前述的程序可以存储于一计算机可读取存储介 质中。 该程序在执行时, 执行包括上述各方法实施例的歩骤; 而前述的存 储介质包括: R0M、 RAM, 磁碟或者光盘等各种可以存储程序代码的介质。 最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非 对其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的 普通技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进 行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或 者替换, 并不使相应技术方案的本质脱离本发明各实施例技术方案的范 围。

Claims

权 利 要 求 书
1、 一种音频信号分类处理方法, 其特征在于, 包括:
获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续 帧数中的至少一项;
根据获取的所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续 帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所 述音频信号中待分类帧为语音信号。
2、 根据权利要求 1所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量包 括:
获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数, 并根据所述待分类帧的音调分布参数, 以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量, N1为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数, 并根据所述音频信号中待分类帧的能量分布参数, 以 及待分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数, N1为正整数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
3、 根据权利要求 2所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音 调分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数
4、 根据权利要求 2所述的音频信号分类处理方法, 其特征在于, 所 述获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧 的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
5、 根据权利要求 1-4 任一所述的音频信号分类处理方法, 其特征在 于, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的 数量包括:
获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数, 并根据所述待分类帧 的音调分布参数, 待分类帧前 N2帧的音调分布参数以及待分类帧后 L 1帧 的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, L1为正整数, N2为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数, 并根据所述音 频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以 及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
6、 根据权利要求 5所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分 布参数, 以及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L 1帧的音调分量的频域分布信息作为待分类帧帧后 L 1帧的 音调分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
7、 根据权利要求 5所述的音频信号分类处理方法, 其特征在于, 所 述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量 分布参数以及待分类帧后 L1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
8、 根据权利要求 1-7任一所述的音频信号分类处理方法, 其特征在 于, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的 数量包括: 获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧 的音调分布参数, 待分类帧前 N3帧的音调分布参数以及待分类帧后 L2帧 的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, L2为正整数, L3为正整数, N3为正整数;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N3 帧的能量分布参数以及待分类帧后 L3帧的能量分布参数, 并根据所述音 频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以 及待分类帧后 L3帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。
9、 根据权利要求 8所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分 布参数, 以及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的 音调分布参数;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧的音调分量的频域分布信息、 待 分类帧前 N3帧的音调分量的频域分布信息和待分类帧帧后 L2帧的音调分 量的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的 数量。
10、 根据权利要求 8所述的音频信号分类处理方法, 其特征在于, 所 述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量 分布参数以及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧的能量分 布参数和待分类帧后 L2帧的高频能量分布比和声压级获取包括所述待分 类帧在内的高频能量分布比大于第九阈值、 声压级大于第十阈值的持续帧 数。
11、 根据权利要求 3、 6或 9所述的音频信号分类处理方法, 其特征 在于, 所述待分类帧中持续帧数大于第六阈值的音调分量的数量为在频域 上大于第七阈值的音调分量的数量。
12、 一种音频信号分类处理装置, 其特征在于, 包括:
第一获取模块, 用于获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所 述待分类帧在高频区域的持续帧数中的至少一项;
分类确定模块, 用于根据所述待分类帧中满足连续性约束条件的音调 分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧的高频 区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信 号, 或确定所述音频信号中待分类帧为语音信号。
13、 根据权利要求 12所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块具体用于获取音频信号中待分类帧,以及待分类帧前 N1 帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, N1为正 整数; 或具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N1帧 的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分 类帧在高频区域的持续帧数;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。
14、 根据权利要求 13所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 以及待 分类帧前 N1帧的音调分布参数包括: 对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数
15、 根据权利要求 13所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 以及 待分类帧前 N1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
16、 根据权利要求 12-15任一所述的音频信号分类处理装置, 其特征 在于, 在延时 LI帧获取所述待分类帧的分类结果时, L1为正整数, 所述 第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以 及待分类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量, N2为正整数; 或, 具体用于获取所述音频信号 中待分类帧,以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧 的能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分类 帧在高频区域的持续帧数;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。
17、 根据权利要求 16所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类 帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括: 对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
18、 根据权利要求 16所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 待分 类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L1帧的高频能量分 布比和声压级作为待分类帧后 L1帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
19、 根据权利要求 12-18任一所述的音频信号分类处理装置, 其特征 在于,
在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量, N3为正整数; 或,
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及 待分类帧后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待 分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数或所述待分类帧在高频区域的持续帧数; 所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。
20、 根据权利要求 19所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类 帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括: 对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调 分布参数;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。
21、 根据权利要求 19所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 待分 类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述分类确定模块根据音频信号中待分类帧、 待分类帧前 N3帧和待 分类帧后 L2帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括:
根据所述接收到的音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的高频能量分布比和声压级获 取包括所述待分类帧在内的高频能量分布比大于第九阈值、 声压级大于第 十阈值的持续帧数。
22、 根据权利要求 14、 17或 20所述的音频信号分类处理装置, 其特 征在于, 所述第一获取模块获取的待分类帧中持续帧数大于第六阈值的音 调分量的数量为在频域上大于第七阈值的音调分量的数量。
23、 一种音频信号分类处理设备, 其特征在于, 包括:
接收器, 用于接收音频信号;
处理器, 与所述接收器连接, 用于获取接收器接收到的音频信号中待 分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述 音频信号中待分类帧为语音信号。
24、 根据权利要求 23所述的音频信号分类处理设备, 其特征在于, 所述处理器具体用于获取音频信号中待分类帧, 以及待分类帧前 N 1帧的 音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参 数获取待分类帧中满足连续性约束条件的音调分量的数量, N1为正整数; 获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域 的持续帧数, N1为正整数; 在所述待分类帧中满足连续性约束条件的音调 分量的数量大于第一阈值、 所述待分类帧在低频区域的持续帧数大于第二 阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音 频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音 信号。
25、 根据权利要求 24所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数, 以及待分类帧 前 N1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包 括:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数
26、 根据权利要求 24所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 以及待分类 帧前 N1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数; 所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。
27、 根据权利要求 23-26任一所述的音频信号分类处理设备, 其特征 在于, 在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述 处理器具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分 类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以 及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信 号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获 取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的 持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于 第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分 类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类 帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。
28、 根据权利要求 27所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数,待分类帧前 N2 帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
29、 根据权利要求 27所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L1帧的高频能量分布 比和声压级作为待分类帧后 L1帧的能量分布参数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
30、 根据权利要求 23-29任一所述的音频信号分类处理设备, 其特征 在于, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整 数, 所述处理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量,N3为正整数;获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述 音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数 量大于第一阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所 述待分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中 待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若 确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧 和待分类帧后 L4帧中确定为语音信号的帧数目是否大于第四阈值, 若超 过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定 所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待 分类帧后 L4帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。
31、 根据权利要求 30所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数,待分类帧前 N3 帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调分 布参数;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。
32、 根据权利要求 30所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧前
N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。
33、 根据权利要求 25、 28或 31所述的音频信号分类处理设备, 其特 征在于, 所述处理器获取的待分类帧中持续帧数大于第六阈值的音调分量 的数量为在频域上大于第七阈值的音调分量的数量。
PCT/CN2014/081400 2013-07-02 2014-07-01 音频信号分类处理方法、装置及设备 WO2015000401A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310274580.9 2013-07-02
CN201310274580.9A CN104282315B (zh) 2013-07-02 2013-07-02 音频信号分类处理方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2015000401A1 true WO2015000401A1 (zh) 2015-01-08

Family

ID=52143107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/081400 WO2015000401A1 (zh) 2013-07-02 2014-07-01 音频信号分类处理方法、装置及设备

Country Status (2)

Country Link
CN (1) CN104282315B (zh)
WO (1) WO2015000401A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (zh) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 一种自适应调节音效的方法及***

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613036A (zh) * 2015-05-20 2020-09-01 谷歌有限责任公司 用于测试智能家居设备的***和方法
US9454893B1 (en) 2015-05-20 2016-09-27 Google Inc. Systems and methods for coordinating and administering self tests of smart home devices having audible outputs

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (en) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (zh) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 音乐/非音乐的实时检测方法和装置
CN102237085A (zh) * 2010-04-26 2011-11-09 华为技术有限公司 音频信号的分类方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785645B2 (en) * 2001-11-29 2004-08-31 Microsoft Corporation Real-time speech and music classifier
KR100964402B1 (ko) * 2006-12-14 2010-06-17 삼성전자주식회사 오디오 신호의 부호화 모드 결정 방법 및 장치와 이를 이용한 오디오 신호의 부호화/복호화 방법 및 장치
CN101577117B (zh) * 2009-03-12 2012-04-11 无锡中星微电子有限公司 伴奏音乐提取方法及装置
CN101847412B (zh) * 2009-03-27 2012-02-15 华为技术有限公司 音频信号的分类方法及装置
CN102446504B (zh) * 2010-10-08 2013-10-09 华为技术有限公司 语音/音乐识别方法及装置
CN102655000B (zh) * 2011-03-04 2014-02-19 华为技术有限公司 一种清浊音分类方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (en) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (zh) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 音乐/非音乐的实时检测方法和装置
CN102237085A (zh) * 2010-04-26 2011-11-09 华为技术有限公司 音频信号的分类方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (zh) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 一种自适应调节音效的方法及***
CN104811864B (zh) * 2015-04-20 2018-11-13 深圳市冠旭电子股份有限公司 一种自适应调节音效的方法及***

Also Published As

Publication number Publication date
CN104282315B (zh) 2017-11-24
CN104282315A (zh) 2015-01-14

Similar Documents

Publication Publication Date Title
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
CN106664486B (zh) 用于风噪声检测的方法和装置
CN108896878B (zh) 一种基于超声波的局部放电检测方法
Aneeja et al. Single frequency filtering approach for discriminating speech and nonspeech
US9959886B2 (en) Spectral comb voice activity detection
CN111128213B (zh) 一种分频段进行处理的噪声抑制方法及其***
US10074384B2 (en) State estimating apparatus, state estimating method, and state estimating computer program
WO2015078121A1 (zh) 音频信号质量检测方法及装置
CN111383646B (zh) 一种语音信号变换方法、装置、设备和存储介质
WO2019233228A1 (zh) 电子设备及设备控制方法
JP4816711B2 (ja) 通話音声処理装置および通話音声処理方法
US20150106087A1 (en) Efficient Discrimination of Voiced and Unvoiced Sounds
US20160027438A1 (en) Concurrent Segmentation of Multiple Similar Vocalizations
WO2013078677A1 (zh) 一种自适应调节音效的方法和设备
WO2015000401A1 (zh) 音频信号分类处理方法、装置及设备
WO2013170610A1 (zh) 检测基音周期的正确性的方法和装置
WO2016078439A1 (zh) 一种语音处理的方法及装置
CN109994129A (zh) 语音处理***、方法和设备
AU2024200622A1 (en) Methods and apparatus to fingerprint an audio signal via exponential normalization
CN104732984B (zh) 一种快速检测单频提示音的方法及***
Dekens et al. Speech rate determination by vowel detection on the modulated energy envelope
CN108847218A (zh) 一种自适应门限整定语音端点检测方法,设备及可读存储介质
CN110211606B (zh) 一种语音认证***的重放攻击检测方法
CN113611330B (zh) 一种音频检测方法、装置、电子设备及存储介质
CN115567845A (zh) 一种信息处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1