WO2015000401A1 - Audio signal classification processing method, apparatus, and device - Google Patents

Audio signal classification processing method, apparatus, and device Download PDF

Info

Publication number
WO2015000401A1
WO2015000401A1 PCT/CN2014/081400 CN2014081400W WO2015000401A1 WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1 CN 2014081400 W CN2014081400 W CN 2014081400W WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
classified
audio signal
frames
energy distribution
Prior art date
Application number
PCT/CN2014/081400
Other languages
French (fr)
Chinese (zh)
Inventor
许丽净
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015000401A1 publication Critical patent/WO2015000401A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music

Definitions

  • Audio signal classification processing method device and device
  • the embodiments of the present invention relate to the field of signal processing technologies, and in particular, to an audio signal classification processing method, apparatus, and device. Background technique
  • the signal to be analyzed in the actual application may include a music signal, such as a ring tones.
  • the speech quality assessment model treats it as a speech signal and gives an incorrect quality assessment.
  • the signal to be analyzed should be classified before being input to the speech quality assessment module. If the segment signal is recognized as a speech signal, it is sent to the speech quality evaluation module for quality evaluation; if the segment signal is recognized as a music signal, it is not sent to the speech quality evaluation module.
  • the prior art provides an audio signal classification method applied to a speech music joint encoder, but the classification method is directed to a speech music joint encoder with a high sampling rate.
  • the existing music signal is generally lacking.
  • High-frequency information, using the existing audio signal classification method applied to the combined combination of speech and music, can only identify a small number of music signals, and the classification accuracy is low, which can not meet the requirements of voice quality assessment.
  • the invention provides an audio signal classification processing method, device and device for improving the classification correctness rate of an audio signal.
  • a first aspect of the present invention provides an audio signal classification processing method, including: acquiring a number of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, and a frame to be classified in the audio signal in a low frequency region At least one of a continuous frame number and a continuous frame number of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, or is determined in the audio signal
  • the frame to be classified is a voice signal.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal comprises:
  • N1 is a positive integer
  • N1 is a positive integer
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the obtained audio signal, and the pitch distribution parameter of the N1 frame before the frame to be classified include:
  • the obtaining, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the N1 frame before the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than the sixth threshold, in combination with the first aspect
  • the energy distribution parameter of the frame to be classified in the obtained audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the to-be-classified frame in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • L1 is a positive integer
  • the audio signal is acquired.
  • the number of tonal components in the to-be-classified frame that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the N2 frame and the energy distribution parameter of the L1 frame after the frame to be classified acquire the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameters of the frame include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the acquiring an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a L1 after the frame to be classified The energy distribution parameters of the frame include:
  • the energy distribution parameter of the frame to be classified, the high-frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are the energy distribution parameters of the N2 frame before the frame to be classified and the high-frequency energy distribution ratio of the L1 frame after the frame to be classified and The sound pressure level is used as an energy distribution parameter of the L 1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the energy distribution parameter of the L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, Determining the number of consecutive frames of the classified frame in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, and otherwise determining that the frame to be classified in the audio signal is Voice signals include:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the frame to be classified in the audio signal is a music signal, determining whether the number of frames determined as a voice signal in the N4 frame before the frame to be classified and the frame to be classified is greater than a fourth threshold, if exceeded, The frame to be classified in the audio signal is modified into a voice signal, and N4 is a positive integer. If it is determined that the frame to be classified in the audio signal is a voice signal, the N4 frame before the frame to be classified and the L3 frame after the frame to be classified are determined. Determining whether the number of frames of the music signal is greater than a fifth threshold, and if greater, correcting the frame to be classified in the audio signal to a music signal.
  • the pitch distribution parameters of the frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified
  • frequency domain distribution information of a tonal component of the pre-frame N3 to be classified as The frequency domain distribution information of the tone distribution parameter frame of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame frame to be classified;
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the acquiring sound The energy distribution parameter of the frame to be classified in the frequency signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • a second aspect of the present invention provides an audio signal classification processing apparatus, including: a first acquisition module, configured to acquire, in an audio signal, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, At least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region;
  • a classification determining module configured to determine, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame And at least one of determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame And the pitch distribution parameter of the N1 frame before the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or
  • the method is specifically configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and an N1 frame before the frame to be classified, and obtain the foregoing according to the to-be-classified frame in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified.
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the classification determining module acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes: Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame to be classified:
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • L1 is a positive integer
  • the first acquiring module is used. Specifically, the frame to be classified in the audio signal, the N2 frame before the frame to be classified, and the tone distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the L1 frame to be classified
  • the pitch distribution parameter of the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a N2 frame to be classified and to be classified
  • the energy distribution parameter of the L1 frame after the frame is classified, and the continuous frame of the frame to be classified in the low frequency region is obtained according to the frame to be classified in the audio signal, the N2 frame of the frame to be
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the post L1 frame include:
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a to-be-classified
  • the energy distribution parameters of the L1 frame after the frame include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the classification determining module acquires, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the continuous frame of the frame to be classified in the low frequency region.
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module is to be classified according to an energy distribution parameter of a frame to be classified in an audio signal
  • the energy distribution parameter of the N2 frame before the frame and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, a N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and The pitch distribution parameter of the L2 frame after the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; or
  • the method is configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and a pre-frame N3 frame and an L3 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and
  • the energy distribution parameter of the L3 frame after the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-classified frame is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if determining the audio signal If the frame to be classified is a music signal, it is determined whether the number of frames determined as a voice signal in the N3 frame before the frame to be classified and the frame in the to-be-classified frame is greater than a fourth threshold, and if yes, the audio signal is to be received.
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a to-be-classified
  • the pitch distribution parameters of the L2 frame after the classification frame include:
  • the pitch distribution parameter of the N3 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame frame to be classified are used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the to-be-classified
  • the energy distribution parameters of the L2 frame after the classification frame include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the L2 frame after the frame to be classified;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the number of frames includes:
  • the ratio is greater than the ninth threshold
  • the sound pressure level is greater than the tenth threshold.
  • the tone that the number of persistent frames in the to-be-classified frame acquired by the first acquiring module is greater than a sixth threshold
  • the number of components is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain combined with the first possible second possibility or the third possible possibility of the second aspect
  • the first acquiring module is specifically configured to acquire a high frequency energy distribution ratio and a sound pressure level of each frame in the received audio signal, and according to a high frequency energy distribution ratio of each frame in the received audio signal.
  • a third aspect of the present invention provides an audio signal classification processing apparatus, including: a receiver, configured to receive an audio signal;
  • a processor configured to obtain, by the receiver, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal received by the receiver, and a continuous frame in the low frequency region of the to-be-classified frame in the audio signal And the at least one of the number of consecutive frames of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, and the continuous frame of the frame to be classified in the low frequency region And determining at least one of the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and to be classified
  • the pitch distribution parameter of the N frames before the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified Obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or the duration of the frame to be classified in the high frequency region
  • the number of frames, N1 is a positive integer; the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than the first threshold, and the to-be-classified De
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the distribution information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the processor acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes: according to the received audio signal And obtaining, by the high-frequency energy distribution ratio and the sound pressure level, the high-frequency energy distribution ratio of the to-be-classified frame and the high-frequency energy distribution ratio that is less than the eighth threshold;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region includes: Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
  • L1 is a positive integer
  • the processor is specifically used.
  • the tone distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; acquiring the frame to be classified in the audio signal, and the energy of the L2 frame before the frame to be classified and the frame after the frame to be classified Distributing parameters, and obtaining, according to the to-be-classified frame in the audio signal, the N2 frame of the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L1 frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution of a tonal component of the N2 frame before the frame to be classified
  • the information is used as the pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L 1 frame include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the processor obtains the continuous frame number of the to-be-classified frame in the low frequency region.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the processor Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the processor acquiring the continuous frame of the frame to be classified in the high frequency region
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third aspect when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame, L2 and L3 are positive integers,
  • the processor is specifically configured to obtain a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame to be classified, and according to the to-be-classified frame, the pre-frame N3 frame and the to-be-classified frame
  • the tone distribution parameter of the L2 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; acquiring the frame to be classified in the audio signal, and the N3 frame before the frame to be classified and the frame to be classified An energy distribution parameter of the L2 frame, and obtaining, according to the to-be-classified frame in the audio signal, the N3 frame of the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of
  • the processor acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L2 frame include:
  • frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame
  • frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The frequency domain distribution information of the tone distribution parameter of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the processor obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L2 frame include:
  • the processor obtains the continuous frame number of the frame to be classified in the low frequency region Includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified acquired by the processor that is greater than a sixth threshold Is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain.
  • the continuity constraint is satisfied in the frame to be classified in the audio signal.
  • the number of tonal components, and the number of consecutive frames of the audio signal to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming that the type of the frame to be classified is a music signal according to the above information , or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart 1 of a specific embodiment of the present invention
  • Figure 3a is a waveform diagram 1 of the input signal "French male + ⁇ ";
  • Figure 3b is a spectrogram corresponding to Figure 3a;
  • Figure 4a is a waveform diagram of an input signal of an audio signal "Jinghu + French male voice";
  • Figure 4b is a spectrum diagram corresponding to Figure 4a;
  • Figure 5a is a waveform diagram of the input signal "Korean male + ensemble"
  • Figure 5b is a spectrum diagram corresponding to Figure 5a;
  • Figure 6a is a waveform diagram 2 of the input signal "French male + ⁇ ";
  • Figure 6b is the initial tone detection result of the input signal shown in Figure 6a;
  • Figure 6c is the result of the tone detection after the input signal is filtered as shown in Figure 6a;
  • Figure 7a is a waveform diagram 3 of the input signal "French male + ⁇ ";
  • Figure 7b is a graph of the pitch characteristic "" m - to ⁇ z - ⁇ corresponding to Figure 7a;
  • Figure 8a is a waveform diagram of the input signal "Jinghu + French male voice"
  • Figure 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ corresponding to Figure 8a;
  • Figure 9a is a waveform diagram of the input signal "Korean male + ensemble";
  • Figure 1 is a graph of the high frequency energy distribution ratio - ⁇ - ⁇ ) corresponding to Figure 9a;
  • Figure 10 is a schematic flow chart 1 of the audio signal classification rule in the embodiment of the present invention.
  • Figure 11a is a waveform diagram 1 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure l ib is a schematic diagram of the classification result corresponding to Figure 11a;
  • Figure 12a is a waveform diagram 2 of the input signal "Chinese female voice + ensemble + English male voice + ⁇ + German male voice + castanets";
  • Figure 12b is a schematic diagram of the smoothed classification result corresponding to Figure 12a;
  • FIG. 13 is a second schematic diagram of an audio signal classification rule according to an embodiment of the present invention.
  • Figure 14a shows the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets" Waveform diagram three;
  • Figure 14b is a schematic diagram of the real-time classification result corresponding to Figure 14a;
  • 15 is a flow chart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure 16b is a schematic diagram showing the classification results of the three classification methods corresponding to Figure 16a;
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps. Step:
  • Step 101 Acquire an amount of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, a continuous frame number of a frame to be classified in the audio signal in the low frequency region, and a duration of the frame to be classified in a high frequency region At least one of the number of frames;
  • Step 102 The number of tonal components satisfying the continuity constraint in the acquired to-be-classified frame, the number of persistent frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame according to the obtained And determining at least one of the audio signals to be classified into a music signal, and determining that the to-be-classified frame in the audio signal is a voice signal.
  • the audio signal classification processing method provided by the embodiment of the present invention can output the classification result without output delay when the frames in the audio signal are classified, that is, output the classification result in real time for the received audio signal frame. There may be a certain output delay, that is, for the received audio signal frame, the classification result is given for a delay.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the following may be classified into three cases according to different output delay requirements.
  • the N frame information is judged, and the second is to allow a smaller classification result output delay, that is, when the output delay is L1 frame, L1 is a positive integer, which can be classified according to the frame to be classified, the L1 frame before the frame to be classified, and the to-be-classified
  • the L1 frame is judged after the frame;
  • the third is to allow the output of the large classification result to be delayed, that is, when the output delay is L2+L3 frame, L2 and L3 are positive integers, first according to the frame to be classified, the L2 frame before the frame to be classified, and After the frame to be classified, the L2 frame is judged, and the classification result of the frame to be classified is obtained, and then modified according to the L3 frame before the frame to be classified and the L3 frame in the frame to be classified.
  • the frames in the first received audio signal cannot be classified, and the first received frame can be set to a default value, and the default is a voice signal or a music signal.
  • the step 101 in the embodiment shown in FIG. 1 acquires the tonal component of the to-be-classified frame in the audio signal that satisfies the continuity constraint condition.
  • the quantity specifically includes:
  • N1 is a positive integer
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal is obtained, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the frame and the to-be-classified N1 frame acquires the number of the tonal components in the frame to be classified that is greater than the sixth threshold.
  • the energy distribution parameter of the frame to be classified in the audio signal is obtained, and
  • the energy distribution parameters of the N1 frame before the frame to be classified include:
  • obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • the sound distribution level acquires the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is less than an eighth threshold
  • the obtaining the continuous frame number of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the number of tonal components includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the L1 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal the pitch distribution parameter of the N2 frame before the frame to be classified, and the tone distribution parameter packet of the L1 frame after the frame to be classified are acquired.
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the foregoing obtains an energy distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the energy distribution parameters of the N2 frame and the energy distribution parameters of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the frame to be classified, the N2 frame to be classified, and the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the LI frame after the frame acquire the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is greater than a ninth threshold and the sound pressure level is greater than a tenth threshold.
  • the classification result output delay is allowed to be L2+L3 frames, that is, the delay L2+L3 frame is used to obtain the classification result of the to-be-classified frame
  • the step 101 of the embodiment shown in FIG. 1 acquires the to-be-classified frame in the audio signal.
  • the number of tonal components that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the post L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region.
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the acquiring the pitch distribution parameter of the frame to be classified in the audio signal is to be
  • the pitch distribution parameters of the N3 frame before the classification frame, and the pitch distribution parameters of the L2 frame after the frame to be classified include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the obtaining the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the high-frequency energy distribution ratio and the sound pressure level of the frame to be classified in the received audio signal are used as energy distribution parameters of the L2 frame after the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the number of tonal components in which the number of persistent frames in the frame to be classified is greater than the sixth threshold is the number of tonal components larger than the seventh threshold in the frequency domain.
  • Step 201 Perform FFT transformation on an ith frame of a current frame, where each step is received for each Frames are all subjected to FFT transformation;
  • Step 202 Obtain a pitch distribution parameter of the ith frame and an energy distribution parameter based on the FFT transformation result
  • Step 203 Determine whether i>L1 is established, that is, whether L1 frames exist before the current frame. If the process is step 204, the process ends. Otherwise, the execution of the foregoing steps 201 and 202 is performed for subsequent frames. Operation
  • Step 204 At: [>1 ⁇ , the audio signal classification result of the i-L1 frame may be obtained, and the specific past information, that is, the i-L1 frame obtained according to the above steps 201 and 202 The pitch distribution parameters and energy distribution parameters of the previous frames, the current information, that is, the pitch distribution parameters and energy distribution parameters of the i-L1 frame, and the future information, that is, the pitch distribution of the L1 frame after the i-L1 frame. Parameter and energy distribution parameters, obtaining audio signal classification results of the i-th frame;
  • Step 205 Output an audio signal classification result of the i-L1 frame.
  • FIG. 3a is a waveform diagram 1 of the input signal "French male voice + ⁇ ”
  • FIG. 3b is a spectrum diagram corresponding to FIG.
  • the sampling rate is 8 kHz, wherein the horizontal axis is the sample point and the vertical axis is the normalized amplitude;
  • the spectral map of Fig. 3b the corresponding sampling rate is also 8 kHz, and the frequency analysis range is (T4kHz.
  • the horizontal axis is the frame, corresponding to the sample point on the horizontal axis of Figure 3a; the vertical axis is the frequency (Hz).
  • the higher the brightness in a certain frequency range the signal is in the band The greater the energy, if the signal continues to maintain a large amount of energy in a certain frequency band, on the spectrum A "bright band” is formed, which is the tone.
  • the pitch duration at the fundamental frequency is slightly longer, the pitch duration at the higher frequency is very short.
  • the voice signal the place where the tone can be detected is voiced. Since the length of the voiced sound is usually short, the corresponding tone duration is also shorter; in the latter half of the music signal, the tone duration is significantly longer.
  • FIG. 4a is a waveform diagram of an input signal of the audio signal "Jinghu + French male voice”
  • FIG. 4b is a spectrum diagram corresponding to FIG. 4a.
  • the horizontal axis is the sample point; the vertical axis is the normalized amplitude; in the spectrum diagram of Fig. 4b, the horizontal axis is the frame; and the vertical axis is the frequency (Hz).
  • the energy distribution of Fig. 4b in the music signal of the first half, the energy is basically distributed above 1 kHz and distributed at 1 kHz to 4 kHz. In the latter half of the speech signal, most of the voiced energy is mainly distributed at 1 kHz. Below; unvoiced energy is distributed from low frequency to higher frequency range. Therefore, the energy of the speech signal cannot be continuously distributed over a relatively high frequency range.
  • Fig. 5a is a waveform diagram of the input signal "Korean male voice + ensemble", wherein the horizontal axis is the sample point; the vertical axis is the normalized Figure 5b is a spectrogram corresponding to Figure 5a, where the horizontal axis is the frame and the vertical axis is the frequency (Hz).
  • the energy distribution can be seen by the following: The energy distribution of the speech signal in the first half of Fig. 5b is similar to the speech signal of Fig.
  • the energy distribution of the speech signal Due to the different energy distribution characteristics of voiced and unvoiced sounds, the energy distribution of the speech signal has a large fluctuation. Therefore, the energy of the speech signal is neither continuously distributed in a relatively high frequency range nor continuously distributed in the low frequency range; in the latter half of the music signal, the energy is mainly distributed below 1 kHz.
  • the difference between the music signal and the speech signal mainly includes: First, the tone duration of the partial music signal is long, the tone duration of the speech signal is usually short; second, the energy of the partial music signal can be continuously distributed. In a relatively high frequency range; the energy of the speech signal cannot be continuously distributed in a relatively high frequency range; the third is that the energy of part of the music signal can be continuously distributed in the low frequency region; the energy of the speech signal cannot be continuously distributed in the low frequency region.
  • the low frequency and high frequency division in the embodiments of the present invention may be determined according to the distribution area of the voice signal, and the area where the voice signal is mainly distributed is defined as a low frequency area, for example, 1 kHz or less is defined as a low frequency area, The 1 kHz is defined as a high frequency region.
  • the specific value may also be different according to the specific application scenario and the specific voice signal.
  • the features to be extracted mainly include pitch characteristics and energy characteristics. Specifically, extracting the tonal features can be divided into three steps:
  • tone component refers to a distribution form of energy in the frequency domain
  • the obtaining the initial pitch detection result may include: first, performing FFT transformation on data of each frame to obtain a power density spectrum; second, determining a local maximum point in the power density spectrum; and finally, focusing on the local maximum point A number of power density spectral coefficients are analyzed to determine whether the local maximum point is a true tonal component.
  • the sampling rate of the input signal is 8 kHz
  • the effective bandwidth is 4 kHz
  • the FFT value is 1024.
  • the local maximum point of the power density spectrum is In this embodiment, how to select a plurality of power density spectral coefficients centered on the local maximum point is relatively flexible, and can be set according to an algorithm. For example, it can be implemented as follows
  • V 2 represents the initial pitch detection result
  • a value of 1 indicates that the k-th frame data has a tonal component at f
  • a value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the kth frame is referred to as a past frame
  • the L1 data located after the kth frame is referred to as a future frame.
  • the kth frame data have a tone score at /;
  • c Quantity, ie to ⁇ L/Z ⁇ r ⁇ Vm ] [ ] l.
  • the steps of the tone continuity analysis are:
  • Step 2 Statistically, the tonal component has continuity with a number of future tonal components, expressed as num right. Similar to step 1 above, sequentially detecting the kth frame, the (k+i) frame, and the like Whether there is continuity between the tonal components, output " Mm - Ai .
  • Step 3 According to " ⁇ TM_ ⁇ , filter the initial tone detection results, such as If one of the following two conditions is met:
  • Num right ⁇ a3 indicates that the tonal component at fk frame fx has a certain continuity, retaining the initial pitch detection result, otherwise it is not retained.
  • Fig. 6a is a waveform diagram 2 of the input signal "French male voice + ⁇ "
  • Fig. 6b is the initial tone detection result of the input signal shown in Fig. 6a, wherein the horizontal axis is a frame, and the horizontal axis of Fig.
  • the tone feature extraction is performed, wherein for the filtered tone detection result, the number of tonal components per frame from the lower frequency to the high frequency range (corresponding to fl4 ⁇ ⁇ F / 2 ) is expressed as
  • Fig. 7a is a waveform diagram 3 of the input signal "French male voice + ⁇ "
  • Fig. 7b A graph of the pitch characteristics corresponding to Fig. 7a.
  • the horizontal axis is a frame, and the graph
  • nwn j mal -flag is always 0 , which is significantly different from the tonal characteristics of the second half.
  • the energy feature extraction method in the above embodiment of the present invention is as follows. Before extracting the energy feature, firstly, the high frequency energy distribution ratio and the sound pressure level ⁇ ⁇ ⁇ of each frame need to be calculated, where k represents the number of frames.
  • Im_ (/) is the imaginary part of the FFT transform of the k-th frame.
  • the denominator represents the total energy of the kth frame; the numerator represents the kth frame at
  • Ratio—energy ⁇ hf ⁇ is small, indicating that the energy of the kth frame is mainly distributed at low frequencies; on the contrary, it indicates that the energy of the kth frame is mainly distributed in a higher frequency range.
  • the distribution characteristics of energy at high frequencies and the distribution characteristics of energy at low frequencies are further analyzed.
  • Fig. 8a is a waveform diagram of the input signal "Jinghu + French male voice”.
  • FIG. 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ ) corresponding to Fig. 8a, wherein the horizontal axis is a frame corresponding to the sample point on the horizontal axis of Fig. 8a; and the vertical axis is the high-frequency energy distribution ratio.
  • the variation of the high-frequency energy distribution ratio curve can be seen from Figure 8b:
  • the high-frequency energy distribution ratio is substantially greater than 0.8, indicating that the energy of the Jinghu signal can be continuously distributed in the higher frequency range;
  • Num_big_ratio_energy_left Represents the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame;
  • Draw — big — mtio — energy — right Indicates the number of frames in the LI frame data after the kth frame that can be continuously distributed in the high frequency future frame.
  • Step 1 Num - big - ratio - ener sy - le ft 0;
  • Step 2 Initialize the variable "Draw as 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are met:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Num non big ratio size if num non big ratio ⁇ 8, indicating that the energy cannot be continuously distributed in the higher frequency range has exceeded the preset range, do not continue to detect, the output listens to m big ratio energy left . If num Non big ratio ⁇ "8, indicating that the energy cannot be continuously distributed in the higher frequency range is still within the preset range, continue to detect until the detection of the past L1 frame data, output "paint - g-rario - i rg ) je/. The steps to get awake-big _ ratio _ energy _ right are similar. Detect whether the (k+ 1)th frame is continuously distributed in a higher frequency range, and output
  • Figure 9a shows the input signal "Korean The waveform of the male voice + ensemble"
  • the graph % is a graph of the high frequency energy distribution ratio ⁇ - ⁇ - ⁇ corresponding to Fig. 9a.
  • the horizontal axis is the frame; the vertical axis is the high frequency energy distribution ratio.
  • the high-frequency energy distribution ratio curve shown in the figure % By observing the change of the high-frequency energy distribution ratio curve shown in the figure %, it can be seen that in the first half of the speech signal, the fluctuation of the high-frequency energy distribution ratio curve is large, indicating that the energy of the speech signal cannot be continuously distributed in the low frequency. In the music signal of the latter half, the high-frequency energy distribution ratio is substantially less than 0.1, indicating that the energy of the ensemble signal can be continuously distributed at low frequencies.
  • ⁇ ⁇ II II mtio — energy — left indicates that the energy can be continuously distributed in the low frequency past frame num _ small _ ratio _ energy _ right .
  • TM m_sm ⁇ _ ra ⁇ _ e / ⁇ _ fe / t is not only drawn for the last L1 frame data analysis, and a ratio -energy _hf ⁇ i) ⁇ i ⁇ 0 ) f will be updated
  • Step 1 initialize num sma ⁇ ra ti energy right to Q ⁇
  • Step 2 sequentially detecting the high-frequency energy distribution ratio ratio _ energy _ hf ⁇ i ) of the (k+1)th frame, the (k+2)th frame, etc. ( ⁇ ⁇ (whether or not the condition: ratio_energy_hf is satisfied) (f) ⁇ a9. If the above conditions are not met, it is not necessary to continue the test, and the output listens to / «-legs ⁇ -/3 ⁇ 4! ⁇ -£ ⁇ /3 ⁇ 4)-/ ⁇ / ⁇ ; if the above conditions are met,
  • Num small ratio energy right- num small ratio energy right + 1 continue to check ⁇ be [] down, until the detection of the future LI frame data, output "paint _ leg" 1 / ⁇ 0 _ £ ⁇ / ⁇ - / ⁇ .
  • the classification rule may be as shown in FIG. 10.
  • the classification rule may include the following steps:
  • Step 301 Determine whether the number of tonal components is greater than 0, that is, "Draw-to" - g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, continue to analyze the U.S. step 302, and analyze the energy in the comparison. The distribution characteristics in the high frequency range, first judge a 6 && S plW> a) . If yes, go to step 303, otherwise execute the step
  • Step 303 determining whether "painting_g-rari 0 _£ rg" -n ⁇ "ll, or satisfying num big ratio energy left + num big ratio energy right ⁇ alO or
  • Step 304 Determine whether the high frequency energy distribution ratio is less than a9, that is,
  • Step 305 Determine whether the "painting_leg" 1/ ⁇ 0 _£ ⁇ / ⁇ -/£ ⁇ ⁇ "13 is satisfied, or num small ratio energy left + num small ratio energy right ⁇ al2 or num _ small _ ratio _ energy _ right >a ⁇ ⁇ If it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 11a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", three of which are: ensemble, cymbal and castanets, In terms of pitch characteristics or energy characteristics, it has a certain typicality;
  • Figure lib is a schematic diagram of the classification result corresponding to Figure 11a, wherein the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal. The value is not 0 corresponding to the music signal. From bottom to top, the vertical axis gives four classification results:
  • MUSIC_ Tone Feature The classification result obtained using only the tone feature is expressed as a solid line. It can be seen from which signals in Figure 11a are applicable to the classification rules for tonal features; MUSIC Energy: Special Features __11:: Only use the classification results obtained with ""Energy: Feature_1" , denoted as a dotted line. The "energy characteristic_1" here refers to whether the energy can be continuously distributed in a higher frequency range. It can be seen which signals in Fig. 11a are suitable for the high frequency distribution characteristics of the energy. Classification rules;
  • the ensemble signal between 100000-300000 points The energy fluctuation of this piece of music signal is very large, only a few frames of energy can be continuously distributed in a higher frequency range, the energy characteristic _1/2 can not afford effect.
  • the pitch of the segment signal has good persistence and can be detected by using the tonal feature;
  • the energy of the segment signal is mainly distributed in the low frequency, and can be detected by using the energy feature_2;
  • the castanick signal after 600000 This segment of the signal can hardly detect the tonal component, and the tonal feature does not work.
  • the energy of this segment of the signal is mainly distributed at high frequencies and can be detected by the energy characteristic _1.
  • the technical solution provided by the embodiment of the present invention can also be applied to an application scenario with a large output delay.
  • the output delay is L2+L3
  • the first embodiment may be provided according to the foregoing embodiment.
  • the technical solution when i>L2, according to the past information, the pitch distribution parameter and the energy distribution parameter of several frames before the i_L2 frame, the current information, that is, the pitch distribution parameter and the energy distribution parameter of the i_L2 frame, and the future The information, that is, the pitch distribution parameter and the energy distribution parameter of the L2 frame after the i_L2 frame, obtain the audio signal classification result of the i-th frame, and the specific implementation manner can be referred to the above embodiment, and further, i>(L2+L3) When it is smoothed, that is, according to the frame before the i_L2-L3 frame to be classified, the N4 frame and the frame to be classified
  • the initial classification result of the L3 frame is corrected.
  • the foregoing foregoing N4 frame may be the first L3 frame, and for the kth frame, the process of the above correction processing is:
  • the initial classification result of the L3 frame located before the kth frame and the L3 frame located after the kth frame is counted, and the number of frames classified as a music signal "" m - mw , and the towel classified as a voice signal are acquired.
  • Awake up - wake up _ music is acquired.
  • the k-th frame classification result is corrected to the music signal; if the result of the initial classification of the k-th frame is a music signal, and "Draw -". "-Listen ⁇ ⁇ " 1 4 , the classification result of the kth frame is corrected to a speech signal.
  • Figure 12a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, Figure 12 shows the smoothed results, as shown in Figure 12, Down to the top, the vertical axis gives two types of classification results:
  • MUS IC_ smoothed result smooth the initial classification result, and obtain the smoothed result, table Shown as a dotted line.
  • the initial classification result has a misjudgment between 250,000 and 300000 points, and the music signal is misjudged as a voice signal; ⁇ between 400,000 and 550,000 points
  • the signal, the initial classification result has a misjudgment at the end of the signal, and the music signal is misjudged as a speech signal.
  • the above misjudgment was corrected by smoothing.
  • the principle and the step of acquiring the pitch distribution parameter and obtaining the energy distribution parameter are similar to the above technical solution, except that the reference information is used in the past.
  • the current information because there is no output delay, requires real-time access to the classification results, and cannot refer to future information.
  • the tone feature can be extracted by referring to the foregoing embodiment, and can be divided into three steps:
  • step A reference may be made to the above embodiment, and the following mainly describes the steps B and C.
  • tonal -fl a g -Original [k] [f] (0 ⁇ f ⁇ tone detection result represents the initial value of 1 k-th frame data showing the presence of tonal components at f
  • the value of A value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the k-th frame is referred to as a past frame.
  • the steps of the tone continuity analysis are:
  • Step 1 Count the continuity of the tonal component with the pitch component of the past frame, expressed as the leg m— ⁇ , initialize the variable “ " ⁇ - to 0, initialize the variable indicating discontinuity
  • step 2 Similar to step 2, it is sequentially detected whether there is continuity between the (k-1)th frame, the (k-2)th frame, and the like and the pitch component of the previous frame. Before each test, you first need to determine the size of "" " ⁇ - ⁇ :
  • Step 2 Filter the initial pitch detection result according to "- ⁇ ;
  • This feature refers to the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame.
  • Step 1 Num - big - ratio - ener sy - ⁇ 0;
  • Step 2 Initialize the variable ""m_M.”_b ⁇ _rari. is 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are satisfied: ⁇ ratio energy _hf(k - l)> If the above conditions are not satisfied, the energy of the (k-1)th frame data is not distributed at a higher frequency. In the range, i has recorded this event - m non big ratio - num non big ratio + 1 If the above conditions are met, the energy of the (ki) frame data is continuously distributed in the higher frequency range:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Wake up - wake up II - ratio - energy - left This feature refers to the number of frames of past frames whose energy can be continuously distributed at low frequencies.
  • the classification rule may be as shown in FIG. 13, and for the k-th frame data, it may include the following steps:
  • Step 401 Determine whether the number of tonal components is greater than 0, g ⁇ " ⁇ -to ?M Z- i 3 ⁇ 4g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, the energy feature is continuously analyzed;
  • Step 402 Analyze the distribution characteristics of energy in a higher frequency range, first determine
  • step 403 If yes, execute step 403, otherwise execute the step
  • Step 403 determining whether "paint-b ⁇ -rari 0 _i / ⁇ -fe / t ⁇ b 8 is satisfied, if yes, output the initial classification result as a music signal, otherwise, performing step 404;
  • Step 404 determining whether the high frequency energy distribution ratio is less than b7, that is,
  • Step 405 determining whether the "painting is satisfied" j e / ⁇ 9, if it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 14a is the waveform diagram 3 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure b gives an example of real-time classification results, where the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal, and the value is not A music signal corresponding to 0 can be seen from Fig. 14a and Fig. 14b. Since there is no future information for reference, a little music signal is misjudged as a voice signal.
  • FIG. 15 is a flowchart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention, and as shown in FIG. 15, the following steps are included:
  • Step 501 performing an FFT transformation on the i-th frame of the current frame
  • Step 502 Obtain a tone distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 503 Obtain an energy distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 504 Generate and cache a real-time classification result of the ith frame.
  • the past information generated and cached in the step 502 and the step 503 in the step that is, the pitch distribution of each frame before the ith frame
  • the parameter and the energy distribution parameter are used to obtain the tonal feature and the energy feature of the ith frame, and generate and cache the real-time classification result.
  • Step 505 When 1>11, where L1 is a small amount of output delay allowed, in addition to obtaining the real-time classification result of each received frame, the initial classification result of the i-L1 frame may also be generated and cached, specifically, When generating the initial classification result of the i-th frame, reference may be made to the past information, that is, the pitch distribution parameter and the energy distribution parameter of several frames before the i-L1 frame, and the current information, that is, the tone of the i-L1 frame.
  • Step 506 When i>(L2+L3), generate and buffer the (i_L2-L3) frame-corrected classification result, specifically, refer to the past information, that is, before the (i_L2-L3) frame.
  • the initial classification result of several frames the future information, that is, the initial classification result of the L3 frame located after the (i_L2-L3) frame, the initial classification result of the (i_L2-L3) frame is corrected, and the specific implementation can be seen.
  • Step 507 Select, according to the allowed output delay, the classification result of the foregoing step 504, step 505, and step 506 as the classification result of the jth frame of the to-be-classified frame:
  • the suboptimal result is output, that is, the initial classification result of the jth frame;
  • the zero delay result is output, that is, the real time classification result of the jth frame.
  • the value of L2 can be set equal to L1.
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure 16b shows the classification results obtained by the three classification methods, as shown in Figure 16b, where the three classification results given on the vertical axis are 31 ( _ The results of real-time classification are indicated by solid lines, ⁇ The classification results are indicated by dotted lines, and the MUSIC_ corrected classification results are indicated by dotted lines.
  • the extracted feature can reflect the more essential features of the music signal different from the voice signal, so that the classification accuracy rate at the low sampling rate is significantly improved. Since the method for extracting features of the technical solution of the embodiment of the present invention is not limited to the sampling rate, it is applicable not only to a low sampling rate but also to signal classification at a high sampling rate. Under the premise of ensuring low algorithm complexity, users can flexibly select real-time classification results, sub-optimal classification results or optimal classification results according to their needs.
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention. As shown in FIG. 17, the apparatus includes a first obtaining module 11 and a classification determining module 12, wherein the first acquiring module 11 is configured to acquire an audio signal.
  • the classification determining module 12 is configured to: according to the number of tonal components satisfying the continuity constraint in the to-category frame, the continuous frame number of the to-be-classified frame in the low frequency region, and the persistent frame of the high frequency region of the to-be-classified frame At least one of the numbers determines that the frame to be classified in the audio signal is a music signal, or determines that the frame to be classified in the audio signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the execution steps of each module may be different according to the presence or absence of the output delay and the output delay length, and specifically include the following situations:
  • the first acquisition module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the to-be-classified frame
  • the tone distribution parameter of the first N1 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a frame before the frame to be classified
  • the energy distribution parameter and obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the frame to be classified in the high frequency region Number of consecutive frames
  • the classification determining module 12 is specifically configured to satisfy a continuity constraint bar in the to-be-classified frame. Determining that the number of tonal components of the piece is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than a third threshold.
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the classification determining module obtains, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the pre-frame N1 frame, the number of tonal components satisfying the continuity constraint in the frame to be classified, including:
  • the module obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the foregoing classification determining module acquires the continuous frame number of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • the foregoing classification determining module acquires the continuous frame number of the to-be-classified frame in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified Includes:
  • the second acquiring module is configured to obtain a to-be-classified frame in the audio signal, a pre-framed N2 frame, and a to-be-classified.
  • N2 is a positive integer; or, specifically, is used to obtain a frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the frame to be classified in the audio signal, Obtaining, according to the energy distribution parameter of the pre-frame N2 frame and the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the first acquiring module acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N2 frame before the frame to be classified and the energy distribution parameter of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified.
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the numbers include:
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the number of frames includes:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is to obtain a classification result of the to-be-classified frame, and the L2 and L3 are positive integers, and the first acquiring module is specifically configured to acquire a frame to be classified in the audio signal, and the N3 frame to be classified before the frame.
  • N3 is a positive integer; or, specifically, for acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the audio signal
  • the number of consecutive frames in the low frequency region or the number of consecutive frames in the high frequency region of the frame to be classified is obtained by the energy distribution parameter of the frame to be classified, the N3 frame before the frame to be classified, and the L2 frame after the frame to be classified;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L2 frame after the frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the first acquiring module acquires an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and an energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame before the frame to be classified, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are used as energy distribution parameters of the L2 frame after the frame to be classified.
  • the foregoing classification determining module acquires the number of consecutive frames of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components whose number of persistent frames in the frame to be classified acquired by the first acquiring module is greater than the sixth threshold is the number of tonal components greater than the seventh threshold in the frequency domain.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing device according to an embodiment of the present invention.
  • the device includes a receiver 21 and a processor 22, where The receiver 21 is configured to receive an audio signal; the processor 22 is connected to the receiver 21, and configured to acquire the number of tonal components satisfying continuity constraints in the to-be-classified frame in the audio signal received by the receiver, the audio And at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, Determining, in the audio signal, the frame to be classified as a music signal, or determining the audio, by using at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region
  • the frame to be classified in the signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is usually not continuously distributed in the high frequency region or the low frequency region, and based on the above characteristics of the music signal,
  • the technical solution provided by the embodiment of the present invention first, the number of tonal components satisfying the continuity constraint in the frame to be classified in the audio signal, and the number of persistent frames of the frame to be classified in the low frequency region and/or the to-be-classified in the audio signal are obtained.
  • the audio signal classification processing method provided by the above technical solution can improve the correct rate of the audio signal classification and satisfy the voice. Requirements for quality assessment.
  • the processor may be implemented by a software flow, or may be implemented by using a hardware entity device such as a digital signal processing (DSP) chip.
  • DSP digital signal processing
  • the processor may include the following situations according to the real-time acquisition of the classification result of the to-be-classified frame or the length of the delay of the classification result output:
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the tone of the N frame before the frame to be classified
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified, and according to the audio
  • the energy distribution parameter of the to-be-classified frame in the signal, and the N1 frame before the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified includes:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the processor according to the pitch distribution parameter of the frame to be classified, and the tone of the N1 frame before the frame to be classified
  • the obtaining, by the distribution parameter, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: obtaining the to-be-classified frame according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the pre-frame N1 frame to be classified The number of the tone components whose number of consecutive frames is greater than the sixth threshold value.
  • the processor acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • the second is that when the classification result of the to-be-classified frame is obtained, the L1 is a positive integer, and the processor is specifically configured to acquire a frame to be classified in the audio signal, a N2 frame before the frame to be classified, and a frame to be classified.
  • N2 is a positive integer
  • the energy distribution parameter of the L1 frame after the frame to be classified acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region; satisfies continuity in the frame to be classified
  • the number of tonal components of the constraint is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the pitch distribution parameters of the N2 frame, and the pitch distribution parameters of the L1 frame after the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the received audio signal is used as the pre-frame N2 frame to be classified.
  • the pitch distribution parameter, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified acquires the number of tonal components whose number of consecutive frames in the frame to be classified is greater than a sixth threshold.
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and an energy distribution parameter of the L1 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are used as the energy distribution parameters of the N2 frame before the frame to be classified, and to be classified
  • the high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame frame are used as energy distribution parameters of the L1 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor according to the energy distribution parameter of the frame to be classified in the audio signal, the N2 frame to be classified before the frame
  • the energy distribution parameter and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is that when the classification result output delay is L2+L3 frame, L2 and L3 are positive integers, and the processor is specifically configured to acquire the to-be-classified frame in the audio signal, the N3 frame before the frame to be classified, and the L2 after the frame to be classified.
  • the to-be-classified frame N3 frame and the tone distribution parameter of the L2 frame after the frame to be classified obtain the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is positive And obtaining an energy distribution parameter of the to-be-classified frame in the audio signal, and the L3 frame of the to-be-classified frame and the L2 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and the to-be-classified frame
  • the energy distribution parameter of the L2 frame after the classification frame acquires the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region; the continuity constraint is satisfied in the frame to be classified
  • the number of conditional tonal components is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number
  • the frame to be classified in the audio signal is corrected to a voice signal
  • N4 is a positive integer
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the tonal component of the L2 frame after the frame to be classified The frequency domain distribution information is used as a pitch distribution parameter of the L2 frame after the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal
  • the energy distribution parameter of the N3 frame before the frame to be classified and the energy distribution parameter of the L2 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high-frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high-frequency energy distribution ratio and the sound pressure level of the N3 frame before the frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified
  • the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame are used as the energy distribution parameters of the L2 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region is included. :
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified that are acquired by the processor that are greater than the sixth threshold is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing Storage media include: R0M, RAM, disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

Provided are an audio signal classification processing method, apparatus, and device. The method comprises: obtaining at least one of the number of tonal components meeting a continuity constraint condition in a to-be-classified frame in an audio signal, the number of contiguous frames of the to-be-classified frame in the audio signal in a low frequency area, and the number of contiguous frames of the to-be-classified frame in the audio signal in a high frequency area (101); and determining whether the to-be-classified frame in the audio signal is a music signal or a voice signal according to the at least one of the number of tonal components meeting the continuity constraint condition in the to-be-classified frame in the audio signal, the number of contiguous frames of the to-be-classified frame in the audio signal in the low frequency area, and the number of contiguous frames of the to-be-classified frame in the audio signal in the high frequency area (102).

Description

音频信号分类处理方法、 装置及设备  Audio signal classification processing method, device and device
技术领域 Technical field
本发明实施例涉及信号处理技术领域, 尤其涉及一种音频信号分类处 理方法、 装置及设备。 背景技术  The embodiments of the present invention relate to the field of signal processing technologies, and in particular, to an audio signal classification processing method, apparatus, and device. Background technique
在移动通信***的语音质量评估中, 现有的语音质量评估模型不适用 于音乐信号。 但是, 实际应用中的待分析信号中可能会包括音乐信号, 比 如彩铃等。 语音质量评估模型会将其视为语音信号, 给出错误的质量评估 结果。 针对该问题, 在将待分析信号输入至语音质量评估模块之前, 应先 对其进行信号分类。 如果识别出该段信号为语音信号, 将其送入语音质量 评估模块进行质量评估; 如果识别出该段信号为音乐信号, 则不送入语音 质量评估模块。  In the speech quality assessment of mobile communication systems, existing speech quality assessment models are not applicable to music signals. However, the signal to be analyzed in the actual application may include a music signal, such as a ring tones. The speech quality assessment model treats it as a speech signal and gives an incorrect quality assessment. For this problem, the signal to be analyzed should be classified before being input to the speech quality assessment module. If the segment signal is recognized as a speech signal, it is sent to the speech quality evaluation module for quality evaluation; if the segment signal is recognized as a music signal, it is not sent to the speech quality evaluation module.
现有技术提供有应用于语音音乐联合编码器的音频信号分类方法, 但 是该分类方法是针对具有高采样率的语音音乐联合编码器, 对于语音质量 评估模型而言, 其中存在的音乐信号普遍缺少高频信息, 利用现有的应用 于语音音乐联合编码器的音频信号分类方法, 仅能识别出少数的音乐信 号, 且分类正确率低, 不能够满足语音质量评估的要求。 发明内容  The prior art provides an audio signal classification method applied to a speech music joint encoder, but the classification method is directed to a speech music joint encoder with a high sampling rate. For the speech quality evaluation model, the existing music signal is generally lacking. High-frequency information, using the existing audio signal classification method applied to the combined combination of speech and music, can only identify a small number of music signals, and the classification accuracy is low, which can not meet the requirements of voice quality assessment. Summary of the invention
本发明提供一种音频信号分类处理方法、 装置及设备, 用于提高音频 信号的分类正确率。  The invention provides an audio signal classification processing method, device and device for improving the classification correctness rate of an audio signal.
本发明的第一个方面是提供一种音频信号分类处理方法, 包括: 获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类帧在低频区域的持续帧数和所述待分类帧在高频 区域的持续帧数中的至少一项;  A first aspect of the present invention provides an audio signal classification processing method, including: acquiring a number of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, and a frame to be classified in the audio signal in a low frequency region At least one of a continuous frame number and a continuous frame number of the frame to be classified in the high frequency region;
根据获取的所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续 帧数, 确定所述音频信号中待分类帧为音乐信号, 或确定所述音频信号中 待分类帧为语音信号。 Determining, according to the obtained number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, or the number of consecutive frames in the high-frequency region of the to-be-classified frame The frame to be classified in the audio signal is a music signal, or is determined in the audio signal The frame to be classified is a voice signal.
在上述第一个方面的第一种可能中, 在所述获取音频信号中待分类帧 中满足连续性约束条件的音调分量的数量包括:  In a first possibility of the above first aspect, the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal comprises:
获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧 中满足连续性约束条件的音调分量的数量, N1为正整数;  Obtaining a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and obtaining a continuity constraint condition in the frame to be classified according to the to-be-classified frame and the pitch distribution parameter of the N frame before the frame to be classified The number of tonal components, N1 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:  And obtaining the continuous frame number of the to-be-classified frame in the low frequency region and/or the continuous frame number of the to-be-classified frame in the high frequency region, including:
获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参 数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数, N1为正整数;  Obtaining, in the audio signal, a frame to be classified, and an energy distribution parameter of the N1 frame before the frame to be classified, and acquiring the to-be-classified frame according to the to-be-classified frame in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified The number of consecutive frames in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, N1 is a positive integer;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括:  Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, or the number of consecutive frames in the high-frequency region of the to-be-classified frame The frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
结合上述第一个方面的第一种可能的第二种可能中, 上述获取音频信 号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分布参数包 括:  In combination with the first possible second possibility of the first aspect, the pitch distribution parameter of the frame to be classified in the obtained audio signal, and the pitch distribution parameter of the N1 frame before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame and the pre-frame N1 frame to be classified in the received audio signal to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧作为 待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分量的频域分布 信息作为待分类帧前 N1帧的音调分布参数;  Obtaining, according to the power density spectrum, a to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame, and frequency domain distribution information of a tonal component of the pre-frame N1 frame to be classified as a pre-framed N1 frame Tone distribution parameter;
所述根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第一个方面的第一种可能的第三种可能中, 上述获取所音频 信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能量分布参数 包括: The obtaining, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the N1 frame before the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than the sixth threshold, in combination with the first aspect In the first possible third possibility, the energy distribution parameter of the frame to be classified in the obtained audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the to-be-classified frame in the low frequency region includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame that is less than an eighth threshold Number of frames;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
在结合上述第一个方面或第一个方面的任一种可能的第四种可能中, 在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述获取音 频信号中待分类帧中满足连续性约束条件的音调分量的数量包括:  In a fourth possible aspect of the first aspect or the first aspect, when the classification result of the to-be-classified frame is acquired in the delayed L1 frame, L1 is a positive integer, and the audio signal is acquired. The number of tonal components in the to-be-classified frame that satisfy the continuity constraint includes:
获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧 后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N2为正整数;  Obtaining a to-be-classified frame in the audio signal, a N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the tone of the L1 frame after the frame to be classified The distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:  And obtaining the continuous frame number of the to-be-classified frame in the low frequency region and/or the continuous frame number of the to-be-classified frame in the high frequency region, including:
获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧 后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数; Acquiring an energy distribution parameter of the to-be-classified frame in the audio signal, and the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the to-be-classified frame in the audio signal, before the frame to be classified The N2 frame and the energy distribution parameter of the L1 frame after the frame to be classified acquire the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括:  Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, or the number of consecutive frames in the high-frequency region of the to-be-classified frame The frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
在结合上述第一个方面的第四种可能的第五种可能中, 所述获取音频 信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以 及待分类帧后 L1帧的音调分布参数包括:  In a fourth possible fifth possibility of combining the foregoing first aspect, the acquiring a pitch distribution parameter of a frame to be classified in an audio signal, a pitch distribution parameter of a N2 frame before the frame to be classified, and a L1 after the frame to be classified The pitch distribution parameters of the frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the L1 frame to be classified, to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution information of a tonal component of a pre-frame N2 frame to be classified The pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:  And the number of tonal components satisfying the continuity constraint in the frame to be classified, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the tonal component of the L1 frame after the frame frame to be classified, the tonal component in the frame to be classified that has a continuous frame number greater than a sixth threshold Quantity.
在结合上述第一个方面的第四种可能的第六种可能中, 所述获取所音 频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以 及待分类帧后 L1帧的能量分布参数包括:  In a fourth possible sixth possibility of combining the foregoing first aspect, the acquiring an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a L1 after the frame to be classified The energy distribution parameters of the frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数; Obtaining a high frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as The energy distribution parameter of the frame to be classified, the high-frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are the energy distribution parameters of the N2 frame before the frame to be classified and the high-frequency energy distribution ratio of the L1 frame after the frame to be classified and The sound pressure level is used as an energy distribution parameter of the L 1 frame after the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:  And obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region, including :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region Includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
在结合上述第一个方面、 第一个方面的上述任一种可能的第七种可能 中, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量 包括:  In combination with any of the foregoing possible seventh possibilities of the first aspect, the first aspect, L2 and L3 are positive integers when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame. The number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal includes:
获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及待分类帧 后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N3为正整数;  Obtaining a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and the tone of the L2 frame after the frame to be classified The distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括:  And obtaining the continuous frame number of the to-be-classified frame in the low frequency region and/or the continuous frame number of the to-be-classified frame in the high frequency region, including:
获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧 后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数;  Obtaining, in the audio signal, a frame to be classified, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the frame to be classified in the audio signal, the N3 frame to be classified and the frame to be classified The energy distribution parameter of the L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧 数, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中 待分类帧为语音信号包括: Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, Determining the number of consecutive frames of the classified frame in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, and otherwise determining that the frame to be classified in the audio signal is Voice signals include:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号;  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。  If it is determined that the frame to be classified in the audio signal is a music signal, determining whether the number of frames determined as a voice signal in the N4 frame before the frame to be classified and the frame to be classified is greater than a fourth threshold, if exceeded, The frame to be classified in the audio signal is modified into a voice signal, and N4 is a positive integer. If it is determined that the frame to be classified in the audio signal is a voice signal, the N4 frame before the frame to be classified and the L3 frame after the frame to be classified are determined. Determining whether the number of frames of the music signal is greater than a fifth threshold, and if greater, correcting the frame to be classified in the audio signal to a music signal.
在结合上述第一个方面的第七中可能的第八种可能中, 所述获取音频 信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以 及待分类帧后 L2帧的音调分布参数包括:  In an eighth possible possibility in combination with the seventh aspect of the first aspect, the acquiring a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a L2 after the frame to be classified The pitch distribution parameters of the frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified frame to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3的 音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数帧和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的音调 分布参数;  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution information of a tonal component of the pre-frame N3 to be classified as The frequency domain distribution information of the tone distribution parameter frame of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:  And the number of tonal components satisfying the continuity constraint in the frame to be classified, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Acquiring, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
在结合上述第一个方面的第七中可能的第九种可能中, 所述获取所音 频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以 及待分类帧后 L2帧的能量分布参数包括: In a possible ninth possibility in combination with the seventh aspect of the first aspect above, the acquiring sound The energy distribution parameter of the frame to be classified in the frequency signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧前 N3帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N3 frame to be classified as a frame to be classified The energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the high frequency region, including :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
在结合上述第一个方面的第二种可能、第五种可能或第八种可能的第 十种可能中, 所述待分类帧中持续帧数大于第六阈值的音调分量的数量为 在频域上大于第七阈值的音调分量的数量。 本发明的第二个方面是提供一种音频信号分类处理装置, 包括: 第一获取模块, 用于获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所 述待分类帧在高频区域的持续帧数中的至少一项;  In the second possible, the fifth possible or the eighth possible tenth possibility of the first aspect, the number of tonal components in the frame to be classified that is greater than the sixth threshold is the on frequency. The number of tonal components on the domain that are greater than the seventh threshold. A second aspect of the present invention provides an audio signal classification processing apparatus, including: a first acquisition module, configured to acquire, in an audio signal, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, At least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region;
分类确定模块, 用于根据所述待分类帧中满足连续性约束条件的音调 分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧的高频 区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信 号, 或确定所述音频信号中待分类帧为语音信号。 在结合上述第二个方面的第一种可能中, 所述第一获取模块具体用于 获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根 据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量, N1为正整数; 或, a classification determining module, configured to determine, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame And at least one of determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal. In a first aspect of the foregoing second aspect, the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame And the pitch distribution parameter of the N1 frame before the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能 量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的 能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分类帧 在高频区域的持续帧数;  The method is specifically configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and an N1 frame before the frame to be classified, and obtain the foregoing according to the to-be-classified frame in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified. The number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。  The classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
结合上述第二个方面第一种可能的第二种可能中, 所述第一获取模块 获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调 分布参数包括:  In combination with the first possible second possibility of the second aspect, the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame and the pre-frame N1 frame to be classified in the received audio signal to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括:  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a frame to be classified, and a frequency domain distribution of a tonal component of the pre-frame N1 frame to be classified The information is used as a pitch distribution parameter of the N1 frame before the frame to be classified; the classification determining module acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified The number of components includes:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第二个方面第一种可能的第三种可能中, 所述第一获取模块 获取所音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能 量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数; Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than a sixth threshold, in combination with the second aspect described above In a possible third possibility, the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes: Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括:  And obtaining, by the classification determining module, the number of consecutive frames of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame to be classified:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame that is less than an eighth threshold Number of frames;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括:  And obtaining, by the classification determining module, the number of consecutive frames of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, including:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
结合上述第二个方面或第二个方面的任一种可能的第四种可能中, 在 延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述第一获取 模块具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类 帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及 待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的 音调分量的数量, N2为正整数; 或, 具体用于获取所述音频信号中待分类 帧, 以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据 所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量 分布参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在 高频区域的持续帧数;  In combination with the second aspect of the foregoing second aspect or the second possible aspect of the second aspect, when the classification result of the to-be-classified frame is acquired in the delayed L1 frame, L1 is a positive integer, and the first acquiring module is used. Specifically, the frame to be classified in the audio signal, the N2 frame before the frame to be classified, and the tone distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the L1 frame to be classified The pitch distribution parameter of the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a N2 frame to be classified and to be classified The energy distribution parameter of the L1 frame after the frame is classified, and the continuous frame of the frame to be classified in the low frequency region is obtained according to the frame to be classified in the audio signal, the N2 frame of the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified. And the number of consecutive frames of the frame to be classified in the high frequency region;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。 结合上述第二个方面第四种可能的第五种可能中, 所述第一获取模块 获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布 参数, 以及待分类帧后 L1帧的音调分布参数包括: The classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal. In combination with the fourth possible fifth possibility of the foregoing second aspect, the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified The pitch distribution parameters of the post L1 frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the L1 frame to be classified, to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution information of a tonal component of a pre-frame N2 frame to be classified The pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:  And the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified. The quantities include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the tonal component of the L1 frame after the frame frame to be classified, the tonal component in the frame to be classified that has a continuous frame number greater than a sixth threshold Quantity.
在结合上述第二个方面第四种可能的第六种可能中, 所述第一获取模 块获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量 分布参数以及待分类帧后 L1帧的能量分布参数包括:  In a fourth possible sixth aspect of the foregoing second aspect, the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a to-be-classified The energy distribution parameters of the L1 frame after the frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L1帧的高频能量分 布比和声压级作为待分类帧后 L1帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N2 frame to be classified as a frame to be classified The energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:  And the classification determining module acquires, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the continuous frame of the frame to be classified in the low frequency region. The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括: The classification determining module is to be classified according to an energy distribution parameter of a frame to be classified in an audio signal The energy distribution parameter of the N2 frame before the frame and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
结合上述第二个方面和第二个方面的上述任一种可能的第七种可能 中, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量, N3为正整数; 或,  In combination with the foregoing seventh aspect of the second aspect and the second possible aspect of the second aspect, when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame, L2 and L3 are positive integers. The first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, a N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and The pitch distribution parameter of the L2 frame after the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; or
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及 待分类帧后 L3帧的能量分布参数, 并根据所述音频信号中待分类帧, 待 分类帧前 N3帧以及待分类帧后 L3帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数或所述待分类帧在高频区域的持续帧数;  Specifically, the method is configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and a pre-frame N3 frame and an L3 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and The energy distribution parameter of the L3 frame after the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。  The classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-classified frame is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if determining the audio signal If the frame to be classified is a music signal, it is determined whether the number of frames determined as a voice signal in the N3 frame before the frame to be classified and the frame in the to-be-classified frame is greater than a fourth threshold, and if yes, the audio signal is to be received. The classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
在结合上述第二个方面的第七种可能的第八种可能中, 所述第一获取 模块获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数包括:  In a seventh possible eighth aspect of the second aspect, the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a to-be-classified The pitch distribution parameters of the L2 frame after the classification frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音 调分布参数; Performing a fast Fourier transform on the frame to be classified in the received audio signal, the N3 frame before the frame to be classified, and the L2 frame after the frame to be classified, to obtain a power density spectrum; Acquiring, according to the power density spectrum, frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame, and frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The pitch distribution parameter of the N3 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame frame to be classified are used as the pitch distribution parameter of the L2 frame after the frame to be classified;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:  And the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified. The quantities include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the to-be-classified frame L2 frame, the number of tonal components in the to-be-classified frame that are greater than the sixth threshold. .
在结合上述第二个方面的第七种可能的第九种可能中, 所述第一获取 模块获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能 量分布参数以及待分类帧后 L2帧的能量分布参数包括:  In a seventh possible ninth possibility of the second aspect, the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the to-be-classified The energy distribution parameters of the L2 frame after the classification frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N3 frame to be classified as a frame to be classified The energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the L2 frame after the frame to be classified;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:  And the classification determining module acquires the continuous frame of the to-be-classified frame in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:  And the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified The number of frames includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 在结合上述第二个方面的第二种可能、第五种可能或第八种可能的第 十种可能中, 所述第一获取模块获取的待分类帧中持续帧数大于第六阈值 的音调分量的数量为在频域上大于第七阈值的音调分量的数量。 满足连续 性约束条件的音调分量的数量为在频域上大于第七阈值的音调分量的数 结合上述第二个方面的第一种可能、 第二种可能或第三中可能的第六 种可能中, 上述第一获取模块具体用于获取接收到的音频信号中的各帧的 高频能量分布比和声压级; 以及根据所述接收到的音频信号中的各帧的高 频能量分布比和声压级, 获取包括所述待分类帧在内的高频能量分布比小 于第八阈值的持续帧数, 或, 根据所述接收到的音频信号中的各帧的高频 能量分布比和声压级, 获取包括所述待分类帧在内的高频能量分布比大于 第九阈值、 声压级大于第十阈值的持续帧数。 本发明的第三个方面是提供一种音频信号分类处理设备, 包括: 接收器, 用于接收音频信号; Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold. In combination with the second possible, the fifth possible or the eighth possible tenth possibility of the foregoing second aspect, the tone that the number of persistent frames in the to-be-classified frame acquired by the first acquiring module is greater than a sixth threshold The number of components is the number of tonal components that are greater than the seventh threshold in the frequency domain. The number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain combined with the first possible second possibility or the third possible possibility of the second aspect The first acquiring module is specifically configured to acquire a high frequency energy distribution ratio and a sound pressure level of each frame in the received audio signal, and according to a high frequency energy distribution ratio of each frame in the received audio signal. a sound pressure level, obtaining a continuous frame number of a high frequency energy distribution ratio including the frame to be classified that is less than an eighth threshold, or according to a high frequency energy distribution ratio of each frame in the received audio signal The sound pressure level obtains a continuous frame number of the high frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and the sound pressure level is greater than a tenth threshold. A third aspect of the present invention provides an audio signal classification processing apparatus, including: a receiver, configured to receive an audio signal;
处理器, 与所述接收器连接, 用于获取接收器接收到的音频信号中待 分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述 音频信号中待分类帧为语音信号。  a processor, configured to obtain, by the receiver, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal received by the receiver, and a continuous frame in the low frequency region of the to-be-classified frame in the audio signal And the at least one of the number of consecutive frames of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, and the continuous frame of the frame to be classified in the low frequency region And determining at least one of the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
在第三个方面的第一种可能中, 所述处理器具体用于获取音频信号中 待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N1为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及 待分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧 数和 /或所述待分类帧在高频区域的持续帧数, N1为正整数; 在所述待分 类帧中满足连续性约束条件的音调分量的数量大于第一阈值、 所述待分类 帧在低频区域的持续帧数大于第二阈值或所述待分类帧在高频区域的持 续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音乐信号, 否则 确定所述音频信号中待分类帧为语音信号。 In a first aspect of the third aspect, the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and to be classified The pitch distribution parameter of the N frames before the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified Obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or the duration of the frame to be classified in the high frequency region The number of frames, N1 is a positive integer; the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than the first threshold, and the to-be-classified Determining, in the audio signal, that the frame to be classified is a music signal, or determining the audio, if the number of consecutive frames in the low frequency region is greater than a second threshold or the number of consecutive frames in the high frequency region is greater than a third threshold. The frame to be classified in the signal is a voice signal.
结合上述第第三个方面的第一种可能的第二种可能中, 所述处理器获 取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数包括:  In conjunction with the first possible second possibility of the third aspect, the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame and the pre-frame N1 frame to be classified in the received audio signal to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 以及和待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包 括:  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a frame to be classified, and a frequency domain of a tonal component of the pre-frame N1 frame to be classified The distribution information is used as a pitch distribution parameter of the N1 frame before the frame to be classified; the processor acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified The number of components includes:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 结合上述第第三个方面的第一种可能的第三种可能中, 所述处理器获 取所音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧的能量 分布参数包括:  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than a sixth threshold, in combination with the third aspect described above The first possible third possibility, the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数;  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes: according to the received audio signal And obtaining, by the high-frequency energy distribution ratio and the sound pressure level, the high-frequency energy distribution ratio of the to-be-classified frame and the high-frequency energy distribution ratio that is less than the eighth threshold;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 Nl帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region includes: Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
结合第三个方面或第三个方面的上述任一种可能的第四种可能中, 在 延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述处理器具 体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1 帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类 帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分 量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分 类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待 分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的持续帧 数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一阈 值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧在 高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音 乐信号, 否则确定所述音频信号中待分类帧为语音信号。  In combination with the fourth aspect of the third aspect or the third possible aspect of the third aspect, when the classification result of the to-be-classified frame is acquired in the delayed L1 frame, L1 is a positive integer, and the processor is specifically used. Obtaining a frame to be classified in the audio signal, a N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified, and according to the frame to be classified, the N2 frame before the frame to be classified and the L1 frame after the frame to be classified The tone distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; acquiring the frame to be classified in the audio signal, and the energy of the L2 frame before the frame to be classified and the frame after the frame to be classified Distributing parameters, and obtaining, according to the to-be-classified frame in the audio signal, the N2 frame of the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or the waiting The number of consecutive frames of the classified frame in the high frequency region; the number of tonal components satisfying the continuity constraint in the frame to be classified is greater than the first threshold, and the frame to be classified continues in the low frequency region When the number of frames is greater than the second threshold or the number of consecutive frames of the to-be-classified frame in the high-frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is voice signal.
在结合第三个方面的第四种可能的第五种可能中, 所述处理器获取音 频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括:  In conjunction with the fourth possible fifth possibility of the third aspect, the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified The pitch distribution parameters of the L1 frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the L1 frame to be classified, to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧帧的 音调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2 帧的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以 及待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧 的音调分布参数;  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution of a tonal component of the N2 frame before the frame to be classified The information is used as the pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括:  The processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified. Includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 LI帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 According to the to-be-classified frame in the received audio signal, the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
在结合第三个方面的第四种可能的第六种可能中, 所述处理器获取所 音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数 以及待分类帧后 L 1帧的能量分布参数包括:  In a fourth possible sixth possibility of combining the third aspect, the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified The energy distribution parameters of the L 1 frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N2 frame to be classified as a frame to be classified The energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the pre-frame N2 frame and the energy distribution parameter of the L1 frame after the frame to be classified, the processor obtains the continuous frame number of the to-be-classified frame in the low frequency region. Includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the processor acquiring the continuous frame of the frame to be classified in the high frequency region The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
结合第三个方面、 第三个方面的上述任一种可能的第七种可能中, 在 延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述 处理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分 类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以 及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N3为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述音频信 号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获 取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的 持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于 第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分 类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类 帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若确定所 述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分 类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则 将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音 频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧 后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所 述音频信号中待分类帧修正为音乐信号。 In combination with the third possible aspect of the third aspect, the third aspect, when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame, L2 and L3 are positive integers, The processor is specifically configured to obtain a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame to be classified, and according to the to-be-classified frame, the pre-frame N3 frame and the to-be-classified frame The tone distribution parameter of the L2 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; acquiring the frame to be classified in the audio signal, and the N3 frame before the frame to be classified and the frame to be classified An energy distribution parameter of the L2 frame, and obtaining, according to the to-be-classified frame in the audio signal, the N3 frame of the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or Or the frame to be classified is in a high frequency region a number of consecutive frames; the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the frame to be classified is high When the number of consecutive frames of the frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if it is determined that the frame to be classified in the audio signal is And determining, by the music signal, whether the number of frames determined as the voice signal in the L3 frame before the frame to be classified and the frame to be classified is greater than a fourth threshold, and if yes, correcting the frame to be classified in the audio signal to voice a signal, N4 is a positive integer; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as music signals in the N4 frame before the frame to be classified and the frame to be classified is greater than the fifth The threshold, if greater, corrects the frame to be classified in the audio signal to a music signal.
结合上述第三个方面的第七种可能的第八种可能中, 所述处理器获取 音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括:  In combination with the seventh possible eighth possibility of the foregoing third aspect, the processor acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified The pitch distribution parameters of the L2 frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified frame to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调分 布参数;  Acquiring, according to the power density spectrum, frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame, and frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The frequency domain distribution information of the tone distribution parameter of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括:  The processor obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified. Includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Acquiring, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
结合上述第三个方面的第七种可能的第九种可能中, 所述处理器获取 所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参 数以及待分类帧后 L2帧的能量分布参数包括:  In combination with the seventh possible ninth possibility of the foregoing third aspect, the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified The energy distribution parameters of the L2 frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数; Obtaining a high frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as The energy distribution parameter of the frame to be classified, the N3 frame before the frame to be classified as the energy distribution parameter of the N3 frame before the frame to be classified, and the high frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are used as the frame to be classified L2 The energy distribution parameter of the frame;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified and the energy distribution parameter of the L2 frame after the frame to be classified, the processor obtains the continuous frame number of the frame to be classified in the low frequency region Includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame of the frame to be classified in the high frequency region The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
结合上述第三个方面的第二种可能、 第五种可能或第八种可能的第十 种可能中, 所述处理器获取的待分类帧中持续帧数大于第六阈值的音调分 量的数量为在频域上大于第七阈值的音调分量的数量。 满足连续性约束条 件的音调分量的数量为在频域上大于第七阈值的音调分量的数量。 本发明提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信 号的音调持续时间较长, 而语音信号的音调持续时间较短, 音乐信号的能 量可以持续分布在高频区域或低频区域, 而语音信号通常不能持续分布在 高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本发明实施例 提供的技术方案中, 首先获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量, 以及音频信号中待分类帧在低频区域的持续帧数和 / 或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认待分类帧的 类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信号分类处理 方法, 能够提高音频信号分类的正确率, 满足语音质量评估的要求。  In combination with the second possibility, the fifth possibility, or the eighth possible tenth possibility of the foregoing third aspect, the number of tonal components in the frame to be classified acquired by the processor that is greater than a sixth threshold Is the number of tonal components that are greater than the seventh threshold in the frequency domain. The number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain. The technical solution provided by the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region or the low frequency region. And the speech signal is generally not continuously distributed in the high frequency region or the low frequency region. On the basis of the above characteristics of the music signal, in the technical solution provided by the embodiment of the present invention, the continuity constraint is satisfied in the frame to be classified in the audio signal. The number of tonal components, and the number of consecutive frames of the audio signal to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming that the type of the frame to be classified is a music signal according to the above information , or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
附图说明 为了更清楚地说明本发明实施例中的技术方案, 下面将对实施例描述 中所需要使用的附图作一简单地介绍, 显而易见地, 下面描述中的附图是 本发明的一些实施例, 对于本领域普通技术人员来讲, 在不付出创造性劳 动性的前提下, 还可以根据这些附图获得其他的附图。 DRAWINGS In order to more clearly illustrate the technical solutions in the embodiments of the present invention, a brief description of the drawings to be used in the description of the embodiments will be briefly made. It is obvious that the drawings in the following description are some embodiments of the present invention. It will be apparent to those skilled in the art that other drawings may be obtained from these drawings without the inventive labor.
图 1为本发明实施例中音频信号分类处理方法的流程示意图一; 图 2为本发明具体实施例中的流程示意图一;  1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention; FIG. 2 is a schematic flowchart 1 of a specific embodiment of the present invention;
图 3a为输入信号 "法语男声 +笙" 的波形图一;  Figure 3a is a waveform diagram 1 of the input signal "French male + 笙";
图 3b为与图 3a对应的语谱图;  Figure 3b is a spectrogram corresponding to Figure 3a;
图 4a为音频信号 "京胡 +法语男声的信号" 的输入信号的波形图; 图 4b为与图 4a对应的语谱图;  Figure 4a is a waveform diagram of an input signal of an audio signal "Jinghu + French male voice"; Figure 4b is a spectrum diagram corresponding to Figure 4a;
图 5a为输入信号 "韩语男声 +合奏" 的波形图;  Figure 5a is a waveform diagram of the input signal "Korean male + ensemble";
图 5b为与图 5a对应的语谱图;  Figure 5b is a spectrum diagram corresponding to Figure 5a;
图 6a为输入信号 "法语男声 +笙" 的波形图二;  Figure 6a is a waveform diagram 2 of the input signal "French male + 笙";
图 6b为图 6a所示输入信号的初始音调检测结果;  Figure 6b is the initial tone detection result of the input signal shown in Figure 6a;
图 6c为图 6a所示输入信号筛选后的音调检测结果;  Figure 6c is the result of the tone detection after the input signal is filtered as shown in Figure 6a;
图 7a为输入信号 "法语男声 +笙" 的波形图三;  Figure 7a is a waveform diagram 3 of the input signal "French male + 笙";
图 7b为图 7a对应的音调特征"" m-to∞z - ^的曲线图; Figure 7b is a graph of the pitch characteristic "" m - to ∞z - ^ corresponding to Figure 7a;
图 8a为输入信号 "京胡 +法语男声" 的波形图;  Figure 8a is a waveform diagram of the input signal "Jinghu + French male voice";
图 8b为与图 8a对应的高频能量分布比值^ - -^^的曲线图; 图 9a为输入信号 "韩语男声 +合奏" 的波形图;  Figure 8b is a graph of the high-frequency energy distribution ratio ^ - -^^ corresponding to Figure 8a; Figure 9a is a waveform diagram of the input signal "Korean male + ensemble";
图%为与图 9a对应的高频能量分布比值 -^ -^^)的曲线图; 图 10为本发明实施例中音频信号分类规则流程示意图一;  Figure 1 is a graph of the high frequency energy distribution ratio -^ -^^) corresponding to Figure 9a; Figure 10 is a schematic flow chart 1 of the audio signal classification rule in the embodiment of the present invention;
图 11a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图一;  Figure 11a is a waveform diagram 1 of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets";
图 l ib为图 11a对应的分类结果示意图;  Figure l ib is a schematic diagram of the classification result corresponding to Figure 11a;
图 12a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图二;  Figure 12a is a waveform diagram 2 of the input signal "Chinese female voice + ensemble + English male voice + 塡 + German male voice + castanets";
图 12b为图 12a对应的平滑后的分类结果示意图;  Figure 12b is a schematic diagram of the smoothed classification result corresponding to Figure 12a;
图 13为本发明实施例中音频信号分类规则流程示意图二;  13 is a second schematic diagram of an audio signal classification rule according to an embodiment of the present invention;
图 14a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图三; Figure 14a shows the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets" Waveform diagram three;
图 14b为图 14a对应的实时分类结果示意图;  Figure 14b is a schematic diagram of the real-time classification result corresponding to Figure 14a;
图 15为本发明实施例中输出延时不固定的情况下语音分类方法流程 图;  15 is a flow chart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention;
图 16a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图四;  Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets";
图 16b为图 16a对应的三种分类方式的分类结果示意图;  Figure 16b is a schematic diagram showing the classification results of the three classification methods corresponding to Figure 16a;
图 17为本发明实施例中音频信号分类处理装置的结构示意图; 图 18为本发明实施例中音频信号分类处理设备的结构示意图。 具体实施方式  17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention; and FIG. 18 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention. detailed description
为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本 发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描 述, 显然,所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提 下所获得的所有其他实施例, 都属于本发明保护的范围。  The technical solutions in the embodiments of the present invention will be clearly and completely described in conjunction with the accompanying drawings in the embodiments of the present invention. It is a partial embodiment of the invention, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
针对现有技术中的缺陷, 本发明实施例提供了一种音频信号分类处理 方法, 图 1为本发明实施例中音频信号分类处理方法的流程示意图一, 如 图 1所示, 该方法包括如下歩骤:  For the deficiencies in the prior art, the embodiment of the present invention provides an audio signal classification processing method. FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps. Step:
歩骤 101、 获取音频信号中待分类帧中满足连续性约束条件的音调分 量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项;  Step 101: Acquire an amount of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, a continuous frame number of a frame to be classified in the audio signal in the low frequency region, and a duration of the frame to be classified in a high frequency region At least one of the number of frames;
歩骤 102、 根据获取的所述待分类帧中满足连续性约束条件的音调分 量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区 域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。  Step 102: The number of tonal components satisfying the continuity constraint in the acquired to-be-classified frame, the number of persistent frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame according to the obtained And determining at least one of the audio signals to be classified into a music signal, and determining that the to-be-classified frame in the audio signal is a voice signal.
本发明实施例提供的音频信号分类处理方法, 在进行音频信号中的各 帧进行分类时, 既可以无输出延时的输出分类结果, 即对于接收到的音频 信号帧, 实时输出分类结果, 也可以存在一定的输出延时, 即对于接收到 的音频信号帧, 延迟一段时间给出分类结果。 本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。 The audio signal classification processing method provided by the embodiment of the present invention can output the classification result without output delay when the frames in the audio signal are classified, that is, output the classification result in real time for the received audio signal frame. There may be a certain output delay, that is, for the received audio signal frame, the classification result is given for a delay. The technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region. On the basis of the above characteristics of the music signal, in the technical solution provided by the embodiment of the present invention, the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
本发明上述实施例中, 其中根据输出延时要求的不同, 可以分为三种 情况, 一是在实时获取所述待分类帧的分类结果时, 需要根据待分类帧, 以及待分类帧之前的 N帧的信息进行判断, 二是在允许较小的分类结果输 出延时, 即输出延时为 L1帧时, L1为正整数, 可以根据待分类帧, 待分 类帧前 L1帧, 以及待分类帧后 L1帧进行判断; 三是允许较大分类结果输 出延时, 即输出延时为 L2+L3帧时, L2和 L3为正整数, 先根据待分类帧, 待分类帧前 L2帧, 以及待分类帧后 L2帧进行判断, 获取初歩的待分类帧 的分类结果,然后再根据待分类帧前 L3帧和待分类帧中后 L3帧进行修改。 其中,在无输出延时时,对于最先接收到的音频信号中的帧无法进行分类, 可以将最先接收到的帧设置默认值, 默认其为语音信号或音乐信号。  In the foregoing embodiments of the present invention, the following may be classified into three cases according to different output delay requirements. First, when the classification result of the to-be-classified frame is obtained in real time, it is required to be based on the to-be-classified frame and the frame to be classified. The N frame information is judged, and the second is to allow a smaller classification result output delay, that is, when the output delay is L1 frame, L1 is a positive integer, which can be classified according to the frame to be classified, the L1 frame before the frame to be classified, and the to-be-classified The L1 frame is judged after the frame; the third is to allow the output of the large classification result to be delayed, that is, when the output delay is L2+L3 frame, L2 and L3 are positive integers, first according to the frame to be classified, the L2 frame before the frame to be classified, and After the frame to be classified, the L2 frame is judged, and the classification result of the frame to be classified is obtained, and then modified according to the L3 frame before the frame to be classified and the L3 frame in the frame to be classified. Among them, when there is no output delay, the frames in the first received audio signal cannot be classified, and the first received frame can be set to a default value, and the default is a voice signal or a music signal.
具体的, 在无输出延时, 即实时获取所述待分类帧的分类结果时, 图 1所示实施例中的歩骤 101获取音频信号中待分类帧中满足连续性约束条 件的音调分量的数量具体包括:  Specifically, when there is no output delay, that is, the classification result of the to-be-classified frame is obtained in real time, the step 101 in the embodiment shown in FIG. 1 acquires the tonal component of the to-be-classified frame in the audio signal that satisfies the continuity constraint condition. The quantity specifically includes:
获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类 帧中满足连续性约束条件的音调分量的数量, N1为正整数;  Obtaining a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and obtaining a continuity constraint condition in the to-be-classified frame according to the to-be-classified frame and the pitch distribution parameter of the N1 frame before the frame to be classified The number of tonal components, N1 is a positive integer;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:  The obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参 数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数, N1为正整数; Obtaining, in the audio signal, a frame to be classified, and an energy distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame in the audio signal, and the energy distribution of the N1 frame before the frame to be classified Obtaining, by the parameter, the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, where N1 is a positive integer;
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:  In step 103 of the embodiment shown in FIG. 1 , according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
上述实施例中, 其中获取音频信号中待分类帧的音调分布参数, 以及 待分类帧前 N1帧的音调分布参数包括:  In the above embodiment, the pitch distribution parameter of the frame to be classified in the audio signal is obtained, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame and the pre-frame N1 frame to be classified in the received audio signal to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧作为 待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分量的频域分布 信息作为待分类帧前 N1帧的音调分布参数。  Obtaining, according to the power density spectrum, a to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame, and frequency domain distribution information of a tonal component of the pre-frame N1 frame to be classified as a pre-framed N1 frame Tone distribution parameters.
而上述的根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音 调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 上述获取所音频信号中待分类帧的能量分布参数, 以及待分类 帧前 N1帧的能量分布参数包括:  And obtaining the number of tonal components satisfying the continuity constraint in the to-be-classified frame according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the pre-frame N1 frame, including: according to the to-be-classified in the received audio signal The frequency domain distribution information of the tonal component of the frame and the to-be-classified N1 frame acquires the number of the tonal components in the frame to be classified that is greater than the sixth threshold. In addition, the energy distribution parameter of the frame to be classified in the audio signal is obtained, and The energy distribution parameters of the N1 frame before the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified The energy distribution parameter of the first N1 frame.
而上述根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:  And obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数; According to the high frequency energy of the frame to be classified and the frame before the frame to be classified in the received audio signal And the sound distribution level acquires the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is less than an eighth threshold;
上述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:  The obtaining the continuous frame number of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
在允许 L1帧分类结果输出延时, 即延时 L1帧获取所述待分类帧的分 类结果时, 图 1所示实施例的歩骤 101中获取音频信号中待分类帧中满足 连续性约束条件的音调分量的数量包括:  When the L1 frame classification result output delay is allowed, that is, the delay L1 frame acquires the classification result of the to-be-classified frame, the continuity constraint is satisfied in the frame to be classified in the obtained audio signal in the step 101 of the embodiment shown in FIG. The number of tonal components includes:
获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧 后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N2为正整数;  Obtaining a to-be-classified frame in the audio signal, a N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the tone of the L1 frame after the frame to be classified The distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:  The obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
获取所述音频信号中待分类帧, 以及待分类帧前 N2帧以及待分类帧 后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数;  Obtaining, in the audio signal, a frame to be classified, and an energy distribution parameter of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the frame to be classified in the audio signal, the N2 frame to be classified and the frame to be classified The energy distribution parameter of the L1 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:  In step 103 of the embodiment shown in FIG. 1 , according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
在上述实施例中, 其中获取音频信号中待分类帧的音调分布参数, 待 分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包 括: In the above embodiment, the pitch distribution parameter of the frame to be classified in the audio signal, the pitch distribution parameter of the N2 frame before the frame to be classified, and the tone distribution parameter packet of the L1 frame after the frame to be classified are acquired. Includes:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the L1 frame to be classified, to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数;  Obtaining, according to the power density spectrum, frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution information of a tonal component of a pre-frame N2 frame to be classified The pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:  And the number of tonal components satisfying the continuity constraint in the frame to be classified, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N2 frame to be classified, and the tonal component of the L1 frame after the frame frame to be classified, the tonal component in the frame to be classified that has a continuous frame number greater than a sixth threshold Quantity.
另外, 上述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 In addition, the foregoing obtains an energy distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: The energy distribution parameters of the N2 frame and the energy distribution parameters of the L1 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L1帧的高频能量分布 比和声压级作为待分类帧后 L1帧的能量分布参数;  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N2 frame to be classified as a frame to be classified The energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:  And obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region, including:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括:  And obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the high frequency region, including :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 LI帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 According to the received audio signal, the frame to be classified, the N2 frame to be classified, and the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the LI frame after the frame acquire the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is greater than a ninth threshold and the sound pressure level is greater than a tenth threshold.
在允许分类结果输出延时为 L2+L3帧, 即延时 L2+L3帧获取所述待分 类帧的分类结果时, 图 1所示实施例的歩骤 101中获取音频信号中待分类 帧中满足连续性约束条件的音调分量的数量包括:  When the classification result output delay is allowed to be L2+L3 frames, that is, the delay L2+L3 frame is used to obtain the classification result of the to-be-classified frame, the step 101 of the embodiment shown in FIG. 1 acquires the to-be-classified frame in the audio signal. The number of tonal components that satisfy the continuity constraint includes:
获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧 的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及待分类帧 后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量 的数量, N3为正整数;  Obtaining a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and the tone of the L2 frame after the frame to be classified The distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
图 1所示实施例的歩骤 102中获取所述音频信号中待分类帧在低频区 域的持续帧数和 /或所述待分类帧在高频区域的持续帧数包括:  The obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧 后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域 的持续帧数和 /或所述待分类帧在高频区域的持续帧数。  Obtaining, in the audio signal, a frame to be classified, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the frame to be classified in the audio signal, the N3 frame to be classified and the frame to be classified The energy distribution parameter of the post L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region.
图 1所示实施例的歩骤 103中根据所述待分类帧中满足连续性约束条 件的音调分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类 帧在高频区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号包括:  In step 103 of the embodiment shown in FIG. 1 , according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号;  And the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region When the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 L3帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于第四阈 值, 若超过, 则将所述音频信号中待分类帧修正为语音信号;  Determining, if the frame to be classified in the audio signal is a music signal, determining whether the number of frames determined as a voice signal in the L3 frame before the frame to be classified and the frame in the L3 frame to be classified is greater than a fourth threshold, if yes, Correcting a frame to be classified in the audio signal to a voice signal;
若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 L3帧和待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈 值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。  Determining, if the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the L3 frame before the frame to be classified and the frame in the L3 frame to be classified is greater than a fifth threshold, if greater than, The frame to be classified in the audio signal is corrected to a music signal.
在上述实施例中, 所述获取音频信号中待分类帧的音调分布参数, 待 分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包 括: In the above embodiment, the acquiring the pitch distribution parameter of the frame to be classified in the audio signal is to be The pitch distribution parameters of the N3 frame before the classification frame, and the pitch distribution parameters of the L2 frame after the frame to be classified include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱;  Performing a fast Fourier transform on the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified frame to obtain a power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的 音调分布参数;  Acquiring, according to the power density spectrum, frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame, and frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The pitch distribution parameter of the N3 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame frame to be classified are used as the pitch distribution parameter of the L2 frame after the frame frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括:  And the number of tonal components satisfying the continuity constraint in the frame to be classified, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。  Acquiring, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
另外, 所述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括:  In addition, the obtaining the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数;  Obtaining the high-frequency energy distribution ratio and the sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameter of the frame to be classified, and the N3 frame of the frame to be classified as the energy distribution parameter of the N3 frame before the frame to be classified, and the to-be-classified The high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame are used as energy distribution parameters of the L2 frame after the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the high frequency region, including : Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
上述针对是否允许输出延时的三种情形下, 其中待分类帧中持续帧数 大于第六阈值的音调分量的数量为在频域上大于第七阈值的音调分量的 数量。  In the above three cases for whether or not the output delay is allowed, the number of tonal components in which the number of persistent frames in the frame to be classified is greater than the sixth threshold is the number of tonal components larger than the seventh threshold in the frequency domain.
以下分别针对上述允许分类结果输出延时等情况进行详细说明。 首 先, 以允许 L1帧的少量固定输出延时为例, 本实施例中 L1取值为 15。 图 2为本发明具体实施例中的流程示意图一, 如图 2所示, 包括如下的歩骤: 歩骤 201、 对当前帧第 i帧进行 FFT变换, 本歩骤中是针对接收到的 每帧都进行 FFT变换;  The following is a detailed description of the above-mentioned allowable classification result output delay and the like. First, for example, a small fixed output delay of the L1 frame is allowed. In this embodiment, L1 is 15. 2 is a schematic flowchart 1 of a specific embodiment of the present invention. As shown in FIG. 2, the method includes the following steps: Step 201: Perform FFT transformation on an ith frame of a current frame, where each step is received for each Frames are all subjected to FFT transformation;
歩骤 202、 基于 FFT变换结果, 获取第 i帧的音调分布参数, 及其能 量分布参数;  Step 202: Obtain a pitch distribution parameter of the ith frame and an energy distribution parameter based on the FFT transformation result;
歩骤 203、 判断 i〉Ll是否成立, 即当前帧之前是否已存在 L1个帧, 如果是执行歩骤 204, 否则结束本流程, 继续执行针对后续的各帧执行上 述歩骤 201和歩骤 202的操作;  Step 203: Determine whether i>L1 is established, that is, whether L1 frames exist before the current frame. If the process is step 204, the process ends. Otherwise, the execution of the foregoing steps 201 and 202 is performed for subsequent frames. Operation
歩骤 204、 在:[〉1^时, 则可以获取第 i-Ll帧的音频信号分类结果, 具体的可以过去的信息, 即按照上述歩骤 201和歩骤 202获取的第 i-Ll 帧之前的若干帧的音调分布参数和能量分布参数, 现在的信息, 即第 i-Ll 帧的音调分布参数和能量分布参数, 以及未来的信息, 即第 i-Ll帧之后 的 L1帧的音调分布参数和能量分布参数,获取第 i-Ll帧的音频信号分类 结果;  Step 204: At: [>1^, the audio signal classification result of the i-L1 frame may be obtained, and the specific past information, that is, the i-L1 frame obtained according to the above steps 201 and 202 The pitch distribution parameters and energy distribution parameters of the previous frames, the current information, that is, the pitch distribution parameters and energy distribution parameters of the i-L1 frame, and the future information, that is, the pitch distribution of the L1 frame after the i-L1 frame. Parameter and energy distribution parameters, obtaining audio signal classification results of the i-th frame;
歩骤 205, 输出第 i-Ll帧的音频信号分类结果。  Step 205: Output an audio signal classification result of the i-L1 frame.
具体的, 对于音乐信号和语音信号的音调分布情况, 可以参照图 3a 和图 3b, 图 3a为输入信号 "法语男声 +笙" 的波形图一, 图 3b为与图 3 对应的语谱图。 在图 3a的输入信号波形中, 采样率为 8kHz, 其中, 横轴 为样本点, 纵轴为归一化幅值; 图 3b的语谱图, 对应的采样率也为 8kHz, 频率分析范围为 (T4kHz。 其中, 横轴为帧, 与图 3a横轴的样本点相对应; 纵轴为频率 (Hz)。 在语谱图中, 某个频率范围内的亮度越高, 表示信号在 该频段的能量越大。 如果信号在某频段持续保持较大的能量, 在语谱图上 就会形成一条 "亮带" , 也就是音调。 通过图 3b的音调分布情况可知, 在前半段的语音信号中, 除了基频处的音调持续时间稍长一些, 更高频率 处的音调持续时间都是很短的。 在语音信号中, 能够检测出音调的地方为 浊音。 由于浊音的长度通常较短, 与之相对应的音调持续时间也较短; 而 在后半段的音乐信号中, 音调持续时间明显较长。 Specifically, for the distribution of the tone of the music signal and the voice signal, reference may be made to FIG. 3a and FIG. 3b. FIG. 3a is a waveform diagram 1 of the input signal "French male voice + 笙", and FIG. 3b is a spectrum diagram corresponding to FIG. In the input signal waveform of Fig. 3a, the sampling rate is 8 kHz, wherein the horizontal axis is the sample point and the vertical axis is the normalized amplitude; the spectral map of Fig. 3b, the corresponding sampling rate is also 8 kHz, and the frequency analysis range is (T4kHz. where the horizontal axis is the frame, corresponding to the sample point on the horizontal axis of Figure 3a; the vertical axis is the frequency (Hz). In the spectrogram, the higher the brightness in a certain frequency range, the signal is in the band The greater the energy, if the signal continues to maintain a large amount of energy in a certain frequency band, on the spectrum A "bright band" is formed, which is the tone. It can be seen from the pitch distribution of Fig. 3b that in the first half of the speech signal, except for the pitch duration at the fundamental frequency is slightly longer, the pitch duration at the higher frequency is very short. In the voice signal, the place where the tone can be detected is voiced. Since the length of the voiced sound is usually short, the corresponding tone duration is also shorter; in the latter half of the music signal, the tone duration is significantly longer.
对于音乐信号和语音信号的能量分布情况, 可以参照图 4a和图 4b, 图 4a为音频信号 "京胡 +法语男声的信号" 的输入信号的波形图, 图 4b 为与图 4a对应的语谱图。 在图 4a的波形图中, 其中, 横轴为样本点; 纵 轴为归一化幅值; 图 4b的语谱图中, 横轴为帧; 纵轴为频率 (Hz)。 通过 图 4b的能量分布情况可知:在前半段的音乐信号中,能量基本分布在 1kHz 以上, 在 1kHz至 4kHz均有分布; 在后半段的语音信号中, 大部分浊音的 能量主要分布在 1kHz以下; 清音的能量在低频至较高频率范围内均有分 布。 因此, 语音信号的能量不可能持续分布在相对较高的频率范围内。  For the energy distribution of the music signal and the voice signal, reference may be made to FIG. 4a and FIG. 4b. FIG. 4a is a waveform diagram of an input signal of the audio signal "Jinghu + French male voice", and FIG. 4b is a spectrum diagram corresponding to FIG. 4a. . In the waveform diagram of Fig. 4a, where the horizontal axis is the sample point; the vertical axis is the normalized amplitude; in the spectrum diagram of Fig. 4b, the horizontal axis is the frame; and the vertical axis is the frequency (Hz). According to the energy distribution of Fig. 4b, in the music signal of the first half, the energy is basically distributed above 1 kHz and distributed at 1 kHz to 4 kHz. In the latter half of the speech signal, most of the voiced energy is mainly distributed at 1 kHz. Below; unvoiced energy is distributed from low frequency to higher frequency range. Therefore, the energy of the speech signal cannot be continuously distributed over a relatively high frequency range.
另外, 部分音乐信号的能量能够持续分布在低频区域; 相比之下, 语 音信号的能量不可能持续分布在低频区域。 以图 5a和图 5b所示的 "韩语 男声 +合奏" 的音频信号为例说明, 图 5a为输入信号 "韩语男声 +合奏" 的波形图, 其中, 横轴为样本点; 纵轴为归一化幅值; 图 5b为与图 5a对 应的语谱图, 其中, 横轴为帧; 纵轴为频率 (Hz)。 通过可以看出如下的能 量分布情况: 图 5b前半段的语音信号的能量分布情况与图 4b的语音信号 类似。 由于浊音和清音的能量分布特性不同, 造成语音信号的能量分布具 有较大的波动。 因此, 语音信号的能量既不可能持续分布在相对较高的频 率范围内, 也不可能持续分布在低频范围内; 在后半段的音乐信号中, 能 量主要分布在 1kHz以下。  In addition, the energy of some music signals can be continuously distributed in the low frequency region; in contrast, the energy of the speech signal cannot be continuously distributed in the low frequency region. Taking the audio signal of "Korean male voice + ensemble" shown in Fig. 5a and Fig. 5b as an example, Fig. 5a is a waveform diagram of the input signal "Korean male voice + ensemble", wherein the horizontal axis is the sample point; the vertical axis is the normalized Figure 5b is a spectrogram corresponding to Figure 5a, where the horizontal axis is the frame and the vertical axis is the frequency (Hz). The energy distribution can be seen by the following: The energy distribution of the speech signal in the first half of Fig. 5b is similar to the speech signal of Fig. 4b. Due to the different energy distribution characteristics of voiced and unvoiced sounds, the energy distribution of the speech signal has a large fluctuation. Therefore, the energy of the speech signal is neither continuously distributed in a relatively high frequency range nor continuously distributed in the low frequency range; in the latter half of the music signal, the energy is mainly distributed below 1 kHz.
综上所述, 音乐信号与语音信号的不同之处主要有: 一是部分音乐信 号的音调持续时间较长, 语音信号的音调持续时间通常较短; 二是部分音 乐信号的能量能够持续分布在相对较高的频率范围内; 语音信号的能量不 能持续分布在相对较高的频率范围内; 三是部分音乐信号的能量能够持续 分布在低频区域; 语音信号的能量不能持续分布在低频区域。 本发明各实 施例中的低频和高频的划分, 可以根据语音信号的分布区域确定, 将语音 信号主要分布的区域定义为低频区域, 例如将 1kHz以下定义为低频区域, 而将 1kHz定义为高频区域, 当然其具体取值也可以根据具体的应用场景 的不同, 针对的具体语音信号的不同而有所区别。 In summary, the difference between the music signal and the speech signal mainly includes: First, the tone duration of the partial music signal is long, the tone duration of the speech signal is usually short; second, the energy of the partial music signal can be continuously distributed. In a relatively high frequency range; the energy of the speech signal cannot be continuously distributed in a relatively high frequency range; the third is that the energy of part of the music signal can be continuously distributed in the low frequency region; the energy of the speech signal cannot be continuously distributed in the low frequency region. The low frequency and high frequency division in the embodiments of the present invention may be determined according to the distribution area of the voice signal, and the area where the voice signal is mainly distributed is defined as a low frequency area, for example, 1 kHz or less is defined as a low frequency area, The 1 kHz is defined as a high frequency region. Of course, the specific value may also be different according to the specific application scenario and the specific voice signal.
基于上述分类原理, 需要提取的特征主要有音调特征及能量特征。 具体的, 提取音调特征可以分为三个歩骤:  Based on the above classification principle, the features to be extracted mainly include pitch characteristics and energy characteristics. Specifically, extracting the tonal features can be divided into three steps:
A、 获取初始音调检测结果, 即各帧的音调分布参数;  A, obtaining initial pitch detection results, that is, pitch distribution parameters of each frame;
B、 通过连续性分析, 对初始音调检测结果进行筛选, 确定待分类帧 中满足连续性约束条件的音调分量, 该音调分量是指能量在频域上的一种 分布形式;  B. Performing a continuous tone analysis to filter the initial tone detection result, and determining a tonal component satisfying the continuity constraint in the frame to be classified, wherein the tone component refers to a distribution form of energy in the frequency domain;
C、 基于筛选后的音调检测结果, 提取音调特征, 即待分类帧的满足 连续性约束条件的音调分量的数量。  C. Extracting the tonal features based on the filtered pitch detection results, that is, the number of tonal components of the frame to be classified that satisfy the continuity constraint.
其中, 上述获取初始音调检测结果可以包括: 首先, 对各个帧的数据 进行 FFT变换, 获取功率密度谱; 其次, 确定功率密度谱中的局部极大点; 最后, 针对以局部极大点为中心的若干功率密度谱系数进行分析, 进一歩 确定局部极大点是否为真正的音调分量。  The obtaining the initial pitch detection result may include: first, performing FFT transformation on data of each frame to obtain a power density spectrum; second, determining a local maximum point in the power density spectrum; and finally, focusing on the local maximum point A number of power density spectral coefficients are analyzed to determine whether the local maximum point is a true tonal component.
本实施例中, 设输入信号的采样率为 8kHz, 有效带宽为 4kHz, FFT 取值为 1024, 功率密度谱的局部极大点为
Figure imgf000030_0001
本实施例中, 如何选取以局部极大点为中心的若干功率密度谱系数进 行分析, 是比较灵活的, 可以根据算法需要设定。 例如可以采用如下方式 实现
In this embodiment, the sampling rate of the input signal is 8 kHz, the effective bandwidth is 4 kHz, and the FFT value is 1024. The local maximum point of the power density spectrum is
Figure imgf000030_0001
In this embodiment, how to select a plurality of power density spectral coefficients centered on the local maximum point is relatively flexible, and can be set according to an algorithm. For example, it can be implemented as follows
如果局部极大点 " ^满足以下条件:  If the local maximum point "^ satisfies the following conditions:
Pf - p{f±i)≥7dB , 其中 = 2,3 · · ,10 即判断局部极大点与相邻的其他点的数值差异较大时, 本实施例中差 异为 ΊάΒ , 则说明该局部极大点是真正的音调分量。 对于上述音调连续性分析的歩骤, 可以设P f - p {f±i) ≥ 7dB , where = 2,3 · · , 10 means that when the difference between the local maximum point and the other points in the adjacent point is large, the difference in this embodiment is ΊάΒ, then the description This local maximum is the true tonal component. For the above-mentioned steps of tone continuity analysis, it can be set
Figure imgf000030_0002
V2)表示初始音调检测结果, 取值为 1表示第 k帧数据在 f 处存在音调分量, 取值为 0表示第 k帧数据在 f 处不存在音 调分量。 相对于第 k帧, 位于第 k帧之前的 L 1帧数据被称为过去帧, 位 于第 k帧之后的 L 1数据被称为未来帧。 设第 k帧数据在/; c处存在音调分 量, 即 to^L/Z^^r^Vm ] [ ] = l。 针对位于第 k帧/; c处的音调分量, 音调 连续性分析的歩骤为:
Figure imgf000030_0002
V 2 ) represents the initial pitch detection result, and a value of 1 indicates that the k-th frame data has a tonal component at f, and a value of 0 indicates that the k-th frame data does not have a tonal component at f. Relative to the kth frame, the L1 frame data located before the kth frame is referred to as a past frame, and the L1 data located after the kth frame is referred to as a future frame. Let the kth frame data have a tone score at /; c Quantity, ie to^L/Z^^r^Vm ] [ ] = l. For the tonal components at the kth frame /; c, the steps of the tone continuity analysis are:
歩骤 1、 统计该音调分量与过去多少帧的音调分量具有连续性, 表示 为 num_left,初始化变量" "m— 为 0,不具有连续性的巾贞数用 """^"。"-。"^ 标识, 初始化变量" -" ^- 为 0, 并记录待分析音调分量所处的位置: pos _cur = fie , Step 1. Count that the tonal component has continuity with the tonal components of the past few frames, expressed as num_left, the initialization variable "" m - is 0, and the number of frames without continuity is """^". "-."^ identifies, initializes the variable "-" ^- to 0 , and records the position of the tonal component to be analyzed: pos _cur = fie ,
检杳 tonal _ flag _ original[k -
Figure imgf000031_0001
3))的取值.
Check tonal _ flag _ original[k -
Figure imgf000031_0001
3)) The value.
如果取值全为 o, 说明第(k-i)帧数据在^" - e"^3^/^^^-^ 3)区 间不存在音调分量, 即位于第 k帧 ^处的音调分量与第(k-l)帧的音调分 量之间出现间断, 记录下本次不连续性事件: If the value is all o, it means that the (ki) frame data does not have a tonal component in the interval ^" - e"^ 3 ^/^^^-^ 3 ), that is, the tonal component at the kth frame ^ and the Kl) A discontinuity occurs between the tonal components of the frame, recording this discontinuity event:
num _ non _ tonal = num _ non _ tonal + 1. Num _ non _ tonal = num _ non _ tonal + 1.
说明第 ( k_ 1 )巾贞数 据在
Figure imgf000031_0002
位于第 k帧 处的 音调分量与第(k-1)帧的音调分量之间具有连续性:
Explain that the (k_ 1) frame data is
Figure imgf000031_0002
There is continuity between the tonal component at the kth frame and the tonal component of the (k-1)th frame:
记录第(k-1)帧音调分量所处的位置: poS_CUr = pOS_CUr + X; Record the position of the (k-1) th frame tonal component: po S _ CUr = p OS _ CUr + X ;
统计出现连续性的巾贞数: n画-1 Φ = num— left + 1. Count the number of consecutive cases: n draw - 1 Φ = num - left + 1.
设置变量 num _ non _ tonal为 Q .  Set the variable num _ non _ tonal to Q .
依次检测第(k-1)帧、第(k-2)帧等与前一帧的音调分量之间是否存在 连续性。 在每次检测之前, 首先需要判断 大小:  It is sequentially detected whether there is continuity between the (k-1)th frame, the (k-2)th frame, and the like and the tonal component of the previous frame. Before each test, you first need to determine the size:
如果 "Mm -画 - 画1 ≥ al , 说明待分析音调分量与过去帧音调分量之间 的间断已经超过预设的范围, 已不再具有连续性。 不必继续检测下去, 输 出 num— left · If " Mm - Draw - draw 1 ≥ al , indicating that the discontinuity between the tonal component to be analyzed and the pitch component of the past frame has exceeded the preset range, there is no longer continuity. It is not necessary to continue the detection, output num_ left
如果 / < , 说明待分析音调分量与过去帧音调分量之间 的间断还在预设的范围内, 继续检测下去。 直到检测完过去 L1帧数据, 输出"画- feIf / < , the gap between the tonal component to be analyzed and the pitch component of the past frame is still within the preset range, and the detection continues. Until the past L1 frame data is detected, the output "paint- fe " is output.
歩骤 2、 统计该音调分量与未来多少帧的音调分量具有连续性, 表示 为 num right . 类似于上述歩骤 1, 依次检测第 k帧、第(k+i)帧等与后一帧的音调分 量之间是否存在连续性, 输出" Mm- AiStep 2: Statistically, the tonal component has continuity with a number of future tonal components, expressed as num right. Similar to step 1 above, sequentially detecting the kth frame, the (k+i) frame, and the like Whether there is continuity between the tonal components, output " Mm - Ai .
歩骤 3: 根据 及 "Μ™_π , 对初始音调检测结果进行筛选, 如 果满足以下两个条件之一: Step 3: According to " Μ TM_π , filter the initial tone detection results, such as If one of the following two conditions is met:
(num left + num right)≥ al  (num left + num right)≥ al
num right≥ a3 说明位于第 k帧 fx处的音调分量具有一定的连续性, 保留初始音调 检测结果, 否则不保留。 在本实施例中, 可以设 "1 = 5; Ω2 = 10 . Ω3 = 8 0 以图 3a和图 3b给出的法语男声 +笙的音频信号为例, 给出音调连续 性分析的实例, 如图 6a和 6b所示, 图 6a为输入信号 "法语男声 +笙" 的 波形图二; 图 6b为图 6a所示输入信号的初始音调检测结果。 其中, 横轴 为帧, 与图 6a横轴的样本点相对应; 纵轴取值为(T511 , 每点对应的频域 分辨率为 4000 Hz /512= 7. 8125Hz。 如果某帧数据在纵轴某点对应的频率 范围内存在音调分量, 将其标识为白色, 否则为黑色。 如果连续若干帧信 号在某个频率范围内存在音调分量, 会形成 "白线" 。 该 "白线"与图 3b 语谱图中的 "亮带"是相对应的; 图 6c为图 6a所示输入信号筛选后的音 调检测结果。 与图 6b的初始音调检测结果相比, 在前半段的语音信号中, 仅保留了基频及其附近的音调持续时间稍长的少量音调分量, 其余的音调 分量均已去掉; 在后半段的音乐信号中, 绝大部分的音调分量均被保留下 来。 Num right≥ a3 indicates that the tonal component at fk frame fx has a certain continuity, retaining the initial pitch detection result, otherwise it is not retained. In this embodiment, it is possible to set "1 = 5 ; Ω2 = 10 . Ω3 = 8 0 as an example of the French male + 笙 audio signal given in Figures 3a and 3b, giving an example of pitch continuity analysis, such as 6a and 6b, Fig. 6a is a waveform diagram 2 of the input signal "French male voice + 笙"; Fig. 6b is the initial tone detection result of the input signal shown in Fig. 6a, wherein the horizontal axis is a frame, and the horizontal axis of Fig. 6a The sample points correspond to each other; the vertical axis has a value of (T511, and the frequency domain resolution corresponding to each point is 4000 Hz / 512 = 7. 8125 Hz. If a frame of data has a tonal component in a frequency range corresponding to a certain point on the vertical axis, Mark it as white, otherwise black. If there are several consecutive frames of signals with tonal components in a certain frequency range, a "white line" will be formed. The "white line" and the "bright band" in the spectrum of Figure 3b are Corresponding to Fig. 6c is the result of the tone detection after the input signal is filtered as shown in Fig. 6a. Compared with the initial tone detection result of Fig. 6b, in the first half of the speech signal, only the fundamental frequency and the pitch around it are retained. a small amount of tonal components that are slightly longer, and the remaining tonal components have been removed. In the latter half of the music signal, most of the tonal components are preserved.
最后进行音调特征提取, 其中针对筛选后的音调检测结果, 统计较低 频率至高频范围(对应于 fl4≤ < F/2 )的每帧音调分量的数量, 表示为 Finally, the tone feature extraction is performed, wherein for the filtered tone detection result, the number of tonal components per frame from the lower frequency to the high frequency range (corresponding to fl4 ≤ < F / 2 ) is expressed as
- tonal jag 如果" 越大,说明对应信号中音调分量持续时 间越长, 该信号是音乐信号的可能性越大。  - tonal jag If "larger", the longer the duration of the tonal component in the corresponding signal, the more likely the signal is to be a music signal.
如上述图 6c所示, 语音信号在基频及其附近频率范围内可能会存在 少许音调持续时间稍长的音调分量。 因此, 统计每帧音调分量的数量的范 围不是从 = G开始的, 而是从 , = 4开始的, 这样可以避免将某些基频音 调分量持续时间较长的语音信号误判为音乐信号。 即上述统计的满足连续 性约束条件的音调分量的数量为在频域上大于第七阈值的音调分量的数 量。 在本实施例中, 可以设 "4 = 40 As shown in Figure 6c above, the speech signal may have a few tonal components with a slightly longer duration of tone at the fundamental frequency and its nearby frequency range. Therefore, the range of counting the number of tonal components per frame does not start from = G , but starts from = 4 , which avoids erroneously judging a speech signal having a longer duration of some fundamental tone components as a music signal. That is, the number of tonal components satisfying the continuity constraint of the above statistics is the number of tonal components larger than the seventh threshold in the frequency domain. In the present embodiment, may be provided "4 = 40
仍以图 3a和图 3b给出的 "法语男声 +笙" 的音频信号为例说明, 如 图 7a和图 7b所示, 图 7a为输入信号 "法语男声 +笙"的波形图三; 图 7b 为图 7a对应的音调特征" 的曲线图。 其中, 横轴为帧, 与图 Still taking the audio signal of "French male voice + 笙" given in Fig. 3a and Fig. 3b as an example, as shown in Fig. 7a and Fig. 7b, Fig. 7a is a waveform diagram 3 of the input signal "French male voice + 笙"; Fig. 7b A graph of the pitch characteristics corresponding to Fig. 7a. The horizontal axis is a frame, and the graph
7a横轴的样本点相对应; 纵轴为音调分量的数量。 由图 7a和图 7b可见, 在前半段的语音信号中, nwn j mal - flag始终为 0, 与后半段笙的音调特征 具有明显区别。 The sample points of the horizontal axis of 7a correspond; the vertical axis is the number of tonal components. As can be seen from Figures 7a and 7b, In the first half of the speech signal, nwn j mal -flag is always 0 , which is significantly different from the tonal characteristics of the second half.
本发明上述实施例中的能量特征提取方式如下, 在提取能量特征之前, 首先需要计算各帧的高频能量分布比值 及声压级 ^Ζ^, 其中 k表示帧数。 The energy feature extraction method in the above embodiment of the present invention is as follows. Before extracting the energy feature, firstly, the high frequency energy distribution ratio and the sound pressure level ^ Ζ ^ of each frame need to be calculated, where k represents the number of frames.
Figure imgf000033_0001
Figure imgf000033_0001
其中, 表示第 k帧的 FFT变换的实部, Im_ (/)表示第 k帧的 FFT变换的虚部。 分母表示第 k帧的总能量; 分子表示第 k帧在 Wherein denotes the real part of the FFT of the k-th frame, Im_ (/) is the imaginary part of the FFT transform of the k-th frame. The denominator represents the total energy of the kth frame; the numerator represents the kth frame at
/ = Ω5 ~ /2 _1)所对应的较高频率范围内的能量总和。 如果 / = Ω 5 ~ / 2 _1) corresponds to the sum of the energy in the higher frequency range. in case
ratio— energy -hf ^软小, 说明第 k帧能量主要分布在低频; 反之, 说明第 k 帧能量主要分布在较高频率范围内。
Figure imgf000033_0002
Ratio—energy −hf ^ is small, indicating that the energy of the kth frame is mainly distributed at low frequencies; on the contrary, it indicates that the energy of the kth frame is mainly distributed in a higher frequency range.
Figure imgf000033_0002
其中, 表示第 k帧的功率密度谱。 如果 ^)较小, 说明第 k帧总能量较小, 如果 较大, 则说明第 k帧总能量较大。 Wherein, it represents the power density spectrum of the k-th frame. If ^) is small, the total energy of the kth frame is small. If it is larger, the total energy of the kth frame is larger.
基于高频能量分布比值及声压级, 进一歩分析能量在高频的分布特性 及能量在低频的分布特性。  Based on the high-frequency energy distribution ratio and sound pressure level, the distribution characteristics of energy at high frequencies and the distribution characteristics of energy at low frequencies are further analyzed.
在获取能量在高频的分布特性时, 仍以图 4给出的 "京胡 +法语男声" 的音频信号为例, 其中图 8a为输入信号 "京胡 +法语男声" 的波形图, 图 When obtaining the distribution characteristics of energy at high frequencies, the audio signal of "Jinghu + French male voice" given in Fig. 4 is taken as an example, and Fig. 8a is a waveform diagram of the input signal "Jinghu + French male voice".
8b为与图 8a对应的高频能量分布比值^^- -^^)的曲线图, 其中, 横轴为帧, 与图 8a横轴的样本点相对应; 纵轴为高频能量分布比值。 通 过图 8b可知高频能量分布比值曲线的变化情况: 8b is a graph of the high-frequency energy distribution ratio ^^- -^^) corresponding to Fig. 8a, wherein the horizontal axis is a frame corresponding to the sample point on the horizontal axis of Fig. 8a; and the vertical axis is the high-frequency energy distribution ratio. The variation of the high-frequency energy distribution ratio curve can be seen from Figure 8b:
在前半段的音乐信号中, 除了演奏间隙的短暂停顿处, 高频能量分布 比值基本上大于 0. 8, 说明该段京胡信号的能量能够持续分布在较高频率 范围内;  In the music signal of the first half, except for the short pause of the performance gap, the high-frequency energy distribution ratio is substantially greater than 0.8, indicating that the energy of the Jinghu signal can be continuously distributed in the higher frequency range;
在后半段的语音信号中, 少量的浊音以及部分清音的高频能量分布比 值较大, 大部分浊音以及部分清音的高频能量分布比值都是比较小的, 导 致高频能量分布比值曲线的波动较大, 说明语音信号的能量是无法持续分 布在较高频率范围内的。 In the second half of the speech signal, a small amount of voiced and partially unvoiced high frequency energy distribution ratio The value of the high frequency, most of the voiced sounds and the partial unvoiced high frequency energy distribution ratio are relatively small, resulting in large fluctuations in the high frequency energy distribution ratio curve, indicating that the energy of the speech signal cannot be continuously distributed in the higher frequency range. .
针对第 k帧, 为了表示能量在高频的分布特性, 基于高频能量分布比 值 及声压级 ( 提取以下特征:  For the kth frame, in order to express the distribution characteristics of energy at high frequencies, based on the high frequency energy distribution ratio and sound pressure level (extract the following characteristics:
num_big_ratio_energy_left . 表示位于第 k帧之前的 L1帧数据中, 能量 能够持续分布在高频的过去帧的帧数;  Num_big_ratio_energy_left . Represents the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame;
画— big— mtio— energy— right : 表示位于第 k帧之后的 LI帧数据中, 能量 能够持续分布在高频的未来帧的帧数。  Draw — big — mtio — energy — right : Indicates the number of frames in the LI frame data after the kth frame that can be continuously distributed in the high frequency future frame.
在提取上述特征之前,首先检査高频能量分布比值 ^^ -^W及 声压级 是否满足以下条件: ifati。― energy _hf、k、> a6、 &&、spl k、> αΊ)。如果 满足该条件,进一歩分析第 k帧能量是否能够持续分布在较高频率范围内。  Before extracting the above features, first check whether the high-frequency energy distribution ratio ^^ -^W and the sound pressure level satisfy the following conditions: ifati. ― energy _hf, k, > a6, &&, spl k, > αΊ). If this condition is met, it is further analyzed whether the energy of the kth frame can be continuously distributed in a higher frequency range.
获取聽 m _ big _ ratio _ energy _ left的歩骤为:  The steps to get m _ big _ ratio _ energy _ left are:
歩骤 1、
Figure imgf000034_0001
num - big - ratio - enersy - left 0;
Step 1,
Figure imgf000034_0001
Num - big - ratio - ener sy - le ft 0;
歩骤 2、 初始化变量"画 为 0;  Step 2. Initialize the variable "Draw as 0;
歩骤 3、 检査 raz '。- j/^-1)及 ^ -1)是否满足以下条件: Step 3. Check raz '. - j/^- 1 ) and ^ - 1 ) Whether the following conditions are met:
{ratio energy _hf(k— i)> αβ) & & (spl(k -l)> al) 如果不满足上述条件, 说明第(k-i)帧数据的能量没有分布在较高频 率范围内, 记录下本次事件. 聽 m non big ratio - num non big ratio + 1. 如果满足上述条件, 说明第(k-1)帧数据的能量持续分布在较高频率 范围内, 统计能量能够持续分布在高频的过去帧的帧数: {ratio energy _hf(k — i)> αβ) && (spl(k -l)> al) If the above conditions are not satisfied, the energy of the (ki) frame data is not distributed in the higher frequency range, and the record is recorded. this incident listen m non big ratio - num non big ratio + 1. If the above conditions are satisfied, the first energy described (k-1) frame data is continuously distributed in the higher frequency range, the statistical frequency distribution of energy can be continuously Number of frames of past frames:
num big ratio energy left― num big ratio energy left + 1.  Num big ratio energy left― num big ratio energy left + 1.
设置变量 num non big ratio为 Q。  Set the variable num non big ratio to Q.
类似于歩骤 3, 依次检测第(k-2)帧、 第(k-1)帧等数据的能量能否持 续分布在较高频率范围内。 在每次检测之前, 首先需要判断  Similar to step 3, it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
num non big ratio的大小, 如果 num non big ratio≥ 8, 说明能量无法持续 分布在较高频率范围内的状态已经超过预设的范围, 不必继续检测下去, 输出聽 m big ratio energy left . 如果 num non big ratio < "8, 说明能量无法 持续分布在较高频率范围内的状态还在预设的范围内, 继续检测下去, 直 到检测完过去 L1帧数据, 输出"画— g-rario— i rg) je/。 获取醒—big _ ratio _ energy _ right的歩骤是类似的。 依次检测第(k+ 1 )帧 否持续分布在较高频率范围内, 输出Num non big ratio size, if num non big ratio ≥ 8, indicating that the energy cannot be continuously distributed in the higher frequency range has exceeded the preset range, do not continue to detect, the output listens to m big ratio energy left . If num Non big ratio <"8, indicating that the energy cannot be continuously distributed in the higher frequency range is still within the preset range, continue to detect until the detection of the past L1 frame data, output "paint - g-rario - i rg ) je/. The steps to get awake-big _ ratio _ energy _ right are similar. Detect whether the (k+ 1)th frame is continuously distributed in a higher frequency range, and output
Figure imgf000035_0001
Figure imgf000035_0001
对于低频能量的分布特性获取, 以图 5a给出的 "韩语男声 +合奏" 的 输入信号为例, 观察能量在低频的分布特性, 如图 9a和图 9b所示, 图 9a 为输入信号 "韩语男声 +合奏" 的波形图, 图%为与图 9a对应的高频能 量分布比值 ^- ^-^^的曲线图。 其中, 横轴为帧; 纵轴为高频能量 分布比值。 通过观察图%所示的在高频能量分布比值曲线的变化情况, 可知, 在前半段的语音信号中, 高频能量分布比值曲线的波动较大, 说明 语音信号的能量是无法持续分布在低频的; 在后半段的音乐信号中, 高频 能量分布比值基本上小于 0.1, 说明该段合奏信号的能量能够持续分布在 低频。  For the acquisition of the distribution characteristics of low-frequency energy, take the input signal of "Korean male + ensemble" given in Figure 5a as an example to observe the distribution characteristics of energy at low frequencies, as shown in Figure 9a and Figure 9b, Figure 9a shows the input signal "Korean The waveform of the male voice + ensemble", the graph % is a graph of the high frequency energy distribution ratio ^- ^-^^ corresponding to Fig. 9a. Among them, the horizontal axis is the frame; the vertical axis is the high frequency energy distribution ratio. By observing the change of the high-frequency energy distribution ratio curve shown in the figure %, it can be seen that in the first half of the speech signal, the fluctuation of the high-frequency energy distribution ratio curve is large, indicating that the energy of the speech signal cannot be continuously distributed in the low frequency. In the music signal of the latter half, the high-frequency energy distribution ratio is substantially less than 0.1, indicating that the energy of the ensemble signal can be continuously distributed at low frequencies.
针对第 k帧, 为了表示能量在低频的分布特性, 基于高频能量分布比 值 mtio energy D及声腿 , 提取以下特征:  For the k-th frame, in order to express the distribution characteristics of energy at low frequencies, based on the high-frequency energy distribution ratio mtio energy D and the acoustic leg, the following features are extracted:
醒―羅 II mtio— energy— left :表示能量能够持续分布在低频的过去帧的 num _ small _ ratio _ energy _ right . 表示位于第 k帧之后的 LI帧数据中, 能 量能够持续分布在低频的未来帧的帧数;  醒 罗 II II mtio — energy — left : indicates that the energy can be continuously distributed in the low frequency past frame num _ small _ ratio _ energy _ right . Indicates that the energy in the LI frame data after the kth frame can be continuously distributed in the low frequency. The number of frames of future frames;
与聽 m _ big _ ratio _ e" - 等参数的获取过程不同,  Different from the acquisition process of listening to m _ big _ ratio _ e" -
™m_sm^_ra^_ e/^_fe/t并不是仅仅针对过去 L1帧数据分析得出的, 而 一帧 ratio -energy _hf{i){i≥0) f 就会更新一次 ™ m_sm ^ _ ra ^ _ e / ^ _ fe / t is not only drawn for the last L1 frame data analysis, and a ratio -energy _hf {i) {i≥0 ) f will be updated
Figure imgf000035_0002
Figure imgf000035_0002
rari。_e"erg) j/ 是否满足条件: ratio— energy— hf、k、<a9。 如果满足该条件, 进一歩分析第 k帧能量是否能够持续分布在低频范围内。 Rari. _ e " e rg) j / Whether the condition is satisfied: ratio—energy—hf,k,<a9. If this condition is met, further analyze whether the energy of the kth frame can be continuously distributed in the low frequency range.
中, 获取 num small ratio energy right的歩骤为. In the process, the step of obtaining num small ratio energy right is
歩骤 1、 初始化 num sma^ rati energy right为 Q · Step 1, initialize num sma ^ ra ti energy right to Q ·
歩骤 2、 依次检测第(k+1)帧、 第(k+2)帧等的高频能量分布比值 ratio _ energy _ hf {i ) ( < ζ·≤ ( 是否满足条件: ratio— energy _hf(f)< a9。如果不 满足上述条件, 不必继续检测下去, 输出 聽/«-腿^-/¾!^-£^/¾)-/^/^; 如 果满足上述条件, Step 2: sequentially detecting the high-frequency energy distribution ratio ratio _ energy _ hf {i ) of the (k+1)th frame, the (k+2)th frame, etc. ( < ζ·≤ (whether or not the condition: ratio_energy_hf is satisfied) (f) < a9. If the above conditions are not met, it is not necessary to continue the test, and the output listens to / «-legs ^-/3⁄4!^-£^/3⁄4)-/^/^; if the above conditions are met,
num small ratio energy right― num small ratio energy right + 1, 继续检 ^贝 []下 去, 直到检测完未来 LI帧数据, 输出"画_腿"1/^0_£^/^-/^ 。 Num small ratio energy right- num small ratio energy right + 1, continue to check ^be [] down, until the detection of the future LI frame data, output "paint _ leg" 1 / ^ 0 _ £ ^ / ^ - / ^.
在本实施例中, 可以设置 = 15(3; «6 = 0.4. α7 = 30. Ω8 = 5. Ω9 = 0.1。 如上述分类原理分析所述, 绝大多数音乐信号具有不同于语音信号的 特性; 相比之下, 语音信号缺乏独有的特性, 很难 100%确定某段信号就是 语音信号。 因此, 在分类时将明显不同于语音信号的音乐信号识别出来, 其余则判为语音信号。 In this embodiment, it is possible to set = 15 (3 ; «6 = 0.4. α7 = 30. Ω8 = 5. Ω9 = 0.1. As described in the classification principle analysis above, most music signals have characteristics different from those of speech signals. In contrast, the lack of unique characteristics of the speech signal makes it difficult to determine 100% of the signal as a speech signal. Therefore, the music signal that is distinctly different from the speech signal is recognized in the classification, and the rest is judged as the speech signal.
具体的, 分类规则可以如图 10所示, 对于第 k帧数据, 其可以包括 如下的歩骤:  Specifically, the classification rule may be as shown in FIG. 10. For the k-th frame data, it may include the following steps:
歩骤 301、 判断音调分量的数量是否大于 0, 即"画 -to" - g >0 如 果满足条件, 则可以输出初始分类结果为音乐信号; 否则继续分析育 特 歩骤 302、 分析能量在较高频率范围内的分布特性, 首先判断
Figure imgf000036_0001
a6 && SplW> a )。 若是, 执行歩骤 303, 否则执行歩骤
Step 301: Determine whether the number of tonal components is greater than 0, that is, "Draw-to" - g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, continue to analyze the U.S. step 302, and analyze the energy in the comparison. The distribution characteristics in the high frequency range, first judge
Figure imgf000036_0001
a 6 && S plW> a) . If yes, go to step 303, otherwise execute the step
304; 304;
歩骤 303、 判断是否满足 "画 _ g-rari0_£ rg)-n ≥"ll, 或者满足 num big ratio energy left + num big ratio energy right≥ alO 或者 Step 303, determining whether "painting_g-rari 0 _£ rg" -n ≥"ll, or satisfying num big ratio energy left + num big ratio energy right ≥ alO or
腿 m— big— ratio— energy— left≥ cdi, 如果满足, 则输出初始分类结果为音乐信 号, 否则, 执行歩骤 304; Leg m—big—ratio—energy—left≥ cdi, if yes, output the initial classification result as a music signal, otherwise, perform step 304;
歩骤 304、 判断高频能量分布比值是否小于 a9, 即  Step 304: Determine whether the high frequency energy distribution ratio is less than a9, that is,
ratio _energy_hf{k)≤a9 f 如果是, 则执行歩骤 305, 否则输出初始分类结果 为语音信号; 歩骤 305、 判断是否满足 "画 _腿"1/^0_£^/^-/£^≥"13, 或者满足 num small ratio energy left + num small ratio energy right≥ al2 或者 num _ small _ ratio _ energy _ right >a\\ ^ 如果满足, 则输出初始分类结果为音乐 信号, 否则输出初始分类结果为语音信号。 Ratio _energy_hf{k) ≤ a9 f If yes, execute step 305, otherwise output the initial classification result as a voice signal; Step 305: Determine whether the "painting_leg" 1/^ 0 _£^/^-/£^ ≥ "13 is satisfied, or num small ratio energy left + num small ratio energy right ≥ al2 or num _ small _ ratio _ energy _ right >a\\ ^ If it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
在本实施例中, 可以设置 ω10 = 15 ; "11 = 10; «12 = 30. "13 = 30。 In this embodiment, ω10 = 15 can be set; "11 = 10; «12 = 30. "13 = 30.
参见图 11a和图 lib所示的, 图 11a为输入信号 "中文女声 +合奏 +英 语男声 +塡 +德语男声 +响板" 的波形图, 其中的三种音乐信号: 合奏、 塡 及响板, 在音调特征或是能量特征方面, 均具有一定的典型性; 图 lib为 图 11a对应的分类结果示意图一, 其中, 横轴为样本点; 纵轴为分类结果, 取值为 0对应语音信号, 取值不为 0对应音乐信号。 由下至上, 纵轴给出 四类分类结果:  See Figure 11a and Figure lib, Figure 11a is a waveform diagram of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets", three of which are: ensemble, cymbal and castanets, In terms of pitch characteristics or energy characteristics, it has a certain typicality; Figure lib is a schematic diagram of the classification result corresponding to Figure 11a, wherein the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal. The value is not 0 corresponding to the music signal. From bottom to top, the vertical axis gives four classification results:
MUSIC_音调特征: 仅使用音调特征得到的分类结果, 表示为实线。 由 此可以看出, 图 11a中的哪些信号是适用于有关音调特征的分类规则的; MUSIC 能量 :特特征征__11:: 仅仅使使用用 ""能能量: 特征 _1"得到的分类结果, 表示为 虚线。 这里的 "能量特征 _1"指的是能量是否能够持续分布在较高频率范 围内。 由此可以看出, 图 11a中的哪些信号是适用于有关能量高频分布特 性的分类规则的;  MUSIC_ Tone Feature: The classification result obtained using only the tone feature is expressed as a solid line. It can be seen from which signals in Figure 11a are applicable to the classification rules for tonal features; MUSIC Energy: Special Features __11:: Only use the classification results obtained with ""Energy: Feature_1" , denoted as a dotted line. The "energy characteristic_1" here refers to whether the energy can be continuously distributed in a higher frequency range. It can be seen which signals in Fig. 11a are suitable for the high frequency distribution characteristics of the energy. Classification rules;
MUSIC_能量 :特特征征__22:: 仅仅使使用用 ""能能量: 特征 _2"得到的分类结果, 表示为 点划线。 这里的 "能量特征 _2 "指的是能量是否能够持续分布在低频。 由 此可以看出, 图 11a中的哪些信号是适用于有关能量低频分布特性的分类 规则的; MUSIC_Energy: Special Feature __22:: Only use the classification result obtained with "" Energy Energy: Feature_2" as a dotted line. Here "Energy Feature_2" refers to whether energy can last Distributed at low frequencies. It can be seen from which of the signals in Figure 11a are applicable to the classification rules for the low-frequency distribution characteristics of energy;
1^1(_初始分类结果: 将 MUSIC_音调特征、 MUSIC_能量特征_1及 MUSIC_能量特征_2的分类结果综合起来, 就可以得到初始分类结果, 表示 为点线。  1^1(_Initial classification result: Combine the classification results of MUSIC_ tone feature, MUSIC_ energy feature_1 and MUSIC_ energy feature_2 to obtain the initial classification result, which is expressed as dotted line.
通过观察图 lib, 可以看出, 针对不同类型的音乐信号, 不同的分类 规则是如何发挥作用的:  By looking at the graph lib, it can be seen how different classification rules work for different types of music signals:
位于 100000-300000点之间的合奏信号: 该段音乐信号在能量上的波 动是很大的, 仅有少数帧的能量能够持续分布在较高频率范围内, 能量特 征_1/2基本不起作用。 但是, 该段信号的音调具有较好的持续性, 可以利 用音调特征检测出来; 位于 400000-550000点之间的塡信号:音调特征能够起到一定的作用, 但是仅依靠音调特征是无法把完整的塡信号检测出来的, 如图断续分布的 实线所示。该段信号的能量主要分布在低频, 可以利用能量特征 _2检测出 来; The ensemble signal between 100000-300000 points: The energy fluctuation of this piece of music signal is very large, only a few frames of energy can be continuously distributed in a higher frequency range, the energy characteristic _1/2 can not afford effect. However, the pitch of the segment signal has good persistence and can be detected by using the tonal feature; The 塡 signal between 400,000 and 550,000 points: the tonal feature can play a certain role, but it is impossible to detect the complete 塡 signal by relying on the tonal feature, as shown by the solid line of the discontinuous distribution. The energy of the segment signal is mainly distributed in the low frequency, and can be detected by using the energy feature_2;
位于 600000点之后的响板信号: 该段信号几乎检测不出音调分量, 音调特征不起作用。 该段信号的能量主要分布在高频, 可以利用能量特征 _1检测出来。  The castanick signal after 600000: This segment of the signal can hardly detect the tonal component, and the tonal feature does not work. The energy of this segment of the signal is mainly distributed at high frequencies and can be detected by the energy characteristic _1.
本发明实施例提供的技术方案, 还可以适应于输出延时较大的应用场 景, 例如当输出延时为 L2+L3时, 设当前帧为第 i帧, 则可以首先按照上 述实施例提供的技术方案, 当 i〉L2时, 根据过去的信息, 第 i_L2帧之前 的若干帧的音调分布参数和能量分布参数, 现在的信息, 即第 i_L2帧的 音调分布参数和能量分布参数, 以及未来的信息, 即第 i_L2帧之后的 L2 帧的音调分布参数和能量分布参数, 获取第 i_L2帧的音频信号分类结果, 其具体的实现方式可以参见上述的实施例, 进一歩当 i〉(L2+L3)时, 可以 进行平滑处理, 即根据待分类帧第 i_L2-L3帧前 N4帧和待分类帧第  The technical solution provided by the embodiment of the present invention can also be applied to an application scenario with a large output delay. For example, when the output delay is L2+L3, if the current frame is the ith frame, the first embodiment may be provided according to the foregoing embodiment. The technical solution, when i>L2, according to the past information, the pitch distribution parameter and the energy distribution parameter of several frames before the i_L2 frame, the current information, that is, the pitch distribution parameter and the energy distribution parameter of the i_L2 frame, and the future The information, that is, the pitch distribution parameter and the energy distribution parameter of the L2 frame after the i_L2 frame, obtain the audio signal classification result of the i-th frame, and the specific implementation manner can be referred to the above embodiment, and further, i>(L2+L3) When it is smoothed, that is, according to the frame before the i_L2-L3 frame to be classified, the N4 frame and the frame to be classified
1-L2-L3帧后 L3帧的初始分类结果进行修正。 After the 1-L2-L3 frame, the initial classification result of the L3 frame is corrected.
具体的, 上述的前 N4帧可以为前 L3帧, 针对第 k帧, 此时上述修正 处理的过程为:  Specifically, the foregoing foregoing N4 frame may be the first L3 frame, and for the kth frame, the process of the above correction processing is:
首先, 对位于第 k帧之前的 L3帧及位于第 k帧之后的 L3帧的初始分 类结果进行统计, 获取被分类为音乐信号的帧数"" m-mw , 以及被分类为 语音信号的巾贞数醒—醒 _ music . First, the initial classification result of the L3 frame located before the kth frame and the L3 frame located after the kth frame is counted, and the number of frames classified as a music signal "" m - mw , and the towel classified as a voice signal are acquired. Awake up - wake up _ music .
其次, 如果第 k帧的初始分类结果为语音信号, 并且" " _m^c≥fll4 , 将第 k帧的分类结果修正为音乐信号; 如果第 k帧的初始分类结果为音乐 信号, 并且"画 - "。 "—聽 ^≥"14, 将第 k帧的分类结果修正为语音信号。 Secondly, if the initial classification result for the k-th frame of the speech signal, and "" _m ^ c≥ fl l 4 , the k-th frame classification result is corrected to the music signal; if the result of the initial classification of the k-th frame is a music signal, and "Draw -". "-Listen ^ ≥" 1 4 , the classification result of the kth frame is corrected to a speech signal.
在本实施例中, 可以设置" 14 = 16In this embodiment, "1 4 = 1 6 can be set.
图 12a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形示意图, 同图 11a所示, 图 12进一歩给出平滑后的结果, 如图 12 所示, 由下至上, 纵轴给出两类分类结果: Figure 12a is a waveform diagram of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets", as shown in Figure 11a, Figure 12 shows the smoothed results, as shown in Figure 12, Down to the top, the vertical axis gives two types of classification results:
1^ 1( _初始分类结果: 表示为实线;  1^ 1 ( _ initial classification result: expressed as a solid line;
MUS IC_平滑后结果: 对初始分类结果进行平滑, 得到平滑后结果, 表 示为虚线。 MUS IC_ smoothed result: smooth the initial classification result, and obtain the smoothed result, table Shown as a dotted line.
观察图 12可知, 位于 100000-300000点之间的合奏信号: 初始分类 结果在 250000-300000点之间存在一处误判,将音乐信号误判为语音信号; 位于 400000-550000点之间的塡信号, 初始分类结果在该信号结尾部分存 在一处误判, 将音乐信号误判为语音信号。 通过平滑处理, 对上述误判进 行了修正。  Looking at Figure 12, the ensemble signal between 100000-300000 points: The initial classification result has a misjudgment between 250,000 and 300000 points, and the music signal is misjudged as a voice signal; 位于 between 400,000 and 550,000 points The signal, the initial classification result, has a misjudgment at the end of the signal, and the music signal is misjudged as a speech signal. The above misjudgment was corrected by smoothing.
另外,对于不能够引入输出延时的应用场景,其中获取音调分布参数, 获取能量分布参数的原理和歩骤与上述技术方案类似, 不同之前仅在于, 在进行分类时参考的是过去的信息和现在的信息, 由于无输出延时, 需要 实时获取分类结果, 无法参考未来的信息。  In addition, for an application scenario in which an output delay cannot be introduced, the principle and the step of acquiring the pitch distribution parameter and obtaining the energy distribution parameter are similar to the above technical solution, except that the reference information is used in the past. The current information, because there is no output delay, requires real-time access to the classification results, and cannot refer to future information.
具体的, 提取音调特征可以参照上述实施例, 可以分为三个歩骤: Specifically, the tone feature can be extracted by referring to the foregoing embodiment, and can be divided into three steps:
A、 获取初始音调检测结果, 即各帧的音调分布参数; A, obtaining initial pitch detection results, that is, pitch distribution parameters of each frame;
B、 通过连续性分析, 对初始音调检测结果进行筛选;  B. Screening the initial pitch detection results by continuity analysis;
C、 基于筛选后的音调检测结果, 提取音调特征, 即待分类帧的音调 分量的数量。  C. Extracting the tonal features, that is, the number of tonal components of the frame to be classified, based on the filtered pitch detection results.
其中上述歩骤 A, 可以参照上述实施例, 以下主要对歩骤 B和歩骤 C 进行详细说明。  For the above step A, reference may be made to the above embodiment, and the following mainly describes the steps B and C.
在进行连续性分析时, 设 tonal-flag -Original[k][f](0≤f < 表示初始音 调检测结果, 取值为 1表示第 k帧数据在 f 处存在音调分量, 取值为 0表 示第 k帧数据在 f 处不存在音调分量。 相对于第 k帧, 位于第 k帧之前的 L1帧数据被称为过去帧。 During continuous analysis, provided tonal -fl a g -Original [k] [f] (0≤f < tone detection result represents the initial value of 1 k-th frame data showing the presence of tonal components at f, the value of A value of 0 indicates that the k-th frame data does not have a tonal component at f. With respect to the k-th frame, the L1 frame data located before the k-th frame is referred to as a past frame.
设第 k帧数据在 fx处存在音调分量, 即
Figure imgf000039_0001
i。 针 对位于第 k帧 fx处的音调分量, 音调连续性分析的歩骤为:
Let the kth frame data have a tonal component at f x , ie
Figure imgf000039_0001
i. For the tonal component located at fk of the kth frame, the steps of the tone continuity analysis are:
歩骤 1: 统计该音调分量与过去多少帧的音调分量具有连续性, 表示 为腿 m— Ιφ , 初始化变量" "^- 为 0, 初始化表示不连续的变量  Step 1: Count the continuity of the tonal component with the pitch component of the past frame, expressed as the leg m— Ιφ, initialize the variable “ "^- to 0, initialize the variable indicating discontinuity
n飄―画 丽 1为 Q, 并记录待分析音调分量所处的位置: poS_CUr = fx., 检杳 tonal flag
Figure imgf000039_0002
l][ ] ((pos _cur-3)≤ f ≤ (pos_cur + 3))的取值. 如果取值全为 o, 说明第(k-i)帧数据在 ^- ^-3^/ ^^"^3)区 间不存在音调分量, 即位于第 k帧 处的音调分量与第(k-l)帧的音调分 量之间出现间断, 记录下本次不连续性事件: ― ηοη _ tonal― num _ non _ tonal + 1.
n floating - painting 1 is Q , and record the position of the tonal component to be analyzed: po S _ CUr = fx., check the tonal flag
Figure imgf000039_0002
l][ ] ((pos _cur-3) ≤ f ≤ (pos_cur + 3)). If the value is all o, the (ki) frame data is in ^- ^- 3 ^/ ^^"^ 3 ) There is no tonal component in the interval, that is, there is a discontinuity between the tonal component at the kth frame and the tonal component of the (kl) frame, and the discontinuity event is recorded: ― ηοη _ tonal― num _ non _ tonal + 1.
如果 tonal - Aag - oriSinal[k - l][pos _cur + x]^l{-3≤ x≤3) ^ 说明第 (k- 1 )巾贞数 据在
Figure imgf000040_0001
即位于第 k帧 处的 音调分量与第(k-1)帧的音调分量之间具有连续性:
If tonal - A ag - ori Sinal[k - l][pos _cur + x]^l{-3≤ x≤3) ^ Explain that the (k-1) frame data is
Figure imgf000040_0001
That is, there is continuity between the tonal component at the kth frame and the tonal component of the (k-1)th frame:
记录第(k-1)帧音调分量所处的位置: pos—c n c + x  Record the position of the (k-1)th tone component: pos—c n c + x
统计出现连续性的巾贞数: n画-1 Φ =腿 m— left + 1 Count the number of consecutive cases: n draw - 1 Φ = leg m - left + 1
设置变量 num _ non _ tonal为。。  Set the variable num _ non _ tonal to . .
类似于歩骤 2, 依次检测第(k-1)帧、 第(k-2)帧等与前一帧的音调分 量之间是否存在连续性。 在每次检测之前, 首先需要判断 "" "^-^^的 大小:  Similar to step 2, it is sequentially detected whether there is continuity between the (k-1)th frame, the (k-2)th frame, and the like and the pitch component of the previous frame. Before each test, you first need to determine the size of "" "^-^^:
如果" -m^-to^ W, 说明待分析音调分量与过去帧音调分量之间 的间断已经超过预设的范围, 已不再具有连续性。 不必继续检测下去, 输 出 num left ·  If " -m^-to^ W, the discontinuity between the tonal component to be analyzed and the pitch component of the past frame has exceeded the preset range, there is no longer continuity. It is not necessary to continue the detection, output num left
如果" -rn^ ^ W, 说明待分析音调分量与过去帧音调分量之间 的间断还在预设的范围内, 继续检测下去。 直到检测完过去 L1帧数据, 输出 numιΦ If " -rn^ ^ W, the gap between the tonal component to be analyzed and the pitch component of the past frame is still within the preset range, continue to detect. Until the past L1 frame data is detected, the output numι Φ
歩骤 2: 根据" -^ 对初始音调检测结果进行筛选;  Step 2: Filter the initial pitch detection result according to "-^;
如果满足条件: 醒— left≥bl, 说明位于第 k帧 fx处的音调分量具有 一定的连续性, 保留初始音调检测结果, 否则不保留。  If the condition is met: awake - left ≥ bl, indicating that the tonal component at fx frame fx has a certain continuity, retaining the initial pitch detection result, otherwise it is not retained.
在本实施例中, 可以设置 W = 5 = 5 In this embodiment, W = 5 = 5 can be set.
进一歩的, 类似上述实施例, 针对筛选后的音调检测结果, 统计较低 频率至高频范围(对应于½≤,< /2)的待分类帧的帧音调分量的数量, 表 示为醒 tonal jag。 如果 MMm_toM _/¾g越大, 说明对应信号中音调分量 持续时间越长, 该信号是音乐信号的可能性越大。 在本实施例中, 设置 ½ = 40 Further, similar to the above embodiment, for the filtered tone detection result, the number of frame tonal components of the frame to be classified in the lower frequency to the high frequency range (corresponding to 1⁄2 ≤ , < / 2 ) is counted, which is expressed as wake up Jag. If MMm_toM _/3⁄4g is larger, it means that the longer the duration of the tonal component in the corresponding signal, the greater the possibility that the signal is a music signal. In this embodiment, setting 1⁄2 = 40
对于能量特征提取, 在提取能量特征之前, 首先需要计算每帧高频能 量分布比值 ^-^^^-^^及声压级^ 其中 k表示帧数。 计算每帧 高频能量分布比值 及声压级 的公式与上述是相同 的。  For energy feature extraction, before extracting the energy feature, it is first necessary to calculate the high-frequency energy distribution ratio of each frame ^-^^^-^^ and the sound pressure level ^ where k represents the number of frames. The formula for calculating the high-frequency energy distribution ratio and sound pressure level per frame is the same as described above.
基于高频能量分布比值及声压级, 进一歩分析能量在高频及低频的分 布特性Based on the high frequency energy distribution ratio and sound pressure level, further analyze the energy at high frequency and low frequency. Cloth characteristics
Figure imgf000041_0001
Figure imgf000041_0001
量分布比值 ratio -energy _hf k)及 ^级 ^), 提取特征 Quantity distribution ratio ratio -energy _hf k) and ^ level ^), extracting features
m_big_mtiQ rgy— Ιφ 该特征是指, 位于第 k帧之前的 L1帧数据中, 能量能够持续分布在高频的过去帧的帧数。  m_big_mtiQ rgy— Ιφ This feature refers to the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame.
在提取该特征之前,首先检査高频能量分布比值 ^- -^^及声 压级 是否满足以下条件: io— energy - hf b4、 & & (Μί > b5、 如果满 足该条件, 进一歩分析第 k帧能量是否能够持续分布在较高频率范围内。  Before extracting the feature, first check whether the high-frequency energy distribution ratio ^- -^^ and the sound pressure level satisfy the following conditions: io- energy - hf b4, & & (Μί > b5, if this condition is satisfied, further analysis Whether the k-th frame energy can be continuously distributed in a higher frequency range.
获取聽 m _ big _ ratio _ energy _ left的歩骤为: 歩骤 1、
Figure imgf000041_0002
num - big - ratio - enersy - ι 0;
The steps to get m _ big _ ratio _ energy _ left are: Step 1.
Figure imgf000041_0002
Num - big - ratio - ener sy - ι 0;
歩骤 2初始化变量" "m_M。"_b^_rari。为 0; Step 2 Initialize the variable ""m_M."_b^_rari. is 0;
歩骤 3、 检査 raz '。- j/^-1)及 ^ -1)是否满足以下条件: {ratio energy _hf(k— l)> 如果不满足上述条件, 说明第(k-1)帧数据的能量没有分布在较高频 率范围内, i己录下本次事件- m non big ratio - num non big ratio + 1 如果满足上述条件, 说明第(k-i)帧数据的能量持续分布在较高频率 范围内: Step 3. Check raz '. - j/^- 1 ) and ^ - 1 ) Whether the following conditions are satisfied: {ratio energy _hf(k - l)> If the above conditions are not satisfied, the energy of the (k-1)th frame data is not distributed at a higher frequency. In the range, i has recorded this event - m non big ratio - num non big ratio + 1 If the above conditions are met, the energy of the (ki) frame data is continuously distributed in the higher frequency range:
统计能量能够持续分布在高频的过去帧的帧数:  The number of frames in which the statistical energy can continue to be distributed over high frequency past frames:
num big ratio energy left― num big ratio energy left + 1  Num big ratio energy left― num big ratio energy left + 1
设置变量 num - non - - rati°为 0 Set the variable num - non - - ra ti° to 0
类似于歩骤 3, 依次检测第(k-2)帧、 第(k-1)帧等数据的能量能否持 续分布在较高频率范围内。 在每次检测之前, 首先需要判断  Similar to step 3, it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
num non big ratio的大小 · Num non big ratio size ·
如果 " _" _^_ra^≥ ,说明能量无法持续分布在较高频率范围内 的状态已经超过预设的范围, 不必继续检测下去, 输出 If "_" _^_ ra ^ ≥ , the state that the energy cannot be continuously distributed in the higher frequency range has exceeded the preset range, and it is not necessary to continue the detection.
num big ratio energy left . Num big ratio energy left .
如果" _" _^_ra^<^,说明能量无法持续分布在较高频率范围内 的状态还在预设的范围内, 继续检测下去, 直到检测完过去 L1帧数据, 输出 num big ratio energy left。 另外, 针对第 k帧, 为了表示能量在低频的分布特性, 基于高频能量 分布比值 ' -^ 及声压级 ^), 提取特征 If "_" _^_ ra ^<^, the state that the energy cannot be continuously distributed in the higher frequency range is still within the preset range, continue to detect until the past L1 frame data is detected, and the output num big ratio energy Left. In addition, for the kth frame, in order to express the distribution characteristics of energy at low frequencies, based on high frequency energy Distribution ratio ' -^ and sound pressure level ^), extracting features
醒―醒 II— ratio— energy— left。该特征是指能量能够持续分布在低频的过去帧 的帧数。 Wake up - wake up II - ratio - energy - left. This feature refers to the number of frames of past frames whose energy can be continuously distributed at low frequencies.
与聽 m _ big _ ratio _ 参数的获取过程不同,  Different from the acquisition process of listening to the m _ big _ ratio _ parameter,
" -^^-™^_ e/^_fe/t并不是仅仅针对过去 L1帧数据分析得出的, 而 是每计算出一帧 ratio -energy _hf{i){i≥0)f 就会更新一次 "- ^^ - ™ ^ _ e / ^ _ fe / t in the past is not only for the frame data analysis based on L1, but each calculate a ratio -energy _hf {i) {i≥0 ) f will be updated once
num small ratio energy left Num small ratio energy left
获取 num smaU ratio energy left的歩骤为. Get the num sma U ratio energy left.
当 二 0时, 初始化腿 m small ratio energy left为 Q . When 2:00, the initial leg m small ratio energy left is Q.
检查每一巾贞 - -^')^0)是否满足条件: ratio— energy— hf i、<b,; 如果满足上述条件, Check if each of the frames - -^')^ 0 ) meets the condition: ratio - energy - hf i, <b,; if the above conditions are met,
num small ratio energy left― num small ratio energy left + 1. Num small ratio energy left― num small ratio energy left + 1.
如果不满足上述条件, num small ratio energy left - 0 ·  If the above conditions are not met, num small ratio energy left - 0 ·
在本实施例中, 设置 Μ = 0·3; ½ = 30. 6 = 5 ; W = (U。  In this embodiment, Μ = 0·3; 1⁄2 = 30. 6 = 5; W = (U.
具体的, 分类规则可以如图 13所示, 对于第 k帧数据, 其可以包括 如下的歩骤:  Specifically, the classification rule may be as shown in FIG. 13, and for the k-th frame data, it may include the following steps:
歩骤 401、 判断音调分量的数量是否大于 0, g卩"目 -to?MZ-i¾g>0。 如 果满足条件, 则可以输出初始分类结果为音乐信号; 否则继续分析能量特 征; Step 401: Determine whether the number of tonal components is greater than 0, g卩"目-to ?M Z- i 3⁄4g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, the energy feature is continuously analyzed;
歩骤 402、 分析能量在较高频率范围内的分布特性, 首先判断  Step 402: Analyze the distribution characteristics of energy in a higher frequency range, first determine
Ό - / ^- )〉M)&& )〉 b5)。 若是, 执行歩骤 403, 否则执行歩骤 Ό - / ^- )〉M)&& )〉 b5). If yes, execute step 403, otherwise execute the step
404; 404;
歩骤 403、 判断是否满足 "画 -b^-rari0_i /^-fe/t≥b8, 如果满足, 则 输出初始分类结果为音乐信号, 否则, 执行歩骤 404; Step 403, determining whether "paint-b^-rari 0 _i / ^-fe / t ≥ b 8 is satisfied, if yes, output the initial classification result as a music signal, otherwise, performing step 404;
歩骤 404、 判断高频能量分布比值是否小于 b7, 即  Step 404, determining whether the high frequency energy distribution ratio is less than b7, that is,
ratio _energy _hf{k)≤bl ^ 如果是, 则执行歩骤 405, 否则输出初始分类结果 为语音信号; Ratio _energy _hf{k) ≤ bl ^ If yes, execute step 405, otherwise output the initial classification result as a speech signal;
歩骤 405、 判断是否满足 "画
Figure imgf000042_0001
je/≥ 9, 如果满足, 则输出初始分类结果为音乐信号, 否则输出初始分类结果为语音信号。 在 本实施例中, 可以设置 ^ = 10, ^ = 30。 图 14a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图三, 同图 11a所示, 其中的三种音乐信号: 合奏、 埙及响板, 在 音调特征或是能量特征方面, 均具有一定的典型性, 图 b进一歩给出实 时分类结果的实例, 其中, 横轴为样本点; 纵轴为分类结果, 取值为 0对 应语音信号, 取值不为 0对应音乐信号, 由图 14a和图 14b可见, 由于没 有未来的信息可供参考, 会将少许音乐信号误判为语音信号。
Step 405, determining whether the "painting is satisfied"
Figure imgf000042_0001
j e / ≥ 9, if it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal. In this embodiment, ^ = 10, ^ = 3 0 can be set. Figure 14a is the waveform diagram 3 of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality. Figure b gives an example of real-time classification results, where the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal, and the value is not A music signal corresponding to 0 can be seen from Fig. 14a and Fig. 14b. Since there is no future information for reference, a little music signal is misjudged as a voice signal.
本发明上述实施例提供的技术方案, 针对无输出延时、 少量输出延时 和大量输出延时三种情况进行了说明, 使得在对输出延时要求不固定的场 景中, 例如语音质量评估应用中, 可以根据实际需要提供上述三种情况下 的分类结果, 且随着输出延时时间的增长, 不仅可以参照待分类帧过去的 信息, 而且可以参照待分类帧未来的信息, 参考信息越多分类的正确率也 会随之提高。 具体的, 图 15为本发明实施例中输出延时不固定的情况下 语音分类方法流程图, 如图 15所示, 包括如下的歩骤:  The technical solution provided by the foregoing embodiment of the present invention describes three situations of no output delay, a small amount of output delay, and a large number of output delays, such that in a scenario where the output delay requirement is not fixed, such as a voice quality assessment application. The classification result in the above three cases can be provided according to actual needs, and as the output delay time increases, not only the information of the past to be classified frame but also the future information of the frame to be classified can be referred to, and the reference information is more The correct rate of classification will also increase. Specifically, FIG. 15 is a flowchart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention, and as shown in FIG. 15, the following steps are included:
歩骤 501、 对当前帧第 i帧进行 FFT变换;  Step 501: performing an FFT transformation on the i-th frame of the current frame;
歩骤 502、 基于 FFT变换结果, 获取第 i帧的音调分布参数并缓存; 歩骤 503、 基于 FFT变换结果, 获取第 i帧的能量分布参数并缓存; 上述的歩骤 501-503中, 不仅针对第 i帧, 而且针对第 i帧之前接收 到的各个帧的, 都进行了相应处理, 获取了其音调分布参数和能量分布参 数。  Step 502: Obtain a tone distribution parameter of the ith frame and cache according to the FFT transform result. Step 503: Obtain an energy distribution parameter of the ith frame and cache according to the FFT transform result. In the foregoing steps 501-503, not only For the ith frame, and for each frame received before the ith frame, corresponding processing is performed, and the pitch distribution parameter and the energy distribution parameter are obtained.
歩骤 504、 生成并缓存第 i帧的实时分类结果, 具体的, 本歩骤中基 于歩骤 502和歩骤 503中生成并缓存的过去的信息, 即第 i帧之前的各个 帧的音调分布参数和能量分布参数, 获取第 i帧的音调特征和能量特征, 生成并缓存实时分类结果, 具体实现方式可以参照上述的实施例;  Step 504: Generate and cache a real-time classification result of the ith frame. Specifically, the past information generated and cached in the step 502 and the step 503 in the step, that is, the pitch distribution of each frame before the ith frame The parameter and the energy distribution parameter are used to obtain the tonal feature and the energy feature of the ith frame, and generate and cache the real-time classification result. For the specific implementation manner, refer to the foregoing embodiment;
歩骤 505、 当 1〉11时, 其中 L1为允许的少量输出延时, 除了获取接 收的各个帧的实时的分类结果, 还可以生成并缓存第 i-Ll帧的初始分类 结果, 具体的, 在生成第 i-Ll帧的初始分类结果时, 可以参考过去的信 息, 即第 i-Ll帧之前的若干帧的音调分布参数和能量分布参数, 现在的 信息, 即第 i-Ll帧的音调分布参数和能量分布参数, 未来的信息, 即第 i-Ll帧之后 L1帧帧音调分布参数和能量分布参数, 获取更为准确的第 i-Ll帧的初始分类结果, 具体实现方式可以参见上述实施例。 歩骤 506, 当 i〉(L2+L3)时, 生成并缓存第(i_L2-L3)帧修正后的分类 结果, 具体的, 即可以参照过去的信息, 即位于第(i_L2-L3)帧之前若干 帧的初始分类结果, 未来的信息, 即位于第(i_L2-L3)帧之后的 L3帧的初 始分类结果, 对第(i_L2-L3)帧的初始分类结果进行修正, 具体的实现方 式可以参见上述的实施例。 Step 505: When 1>11, where L1 is a small amount of output delay allowed, in addition to obtaining the real-time classification result of each received frame, the initial classification result of the i-L1 frame may also be generated and cached, specifically, When generating the initial classification result of the i-th frame, reference may be made to the past information, that is, the pitch distribution parameter and the energy distribution parameter of several frames before the i-L1 frame, and the current information, that is, the tone of the i-L1 frame. The distribution parameter and the energy distribution parameter, the future information, that is, the L1 frame frame pitch distribution parameter and the energy distribution parameter after the i-L1 frame, obtain a more accurate initial classification result of the i-L1 frame, and the specific implementation manner can be referred to above. Example. Step 506: When i>(L2+L3), generate and buffer the (i_L2-L3) frame-corrected classification result, specifically, refer to the past information, that is, before the (i_L2-L3) frame. The initial classification result of several frames, the future information, that is, the initial classification result of the L3 frame located after the (i_L2-L3) frame, the initial classification result of the (i_L2-L3) frame is corrected, and the specific implementation can be seen. The above embodiment.
歩骤 507、 根据允许的输出延时的不同, 选择上述歩骤 504、 歩骤 505 和歩骤 506的分类结果, 作为待分类帧第 j帧的分类结果:  Step 507: Select, according to the allowed output delay, the classification result of the foregoing step 504, step 505, and step 506 as the classification result of the jth frame of the to-be-classified frame:
如果输出延时满足条件: (i_j)〉= (L2+L3), 输出最优结果, 即第 j帧 修正后的分类结果;  If the output delay satisfies the condition: (i_j)>= (L2+L3), the optimal result is output, that is, the corrected classification result of the jth frame;
如果输出延时满足条件: (L2+L3)〉(i-j)〉=Ll, 输出次优结果, 即第 j 帧的初始分类结果;  If the output delay satisfies the condition: (L2+L3)>(i-j)〉=Ll, the suboptimal result is output, that is, the initial classification result of the jth frame;
如果输出延时满足条件: (i_j)〈Ll, 输出零延时结果, 即第 j帧的实 时分类结果。  If the output delay satisfies the condition: (i_j) <Ll, the zero delay result is output, that is, the real time classification result of the jth frame.
本发明上述实施例中可以将 L2的取值设为与 L1相等。  In the above embodiment of the present invention, the value of L2 can be set equal to L1.
图 16a为输入信号 "中文女声 +合奏 +英语男声 +塡 +德语男声 +响板" 的波形图四, 同图 11a所示, 其中的三种音乐信号: 合奏、 塡及响板, 在 音调特征或是能量特征方面, 均具有一定的典型性, 图 16b给出了三种分 类方法得到的分类结果,如图 16b所示,其中纵轴上给出的三种分类结果, 依次是 31( _实时分类结果, 用实线表示, ΜΙ^Κ^ 始分类结果, 用点线 表示, MUSIC_修正后的分类结果, 用虚线表示。  Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + 塡 + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality. Figure 16b shows the classification results obtained by the three classification methods, as shown in Figure 16b, where the three classification results given on the vertical axis are 31 ( _ The results of real-time classification are indicated by solid lines, ΜΙ^Κ^ The classification results are indicated by dotted lines, and the MUSIC_ corrected classification results are indicated by dotted lines.
如图 16b所示, 根据分类结果的正确率, 修正后的分类结果〉初始分 类结果〉实时分类结果。 因此, 在输出延时允许的情况下, 用户可以充分 利用尽可能多的未来信息, 输出当前条件下可以得到的最好的分类结果。  As shown in Fig. 16b, according to the correct rate of the classification result, the corrected classification result> initial classification result> real-time classification result. Therefore, with the output delay allowed, the user can make full use of as much future information as possible to output the best classification results available under current conditions.
本发明实施例提供的技术方案, 其提取的特征能够反映出音乐信号不 同于语音信号的更为本质的特征, 使得在低采样率下的分类正确率明显提 高。 由于本发明实施例的技术方案提取特征的方法并不受限于采样率, 因 此其不仅适用于低采样率, 也适用于高采样率下的信号分类。 在确保较低 的算法复杂度的前提下, 用户可以根据需求灵活选择实时分类结果、 次优 分类结果或是最优分类结果。  According to the technical solution provided by the embodiment of the present invention, the extracted feature can reflect the more essential features of the music signal different from the voice signal, so that the classification accuracy rate at the low sampling rate is significantly improved. Since the method for extracting features of the technical solution of the embodiment of the present invention is not limited to the sampling rate, it is applicable not only to a low sampling rate but also to signal classification at a high sampling rate. Under the premise of ensuring low algorithm complexity, users can flexibly select real-time classification results, sub-optimal classification results or optimal classification results according to their needs.
本发明实施例还提供了一种与上述方法对应的音频信号分类处理装 置, 图 Π为本发明实施例中音频信号分类处理装置的结构示意图, 如图 17所示, 该装置包括第一获取模块 11和分类确定模块 12, 其中第一获取 模块 11用于获取音频信号中待分类帧中满足连续性约束条件的音调分量 的数量、 所述音频信号中待分类帧在低频区域的持续帧数和所述待分类帧 在高频区域的持续帧数中的至少一项; 分类确定模块 12用于根据所述待 分类帧中满足连续性约束条件的音调分量的数量、所述待分类帧在低频区 域的持续帧数和所述待分类帧的高频区域的持续帧数中的至少一项, 确定 所述音频信号中待分类帧为音乐信号, 或确定所述音频信号中待分类帧为 语音信号。 The embodiment of the invention further provides an audio signal classification processing device corresponding to the above method. FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention. As shown in FIG. 17, the apparatus includes a first obtaining module 11 and a classification determining module 12, wherein the first acquiring module 11 is configured to acquire an audio signal. At least one of the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the frame to be classified in the low frequency region, and the number of consecutive frames of the frame to be classified in the high frequency region The classification determining module 12 is configured to: according to the number of tonal components satisfying the continuity constraint in the to-category frame, the continuous frame number of the to-be-classified frame in the low frequency region, and the persistent frame of the high frequency region of the to-be-classified frame At least one of the numbers determines that the frame to be classified in the audio signal is a music signal, or determines that the frame to be classified in the audio signal is a voice signal.
本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。  The technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region. On the basis of the above characteristics of the music signal, in the technical solution provided by the embodiment of the present invention, the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
本发明上述实施例中, 其中根据有无输出延时和输出延时长度的不 同,其中的各个模块的执行的歩骤也会有所不同,具体包括如下几种情况: 一是在实时获取所述待分类帧的分类结果时, 所述第一获取模块具体 用于获取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调分布参数获取待分类 帧中满足连续性约束条件的音调分量的数量, N1为正整数; 或, 具体用于 获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数或所述待分类帧在高频区域的 持续帧数;  In the above embodiment of the present invention, the execution steps of each module may be different according to the presence or absence of the output delay and the output delay length, and specifically include the following situations: When the classification result of the classification frame is mentioned, the first acquisition module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the to-be-classified frame The tone distribution parameter of the first N1 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a frame before the frame to be classified The energy distribution parameter, and obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the frame to be classified in the high frequency region Number of consecutive frames;
所述分类确定模块 12具体用于在所述待分类帧中满足连续性约束条 件的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数 大于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确 定所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类 帧为语音信号。 The classification determining module 12 is specifically configured to satisfy a continuity constraint bar in the to-be-classified frame. Determining that the number of tonal components of the piece is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than a third threshold The frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal.
具体的, 上述的第一获取模块获取音频信号中待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分布参数包括:  Specifically, the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信 号中的待分类帧的音调分量的频域分布信息作为待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1 帧的音调分布参数。  Performing a fast Fourier transform on the to-be-classified frame and the pre-framed N1 frame in the received audio signal to obtain a power density spectrum; acquiring the to-be-classified frame in the received audio signal according to the power density spectrum The frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
上述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括:  The classification determining module obtains, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the pre-frame N1 frame, the number of tonal components satisfying the continuity constraint in the frame to be classified, including:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 上述的第一获取模块获取所音频信号中待分类帧的能量分布参 数, 以及待分类帧前 N1帧的能量分布参数包括:  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified N1 frame, the number of the to-be-classified frames in which the number of consecutive frames is greater than the sixth threshold. The module obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified The energy distribution parameter of the first N1 frame.
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括:  The foregoing classification determining module acquires the continuous frame number of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified, including:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame that is less than an eighth threshold The number of frames.
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括: The foregoing classification determining module acquires the continuous frame number of the to-be-classified frame in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified Includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 Nl帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 二是在延时 L1帧获取所述待 分类帧的分类结果时, L1为正整数, 所述第一获取模块具体用于获取音频 信号中待分类帧, 待分类帧前 N2帧, 以及待分类帧后 L1帧的音调分布参 数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, N2为正 整数; 或, 具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N2 帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分类 帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获取所述待分 类帧在低频区域的持续帧数或所述待分类帧在高频区域的持续帧数;  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold. The second acquiring module is configured to obtain a to-be-classified frame in the audio signal, a pre-framed N2 frame, and a to-be-classified. a pitch distribution parameter of the L1 frame after the frame, and acquiring, according to the to-be-classified frame, the N2 frame of the frame to be classified and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified, N2 is a positive integer; or, specifically, is used to obtain a frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the frame to be classified in the audio signal, Obtaining, according to the energy distribution parameter of the pre-frame N2 frame and the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。  The classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
其中, 上述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数 包括:  The first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified includes:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信息 作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调分 量的频域分布信息作为待分类帧帧后 L1帧的音调分布参数。  Performing a fast Fourier transform on the to-be-classified frame, the pre-framed N2 frame, and the to-be-classified frame frame L1 frame in the received audio signal to obtain a power density spectrum; acquiring the received according to the power density spectrum The frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified. And the frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified.
上述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:  The classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified. Includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 LI帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 According to the to-be-classified frame in the received audio signal, the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
另外, 上述第一获取模块获取所音频信号中待分类帧的能量分布参 数, 待分类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参 数包括:  In addition, the first acquiring module acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N2 frame before the frame to be classified and the energy distribution parameter of the L1 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L 1帧的高频能量分 布比和声压级作为待分类帧后 L 1帧的能量分布参数。  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N2 frame to be classified as a frame to be classified The energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified.
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:  The classification determining module acquires the continuous frame of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified. The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The number of consecutive frames that are less than the eighth threshold.
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:  The classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified. The number of frames includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
三是在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正 整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待 分类帧前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足 连续性约束条件的音调分量的数量, N3为正整数; 或, 具体用于获取所述 音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧后 L2帧的能量 分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N3帧以及待分 类帧后 L2帧的能量分布参数获取所述待分类帧在低频区域的持续帧数或 所述待分类帧在高频区域的持续帧数; 所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。 The third is to obtain a classification result of the to-be-classified frame, and the L2 and L3 are positive integers, and the first acquiring module is specifically configured to acquire a frame to be classified in the audio signal, and the N3 frame to be classified before the frame. And a pitch distribution parameter of the L2 frame after the frame to be classified, and acquiring a tone satisfying the continuity constraint in the frame to be classified according to the to-be-classified frame, the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified The number of components, N3 is a positive integer; or, specifically, for acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the audio signal The number of consecutive frames in the low frequency region or the number of consecutive frames in the high frequency region of the frame to be classified is obtained by the energy distribution parameter of the frame to be classified, the N3 frame before the frame to be classified, and the L2 frame after the frame to be classified; The classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-classified frame is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if determining the audio signal If the frame to be classified is a music signal, it is determined whether the number of frames determined as a voice signal in the N3 frame before the frame to be classified and the frame in the to-be-classified frame is greater than a fourth threshold, and if yes, the audio signal is to be received. The classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
其中, 上述的第一获取模块获取音频信号中待分类帧的音调分布参 数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布 参数包括:  The first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified includes:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N3帧的音调分量的频域分布信息 作为待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分量 的频域分布信息作为待分类帧后 L2帧的音调分布参数。  Performing a fast Fourier transform on the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified frame frame, to obtain a power density spectrum; and acquiring the received according to the power density spectrum The frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified. And the frequency domain distribution information of the tonal components of the L2 frame after the frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified.
上述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括:  The classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified. Includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the to-be-classified frame L2 frame, the number of tonal components in the to-be-classified frame that are greater than the sixth threshold. .
另外, 上述第一获取模块获取所音频信号中待分类帧的能量分布参 数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参 数包括:  In addition, the first acquiring module acquires an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and an energy distribution parameter of the L2 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数。 Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, a high-frequency energy distribution ratio and a sound pressure level of the N3 frame before the frame to be classified The energy distribution parameter of the N3 frame before the frame to be classified, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are used as energy distribution parameters of the L2 frame after the frame to be classified.
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括:  The foregoing classification determining module acquires the number of consecutive frames of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified. Includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数;  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
上述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括:  The classification determining module acquires the continuous frame of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified. The numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
上述三种情况下, 第一获取模块获取的待分类帧中持续帧数大于第六 阈值的音调分量的数量为在频域上大于第七阈值的音调分量的数量。  In the above three cases, the number of tonal components whose number of persistent frames in the frame to be classified acquired by the first acquiring module is greater than the sixth threshold is the number of tonal components greater than the seventh threshold in the frequency domain.
本发明实施例还提供了一种音频信号分类处理设备, 图 18为本发明 实施例中音频信号分类处理设备的结构示意图, 如图 18所示, 该设备包 括接收器 21和处理器 22, 其中的接收器 21用于接收音频信号; 处理器 22与所述接收器 21连接, 用于获取接收器接收到的音频信号中待分类帧 中满足连续性约束条件的音调分量的数量、所述音频信号中待分类帧在低 频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述音频信号 中待分类帧为语音信号。  The embodiment of the present invention further provides an audio signal classification processing device. FIG. 18 is a schematic structural diagram of an audio signal classification processing device according to an embodiment of the present invention. As shown in FIG. 18, the device includes a receiver 21 and a processor 22, where The receiver 21 is configured to receive an audio signal; the processor 22 is connected to the receiver 21, and configured to acquire the number of tonal components satisfying continuity constraints in the to-be-classified frame in the audio signal received by the receiver, the audio And at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, Determining, in the audio signal, the frame to be classified as a music signal, or determining the audio, by using at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region The frame to be classified in the signal is a voice signal.
本发明上述实施例提供的技术方案, 主要是考虑到音乐信号的特性, 例如音乐信号的音调持续时间较长, 而语音信号的音调持续时间较短, 音 乐信号的能量可以持续分布在高频区域或低频区域, 而语音信号通常不能 持续分布在高频区域或低频区域, 在考虑音乐信号上述特点的基础上, 本 发明实施例提供的技术方案中, 首先获取音频信号中待分类帧中满足连续 性约束条件的音调分量的数量, 以及音频信号中待分类帧在低频区域的持 续帧数和 /或所述待分类帧在高频区域的持续帧数, 并根据上述信息确认 待分类帧的类型是音乐信号, 还是语音信号, 上述技术方案提供的音频信 号分类处理方法, 能够提高音频信号分类的正确率, 满足语音质量评估的 要求。 The technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is usually not continuously distributed in the high frequency region or the low frequency region, and based on the above characteristics of the music signal, In the technical solution provided by the embodiment of the present invention, first, the number of tonal components satisfying the continuity constraint in the frame to be classified in the audio signal, and the number of persistent frames of the frame to be classified in the low frequency region and/or the to-be-classified in the audio signal are obtained. The number of consecutive frames of the frame in the high frequency region, and confirming whether the type of the frame to be classified is a music signal or a voice signal according to the above information, the audio signal classification processing method provided by the above technical solution can improve the correct rate of the audio signal classification and satisfy the voice. Requirements for quality assessment.
本发明上述实施例中, 其中的处理器可以由软件流程实现, 也可以通 过使用数字信号处理 (Digital Signal Processing, 以下简称: DSP ) 芯 片等硬件实体设备实现。  In the above embodiments of the present invention, the processor may be implemented by a software flow, or may be implemented by using a hardware entity device such as a digital signal processing (DSP) chip.
本发明上述实施例中, 其中根据有实时获取所述待分类帧的分类结 果,或者是允许分类结果输出延时的长短,处理器可以包括如下几种情况: 一是在实时获取所述待分类帧的分类结果时, 所述处理器具体用于获 取音频信号中待分类帧, 以及待分类帧前 N1帧的音调分布参数, 并根据 所述待分类帧, 以及待分类帧前 N帧的音调分布参数获取待分类帧中满足 连续性约束条件的音调分量的数量, N1为正整数; 获取所述音频信号中待 分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中 待分类帧, 以及待分类帧前 N1帧的能量分布参数获取所述待分类帧在低 频区域的持续帧数和 /或所述待分类帧在高频区域的持续帧数, N1为正整 数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一阈 值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧在 高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为音 乐信号, 否则确定所述音频信号中待分类帧为语音信号。  In the foregoing embodiment of the present invention, the processor may include the following situations according to the real-time acquisition of the classification result of the to-be-classified frame or the length of the delay of the classification result output: When the classification result of the frame is used, the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the tone of the N frame before the frame to be classified The distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified, and according to the audio The energy distribution parameter of the to-be-classified frame in the signal, and the N1 frame before the frame to be classified, obtains the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, and N1 is positive An integer; a number of tonal components satisfying a continuity constraint in the to-be-classified frame is greater than a first threshold, and the frame to be classified is in a low frequency region When the number of consecutive frames is greater than the second threshold or the number of consecutive frames of the to-be-classified frame in the high-frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining the frame to be classified in the audio signal For voice signals.
其中, 处理器获取音频信号中待分类帧的音调分布参数, 以及待分类 帧前 N1帧的音调分布参数包括:  The processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified includes:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; 根据所述功率密度谱获取所述接收到的音频信 号中的待分类帧的音调分量的频域分布信息作为待分类帧的音调分布参 数, 以及待分类帧前 N1帧的音调分量的频域分布信息作为待分类帧前 N1 帧的音调分布参数。  Performing a fast Fourier transform on the to-be-classified frame and the pre-framed N1 frame in the received audio signal to obtain a power density spectrum; acquiring the to-be-classified frame in the received audio signal according to the power density spectrum The frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 另外, 处理器获取所音频信号中待分类帧的能量分布参数, 以及待分 类帧前 N1帧的能量分布参数包括: The processor according to the pitch distribution parameter of the frame to be classified, and the tone of the N1 frame before the frame to be classified The obtaining, by the distribution parameter, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: obtaining the to-be-classified frame according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the pre-frame N1 frame to be classified The number of the tone components whose number of consecutive frames is greater than the sixth threshold value. In addition, the processor acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frame before the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数。  Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified The energy distribution parameter of the first N1 frame.
处理器根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame that is less than an eighth threshold The number of frames.
处理器根据音频信号中待分类帧的能量分布参数,以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括:  Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。  Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
二是在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所 述处理器具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待 分类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条 件的音调分量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频 信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域 的持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大 于第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待 分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分 类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。 The second is that when the classification result of the to-be-classified frame is obtained, the L1 is a positive integer, and the processor is specifically configured to acquire a frame to be classified in the audio signal, a N2 frame before the frame to be classified, and a frame to be classified. a tone distribution parameter of the L1 frame, and according to the to-be-classified frame, the N2 frame of the frame to be classified and the pitch distribution parameter of the L1 frame after the frame to be classified, obtain the number of tonal components satisfying the continuity constraint in the frame to be classified, N2 is a positive integer; obtaining an energy distribution parameter of the to-be-classified frame in the audio signal, and a pre-frame N2 frame and an L1 frame after the frame to be classified, and according to the to-be-classified frame in the audio signal, the N2 frame before the frame to be classified and The energy distribution parameter of the L1 frame after the frame to be classified acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region; satisfies continuity in the frame to be classified The number of tonal components of the constraint is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than a third threshold. Given the audio signal to be divided The class frame is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal.
其中, 处理器获取音频信号中待分类帧的音调分布参数, 待分类帧前 The processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括: The pitch distribution parameters of the N2 frame, and the pitch distribution parameters of the L1 frame after the frame to be classified include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信息 作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调分 量的频域分布信息。  Performing a fast Fourier transform on the to-be-classified frame, the pre-framed N2 frame, and the to-be-classified frame frame L1 frame in the received audio signal to obtain a power density spectrum; acquiring the received according to the power density spectrum The frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified. And frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified.
处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布 参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量包括:  The processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
根据接收到的音频信号中的待分类帧的音调分量的频域分布信息作 为待分类帧的音调分布参数, 待分类帧前 N2帧的音调分量的频域分布信 息作为待分类帧前 N2帧的音调分布参数, 以及待分类帧帧后 L1帧的音调 分量的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量 的数量。  According to the frequency domain distribution information of the tonal component of the to-be-classified frame in the received audio signal, the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pre-frame N2 frame to be classified. The pitch distribution parameter, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified, acquires the number of tonal components whose number of consecutive frames in the frame to be classified is greater than a sixth threshold.
另外, 处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧 前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数, 以及待分类帧帧后 L1帧的高频能 量分布比和声压级作为待分类帧后 L1帧的能量分布参数。  In addition, the processor obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and an energy distribution parameter of the L1 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal The high frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified, the high frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are used as the energy distribution parameters of the N2 frame before the frame to be classified, and to be classified The high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame frame are used as energy distribution parameters of the L1 frame after the frame to be classified.
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧 的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧 在低频区域的持续帧数包括:  The processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The number of consecutive frames that are less than the eighth threshold.
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧 的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类帧 在高频区域的持续帧数包括: The processor according to the energy distribution parameter of the frame to be classified in the audio signal, the N2 frame to be classified before the frame The energy distribution parameter and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
三是在分类结果输出延时为 L2+L3帧时, L2和 L3为正整数, 所述处 理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类 帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N3帧以及 待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约束条件的 音调分量的数量, N3为正整数; 获取所述音频信号中待分类帧, 以及待分 类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述音频信号 中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取 所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的持 续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于第 一阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类 帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧 为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若确定所述 音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类 帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将 所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频 信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音 频信号中待分类帧修正为音乐信号。  The third is that when the classification result output delay is L2+L3 frame, L2 and L3 are positive integers, and the processor is specifically configured to acquire the to-be-classified frame in the audio signal, the N3 frame before the frame to be classified, and the L2 after the frame to be classified. a tone distribution parameter of the frame, and according to the to-be-classified frame, the to-be-classified frame N3 frame and the tone distribution parameter of the L2 frame after the frame to be classified, obtain the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is positive And obtaining an energy distribution parameter of the to-be-classified frame in the audio signal, and the L3 frame of the to-be-classified frame and the L2 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and the to-be-classified frame The energy distribution parameter of the L2 frame after the classification frame acquires the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region; the continuity constraint is satisfied in the frame to be classified The number of conditional tonal components is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than the third And determining, in the audio signal, the frame to be classified as a music signal, and determining that the frame to be classified in the audio signal is a voice signal; if it is determined that the frame to be classified in the audio signal is a music signal, determining the to-be-classified Whether the number of frames determined as the voice signal in the L3 frame before the frame and the frame after the frame to be classified is greater than a fourth threshold. If yes, the frame to be classified in the audio signal is corrected to a voice signal, and N4 is a positive integer; Determining, in the audio signal, the frame to be classified is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame to be classified is greater than a fifth threshold, if greater than, The frame to be classified in the audio signal is corrected to a music signal.
其中, 处理器获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括:  The processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified includes:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 根据所述功率密度谱获 取所述接收到的音频信号中的待分类帧的音调分量的频域分布信息作为 待分类帧的音调分布参数, 待分类帧前 N3帧的音调分量的频域分布信息 作为待分类帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分量 的频域分布信息作为待分类帧后 L2帧的音调分布参数。 Performing a fast Fourier transform on the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified frame frame, to obtain a power density spectrum; and acquiring the received according to the power density spectrum The frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified. And the tonal component of the L2 frame after the frame to be classified The frequency domain distribution information is used as a pitch distribution parameter of the L2 frame after the frame to be classified.
处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布 参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量包括:  The processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。  Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal, the pre-frame N3 frame to be classified, and the tonal component of the to-be-classified frame L2 frame, the number of tonal components in the to-be-classified frame that are greater than the sixth threshold. .
另外, 处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧 前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数。  In addition, the processor obtains an energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N3 frame before the frame to be classified and the energy distribution parameter of the L2 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal The high-frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified, the high-frequency energy distribution ratio and the sound pressure level of the N3 frame before the frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified, and the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame are used as the energy distribution parameters of the L2 frame after the frame to be classified.
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧 的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧 在低频区域的持续帧数包括:  The processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The number of consecutive frames that are less than the eighth threshold.
处理器根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧 的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧 在高频区域的持续帧数包括:  The processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region is included. :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。  Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified The ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
上述三种情况下, 处理器获取的待分类帧中持续帧数大于第六阈值的 音调分量的数量为在频域上大于第七阈值的音调分量的数量。 本领域普通 技术人员可以理解: 实现上述各方法实施例的全部或部分歩骤可以通过程 序指令相关的硬件来完成。 前述的程序可以存储于一计算机可读取存储介 质中。 该程序在执行时, 执行包括上述各方法实施例的歩骤; 而前述的存 储介质包括: R0M、 RAM, 磁碟或者光盘等各种可以存储程序代码的介质。 最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非 对其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的 普通技术人员应当理解: 其依然可以对前述各实施例所记载的技术方案进 行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或 者替换, 并不使相应技术方案的本质脱离本发明各实施例技术方案的范 围。 In the above three cases, the number of tonal components in the frame to be classified that are acquired by the processor that are greater than the sixth threshold is the number of tonal components that are greater than the seventh threshold in the frequency domain. It will be understood by those skilled in the art that all or part of the steps of implementing the above method embodiments may be performed by hardware related to the program instructions. The aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing Storage media include: R0M, RAM, disk or optical disk and other media that can store program code. It should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, and are not intended to be limiting; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

权 利 要 求 书 claims
1、 一种音频信号分类处理方法, 其特征在于, 包括: 1. An audio signal classification and processing method, characterized by including:
获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续 帧数中的至少一项; Obtain at least one of the number of tonal components that satisfy the continuity constraint in the frame to be classified in the audio signal, the number of continuous frames of the frame to be classified in the low-frequency region, and the number of continuous frames of the frame to be classified in the high-frequency region. ;
根据获取的所述待分类帧中满足连续性约束条件的音调分量的数量、 所述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续 帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所 述音频信号中待分类帧为语音信号。 According to at least one of the obtained number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames of the frame to be classified in the low-frequency region, and the number of continuous frames of the frame to be classified in the high-frequency region. Item, determine that the frame to be classified in the audio signal is a music signal, or determine that the frame to be classified in the audio signal is a speech signal.
2、 根据权利要求 1所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧中满足连续性约束条件的音调分量的数量包 括: 2. The audio signal classification processing method according to claim 1, characterized in that, the obtaining the number of tone components that satisfy the continuity constraint in the frame to be classified in the audio signal includes:
获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数, 并根据所述待分类帧的音调分布参数, 以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量, N1为正整数; Obtain the pitch distribution parameters of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frames before the frame to be classified, and obtain the pitch distribution parameters of the frame to be classified based on the pitch distribution parameters of the frame to be classified, and the pitch distribution parameters of the N1 frames before the frame to be classified. The number of tonal components in the frame that satisfy the continuity constraint, N1 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括: The obtaining of the number of continuous frames of the frame to be classified in the low-frequency region and/or the number of continuous frames of the frame to be classified in the high-frequency region in the audio signal includes:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数, 并根据所述音频信号中待分类帧的能量分布参数, 以 及待分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数, N1为正整数; Obtain the energy distribution parameters of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frame to be classified, and based on the energy distribution parameters of the frame to be classified in the audio signal, and the energy of the N1 frames before the frame to be classified The distribution parameter obtains the number of continuous frames of the frame to be classified in the low-frequency area and/or the number of continuous frames of the frame to be classified in the high-frequency area, and N1 is a positive integer;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括: The method is based on at least one of the number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames of the frame to be classified in the low-frequency region, and the number of continuous frames of the frame to be classified in the high-frequency region. Item, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a speech signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。 The number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the number of continuous frames of the frame to be classified in the low-frequency region is greater than the second threshold, or the duration of the frame to be classified in the high-frequency region is When the number of frames is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal.
3、 根据权利要求 2所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 以及待分类帧前 N1帧的音 调分布参数包括: 3. The audio signal classification processing method according to claim 2, wherein the obtaining the pitch distribution parameters of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frames before the frame to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; Perform fast Fourier transform on the frame to be classified and the N1 frames before the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的音调分 布参数获取待分类帧中满足连续性约束条件的音调分量的数量包括: According to the power density spectrum, the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution of the tonal component of the N1 frames before the frame to be classified is obtained The information is used as the pitch distribution parameters of the N1 frames before the frame to be classified; the number of pitch components that satisfy the continuity constraint in the frame to be classified is obtained according to the pitch distribution parameters of the frame to be classified and the pitch distribution parameters of the N1 frames before the frame to be classified. include:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 Obtain the number of tonal components in the frame to be classified whose number of continuous frames is greater than the sixth threshold in the frame to be classified according to the frequency domain distribution information of the tonal component of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal
4、 根据权利要求 2所述的音频信号分类处理方法, 其特征在于, 所 述获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1帧 的能量分布参数包括: 4. The audio signal classification processing method according to claim 2, characterized in that, the obtaining the energy distribution parameters of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and the high-frequency energy distribution ratio and sound pressure level of the N1 frames before the frame to be classified as the frame to be classified. Energy distribution parameters of the first N1 frames;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括: The method of obtaining the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of N1 frames before the frame to be classified includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数; According to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal, the duration of the high-frequency energy distribution ratio including the frame to be classified being less than the eighth threshold is obtained. number of frames;
所述根据音频信号中待分类帧的能量分布参数, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括: The method of obtaining the number of continuous frames in the high-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of N1 frames before the frame to be classified includes:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 According to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal, the high-frequency energy distribution ratio including the frame to be classified is greater than the ninth threshold, and the sound pressure level is obtained. The number of continuous frames in which the pressure level is greater than the tenth threshold.
5、 根据权利要求 1-4 任一所述的音频信号分类处理方法, 其特征在 于, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的 数量包括: 5. The audio signal classification and processing method according to any one of claims 1-4, characterized in that Therefore, the obtaining the number of tonal components that satisfy the continuity constraint in the frame to be classified in the audio signal includes:
获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数, 并根据所述待分类帧 的音调分布参数, 待分类帧前 N2帧的音调分布参数以及待分类帧后 L 1帧 的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, L1为正整数, N2为正整数; Obtain the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N2 frames before the frame to be classified, and the pitch distribution parameters of the L1 frame after the frame to be classified, and according to the pitch distribution parameters of the frame to be classified, the frame to be classified The pitch distribution parameters of the first N2 frames and the pitch distribution parameters of the L1 frame after the frame to be classified are used to obtain the number of pitch components that satisfy the continuity constraint in the frame to be classified. L1 is a positive integer and N2 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括: The obtaining of the number of continuous frames of the frame to be classified in the low-frequency region and/or the number of continuous frames of the frame to be classified in the high-frequency region in the audio signal includes:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数, 并根据所述音 频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以 及待分类帧后 L1帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数; Obtain the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frame after the frame to be classified, and based on the energy distribution of the frame to be classified in the audio signal parameters, the energy distribution parameter of the N2 frames before the frame to be classified and the energy distribution parameter of the L1 frame after the frame to be classified are used to obtain the number of continuous frames of the frame to be classified in the low frequency area and/or the duration of the frame to be classified in the high frequency area. number of frames;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括: The method is based on at least one of the number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames of the frame to be classified in the low-frequency region, and the number of continuous frames of the frame to be classified in the high-frequency region. Item, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a speech signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号。 The number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the number of continuous frames of the frame to be classified in the low-frequency region is greater than the second threshold, or the duration of the frame to be classified in the high-frequency region is When the number of frames is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal.
6、 根据权利要求 5所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 待分类帧前 N2帧的音调分 布参数, 以及待分类帧后 L1帧的音调分布参数包括: 6. The audio signal classification processing method according to claim 5, characterized in that: the said obtaining the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N2 frames before the frame to be classified, and the L1 after the frame to be classified The pitch distribution parameters of the frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; Perform fast Fourier transform on the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L 1帧的音调分量的频域分布信息作为待分类帧帧后 L 1帧的 音调分布参数; According to the power density spectrum, the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the N2 frames before the frame to be classified is obtained as the pitch distribution parameters of N2 frames before the frame to be classified, and The frequency domain distribution information of the pitch component of L 1 frames after the frame to be classified is used as the pitch distribution parameter of L 1 frames after the frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调分布参 数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括: The method of obtaining the number of tonal components that satisfy the continuity constraint in the frame to be classified based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N2 frames before the frame to be classified, and the pitch distribution parameters of the L1 frame after the frame to be classified includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 According to the frequency domain distribution information of the tonal components of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal, the tonal component of the frame to be classified whose number of continuous frames is greater than the sixth threshold is obtained quantity.
7、 根据权利要求 5所述的音频信号分类处理方法, 其特征在于, 所 述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量 分布参数以及待分类帧后 L1帧的能量分布参数包括: 7. The audio signal classification processing method according to claim 5, characterized in that: the energy distribution parameters of the frames to be classified in the audio signal are obtained, the energy distribution parameters of the N2 frames before the frames to be classified and L1 after the frames to be classified The energy distribution parameters of the frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L 1帧的高频能量分布 比和声压级作为待分类帧后 L 1帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and obtain the high-frequency energy distribution ratio and sound pressure level of the N2 frames before the frame to be classified as the frame before the classification The energy distribution parameters of the N2 frame and the high-frequency energy distribution ratio and sound pressure level of the L 1 frame after the frame to be classified are used as the energy distribution parameters of the L 1 frame after the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括: The method of obtaining the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frames after the frame to be classified includes :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The number of continuous frames is less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的 能量分布参数以及待分类帧后 L 1帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括: The number of continuous frames in the high-frequency region of the frame to be classified is obtained based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frames after the frame to be classified. include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The number of continuous frames in which the ratio is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold.
8、 根据权利要求 1-7任一所述的音频信号分类处理方法, 其特征在 于, 所述获取音频信号中待分类帧中满足连续性约束条件的音调分量的 数量包括: 获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧 的音调分布参数, 待分类帧前 N3帧的音调分布参数以及待分类帧后 L2帧 的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, L2为正整数, L3为正整数, N3为正整数; 8. The audio signal classification processing method according to any one of claims 1 to 7, characterized in that said obtaining the number of tone components that satisfy the continuity constraint in the frame to be classified in the audio signal includes: Obtain the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N3 frames before the frame to be classified, and the pitch distribution parameters of the L2 frame after the frame to be classified, and according to the pitch distribution parameters of the frame to be classified, the frame to be classified The pitch distribution parameters of the first N3 frames and the pitch distribution parameters of the L2 frame after the frame to be classified obtain the number of pitch components that meet the continuity constraints in the frame to be classified. L2 is a positive integer, L3 is a positive integer, and N3 is a positive integer;
所述获取所述音频信号中待分类帧在低频区域的持续帧数和 /或所述 待分类帧在高频区域的持续帧数包括: The obtaining of the number of continuous frames of the frame to be classified in the low-frequency region and/or the number of continuous frames of the frame to be classified in the high-frequency region in the audio signal includes:
获取所述音频信号中待分类帧的能量分布参数, 以及待分类帧前 N3 帧的能量分布参数以及待分类帧后 L3帧的能量分布参数, 并根据所述音 频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以 及待分类帧后 L3帧的能量分布参数获取所述待分类帧在低频区域的持续 帧数和 /或所述待分类帧在高频区域的持续帧数; Obtain the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L3 frames after the frame to be classified, and based on the energy distribution of the frame to be classified in the audio signal parameters, the energy distribution parameters of the N3 frames before the frame to be classified and the energy distribution parameters of the L3 frames after the frame to be classified obtain the number of continuous frames of the frame to be classified in the low frequency area and/or the duration of the frame to be classified in the high frequency area number of frames;
所述根据所述待分类帧中满足连续性约束条件的音调分量的数量、所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 否则确定所 述音频信号中待分类帧为语音信号包括: The method is based on at least one of the number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames of the frame to be classified in the low-frequency region, and the number of continuous frames of the frame to be classified in the high-frequency region. Item, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a speech signal includes:
在所述待分类帧中满足连续性约束条件的音调分量的数量大于第一 阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分类帧 在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类帧为 音乐信号, 否则确定所述音频信号中待分类帧为语音信号; The number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the number of continuous frames of the frame to be classified in the low-frequency region is greater than the second threshold, or the duration of the frame to be classified in the high-frequency region is When the number of frames is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal;
若确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为语音信号的帧数目是否大于第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待分类帧后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。 If it is determined that the frame to be classified in the audio signal is a music signal, then determine whether the number of frames determined to be speech signals among the N4 frames before the frame to be classified and the L3 frames after the frame to be classified is greater than the fourth threshold, and if it exceeds, then The frame to be classified in the audio signal is modified into a speech signal, and N4 is a positive integer; if it is determined that the frame to be classified in the audio signal is a speech signal, then determine the N4 frames before the frame to be classified and the L3 frame after the frame to be classified. It is determined whether the number of frames that are music signals is greater than the fifth threshold. If it is greater, the frames to be classified in the audio signal are corrected into music signals.
9、 根据权利要求 8所述的音频信号分类处理方法, 其特征在于, 所 述获取音频信号中待分类帧的音调分布参数, 待分类帧前 N3帧的音调分 布参数, 以及待分类帧后 L2帧的音调分布参数包括: 9. The audio signal classification processing method according to claim 8, characterized in that: the said obtaining the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N3 frames before the frame to be classified, and the L2 after the frame to be classified The pitch distribution parameters of the frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; For the frame to be classified, the N3 frames before the frame to be classified and the frame to be classified in the received audio signal The last L2 frame is subjected to fast Fourier transform to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧帧后 L2帧的 音调分布参数; The frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained according to the power density spectrum as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the N3 frames before the frame to be classified is obtained As the pitch distribution parameter of the N3 frames before the frame to be classified, and the frequency domain distribution information of the pitch component of the L2 frame after the frame to be classified as the pitch distribution parameter of the L2 frame after the frame to be classified;
所述根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调分布参 数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量包括: The method of obtaining the number of tonal components that satisfy the continuity constraint in the frame to be classified based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N3 frames before the frame to be classified, and the pitch distribution parameters of the L2 frames after the frame to be classified includes:
根据接收到的音频信号中的待分类帧的音调分量的频域分布信息、 待 分类帧前 N3帧的音调分量的频域分布信息和待分类帧帧后 L2帧的音调分 量的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的 数量。 According to the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal, the frequency domain distribution information of the tonal component of the N3 frames before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame to be classified. Obtain the number of tonal components in the frame to be classified whose continuous frame number is greater than the sixth threshold.
10、 根据权利要求 8所述的音频信号分类处理方法, 其特征在于, 所 述获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量 分布参数以及待分类帧后 L2帧的能量分布参数包括: 10. The audio signal classification processing method according to claim 8, characterized in that: the energy distribution parameters of the frames to be classified in the audio signal are obtained, the energy distribution parameters of N3 frames before the frames to be classified and L2 after the frames to be classified The energy distribution parameters of the frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and obtain the high-frequency energy distribution ratio and sound pressure level of the N3 frames before the frame to be classified as the frame before the classification The energy distribution parameters of the N3 frame, and the high-frequency energy distribution ratio and sound pressure level of the L2 frame after the frame to be classified are used as the energy distribution parameters of the L2 frame after the frame to be classified;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数包括: The method of obtaining the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frames to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L2 frames after the frame to be classified includes:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the received audio signal. The number of continuous frames is less than the eighth threshold;
所述根据音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的 能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 高频区域的持续帧数包括: The method of obtaining the number of continuous frames in the high-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L2 frames after the frame to be classified includes :
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧的能量分 布参数和待分类帧后 L2帧的高频能量分布比和声压级获取包括所述待分 类帧在内的高频能量分布比大于第九阈值、 声压级大于第十阈值的持续帧 数。 According to the energy analysis of the frame to be classified and the N3 frames before the frame to be classified in the received audio signal The distribution parameters and the high-frequency energy distribution ratio and sound pressure level of the L2 frame after the frame to be classified are used to obtain the number of continuous frames including the frame to be classified in which the high-frequency energy distribution ratio is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold. .
11、 根据权利要求 3、 6或 9所述的音频信号分类处理方法, 其特征 在于, 所述待分类帧中持续帧数大于第六阈值的音调分量的数量为在频域 上大于第七阈值的音调分量的数量。 11. The audio signal classification processing method according to claim 3, 6 or 9, characterized in that the number of tonal components in the frame to be classified whose continuous frame number is greater than the sixth threshold is greater than the seventh threshold in the frequency domain. the number of tonal components.
12、 一种音频信号分类处理装置, 其特征在于, 包括: 12. An audio signal classification and processing device, characterized in that it includes:
第一获取模块, 用于获取音频信号中待分类帧中满足连续性约束条件 的音调分量的数量、所述音频信号中待分类帧在低频区域的持续帧数和所 述待分类帧在高频区域的持续帧数中的至少一项; The first acquisition module is used to acquire the number of tonal components that satisfy the continuity constraint in the frame to be classified in the audio signal, the number of continuous frames in the low-frequency region of the frame to be classified in the audio signal, and the number of continuous frames in the high-frequency region of the frame to be classified in the audio signal. At least one of the region's duration frames;
分类确定模块, 用于根据所述待分类帧中满足连续性约束条件的音调 分量的数量、所述待分类帧在低频区域的持续帧数和所述待分类帧的高频 区域的持续帧数中的至少一项, 确定所述音频信号中待分类帧为音乐信 号, 或确定所述音频信号中待分类帧为语音信号。 Classification determination module, configured to determine the number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames in the low-frequency region of the frame to be classified, and the number of continuous frames in the high-frequency region of the frame to be classified. At least one of: determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a speech signal.
13、 根据权利要求 12所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块具体用于获取音频信号中待分类帧,以及待分类帧前 N1 帧的音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N1帧的音调 分布参数获取待分类帧中满足连续性约束条件的音调分量的数量, N1为正 整数; 或具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N1帧 的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1 帧的能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分 类帧在高频区域的持续帧数; 13. The audio signal classification processing device according to claim 12, characterized in that the first acquisition module is specifically used to acquire the frame to be classified in the audio signal and the pitch distribution parameters of N1 frames before the frame to be classified, and according to The frame to be classified, and the pitch distribution parameters of the N1 frames before the frame to be classified are used to obtain the number of pitch components that satisfy the continuity constraint in the frame to be classified, N1 is a positive integer; or specifically used to obtain the audio signal to be classified. frame, and the energy distribution parameters of the N1 frames before the frame to be classified, and obtain the number of continuous frames in the low-frequency region of the frame to be classified according to the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frame to be classified, or The number of continuous frames in the high-frequency area of the frame to be classified;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。 The classification determination module is specifically configured to determine whether the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than a first threshold, the number of continuous frames in the low-frequency region of the frame to be classified is greater than a second threshold, or the number of tonal components to be classified is greater than a second threshold. When the number of continuous frames of the classified frame in the high-frequency region is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal.
14、 根据权利要求 13所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 以及待 分类帧前 N1帧的音调分布参数包括: 对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; 14. The audio signal classification processing device according to claim 13, characterized in that the first acquisition module obtains the pitch distribution parameters of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frames before the frame to be classified include: Perform fast Fourier transform on the frame to be classified and the N1 frames before the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述分类确定模块根据待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数 量包括: According to the power density spectrum, the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution of the tonal component of the N1 frames before the frame to be classified is obtained The information is used as the pitch distribution parameters of the N1 frames before the frame to be classified; the classification determination module obtains the pitch that satisfies the continuity constraint in the frame to be classified based on the pitch distribution parameters of the frame to be classified and the pitch distribution parameters of the N1 frames before the frame to be classified. The number of portions includes:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 Obtain the number of tonal components in the frame to be classified whose number of continuous frames is greater than the sixth threshold in the frame to be classified according to the frequency domain distribution information of the tonal component of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal
15、 根据权利要求 13所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 以及 待分类帧前 N1帧的能量分布参数包括: 15. The audio signal classification processing device according to claim 13, wherein the first acquisition module obtains the energy distribution parameters of the frames to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frames to be classified include: :
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and the high-frequency energy distribution ratio and sound pressure level of the N1 frames before the frame to be classified as the frame to be classified. Energy distribution parameters of the first N1 frames;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数 包括: The classification determination module obtains the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of the N1 frames before the frame to be classified, including:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数; According to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal, the duration of the high-frequency energy distribution ratio including the frame to be classified being less than the eighth threshold is obtained. number of frames;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 以及待 分类帧前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括: The classification determination module obtains the number of continuous frames in the high-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of the N1 frames before the frame to be classified, including:
根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 According to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal, the high-frequency energy distribution ratio including the frame to be classified is greater than the ninth threshold, and the sound pressure level is obtained. The number of continuous frames in which the pressure level is greater than the tenth threshold.
16、 根据权利要求 12-15任一所述的音频信号分类处理装置, 其特征 在于, 在延时 LI帧获取所述待分类帧的分类结果时, L1为正整数, 所述 第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以 及待分类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约 束条件的音调分量的数量, N2为正整数; 或, 具体用于获取所述音频信号 中待分类帧,以及待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧 的能量分布参数获取所述待分类帧在低频区域的持续帧数或所述待分类 帧在高频区域的持续帧数; 16. The audio signal classification and processing device according to any one of claims 12 to 15, characterized by When the L1 frame is delayed to obtain the classification result of the frame to be classified, L1 is a positive integer, and the first acquisition module is specifically used to obtain the frame to be classified in the audio signal, the N2 frames before the frame to be classified, and the frames to be classified The pitch distribution parameters of the L1 frame after the frame, and based on the pitch distribution parameters of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified, the number of tone components that satisfy the continuity constraint in the frame to be classified is obtained, N2 is a positive integer; or, specifically used to obtain the energy distribution parameters of the frame to be classified in the audio signal, the N2 frame before the frame to be classified, and the L1 frame after the frame to be classified, and based on the frame to be classified in the audio signal, The energy distribution parameters of the N2 frames before the frame to be classified and the L1 frame after the frame to be classified are used to obtain the number of continuous frames of the frame to be classified in the low-frequency area or the number of continuous frames of the frame to be classified in the high-frequency area;
所述分类确定模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号。 The classification determination module is specifically configured to determine whether the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than a first threshold, the number of continuous frames in the low-frequency region of the frame to be classified is greater than a second threshold, or the number of tonal components to be classified is greater than a second threshold. When the number of continuous frames of the classified frame in the high-frequency region is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal.
17、 根据权利要求 16所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类 帧前 N2帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括: 对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; 17. The audio signal classification processing device according to claim 16, characterized in that: the first acquisition module obtains the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N2 frames before the frame to be classified, and The pitch distribution parameters of the L1 frame after the classified frame include: performing fast Fourier transform on the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数; According to the power density spectrum, the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the N2 frames before the frame to be classified is obtained As the pitch distribution parameter of the N2 frames before the frame to be classified, and the frequency domain distribution information of the pitch component of the L1 frame after the frame to be classified as the pitch distribution parameter of the L1 frame after the frame to be classified;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括: The classification determination module obtains the pitch components of the frame to be classified that satisfy the continuity constraint based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N2 frames before the frame to be classified, and the pitch distribution parameters of the L1 frame after the frame to be classified. Quantity includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 According to the frequency domain distribution information of the tonal components of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal, the tonal component of the frame to be classified whose number of continuous frames is greater than the sixth threshold is obtained quantity.
18、 根据权利要求 16所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 待分 类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧帧后 L1帧的高频能量分 布比和声压级作为待分类帧后 L1帧的能量分布参数; 18. The audio signal classification processing device according to claim 16, characterized in that the first acquisition module obtains the energy distribution parameters of the frames to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified and the energy distribution parameters of the frames to be classified. The energy distribution parameters of the L1 frame after the classified frame include: obtaining the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, the high-frequency energy of the N2 frames before the frame to be classified The distribution ratio and sound pressure level are used as the energy distribution parameters of the N2 frames before the frame to be classified, and the high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括: The classification determination module obtains the continuous frames of the frame to be classified in the low-frequency region based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frame after the frame to be classified. Numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The number of continuous frames is less than the eighth threshold;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述 待分类帧在高频区域的持续帧数包括: The classification determination module obtains the duration of the frame to be classified in the high-frequency region based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frame after the frame to be classified. Frames include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The number of continuous frames in which the ratio is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold.
19、 根据权利要求 12-18任一所述的音频信号分类处理装置, 其特征 在于, 19. The audio signal classification and processing device according to any one of claims 12 to 18, characterized in that,
在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整数, 所述第一获取模块具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量, N3为正整数; 或, When delaying the L2+L3 frame to obtain the classification result of the frame to be classified, L2 and L3 are positive integers, the first acquisition module is specifically used to obtain the frame to be classified in the audio signal, the N3 frames before the frame to be classified, and The pitch distribution parameters of the L2 frame after the frame to be classified, and based on the pitch distribution parameters of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frame after the frame to be classified, obtain the pitch component of the frame to be classified that satisfies the continuity constraint. Quantity, N3 is a positive integer; or,
具体用于获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及 待分类帧后 L2帧的能量分布参数, 并根据所述音频信号中待分类帧, 待 分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数获取所述待分类帧在 低频区域的持续帧数或所述待分类帧在高频区域的持续帧数; 所述分类处理模块具体用于在所述待分类帧中满足连续性约束条件 的音调分量的数量大于第一阈值、所述待分类帧在低频区域的持续帧数大 于第二阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定 所述音频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧 为语音信号; 若确定所述音频信号中待分类帧为音乐信号, 则确定所述待 分类帧前 N4帧和待分类帧中后 L3帧中确定为语音信号的帧数目是否大于 第四阈值, 若超过, 则将所述音频信号中待分类帧修正为语音信号; 若确 定所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和 待分类帧中后 L3帧中确定为音乐信号的帧数目是否大于第五阈值, 若大 于, 则将所述音频信号中待分类帧修正为音乐信号, N4为正整数。 Specifically used to obtain the energy distribution parameters of the frame to be classified in the audio signal, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified, and based on the frame to be classified in the audio signal, the N3 frames before the frame to be classified, and The energy distribution parameter of the L2 frame after the frame to be classified obtains the number of continuous frames of the frame to be classified in the low-frequency area or the number of continuous frames of the frame to be classified in the high-frequency area; The classification processing module is specifically configured to: the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than a first threshold, the number of continuous frames in the low-frequency region of the frame to be classified is greater than the second threshold, or the number of tonal components to be classified is greater than a second threshold. When the number of continuous frames in the high-frequency region of the classified frame is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a speech signal; if it is determined that the frame in the audio signal is If the frame to be classified is a music signal, it is determined whether the number of frames determined to be speech signals among the N4 frames before the frame to be classified and the L3 frames after the frame to be classified is greater than the fourth threshold. If it exceeds, then the number of frames in the audio signal to be classified is The classified frame is modified into a speech signal; if it is determined that the frame to be classified in the audio signal is a speech signal, then determine whether the number of frames determined to be music signals among the N4 frames before the frame to be classified and the L3 frames after the frame to be classified is greater than the Five thresholds, if it is greater than, the frame to be classified in the audio signal will be corrected into a music signal, N4 is a positive integer.
20、 根据权利要求 19所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取音频信号中待分类帧的音调分布参数, 待分类 帧前 N3帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括: 对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; 20. The audio signal classification processing device according to claim 19, characterized in that: the first acquisition module obtains the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N3 frames before the frame to be classified, and The pitch distribution parameters of the L2 frame after the classified frame include: performing fast Fourier transform on the frame to be classified, the N3 frames before the frame to be classified, and the L2 frame after the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数, 以及 待分类帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调 分布参数; The frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained according to the power density spectrum as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the N3 frames before the frame to be classified is obtained As the pitch distribution parameters of the N3 frames before the frame to be classified, and the frequency domain distribution information of the pitch components of the L2 frame after the frame to be classified as the pitch distribution parameters of the L2 frame after the frame to be classified;
所述分类确定模块根据待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中 满足连续性约束条件的音调分量的数量包括: The classification determination module obtains the pitch components of the frame to be classified that satisfy the continuity constraint based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N3 frames before the frame to be classified, and the pitch distribution parameters of the L2 frame after the frame to be classified. Quantity includes:
根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六阈 值的音调分量的数量。 Obtain the number of tonal components in the frame to be classified whose number of continuous frames is greater than the sixth threshold in the frame to be classified according to the frequency domain distribution information of the tonal component of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frame after the frame to be classified in the received audio signal .
21、 根据权利要求 19所述的音频信号分类处理装置, 其特征在于, 所述第一获取模块获取所音频信号中待分类帧的能量分布参数, 待分 类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括: 获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧的高频能量分布比和声压级 作为待分类帧前 N3帧的能量分布参数, 以及待分类帧帧后 L2帧的高频能 量分布比和声压级作为待分类帧后 L2帧的能量分布参数; 21. The audio signal classification processing device according to claim 19, characterized in that the first acquisition module obtains the energy distribution parameters of the frames to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified and the energy distribution parameters of the frames to be classified. The energy distribution parameters of the L2 frame after the classified frame include: Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as The energy distribution parameters of the frame to be classified, the high-frequency energy distribution ratio and sound pressure level of the N3 frames before the frame to be classified are used as the energy distribution parameters of the N3 frames before the frame to be classified, and the high-frequency energy distribution ratio of the L2 frame after the frame to be classified and sound pressure level as the energy distribution parameters of the L2 frame after the frame to be classified;
所述分类确定模块根据音频信号中待分类帧的能量分布参数, 待分类 帧前 N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述 待分类帧在低频区域的持续帧数包括: The classification determination module obtains the continuous frames of the frame to be classified in the low-frequency region based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L2 frame after the frame to be classified. Numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the received audio signal. The number of continuous frames is less than the eighth threshold;
所述分类确定模块根据音频信号中待分类帧、 待分类帧前 N3帧和待 分类帧后 L2帧的能量分布参数获取所述待分类帧在高频区域的持续帧数 包括: The classification determination module obtains the number of continuous frames in the high-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frame after the frame to be classified in the audio signal, including:
根据所述接收到的音频信号中待分类帧的能量分布参数, 待分类帧前 N3帧的能量分布参数以及待分类帧后 L2帧的高频能量分布比和声压级获 取包括所述待分类帧在内的高频能量分布比大于第九阈值、 声压级大于第 十阈值的持续帧数。 According to the energy distribution parameters of the frame to be classified in the received audio signal, the energy distribution parameters of the N3 frames before the frame to be classified and the high-frequency energy distribution ratio and sound pressure level of the L2 frame after the frame to be classified are obtained including the to-be-classified frame. The number of continuous frames in which the high-frequency energy distribution ratio within the frame is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold.
22、 根据权利要求 14、 17或 20所述的音频信号分类处理装置, 其特 征在于, 所述第一获取模块获取的待分类帧中持续帧数大于第六阈值的音 调分量的数量为在频域上大于第七阈值的音调分量的数量。 22. The audio signal classification and processing device according to claim 14, 17 or 20, characterized in that the number of tonal components in the frames to be classified acquired by the first acquisition module and whose continuous frame number is greater than the sixth threshold is in frequency The number of tonal components in the domain that are greater than the seventh threshold.
23、 一种音频信号分类处理设备, 其特征在于, 包括: 23. An audio signal classification and processing device, characterized by including:
接收器, 用于接收音频信号; Receiver, used to receive audio signals;
处理器, 与所述接收器连接, 用于获取接收器接收到的音频信号中待 分类帧中满足连续性约束条件的音调分量的数量、 所述音频信号中待分类 帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧数中的至 少一项, 根据所述待分类帧中满足连续性约束条件的音调分量的数量、 所 述待分类帧在低频区域的持续帧数和所述待分类帧在高频区域的持续帧 数中的至少一项, 确定所述音频信号中待分类帧为音乐信号, 或确定所述 音频信号中待分类帧为语音信号。 A processor, connected to the receiver, used to obtain the number of tonal components that satisfy the continuity constraint in the frames to be classified in the audio signal received by the receiver, and the continuous frames in the low-frequency region of the frames to be classified in the audio signal. and at least one of the number of continuous frames of the frame to be classified in the high-frequency region, according to the number of tonal components that satisfy the continuity constraint in the frame to be classified, the number of continuous frames of the frame to be classified in the low-frequency region At least one of the number and the number of continuous frames of the frame to be classified in the high-frequency region, determine that the frame to be classified in the audio signal is a music signal, or determine that the frame to be classified in the audio signal is a speech signal.
24、 根据权利要求 23所述的音频信号分类处理设备, 其特征在于, 所述处理器具体用于获取音频信号中待分类帧, 以及待分类帧前 N 1帧的 音调分布参数, 并根据所述待分类帧, 以及待分类帧前 N帧的音调分布参 数获取待分类帧中满足连续性约束条件的音调分量的数量, N1为正整数; 获取所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数, 并根据所述音频信号中待分类帧, 以及待分类帧前 N1帧的能量分布参数 获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域 的持续帧数, N1为正整数; 在所述待分类帧中满足连续性约束条件的音调 分量的数量大于第一阈值、 所述待分类帧在低频区域的持续帧数大于第二 阈值或所述待分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音 频信号中待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音 信号。 24. The audio signal classification processing device according to claim 23, characterized in that the processor is specifically configured to obtain the frame to be classified in the audio signal, and the N 1 frames before the frame to be classified. Pitch distribution parameter, and obtain the number of tone components that satisfy the continuity constraint in the frame to be classified based on the frame to be classified and the tone distribution parameters of the N frames before the frame to be classified, N1 is a positive integer; obtain the number of tone components in the audio signal The frame to be classified, and the energy distribution parameters of the N1 frames before the frame to be classified, and based on the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frame to be classified, the continuous frames of the frame to be classified in the low frequency area are obtained and/or the number of continuous frames in the high-frequency region of the frame to be classified, N1 is a positive integer; the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the frame to be classified When the number of continuous frames in the low-frequency region is greater than the second threshold or the number of continuous frames in the high-frequency region of the frame to be classified is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the audio signal The frame to be classified is a speech signal.
25、 根据权利要求 24所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数, 以及待分类帧 前 N1帧的音调分布参数包括: 25. The audio signal classification and processing device according to claim 24, wherein the processor obtains the pitch distribution parameters of the frames to be classified in the audio signal, and the pitch distribution parameters of the N1 frames before the frames to be classified include:
对接收到的音频信号中的待分类帧和待分类帧前 N1帧进行快速傅里 叶变换, 获取功率密度谱; Perform fast Fourier transform on the frame to be classified and the N1 frames before the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数,以及待分类帧前 N1 帧的音调分量的频域分布信息作为待分类帧前 N1帧的音调分布参数; 所述处理器根据待分类帧的音调分布参数, 以及待分类帧前 N1帧的 音调分布参数获取待分类帧中满足连续性约束条件的音调分量的数量包 括: According to the power density spectrum, the frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution of the tonal component of the N1 frames before the frame to be classified is obtained The information is used as the pitch distribution parameters of the N1 frames before the frame to be classified; the processor obtains the pitch components that satisfy the continuity constraints in the frame to be classified based on the pitch distribution parameters of the frame to be classified and the pitch distribution parameters of the N1 frames before the frame to be classified. The quantity includes:
根据接收到的音频信号中的待分类帧和待分类帧前 N1帧的音调分量 的频域分布信息获取待分类帧中持续帧数大于第六阈值的音调分量的数 Obtain the number of tonal components in the frame to be classified whose number of continuous frames is greater than the sixth threshold in the frame to be classified according to the frequency domain distribution information of the tonal component of the frame to be classified and the N1 frames before the frame to be classified in the received audio signal
26、 根据权利要求 24所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 以及待分类 帧前 N1帧的能量分布参数包括: 26. The audio signal classification and processing device according to claim 24, characterized in that the processor obtains the energy distribution parameters of the frames to be classified in the audio signal, and the energy distribution parameters of the N1 frames before the frames to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 以及待分类帧前 N1帧的高频能量分布比和声 压级作为待分类帧前 N1帧的能量分布参数; 所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在低频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比小于第 八阈值的持续帧数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and the high-frequency energy distribution ratio and sound pressure level of the N1 frames before the frame to be classified as the frame to be classified. Energy distribution parameters of the first N1 frames; The processor obtains the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of the N1 frames before the frame to be classified, including: based on the received audio signal The high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified are obtained to obtain the number of continuous frames in which the high-frequency energy distribution ratio, including the frame to be classified, is less than the eighth threshold;
所述处理器根据音频信号中待分类帧的能量分布参数, 以及待分类帧 前 N1帧的能量分布参数获取所述待分类帧在高频区域的持续帧数包括: 根据所述接收到的音频信号中待分类帧和待分类帧前 N1帧的高频能 量分布比和声压级获取包括所述待分类帧在内的高频能量分布比大于第 九阈值、 声压级大于第十阈值的持续帧数。 The processor obtains the number of continuous frames in the high-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal and the energy distribution parameters of the N1 frames before the frame to be classified, including: based on the received audio The high-frequency energy distribution ratio and sound pressure level of the frame to be classified and the N1 frames before the frame to be classified in the signal are obtained. The high-frequency energy distribution ratio including the frame to be classified is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold. Duration of frames.
27、 根据权利要求 23-26任一所述的音频信号分类处理设备, 其特征 在于, 在延时 L1帧获取所述待分类帧的分类结果时, L1为正整数, 所述 处理器具体用于获取音频信号中待分类帧, 待分类帧前 N2帧, 以及待分 类帧后 L1帧的音调分布参数, 并根据所述待分类帧, 待分类帧前 N2帧以 及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连续性约束条件 的音调分量的数量, N2为正整数; 获取所述音频信号中待分类帧, 以及待 分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数, 并根据所述音频信 号中待分类帧, 待分类帧前 N2帧以及待分类帧后 L1帧的能量分布参数获 取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频区域的 持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数量大于 第一阈值、所述待分类帧在低频区域的持续帧数大于第二阈值或所述待分 类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中待分类 帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号。 27. The audio signal classification and processing device according to any one of claims 23 to 26, characterized in that, when delaying the L1 frame to obtain the classification result of the frame to be classified, L1 is a positive integer, and the processor specifically uses To obtain the pitch distribution parameters of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the audio signal, and based on the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified The pitch distribution parameter obtains the number of pitch components that satisfy the continuity constraint in the frame to be classified, N2 is a positive integer; obtains the frame to be classified in the audio signal, as well as the energy of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified Distribution parameters, and obtain the number of continuous frames in the low-frequency region of the frame to be classified and/or the number of continuous frames of the frame to be classified according to the energy distribution parameters of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the audio signal. The number of continuous frames of the classified frame in the high-frequency region; the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the number of continuous frames of the frame to be classified in the low-frequency region is greater than the second threshold, or When the number of continuous frames in the high-frequency region of the frame to be classified is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal; otherwise, it is determined that the frame to be classified in the audio signal is a speech signal.
28、 根据权利要求 27所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数,待分类帧前 N2 帧的音调分布参数, 以及待分类帧后 L1帧的音调分布参数包括: 28. The audio signal classification processing device according to claim 27, characterized in that the processor obtains the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N2 frames before the frame to be classified, and the frame to be classified The pitch distribution parameters of the latter L1 frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧帧 后 L1帧进行快速傅里叶变换, 获取功率密度谱; Perform fast Fourier transform on the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N2帧 的音调分量的频域分布信息作为待分类帧前 N2帧的音调分布参数, 以及 待分类帧帧后 L1帧的音调分量的频域分布信息作为待分类帧帧后 L1帧的 音调分布参数; According to the power density spectrum, the frequency domain distribution information of the pitch component of the frame to be classified in the received audio signal is obtained as the pitch distribution parameter of the frame to be classified, N2 frames before the frame to be classified The frequency domain distribution information of the pitch component is used as the pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the pitch component of the L1 frame after the frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame to be classified;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N2帧的音调 分布参数, 以及待分类帧后 L1帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括: The processor obtains the number of tonal components that satisfy the continuity constraint in the frame to be classified based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N2 frames before the frame to be classified, and the pitch distribution parameters of the L1 frame after the frame to be classified. include:
根据接收到的音频信号中的待分类帧、 待分类帧前 N2帧和待分类帧 帧后 L1帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 According to the frequency domain distribution information of the tonal components of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal, the tonal component of the frame to be classified whose number of continuous frames is greater than the sixth threshold is obtained quantity.
29、 根据权利要求 27所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧前 N2帧的能量分布参数以及待分类帧后 L1帧的能量分布参数包括: 29. The audio signal classification processing device according to claim 27, characterized in that, the processor obtains the energy distribution parameters of the frames to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified and the frames to be classified The energy distribution parameters of the last L1 frame include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N2帧的高频能量分布比和声压级 作为待分类帧前 N2帧的能量分布参数和待分类帧后 L1帧的高频能量分布 比和声压级作为待分类帧后 L1帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameters of the frame to be classified, and obtain the high-frequency energy distribution ratio and sound pressure level of the N2 frames before the frame to be classified as the frame before the classification The energy distribution parameters of the N2 frame and the high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括: The processor obtains the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frame after the frame to be classified. include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The ratio is less than the eighth threshold of the number of continuous frames;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N2 帧的能量分布参数以及待分类帧后 L1帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括: The processor obtains the continuous frames of the frame to be classified in the high-frequency region based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N2 frames before the frame to be classified, and the energy distribution parameters of the L1 frame after the frame to be classified. Numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N2帧和待分类 帧后 L1帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N2 frames before the frame to be classified, and the L1 frame after the frame to be classified in the received audio signal. The number of continuous frames in which the ratio is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold.
30、 根据权利要求 23-29任一所述的音频信号分类处理设备, 其特征 在于, 在延时 L2+L3帧获取所述待分类帧的分类结果时, L2和 L3为正整 数, 所述处理器具体用于获取音频信号中待分类帧, 待分类帧前 N3帧, 以及待分类帧后 L2帧的音调分布参数, 并根据所述待分类帧, 待分类帧 前 N3帧以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连续性 约束条件的音调分量的数量,N3为正整数;获取所述音频信号中待分类帧, 以及待分类帧前 N3帧以及待分类帧后 L2帧的能量分布参数, 并根据所述 音频信号中待分类帧, 待分类帧前 N3帧以及待分类帧后 L2帧的能量分布 参数获取所述待分类帧在低频区域的持续帧数和 /或所述待分类帧在高频 区域的持续帧数; 在所述待分类帧中满足连续性约束条件的音调分量的数 量大于第一阈值、 所述待分类帧在低频区域的持续帧数大于第二阈值或所 述待分类帧在高频区域的持续帧数大于第三阈值时, 确定所述音频信号中 待分类帧为音乐信号, 否则确定所述音频信号中待分类帧为语音信号; 若 确定所述音频信号中待分类帧为音乐信号, 则确定所述待分类帧前 N4帧 和待分类帧后 L4帧中确定为语音信号的帧数目是否大于第四阈值, 若超 过, 则将所述音频信号中待分类帧修正为语音信号, N4为正整数; 若确定 所述音频信号中待分类帧为语音信号, 则确定所述待分类帧前 N4帧和待 分类帧后 L4帧中确定为音乐信号的帧数目是否大于第五阈值, 若大于, 则将所述音频信号中待分类帧修正为音乐信号。 30. The audio signal classification processing device according to any one of claims 23 to 29, characterized in that, when delaying the L2+L3 frame to obtain the classification result of the frame to be classified, L2 and L3 are positive integers. Number, the processor is specifically configured to obtain the pitch distribution parameters of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the audio signal, and according to the frame to be classified, the N3 frames before the frame to be classified And the pitch distribution parameter of the L2 frame after the frame to be classified is obtained to obtain the number of pitch components that satisfy the continuity constraint in the frame to be classified, N3 is a positive integer; the frame to be classified in the audio signal is obtained, and the N3 frames before the frame to be classified are obtained and The energy distribution parameters of the L2 frame after the frame to be classified, and the duration of the frame to be classified in the low-frequency region is obtained based on the energy distribution parameters of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frame after the frame to be classified in the audio signal. The number of frames and/or the number of continuous frames in the high-frequency region of the frame to be classified; the number of tonal components that satisfy the continuity constraint in the frame to be classified is greater than the first threshold, the number of tonal components in the low-frequency region of the frame to be classified When the number of continuous frames is greater than the second threshold or the number of continuous frames in the high-frequency area of the frame to be classified is greater than the third threshold, it is determined that the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is is a speech signal; if it is determined that the frame to be classified in the audio signal is a music signal, then determine whether the number of frames determined to be speech signals among the N4 frames before the frame to be classified and the L4 frames after the frame to be classified is greater than the fourth threshold, if If exceeds Whether the number of frames in the last L4 frames determined to be music signals is greater than the fifth threshold, if so, correct the frames to be classified in the audio signal to music signals.
31、 根据权利要求 30所述的音频信号分类处理设备, 其特征在于, 所述处理器获取音频信号中待分类帧的音调分布参数,待分类帧前 N3 帧的音调分布参数, 以及待分类帧后 L2帧的音调分布参数包括: 31. The audio signal classification processing device according to claim 30, characterized in that the processor obtains the pitch distribution parameters of the frame to be classified in the audio signal, the pitch distribution parameters of the N3 frames before the frame to be classified, and the frame to be classified The pitch distribution parameters of the latter L2 frame include:
对接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧帧 后 L2帧进行快速傅里叶变换, 获取功率密度谱; Perform fast Fourier transform on the frames to be classified, the N3 frames before the frames to be classified, and the L2 frames after the frames to be classified in the received audio signal to obtain the power density spectrum;
根据所述功率密度谱获取所述接收到的音频信号中的待分类帧的音 调分量的频域分布信息作为待分类帧的音调分布参数, 待分类帧前 N3帧 的音调分量的频域分布信息作为待分类帧前 N3帧的音调分布参数和待分 类帧帧后 L2帧的音调分量的频域分布信息作为待分类帧后 L2帧的音调分 布参数; The frequency domain distribution information of the tonal component of the frame to be classified in the received audio signal is obtained according to the power density spectrum as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the N3 frames before the frame to be classified is obtained As the pitch distribution parameters of the N3 frames before the frame to be classified and the frequency domain distribution information of the pitch components of the L2 frame after the frame to be classified as the pitch distribution parameters of the L2 frame after the frame to be classified;
所述处理器根据待分类帧的音调分布参数, 待分类帧前 N3帧的音调 分布参数, 以及待分类帧后 L2帧的音调分布参数获取待分类帧中满足连 续性约束条件的音调分量的数量包括: 根据接收到的音频信号中的待分类帧、 待分类帧前 N3帧和待分类帧 帧后 L2帧的音调分量的频域分布信息获取待分类帧中持续帧数大于第六 阈值的音调分量的数量。 The processor obtains the number of pitch components that satisfy the continuity constraint in the frame to be classified based on the pitch distribution parameters of the frame to be classified, the pitch distribution parameters of the N3 frames before the frame to be classified, and the pitch distribution parameters of the L2 frame after the frame to be classified. include: According to the frequency domain distribution information of the tonal components of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the received audio signal, the tonal component of the frame to be classified whose number of continuous frames is greater than the sixth threshold is obtained quantity.
32、 根据权利要求 30所述的音频信号分类处理设备, 其特征在于, 所述处理器获取所音频信号中待分类帧的能量分布参数, 待分类帧前 32. The audio signal classification and processing device according to claim 30, characterized in that the processor obtains the energy distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
N3帧的能量分布参数以及待分类帧后 L2帧的能量分布参数包括: The energy distribution parameters of the N3 frame and the energy distribution parameters of the L2 frame after the frame to be classified include:
获取接收到的音频信号中待分类帧的高频能量分布比和声压级作为 待分类帧的能量分布参数, 待分类帧前 N3帧作为待分类帧前 N3帧的能量 分布参数, 以及待分类帧帧后 L2帧的高频能量分布比和声压级作为待分 类帧后 L2帧的能量分布参数; Obtain the high-frequency energy distribution ratio and sound pressure level of the frame to be classified in the received audio signal as the energy distribution parameter of the frame to be classified, the N3 frames before the frame to be classified as the energy distribution parameters of the N3 frames before the frame to be classified, and The high-frequency energy distribution ratio and sound pressure level of the L2 frame after the frame are used as the energy distribution parameters of the L2 frame after the frame to be classified;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在低频区域的持续帧数包括: The processor obtains the number of continuous frames in the low-frequency region of the frame to be classified based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L2 frame after the frame to be classified. include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比小于第八阈值的持续帧数; Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the received audio signal. The ratio is less than the eighth threshold of the number of continuous frames;
所述处理器根据音频信号中待分类帧的能量分布参数,待分类帧前 N3 帧的能量分布参数以及待分类帧后 L2帧的能量分布参数获取所述待分类 帧在高频区域的持续帧数包括: The processor obtains the continuous frames of the frame to be classified in the high-frequency region based on the energy distribution parameters of the frame to be classified in the audio signal, the energy distribution parameters of the N3 frames before the frame to be classified, and the energy distribution parameters of the L2 frame after the frame to be classified. Numbers include:
根据所述接收到的音频信号中待分类帧、 待分类帧前 N3帧和待分类 帧后 L2帧的高频能量分布比和声压级获取包括所述待分类帧在内的高频 能量分布比大于第九阈值、 声压级大于第十阈值的持续帧数。 Obtain the high-frequency energy distribution including the frame to be classified according to the high-frequency energy distribution ratio and sound pressure level of the frame to be classified, the N3 frames before the frame to be classified, and the L2 frames after the frame to be classified in the received audio signal. The number of continuous frames in which the ratio is greater than the ninth threshold and the sound pressure level is greater than the tenth threshold.
33、 根据权利要求 25、 28或 31所述的音频信号分类处理设备, 其特 征在于, 所述处理器获取的待分类帧中持续帧数大于第六阈值的音调分量 的数量为在频域上大于第七阈值的音调分量的数量。 33. The audio signal classification and processing device according to claim 25, 28 or 31, characterized in that the number of tonal components in the frames to be classified obtained by the processor and whose number of continuous frames is greater than the sixth threshold is in the frequency domain. The number of tonal components greater than the seventh threshold.
PCT/CN2014/081400 2013-07-02 2014-07-01 Audio signal classification processing method, apparatus, and device WO2015000401A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310274580.9A CN104282315B (en) 2013-07-02 2013-07-02 Audio signal classification processing method, device and equipment
CN201310274580.9 2013-07-02

Publications (1)

Publication Number Publication Date
WO2015000401A1 true WO2015000401A1 (en) 2015-01-08

Family

ID=52143107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/081400 WO2015000401A1 (en) 2013-07-02 2014-07-01 Audio signal classification processing method, apparatus, and device

Country Status (2)

Country Link
CN (1) CN104282315B (en)
WO (1) WO2015000401A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (en) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 Method and system for self-adaptive adjustment of audio effect

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9454893B1 (en) 2015-05-20 2016-09-27 Google Inc. Systems and methods for coordinating and administering self tests of smart home devices having audible outputs
EP3298598B1 (en) * 2015-05-20 2020-06-03 Google LLC Systems and methods for testing smart home devices

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (en) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
CN102237085A (en) * 2010-04-26 2011-11-09 华为技术有限公司 Method and device for classifying audio signals

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785645B2 (en) * 2001-11-29 2004-08-31 Microsoft Corporation Real-time speech and music classifier
KR100964402B1 (en) * 2006-12-14 2010-06-17 삼성전자주식회사 Method and Apparatus for determining encoding mode of audio signal, and method and appartus for encoding/decoding audio signal using it
CN101577117B (en) * 2009-03-12 2012-04-11 无锡中星微电子有限公司 Extracting method of accompaniment music and device
CN101847412B (en) * 2009-03-27 2012-02-15 华为技术有限公司 Method and device for classifying audio signals
CN102446504B (en) * 2010-10-08 2013-10-09 华为技术有限公司 Voice/Music identifying method and equipment
CN102655000B (en) * 2011-03-04 2014-02-19 华为技术有限公司 Method and device for classifying unvoiced sound and voiced sound

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (en) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Low-complexity music detection algorithm and system
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (en) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 Music/ non-music real-time detection method and device
CN102237085A (en) * 2010-04-26 2011-11-09 华为技术有限公司 Method and device for classifying audio signals

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (en) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 Method and system for self-adaptive adjustment of audio effect
CN104811864B (en) * 2015-04-20 2018-11-13 深圳市冠旭电子股份有限公司 A kind of method and system of automatic adjusument audio

Also Published As

Publication number Publication date
CN104282315A (en) 2015-01-14
CN104282315B (en) 2017-11-24

Similar Documents

Publication Publication Date Title
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN106664486B (en) Method and apparatus for wind noise detection
CN108896878B (en) Partial discharge detection method based on ultrasonic waves
US9959886B2 (en) Spectral comb voice activity detection
WO2015078121A1 (en) Audio signal quality detection method and device
WO2014177084A1 (en) Voice activation detection method and device
WO2019233228A1 (en) Electronic device and device control method
US20200365173A1 (en) Method for constructing voice detection model and voice endpoint detection system
US20150106087A1 (en) Efficient Discrimination of Voiced and Unvoiced Sounds
US9792898B2 (en) Concurrent segmentation of multiple similar vocalizations
JP2010112995A (en) Call voice processing device, call voice processing method and program
WO2013078677A1 (en) A method and device for adaptively adjusting sound effect
WO2015000401A1 (en) Audio signal classification processing method, apparatus, and device
WO2016078439A1 (en) Voice processing method and apparatus
KR101295727B1 (en) Apparatus and method for adaptive noise estimation
CN104732984B (en) A kind of method and system of quick detection single-frequency prompt tone
Dekens et al. Speech rate determination by vowel detection on the modulated energy envelope
CN109994129A (en) Speech processing system, method and apparatus
CN109377982A (en) A kind of efficient voice acquisition methods
CN108847218A (en) A kind of adaptive threshold adjusting sound end detecting method, equipment and readable storage medium storing program for executing
Craciun et al. Correlation coefficient-based voice activity detector algorithm
WO2022068440A1 (en) Howling suppression method and apparatus, computer device, and storage medium
CN115567845A (en) Information processing method and device
CN110268726A (en) New-type intelligent hearing aid
CN109427345B (en) Wind noise detection method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1