WO2015000401A1 - Procédé, appareil et dispositif de traitement de classification de signal audio - Google Patents

Procédé, appareil et dispositif de traitement de classification de signal audio Download PDF

Info

Publication number
WO2015000401A1
WO2015000401A1 PCT/CN2014/081400 CN2014081400W WO2015000401A1 WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1 CN 2014081400 W CN2014081400 W CN 2014081400W WO 2015000401 A1 WO2015000401 A1 WO 2015000401A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
classified
audio signal
frames
energy distribution
Prior art date
Application number
PCT/CN2014/081400
Other languages
English (en)
Chinese (zh)
Inventor
许丽净
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2015000401A1 publication Critical patent/WO2015000401A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/81Detection of presence or absence of voice signals for discriminating voice from music

Definitions

  • Audio signal classification processing method device and device
  • the embodiments of the present invention relate to the field of signal processing technologies, and in particular, to an audio signal classification processing method, apparatus, and device. Background technique
  • the signal to be analyzed in the actual application may include a music signal, such as a ring tones.
  • the speech quality assessment model treats it as a speech signal and gives an incorrect quality assessment.
  • the signal to be analyzed should be classified before being input to the speech quality assessment module. If the segment signal is recognized as a speech signal, it is sent to the speech quality evaluation module for quality evaluation; if the segment signal is recognized as a music signal, it is not sent to the speech quality evaluation module.
  • the prior art provides an audio signal classification method applied to a speech music joint encoder, but the classification method is directed to a speech music joint encoder with a high sampling rate.
  • the existing music signal is generally lacking.
  • High-frequency information, using the existing audio signal classification method applied to the combined combination of speech and music, can only identify a small number of music signals, and the classification accuracy is low, which can not meet the requirements of voice quality assessment.
  • the invention provides an audio signal classification processing method, device and device for improving the classification correctness rate of an audio signal.
  • a first aspect of the present invention provides an audio signal classification processing method, including: acquiring a number of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, and a frame to be classified in the audio signal in a low frequency region At least one of a continuous frame number and a continuous frame number of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, or is determined in the audio signal
  • the frame to be classified is a voice signal.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal comprises:
  • N1 is a positive integer
  • N1 is a positive integer
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the obtained audio signal, and the pitch distribution parameter of the N1 frame before the frame to be classified include:
  • the obtaining, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the N1 frame before the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: Obtaining, according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the to-be-classified frame N1 frame, the number of tonal components in the frame to be classified that is greater than the sixth threshold, in combination with the first aspect
  • the energy distribution parameter of the frame to be classified in the obtained audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the to-be-classified frame in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • L1 is a positive integer
  • the audio signal is acquired.
  • the number of tonal components in the to-be-classified frame that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the N2 frame and the energy distribution parameter of the L1 frame after the frame to be classified acquire the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal, including:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameters of the frame include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the acquiring an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a L1 after the frame to be classified The energy distribution parameters of the frame include:
  • the energy distribution parameter of the frame to be classified, the high-frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are the energy distribution parameters of the N2 frame before the frame to be classified and the high-frequency energy distribution ratio of the L1 frame after the frame to be classified and The sound pressure level is used as an energy distribution parameter of the L 1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame.
  • the number of tonal components satisfying the continuity constraint in the frame to be classified in the acquired audio signal includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the energy distribution parameter of the L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • Determining, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, Determining the number of consecutive frames of the classified frame in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, and otherwise determining that the frame to be classified in the audio signal is Voice signals include:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the frame to be classified in the audio signal is a music signal, determining whether the number of frames determined as a voice signal in the N4 frame before the frame to be classified and the frame to be classified is greater than a fourth threshold, if exceeded, The frame to be classified in the audio signal is modified into a voice signal, and N4 is a positive integer. If it is determined that the frame to be classified in the audio signal is a voice signal, the N4 frame before the frame to be classified and the L3 frame after the frame to be classified are determined. Determining whether the number of frames of the music signal is greater than a fifth threshold, and if greater, correcting the frame to be classified in the audio signal to a music signal.
  • the pitch distribution parameters of the frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified
  • frequency domain distribution information of a tonal component of the pre-frame N3 to be classified as The frequency domain distribution information of the tone distribution parameter frame of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame frame to be classified;
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the acquiring sound The energy distribution parameter of the frame to be classified in the frequency signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • a second aspect of the present invention provides an audio signal classification processing apparatus, including: a first acquisition module, configured to acquire, in an audio signal, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, At least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region;
  • a classification determining module configured to determine, according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame And at least one of determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame And the pitch distribution parameter of the N1 frame before the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or
  • the method is specifically configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and an N1 frame before the frame to be classified, and obtain the foregoing according to the to-be-classified frame in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified.
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the classification determining module acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes: Obtaining a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal as an energy distribution parameter of the frame to be classified, and a high-frequency energy distribution ratio and a sound pressure level of the N1 frame to be classified as a frame to be classified Energy distribution parameters of the first N1 frame;
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame to be classified:
  • the classification determining module obtains, by the classification determining module, the number of consecutive frames of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • L1 is a positive integer
  • the first acquiring module is used. Specifically, the frame to be classified in the audio signal, the N2 frame before the frame to be classified, and the tone distribution parameter of the L1 frame after the frame to be classified, and according to the to-be-classified frame, the N2 frame before the frame to be classified and the L1 frame to be classified
  • the pitch distribution parameter of the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a N2 frame to be classified and to be classified
  • the energy distribution parameter of the L1 frame after the frame is classified, and the continuous frame of the frame to be classified in the low frequency region is obtained according to the frame to be classified in the audio signal, the N2 frame of the frame to be
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the post L1 frame include:
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a to-be-classified
  • the energy distribution parameters of the L1 frame after the frame include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the classification determining module acquires, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the continuous frame of the frame to be classified in the low frequency region.
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module is to be classified according to an energy distribution parameter of a frame to be classified in an audio signal
  • the energy distribution parameter of the N2 frame before the frame and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • L2 and L3 are positive integers.
  • the first acquiring module is specifically configured to acquire a to-be-classified frame in the audio signal, a N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified, and according to the to-be-classified frame, the N3 frame before the frame to be classified and The pitch distribution parameter of the L2 frame after the frame to be classified obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; or
  • the method is configured to obtain an energy distribution parameter of the to-be-classified frame in the audio signal, and a pre-frame N3 frame and an L3 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and
  • the energy distribution parameter of the L3 frame after the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-classified frame is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal; if determining the audio signal If the frame to be classified is a music signal, it is determined whether the number of frames determined as a voice signal in the N3 frame before the frame to be classified and the frame in the to-be-classified frame is greater than a fourth threshold, and if yes, the audio signal is to be received.
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a to-be-classified
  • the pitch distribution parameters of the L2 frame after the classification frame include:
  • the pitch distribution parameter of the N3 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L2 frame after the frame frame to be classified are used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the classification determining module acquires, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the tonal component that satisfies the continuity constraint in the frame to be classified.
  • the quantities include:
  • the first acquiring module acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the to-be-classified
  • the energy distribution parameters of the L2 frame after the classification frame include:
  • the energy distribution parameter of the N3 frame, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are the energy distribution parameters of the L2 frame after the frame to be classified;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the numbers include:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified
  • the number of frames includes:
  • the ratio is greater than the ninth threshold
  • the sound pressure level is greater than the tenth threshold.
  • the tone that the number of persistent frames in the to-be-classified frame acquired by the first acquiring module is greater than a sixth threshold
  • the number of components is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain combined with the first possible second possibility or the third possible possibility of the second aspect
  • the first acquiring module is specifically configured to acquire a high frequency energy distribution ratio and a sound pressure level of each frame in the received audio signal, and according to a high frequency energy distribution ratio of each frame in the received audio signal.
  • a third aspect of the present invention provides an audio signal classification processing apparatus, including: a receiver, configured to receive an audio signal;
  • a processor configured to obtain, by the receiver, a quantity of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal received by the receiver, and a continuous frame in the low frequency region of the to-be-classified frame in the audio signal And the at least one of the number of consecutive frames of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, and the continuous frame of the frame to be classified in the low frequency region And determining at least one of the number of consecutive frames of the frame to be classified in the high frequency region, determining that the frame to be classified in the audio signal is a music signal, or determining that the frame to be classified in the audio signal is a voice signal.
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and to be classified
  • the pitch distribution parameter of the N frames before the frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified Obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region and/or the duration of the frame to be classified in the high frequency region
  • the number of frames, N1 is a positive integer; the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than the first threshold, and the to-be-classified De
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the distribution information is used as a pitch distribution parameter of the N1 frame before the frame to be classified;
  • the processor acquires the tone satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified and the pitch distribution parameter of the N1 frame before the frame to be classified
  • the number of components includes:
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes: according to the received audio signal And obtaining, by the high-frequency energy distribution ratio and the sound pressure level, the high-frequency energy distribution ratio of the to-be-classified frame and the high-frequency energy distribution ratio that is less than the eighth threshold;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region includes: Obtaining, according to the high frequency energy distribution ratio of the to-be-classified frame and the pre-frame N1 frame of the received audio signal, and the sound pressure level, obtaining a high-frequency energy distribution ratio including the to-be-classified frame, which is greater than a ninth threshold, and sound The number of consecutive frames whose pressure level is greater than the tenth threshold.
  • L1 is a positive integer
  • the processor is specifically used.
  • the tone distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer; acquiring the frame to be classified in the audio signal, and the energy of the L2 frame before the frame to be classified and the frame after the frame to be classified Distributing parameters, and obtaining, according to the to-be-classified frame in the audio signal, the N2 frame of the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L1 frame include:
  • frequency domain distribution information of a tonal component of the frame to be classified in the received audio signal as a pitch distribution parameter of a frame to be classified, and frequency domain distribution of a tonal component of the N2 frame before the frame to be classified
  • the information is used as the pitch distribution parameter of the N2 frame before the frame to be classified, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified;
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N2 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L 1 frame include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • the processor obtains the continuous frame number of the to-be-classified frame in the low frequency region.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the processor Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the processor acquiring the continuous frame of the frame to be classified in the high frequency region
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third aspect when the classification result of the to-be-classified frame is acquired in the delayed L2+L3 frame, L2 and L3 are positive integers,
  • the processor is specifically configured to obtain a to-be-classified frame in the audio signal, a pre-frame N3 frame, and a tone distribution parameter of the L2 frame to be classified, and according to the to-be-classified frame, the pre-frame N3 frame and the to-be-classified frame
  • the tone distribution parameter of the L2 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer; acquiring the frame to be classified in the audio signal, and the N3 frame before the frame to be classified and the frame to be classified An energy distribution parameter of the L2 frame, and obtaining, according to the to-be-classified frame in the audio signal, the N3 frame of the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of
  • the processor acquires a pitch distribution parameter of a frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the pitch distribution parameters of the L2 frame include:
  • frequency domain distribution information of a tonal component of the to-be-classified frame in the received audio signal as a pitch distribution parameter of a to-be-classified frame
  • frequency domain distribution information of a tonal component of the pre-frame N3 frame to be classified The frequency domain distribution information of the tone distribution parameter of the N3 frame before the frame to be classified and the tone component of the L2 frame after the frame frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified;
  • the processor obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the processor acquires an energy distribution parameter of a frame to be classified in the audio signal, an energy distribution parameter of the N3 frame before the frame to be classified, and a frame to be classified
  • the energy distribution parameters of the L2 frame include:
  • the processor obtains the continuous frame number of the frame to be classified in the low frequency region Includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified acquired by the processor that is greater than a sixth threshold Is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the number of tonal components satisfying the continuity constraint is the number of tonal components greater than the seventh threshold in the frequency domain.
  • the continuity constraint is satisfied in the frame to be classified in the audio signal.
  • the number of tonal components, and the number of consecutive frames of the audio signal to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming that the type of the frame to be classified is a music signal according to the above information , or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart 1 of a specific embodiment of the present invention
  • Figure 3a is a waveform diagram 1 of the input signal "French male + ⁇ ";
  • Figure 3b is a spectrogram corresponding to Figure 3a;
  • Figure 4a is a waveform diagram of an input signal of an audio signal "Jinghu + French male voice";
  • Figure 4b is a spectrum diagram corresponding to Figure 4a;
  • Figure 5a is a waveform diagram of the input signal "Korean male + ensemble"
  • Figure 5b is a spectrum diagram corresponding to Figure 5a;
  • Figure 6a is a waveform diagram 2 of the input signal "French male + ⁇ ";
  • Figure 6b is the initial tone detection result of the input signal shown in Figure 6a;
  • Figure 6c is the result of the tone detection after the input signal is filtered as shown in Figure 6a;
  • Figure 7a is a waveform diagram 3 of the input signal "French male + ⁇ ";
  • Figure 7b is a graph of the pitch characteristic "" m - to ⁇ z - ⁇ corresponding to Figure 7a;
  • Figure 8a is a waveform diagram of the input signal "Jinghu + French male voice"
  • Figure 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ corresponding to Figure 8a;
  • Figure 9a is a waveform diagram of the input signal "Korean male + ensemble";
  • Figure 1 is a graph of the high frequency energy distribution ratio - ⁇ - ⁇ ) corresponding to Figure 9a;
  • Figure 10 is a schematic flow chart 1 of the audio signal classification rule in the embodiment of the present invention.
  • Figure 11a is a waveform diagram 1 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure l ib is a schematic diagram of the classification result corresponding to Figure 11a;
  • Figure 12a is a waveform diagram 2 of the input signal "Chinese female voice + ensemble + English male voice + ⁇ + German male voice + castanets";
  • Figure 12b is a schematic diagram of the smoothed classification result corresponding to Figure 12a;
  • FIG. 13 is a second schematic diagram of an audio signal classification rule according to an embodiment of the present invention.
  • Figure 14a shows the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets" Waveform diagram three;
  • Figure 14b is a schematic diagram of the real-time classification result corresponding to Figure 14a;
  • 15 is a flow chart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets";
  • Figure 16b is a schematic diagram showing the classification results of the three classification methods corresponding to Figure 16a;
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention.
  • FIG. 1 is a schematic flowchart 1 of an audio signal classification processing method according to an embodiment of the present invention. As shown in FIG. 1, the method includes the following steps. Step:
  • Step 101 Acquire an amount of tonal components satisfying a continuity constraint in a frame to be classified in an audio signal, a continuous frame number of a frame to be classified in the audio signal in the low frequency region, and a duration of the frame to be classified in a high frequency region At least one of the number of frames;
  • Step 102 The number of tonal components satisfying the continuity constraint in the acquired to-be-classified frame, the number of persistent frames of the to-be-classified frame in the low-frequency region, and the number of consecutive frames in the high-frequency region of the to-be-classified frame according to the obtained And determining at least one of the audio signals to be classified into a music signal, and determining that the to-be-classified frame in the audio signal is a voice signal.
  • the audio signal classification processing method provided by the embodiment of the present invention can output the classification result without output delay when the frames in the audio signal are classified, that is, output the classification result in real time for the received audio signal frame. There may be a certain output delay, that is, for the received audio signal frame, the classification result is given for a delay.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the following may be classified into three cases according to different output delay requirements.
  • the N frame information is judged, and the second is to allow a smaller classification result output delay, that is, when the output delay is L1 frame, L1 is a positive integer, which can be classified according to the frame to be classified, the L1 frame before the frame to be classified, and the to-be-classified
  • the L1 frame is judged after the frame;
  • the third is to allow the output of the large classification result to be delayed, that is, when the output delay is L2+L3 frame, L2 and L3 are positive integers, first according to the frame to be classified, the L2 frame before the frame to be classified, and After the frame to be classified, the L2 frame is judged, and the classification result of the frame to be classified is obtained, and then modified according to the L3 frame before the frame to be classified and the L3 frame in the frame to be classified.
  • the frames in the first received audio signal cannot be classified, and the first received frame can be set to a default value, and the default is a voice signal or a music signal.
  • the step 101 in the embodiment shown in FIG. 1 acquires the tonal component of the to-be-classified frame in the audio signal that satisfies the continuity constraint condition.
  • the quantity specifically includes:
  • N1 is a positive integer
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal is obtained, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the frame and the to-be-classified N1 frame acquires the number of the tonal components in the frame to be classified that is greater than the sixth threshold.
  • the energy distribution parameter of the frame to be classified in the audio signal is obtained, and
  • the energy distribution parameters of the N1 frame before the frame to be classified include:
  • obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • the sound distribution level acquires the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is less than an eighth threshold
  • the obtaining the continuous frame number of the frame to be classified in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the number of tonal components includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N2 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the L1 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region;
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal.
  • the pitch distribution parameter of the frame to be classified in the audio signal the pitch distribution parameter of the N2 frame before the frame to be classified, and the tone distribution parameter packet of the L1 frame after the frame to be classified are acquired.
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified include:
  • the foregoing obtains an energy distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the energy distribution parameters of the N2 frame and the energy distribution parameters of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high-frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame to be classified are the energy distribution parameters of the L1 frame after the frame to be classified;
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified, the pre-frame N2 frame to be classified, and the L1 frame to be classified after the received audio signal a number of consecutive frames that are less than the eighth threshold;
  • the frame to be classified, the N2 frame to be classified, and the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the LI frame after the frame acquire the number of consecutive frames in which the high frequency energy distribution ratio including the to-be-classified frame is greater than a ninth threshold and the sound pressure level is greater than a tenth threshold.
  • the classification result output delay is allowed to be L2+L3 frames, that is, the delay L2+L3 frame is used to obtain the classification result of the to-be-classified frame
  • the step 101 of the embodiment shown in FIG. 1 acquires the to-be-classified frame in the audio signal.
  • the number of tonal components that satisfy the continuity constraint includes:
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is a positive integer;
  • the obtaining of the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region in the step 102 of the embodiment shown in FIG. 1 includes:
  • the energy distribution parameter of the post L2 frame acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region.
  • step 103 of the embodiment shown in FIG. 1 according to the number of tonal components satisfying the continuity constraint in the to-be-classified frame, the number of consecutive frames of the to-be-classified frame in the low frequency region, and the frame to be classified in the high frequency Determining, in at least one of the continuous frames of the area, the frame to be classified in the audio signal is a music signal, and determining that the frame to be classified in the audio signal is a voice signal includes:
  • the number of the tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, the number of consecutive frames of the to-be-classified frame in the low-frequency region is greater than a second threshold, or the duration of the to-be-classified frame in the high-frequency region
  • the number of frames is greater than the third threshold, determining that the frame to be classified in the audio signal is a music signal, otherwise determining that the frame to be classified in the audio signal is a voice signal
  • the acquiring the pitch distribution parameter of the frame to be classified in the audio signal is to be
  • the pitch distribution parameters of the N3 frame before the classification frame, and the pitch distribution parameters of the L2 frame after the frame to be classified include:
  • the pitch distribution parameter of the frame to be classified the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified include:
  • the pre-frame N3 frame to be classified, and the tonal component of the L2 frame to be classified, the tonal component in the frame to be classified is greater than the sixth threshold. Quantity.
  • the obtaining the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified include:
  • the high-frequency energy distribution ratio and the sound pressure level of the frame to be classified in the received audio signal are used as energy distribution parameters of the L2 frame after the frame to be classified;
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, obtaining the continuous frame number of the frame to be classified in the low frequency region includes:
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the number of tonal components in which the number of persistent frames in the frame to be classified is greater than the sixth threshold is the number of tonal components larger than the seventh threshold in the frequency domain.
  • Step 201 Perform FFT transformation on an ith frame of a current frame, where each step is received for each Frames are all subjected to FFT transformation;
  • Step 202 Obtain a pitch distribution parameter of the ith frame and an energy distribution parameter based on the FFT transformation result
  • Step 203 Determine whether i>L1 is established, that is, whether L1 frames exist before the current frame. If the process is step 204, the process ends. Otherwise, the execution of the foregoing steps 201 and 202 is performed for subsequent frames. Operation
  • Step 204 At: [>1 ⁇ , the audio signal classification result of the i-L1 frame may be obtained, and the specific past information, that is, the i-L1 frame obtained according to the above steps 201 and 202 The pitch distribution parameters and energy distribution parameters of the previous frames, the current information, that is, the pitch distribution parameters and energy distribution parameters of the i-L1 frame, and the future information, that is, the pitch distribution of the L1 frame after the i-L1 frame. Parameter and energy distribution parameters, obtaining audio signal classification results of the i-th frame;
  • Step 205 Output an audio signal classification result of the i-L1 frame.
  • FIG. 3a is a waveform diagram 1 of the input signal "French male voice + ⁇ ”
  • FIG. 3b is a spectrum diagram corresponding to FIG.
  • the sampling rate is 8 kHz, wherein the horizontal axis is the sample point and the vertical axis is the normalized amplitude;
  • the spectral map of Fig. 3b the corresponding sampling rate is also 8 kHz, and the frequency analysis range is (T4kHz.
  • the horizontal axis is the frame, corresponding to the sample point on the horizontal axis of Figure 3a; the vertical axis is the frequency (Hz).
  • the higher the brightness in a certain frequency range the signal is in the band The greater the energy, if the signal continues to maintain a large amount of energy in a certain frequency band, on the spectrum A "bright band” is formed, which is the tone.
  • the pitch duration at the fundamental frequency is slightly longer, the pitch duration at the higher frequency is very short.
  • the voice signal the place where the tone can be detected is voiced. Since the length of the voiced sound is usually short, the corresponding tone duration is also shorter; in the latter half of the music signal, the tone duration is significantly longer.
  • FIG. 4a is a waveform diagram of an input signal of the audio signal "Jinghu + French male voice”
  • FIG. 4b is a spectrum diagram corresponding to FIG. 4a.
  • the horizontal axis is the sample point; the vertical axis is the normalized amplitude; in the spectrum diagram of Fig. 4b, the horizontal axis is the frame; and the vertical axis is the frequency (Hz).
  • the energy distribution of Fig. 4b in the music signal of the first half, the energy is basically distributed above 1 kHz and distributed at 1 kHz to 4 kHz. In the latter half of the speech signal, most of the voiced energy is mainly distributed at 1 kHz. Below; unvoiced energy is distributed from low frequency to higher frequency range. Therefore, the energy of the speech signal cannot be continuously distributed over a relatively high frequency range.
  • Fig. 5a is a waveform diagram of the input signal "Korean male voice + ensemble", wherein the horizontal axis is the sample point; the vertical axis is the normalized Figure 5b is a spectrogram corresponding to Figure 5a, where the horizontal axis is the frame and the vertical axis is the frequency (Hz).
  • the energy distribution can be seen by the following: The energy distribution of the speech signal in the first half of Fig. 5b is similar to the speech signal of Fig.
  • the energy distribution of the speech signal Due to the different energy distribution characteristics of voiced and unvoiced sounds, the energy distribution of the speech signal has a large fluctuation. Therefore, the energy of the speech signal is neither continuously distributed in a relatively high frequency range nor continuously distributed in the low frequency range; in the latter half of the music signal, the energy is mainly distributed below 1 kHz.
  • the difference between the music signal and the speech signal mainly includes: First, the tone duration of the partial music signal is long, the tone duration of the speech signal is usually short; second, the energy of the partial music signal can be continuously distributed. In a relatively high frequency range; the energy of the speech signal cannot be continuously distributed in a relatively high frequency range; the third is that the energy of part of the music signal can be continuously distributed in the low frequency region; the energy of the speech signal cannot be continuously distributed in the low frequency region.
  • the low frequency and high frequency division in the embodiments of the present invention may be determined according to the distribution area of the voice signal, and the area where the voice signal is mainly distributed is defined as a low frequency area, for example, 1 kHz or less is defined as a low frequency area, The 1 kHz is defined as a high frequency region.
  • the specific value may also be different according to the specific application scenario and the specific voice signal.
  • the features to be extracted mainly include pitch characteristics and energy characteristics. Specifically, extracting the tonal features can be divided into three steps:
  • tone component refers to a distribution form of energy in the frequency domain
  • the obtaining the initial pitch detection result may include: first, performing FFT transformation on data of each frame to obtain a power density spectrum; second, determining a local maximum point in the power density spectrum; and finally, focusing on the local maximum point A number of power density spectral coefficients are analyzed to determine whether the local maximum point is a true tonal component.
  • the sampling rate of the input signal is 8 kHz
  • the effective bandwidth is 4 kHz
  • the FFT value is 1024.
  • the local maximum point of the power density spectrum is In this embodiment, how to select a plurality of power density spectral coefficients centered on the local maximum point is relatively flexible, and can be set according to an algorithm. For example, it can be implemented as follows
  • V 2 represents the initial pitch detection result
  • a value of 1 indicates that the k-th frame data has a tonal component at f
  • a value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the kth frame is referred to as a past frame
  • the L1 data located after the kth frame is referred to as a future frame.
  • the kth frame data have a tone score at /;
  • c Quantity, ie to ⁇ L/Z ⁇ r ⁇ Vm ] [ ] l.
  • the steps of the tone continuity analysis are:
  • Step 2 Statistically, the tonal component has continuity with a number of future tonal components, expressed as num right. Similar to step 1 above, sequentially detecting the kth frame, the (k+i) frame, and the like Whether there is continuity between the tonal components, output " Mm - Ai .
  • Step 3 According to " ⁇ TM_ ⁇ , filter the initial tone detection results, such as If one of the following two conditions is met:
  • Num right ⁇ a3 indicates that the tonal component at fk frame fx has a certain continuity, retaining the initial pitch detection result, otherwise it is not retained.
  • Fig. 6a is a waveform diagram 2 of the input signal "French male voice + ⁇ "
  • Fig. 6b is the initial tone detection result of the input signal shown in Fig. 6a, wherein the horizontal axis is a frame, and the horizontal axis of Fig.
  • the tone feature extraction is performed, wherein for the filtered tone detection result, the number of tonal components per frame from the lower frequency to the high frequency range (corresponding to fl4 ⁇ ⁇ F / 2 ) is expressed as
  • Fig. 7a is a waveform diagram 3 of the input signal "French male voice + ⁇ "
  • Fig. 7b A graph of the pitch characteristics corresponding to Fig. 7a.
  • the horizontal axis is a frame, and the graph
  • nwn j mal -flag is always 0 , which is significantly different from the tonal characteristics of the second half.
  • the energy feature extraction method in the above embodiment of the present invention is as follows. Before extracting the energy feature, firstly, the high frequency energy distribution ratio and the sound pressure level ⁇ ⁇ ⁇ of each frame need to be calculated, where k represents the number of frames.
  • Im_ (/) is the imaginary part of the FFT transform of the k-th frame.
  • the denominator represents the total energy of the kth frame; the numerator represents the kth frame at
  • Ratio—energy ⁇ hf ⁇ is small, indicating that the energy of the kth frame is mainly distributed at low frequencies; on the contrary, it indicates that the energy of the kth frame is mainly distributed in a higher frequency range.
  • the distribution characteristics of energy at high frequencies and the distribution characteristics of energy at low frequencies are further analyzed.
  • Fig. 8a is a waveform diagram of the input signal "Jinghu + French male voice”.
  • FIG. 8b is a graph of the high-frequency energy distribution ratio ⁇ - - ⁇ ) corresponding to Fig. 8a, wherein the horizontal axis is a frame corresponding to the sample point on the horizontal axis of Fig. 8a; and the vertical axis is the high-frequency energy distribution ratio.
  • the variation of the high-frequency energy distribution ratio curve can be seen from Figure 8b:
  • the high-frequency energy distribution ratio is substantially greater than 0.8, indicating that the energy of the Jinghu signal can be continuously distributed in the higher frequency range;
  • Num_big_ratio_energy_left Represents the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame;
  • Draw — big — mtio — energy — right Indicates the number of frames in the LI frame data after the kth frame that can be continuously distributed in the high frequency future frame.
  • Step 1 Num - big - ratio - ener sy - le ft 0;
  • Step 2 Initialize the variable "Draw as 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are met:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Num non big ratio size if num non big ratio ⁇ 8, indicating that the energy cannot be continuously distributed in the higher frequency range has exceeded the preset range, do not continue to detect, the output listens to m big ratio energy left . If num Non big ratio ⁇ "8, indicating that the energy cannot be continuously distributed in the higher frequency range is still within the preset range, continue to detect until the detection of the past L1 frame data, output "paint - g-rario - i rg ) je/. The steps to get awake-big _ ratio _ energy _ right are similar. Detect whether the (k+ 1)th frame is continuously distributed in a higher frequency range, and output
  • Figure 9a shows the input signal "Korean The waveform of the male voice + ensemble"
  • the graph % is a graph of the high frequency energy distribution ratio ⁇ - ⁇ - ⁇ corresponding to Fig. 9a.
  • the horizontal axis is the frame; the vertical axis is the high frequency energy distribution ratio.
  • the high-frequency energy distribution ratio curve shown in the figure % By observing the change of the high-frequency energy distribution ratio curve shown in the figure %, it can be seen that in the first half of the speech signal, the fluctuation of the high-frequency energy distribution ratio curve is large, indicating that the energy of the speech signal cannot be continuously distributed in the low frequency. In the music signal of the latter half, the high-frequency energy distribution ratio is substantially less than 0.1, indicating that the energy of the ensemble signal can be continuously distributed at low frequencies.
  • ⁇ ⁇ II II mtio — energy — left indicates that the energy can be continuously distributed in the low frequency past frame num _ small _ ratio _ energy _ right .
  • TM m_sm ⁇ _ ra ⁇ _ e / ⁇ _ fe / t is not only drawn for the last L1 frame data analysis, and a ratio -energy _hf ⁇ i) ⁇ i ⁇ 0 ) f will be updated
  • Step 1 initialize num sma ⁇ ra ti energy right to Q ⁇
  • Step 2 sequentially detecting the high-frequency energy distribution ratio ratio _ energy _ hf ⁇ i ) of the (k+1)th frame, the (k+2)th frame, etc. ( ⁇ ⁇ (whether or not the condition: ratio_energy_hf is satisfied) (f) ⁇ a9. If the above conditions are not met, it is not necessary to continue the test, and the output listens to / «-legs ⁇ -/3 ⁇ 4! ⁇ -£ ⁇ /3 ⁇ 4)-/ ⁇ / ⁇ ; if the above conditions are met,
  • Num small ratio energy right- num small ratio energy right + 1 continue to check ⁇ be [] down, until the detection of the future LI frame data, output "paint _ leg" 1 / ⁇ 0 _ £ ⁇ / ⁇ - / ⁇ .
  • the classification rule may be as shown in FIG. 10.
  • the classification rule may include the following steps:
  • Step 301 Determine whether the number of tonal components is greater than 0, that is, "Draw-to" - g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, continue to analyze the U.S. step 302, and analyze the energy in the comparison. The distribution characteristics in the high frequency range, first judge a 6 && S plW> a) . If yes, go to step 303, otherwise execute the step
  • Step 303 determining whether "painting_g-rari 0 _£ rg" -n ⁇ "ll, or satisfying num big ratio energy left + num big ratio energy right ⁇ alO or
  • Step 304 Determine whether the high frequency energy distribution ratio is less than a9, that is,
  • Step 305 Determine whether the "painting_leg" 1/ ⁇ 0 _£ ⁇ / ⁇ -/£ ⁇ ⁇ "13 is satisfied, or num small ratio energy left + num small ratio energy right ⁇ al2 or num _ small _ ratio _ energy _ right >a ⁇ ⁇ If it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 11a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", three of which are: ensemble, cymbal and castanets, In terms of pitch characteristics or energy characteristics, it has a certain typicality;
  • Figure lib is a schematic diagram of the classification result corresponding to Figure 11a, wherein the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal. The value is not 0 corresponding to the music signal. From bottom to top, the vertical axis gives four classification results:
  • MUSIC_ Tone Feature The classification result obtained using only the tone feature is expressed as a solid line. It can be seen from which signals in Figure 11a are applicable to the classification rules for tonal features; MUSIC Energy: Special Features __11:: Only use the classification results obtained with ""Energy: Feature_1" , denoted as a dotted line. The "energy characteristic_1" here refers to whether the energy can be continuously distributed in a higher frequency range. It can be seen which signals in Fig. 11a are suitable for the high frequency distribution characteristics of the energy. Classification rules;
  • the ensemble signal between 100000-300000 points The energy fluctuation of this piece of music signal is very large, only a few frames of energy can be continuously distributed in a higher frequency range, the energy characteristic _1/2 can not afford effect.
  • the pitch of the segment signal has good persistence and can be detected by using the tonal feature;
  • the energy of the segment signal is mainly distributed in the low frequency, and can be detected by using the energy feature_2;
  • the castanick signal after 600000 This segment of the signal can hardly detect the tonal component, and the tonal feature does not work.
  • the energy of this segment of the signal is mainly distributed at high frequencies and can be detected by the energy characteristic _1.
  • the technical solution provided by the embodiment of the present invention can also be applied to an application scenario with a large output delay.
  • the output delay is L2+L3
  • the first embodiment may be provided according to the foregoing embodiment.
  • the technical solution when i>L2, according to the past information, the pitch distribution parameter and the energy distribution parameter of several frames before the i_L2 frame, the current information, that is, the pitch distribution parameter and the energy distribution parameter of the i_L2 frame, and the future The information, that is, the pitch distribution parameter and the energy distribution parameter of the L2 frame after the i_L2 frame, obtain the audio signal classification result of the i-th frame, and the specific implementation manner can be referred to the above embodiment, and further, i>(L2+L3) When it is smoothed, that is, according to the frame before the i_L2-L3 frame to be classified, the N4 frame and the frame to be classified
  • the initial classification result of the L3 frame is corrected.
  • the foregoing foregoing N4 frame may be the first L3 frame, and for the kth frame, the process of the above correction processing is:
  • the initial classification result of the L3 frame located before the kth frame and the L3 frame located after the kth frame is counted, and the number of frames classified as a music signal "" m - mw , and the towel classified as a voice signal are acquired.
  • Awake up - wake up _ music is acquired.
  • the k-th frame classification result is corrected to the music signal; if the result of the initial classification of the k-th frame is a music signal, and "Draw -". "-Listen ⁇ ⁇ " 1 4 , the classification result of the kth frame is corrected to a speech signal.
  • Figure 12a is a waveform diagram of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, Figure 12 shows the smoothed results, as shown in Figure 12, Down to the top, the vertical axis gives two types of classification results:
  • MUS IC_ smoothed result smooth the initial classification result, and obtain the smoothed result, table Shown as a dotted line.
  • the initial classification result has a misjudgment between 250,000 and 300000 points, and the music signal is misjudged as a voice signal; ⁇ between 400,000 and 550,000 points
  • the signal, the initial classification result has a misjudgment at the end of the signal, and the music signal is misjudged as a speech signal.
  • the above misjudgment was corrected by smoothing.
  • the principle and the step of acquiring the pitch distribution parameter and obtaining the energy distribution parameter are similar to the above technical solution, except that the reference information is used in the past.
  • the current information because there is no output delay, requires real-time access to the classification results, and cannot refer to future information.
  • the tone feature can be extracted by referring to the foregoing embodiment, and can be divided into three steps:
  • step A reference may be made to the above embodiment, and the following mainly describes the steps B and C.
  • tonal -fl a g -Original [k] [f] (0 ⁇ f ⁇ tone detection result represents the initial value of 1 k-th frame data showing the presence of tonal components at f
  • the value of A value of 0 indicates that the k-th frame data does not have a tonal component at f.
  • the L1 frame data located before the k-th frame is referred to as a past frame.
  • the steps of the tone continuity analysis are:
  • Step 1 Count the continuity of the tonal component with the pitch component of the past frame, expressed as the leg m— ⁇ , initialize the variable “ " ⁇ - to 0, initialize the variable indicating discontinuity
  • step 2 Similar to step 2, it is sequentially detected whether there is continuity between the (k-1)th frame, the (k-2)th frame, and the like and the pitch component of the previous frame. Before each test, you first need to determine the size of "" " ⁇ - ⁇ :
  • Step 2 Filter the initial pitch detection result according to "- ⁇ ;
  • This feature refers to the number of frames of the past frame in which the energy can be continuously distributed in the L1 frame data before the kth frame.
  • Step 1 Num - big - ratio - ener sy - ⁇ 0;
  • Step 2 Initialize the variable ""m_M.”_b ⁇ _rari. is 0;
  • Step 3 Check raz '. - j/ ⁇ - 1 ) and ⁇ - 1 ) Whether the following conditions are satisfied: ⁇ ratio energy _hf(k - l)> If the above conditions are not satisfied, the energy of the (k-1)th frame data is not distributed at a higher frequency. In the range, i has recorded this event - m non big ratio - num non big ratio + 1 If the above conditions are met, the energy of the (ki) frame data is continuously distributed in the higher frequency range:
  • step 3 it is sequentially detected whether the energy of the data of the (k-2)th frame, the (k-1)th frame, and the like is continuously distributed in a higher frequency range. Before each test, you first need to judge
  • Wake up - wake up II - ratio - energy - left This feature refers to the number of frames of past frames whose energy can be continuously distributed at low frequencies.
  • the classification rule may be as shown in FIG. 13, and for the k-th frame data, it may include the following steps:
  • Step 401 Determine whether the number of tonal components is greater than 0, g ⁇ " ⁇ -to ?M Z- i 3 ⁇ 4g > 0. If the condition is met, the initial classification result may be output as a music signal; otherwise, the energy feature is continuously analyzed;
  • Step 402 Analyze the distribution characteristics of energy in a higher frequency range, first determine
  • step 403 If yes, execute step 403, otherwise execute the step
  • Step 403 determining whether "paint-b ⁇ -rari 0 _i / ⁇ -fe / t ⁇ b 8 is satisfied, if yes, output the initial classification result as a music signal, otherwise, performing step 404;
  • Step 404 determining whether the high frequency energy distribution ratio is less than b7, that is,
  • Step 405 determining whether the "painting is satisfied" j e / ⁇ 9, if it is satisfied, the initial classification result is output as a music signal, otherwise the initial classification result is output as a voice signal.
  • Figure 14a is the waveform diagram 3 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure b gives an example of real-time classification results, where the horizontal axis is the sample point; the vertical axis is the classification result, and the value is 0 corresponding to the speech signal, and the value is not A music signal corresponding to 0 can be seen from Fig. 14a and Fig. 14b. Since there is no future information for reference, a little music signal is misjudged as a voice signal.
  • FIG. 15 is a flowchart of a voice classification method in a case where an output delay is not fixed according to an embodiment of the present invention, and as shown in FIG. 15, the following steps are included:
  • Step 501 performing an FFT transformation on the i-th frame of the current frame
  • Step 502 Obtain a tone distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 503 Obtain an energy distribution parameter of the ith frame and cache according to the FFT transform result.
  • Step 504 Generate and cache a real-time classification result of the ith frame.
  • the past information generated and cached in the step 502 and the step 503 in the step that is, the pitch distribution of each frame before the ith frame
  • the parameter and the energy distribution parameter are used to obtain the tonal feature and the energy feature of the ith frame, and generate and cache the real-time classification result.
  • Step 505 When 1>11, where L1 is a small amount of output delay allowed, in addition to obtaining the real-time classification result of each received frame, the initial classification result of the i-L1 frame may also be generated and cached, specifically, When generating the initial classification result of the i-th frame, reference may be made to the past information, that is, the pitch distribution parameter and the energy distribution parameter of several frames before the i-L1 frame, and the current information, that is, the tone of the i-L1 frame.
  • Step 506 When i>(L2+L3), generate and buffer the (i_L2-L3) frame-corrected classification result, specifically, refer to the past information, that is, before the (i_L2-L3) frame.
  • the initial classification result of several frames the future information, that is, the initial classification result of the L3 frame located after the (i_L2-L3) frame, the initial classification result of the (i_L2-L3) frame is corrected, and the specific implementation can be seen.
  • Step 507 Select, according to the allowed output delay, the classification result of the foregoing step 504, step 505, and step 506 as the classification result of the jth frame of the to-be-classified frame:
  • the suboptimal result is output, that is, the initial classification result of the jth frame;
  • the zero delay result is output, that is, the real time classification result of the jth frame.
  • the value of L2 can be set equal to L1.
  • Figure 16a is a waveform diagram 4 of the input signal "Chinese female + ensemble + English male + ⁇ + German male + castanets", as shown in Figure 11a, three of which are: ensemble, cymbal and castanets, in tonal features Or the energy characteristics have a certain typicality.
  • Figure 16b shows the classification results obtained by the three classification methods, as shown in Figure 16b, where the three classification results given on the vertical axis are 31 ( _ The results of real-time classification are indicated by solid lines, ⁇ The classification results are indicated by dotted lines, and the MUSIC_ corrected classification results are indicated by dotted lines.
  • the extracted feature can reflect the more essential features of the music signal different from the voice signal, so that the classification accuracy rate at the low sampling rate is significantly improved. Since the method for extracting features of the technical solution of the embodiment of the present invention is not limited to the sampling rate, it is applicable not only to a low sampling rate but also to signal classification at a high sampling rate. Under the premise of ensuring low algorithm complexity, users can flexibly select real-time classification results, sub-optimal classification results or optimal classification results according to their needs.
  • FIG. 17 is a schematic structural diagram of an audio signal classification processing apparatus according to an embodiment of the present invention. As shown in FIG. 17, the apparatus includes a first obtaining module 11 and a classification determining module 12, wherein the first acquiring module 11 is configured to acquire an audio signal.
  • the classification determining module 12 is configured to: according to the number of tonal components satisfying the continuity constraint in the to-category frame, the continuous frame number of the to-be-classified frame in the low frequency region, and the persistent frame of the high frequency region of the to-be-classified frame At least one of the numbers determines that the frame to be classified in the audio signal is a music signal, or determines that the frame to be classified in the audio signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is generally not continuously distributed in the high frequency region or the low frequency region.
  • the first to obtain the audio signal in the frame to be classified satisfies the continuous The number of tonal components of the sexual constraint, and the number of consecutive frames of the frame to be classified in the low frequency region of the audio signal and/or the number of consecutive frames of the frame to be classified in the high frequency region, and confirming the type of the frame to be classified according to the above information Whether it is a music signal or a voice signal, the audio signal classification processing method provided by the above technical solution can improve the correct rate of audio signal classification and meet the requirements of voice quality assessment.
  • the execution steps of each module may be different according to the presence or absence of the output delay and the output delay length, and specifically include the following situations:
  • the first acquisition module is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the to-be-classified frame
  • the tone distribution parameter of the first N1 frame obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; or, specifically, is used to acquire a frame to be classified in the audio signal, and a frame before the frame to be classified
  • the energy distribution parameter and obtaining, according to the to-be-classified frame in the audio signal, and the energy distribution parameter of the N1 frame before the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the frame to be classified in the high frequency region Number of consecutive frames
  • the classification determining module 12 is specifically configured to satisfy a continuity constraint bar in the to-be-classified frame. Determining that the number of tonal components of the piece is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number of consecutive frames of the frame to be classified in the high frequency region is greater than a third threshold.
  • the frame to be classified in the audio signal is a music signal, otherwise it is determined that the frame to be classified in the audio signal is a voice signal.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, and the pitch distribution parameters of the N1 frame before the frame to be classified include:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the classification determining module obtains, according to the pitch distribution parameter of the frame to be classified, and the pitch distribution parameter of the pre-frame N1 frame, the number of tonal components satisfying the continuity constraint in the frame to be classified, including:
  • the module obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified includes:
  • the foregoing classification determining module acquires the continuous frame number of the frame to be classified in the low frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified, including:
  • the foregoing classification determining module acquires the continuous frame number of the to-be-classified frame in the high frequency region according to the energy distribution parameter of the frame to be classified in the audio signal and the energy distribution parameter of the N1 frame before the frame to be classified Includes:
  • the second acquiring module is configured to obtain a to-be-classified frame in the audio signal, a pre-framed N2 frame, and a to-be-classified.
  • N2 is a positive integer; or, specifically, is used to obtain a frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and the L1 frame after the frame to be classified, and according to the frame to be classified in the audio signal, Obtaining, according to the energy distribution parameter of the pre-frame N2 frame and the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region or the number of consecutive frames of the frame to be classified in the high frequency region;
  • the classification determining module is specifically configured to: the number of tonal components satisfying the continuity constraint in the to-be-classified frame is greater than a first threshold, and the number of persistent frames of the to-be-classified frame in the low-frequency region is greater than a second threshold or the to-be-determined When the number of consecutive frames in the high frequency region is greater than the third threshold, the frame to be classified in the audio signal is determined to be a music signal, and the frame to be classified in the audio signal is determined to be a voice signal.
  • the first acquiring module acquires a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N2 frame before the frame to be classified, and a pitch distribution parameter of the L1 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified is used as the pitch distribution parameter of the L1 frame after the frame frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified.
  • the pre-framed N2 frame and the to-be-classified frame The frequency domain distribution information of the tonal components of the post-frame LI frame acquires the number of tonal components whose number of persistent frames in the to-be-classified frame is greater than a sixth threshold.
  • the first acquiring module acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the N2 frame before the frame to be classified and the energy distribution parameter of the L1 frame after the frame to be classified include:
  • the energy distribution parameter of the N2 frame and the high frequency energy distribution ratio and the sound pressure level of the L1 frame after the frame frame to be classified are used as the energy distribution parameters of the L1 frame after the frame to be classified.
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the numbers include:
  • the classification determining module obtains the continuation of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified.
  • the number of frames includes:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is to obtain a classification result of the to-be-classified frame, and the L2 and L3 are positive integers, and the first acquiring module is specifically configured to acquire a frame to be classified in the audio signal, and the N3 frame to be classified before the frame.
  • N3 is a positive integer; or, specifically, for acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and the L2 frame after the frame to be classified, and according to the audio signal
  • the number of consecutive frames in the low frequency region or the number of consecutive frames in the high frequency region of the frame to be classified is obtained by the energy distribution parameter of the frame to be classified, the N3 frame before the frame to be classified, and the L2 frame after the frame to be classified;
  • the classification processing module is specifically configured to: in the frame to be classified, the number of tonal components satisfying the continuity constraint is greater than a first threshold, and the number of consecutive frames in the low frequency region of the to-be-
  • the classification frame is modified to a voice signal; if it is determined that the frame to be classified in the audio signal is a voice signal, determining whether the number of frames determined as the music signal in the N4 frame before the frame to be classified and the frame after the L3 frame to be classified is greater than The five thresholds, if greater, correct the frame to be classified in the audio signal to a music signal, and N4 is a positive integer.
  • the first acquiring module obtains the pitch distribution parameter of the frame to be classified in the audio signal, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the frequency domain distribution information of the tonal components of the L2 frame after the frame to be classified is used as the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the classification determining module obtains the number of tonal components satisfying the continuity constraint in the frame to be classified according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified.
  • the first acquiring module acquires an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N3 frame before the frame to be classified and an energy distribution parameter of the L2 frame after the frame to be classified include:
  • the energy distribution parameter of the N3 frame before the frame to be classified, and the high-frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame to be classified are used as energy distribution parameters of the L2 frame after the frame to be classified.
  • the foregoing classification determining module acquires the number of consecutive frames of the to-be-classified frame in the low-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • Obtaining a high-frequency energy distribution including the to-be-classified frame according to a high-frequency energy distribution ratio and a sound pressure level of the frame to be classified in the received audio signal, the pre-frame N3 frame to be classified, and the L2 frame to be classified a number of consecutive frames that are less than the eighth threshold;
  • the classification determining module acquires the continuous frame of the to-be-classified frame in the high-frequency region according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified.
  • the numbers include:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components whose number of persistent frames in the frame to be classified acquired by the first acquiring module is greater than the sixth threshold is the number of tonal components greater than the seventh threshold in the frequency domain.
  • FIG. 18 is a schematic structural diagram of an audio signal classification processing device according to an embodiment of the present invention.
  • the device includes a receiver 21 and a processor 22, where The receiver 21 is configured to receive an audio signal; the processor 22 is connected to the receiver 21, and configured to acquire the number of tonal components satisfying continuity constraints in the to-be-classified frame in the audio signal received by the receiver, the audio And at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region, according to the number of tonal components satisfying the continuity constraint in the frame to be classified, Determining, in the audio signal, the frame to be classified as a music signal, or determining the audio, by using at least one of a continuous frame number of the frame to be classified in the low frequency region and a continuous frame number of the frame to be classified in the high frequency region
  • the frame to be classified in the signal is a voice signal.
  • the technical solution provided by the above embodiments of the present invention mainly considers the characteristics of the music signal, for example, the tone duration of the music signal is long, and the tone duration of the voice signal is short, and the energy of the music signal can be continuously distributed in the high frequency region. Or a low frequency region, and the speech signal is usually not continuously distributed in the high frequency region or the low frequency region, and based on the above characteristics of the music signal,
  • the technical solution provided by the embodiment of the present invention first, the number of tonal components satisfying the continuity constraint in the frame to be classified in the audio signal, and the number of persistent frames of the frame to be classified in the low frequency region and/or the to-be-classified in the audio signal are obtained.
  • the audio signal classification processing method provided by the above technical solution can improve the correct rate of the audio signal classification and satisfy the voice. Requirements for quality assessment.
  • the processor may be implemented by a software flow, or may be implemented by using a hardware entity device such as a digital signal processing (DSP) chip.
  • DSP digital signal processing
  • the processor may include the following situations according to the real-time acquisition of the classification result of the to-be-classified frame or the length of the delay of the classification result output:
  • the processor is specifically configured to acquire a to-be-classified frame in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified, and according to the to-be-classified frame, and the tone of the N frame before the frame to be classified
  • the distribution parameter obtains the number of tonal components satisfying the continuity constraint in the frame to be classified, and N1 is a positive integer; acquiring a frame to be classified in the audio signal, and an energy distribution parameter of the N1 frame before the frame to be classified, and according to the audio
  • the energy distribution parameter of the to-be-classified frame in the signal, and the N1 frame before the frame to be classified obtains the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, and a pitch distribution parameter of the N1 frame before the frame to be classified includes:
  • the frequency domain distribution information of the tonal component is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N1 frame is used as the pitch distribution parameter of the pre-frame N1 frame to be classified.
  • the processor according to the pitch distribution parameter of the frame to be classified, and the tone of the N1 frame before the frame to be classified
  • the obtaining, by the distribution parameter, the number of tonal components satisfying the continuity constraint in the frame to be classified includes: obtaining the to-be-classified frame according to the frequency domain distribution information of the to-be-classified frame in the received audio signal and the tonal component of the pre-frame N1 frame to be classified The number of the tone components whose number of consecutive frames is greater than the sixth threshold value.
  • the processor acquires the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameters of the N1 frame before the frame to be classified include:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • Obtaining, according to the energy distribution parameter of the frame to be classified in the audio signal, and the energy distribution parameter of the pre-frame N1 frame, the number of consecutive frames of the frame to be classified in the high frequency region includes:
  • the second is that when the classification result of the to-be-classified frame is obtained, the L1 is a positive integer, and the processor is specifically configured to acquire a frame to be classified in the audio signal, a N2 frame before the frame to be classified, and a frame to be classified.
  • N2 is a positive integer
  • the energy distribution parameter of the L1 frame after the frame to be classified acquires the number of consecutive frames of the frame to be classified in the low frequency region and/or the number of consecutive frames of the frame to be classified in the high frequency region; satisfies continuity in the frame to be classified
  • the number of tonal components of the constraint is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or
  • the processor acquires a pitch distribution parameter of the frame to be classified in the audio signal, before the frame to be classified
  • the pitch distribution parameters of the N2 frame, and the pitch distribution parameters of the L1 frame after the frame to be classified include:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal component of the pre-frame N2 frame is used as the pitch distribution parameter of the N2 frame before the frame to be classified.
  • frequency domain distribution information of the tonal components of the L1 frame after the frame frame to be classified are used as the pitch distribution parameter of the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N2 frame before the frame to be classified, and the pitch distribution parameter of the L1 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the frequency domain distribution information of the tonal component of the to-be-classified frame in the received audio signal is used as the pre-frame N2 frame to be classified.
  • the pitch distribution parameter, and the frequency domain distribution information of the tonal component of the L1 frame after the frame frame to be classified acquires the number of tonal components whose number of consecutive frames in the frame to be classified is greater than a sixth threshold.
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal, and an energy distribution parameter of the N2 frame before the frame to be classified and an energy distribution parameter of the L1 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high frequency energy distribution ratio and the sound pressure level of the N2 frame before the frame to be classified are used as the energy distribution parameters of the N2 frame before the frame to be classified, and to be classified
  • the high-frequency energy distribution ratio and sound pressure level of the L1 frame after the frame frame are used as energy distribution parameters of the L1 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N2 frame before the frame to be classified, and the energy distribution parameter of the L1 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor according to the energy distribution parameter of the frame to be classified in the audio signal, the N2 frame to be classified before the frame
  • the energy distribution parameter and the energy distribution parameter of the L1 frame after the frame to be classified obtain the continuous frame number of the frame to be classified in the high frequency region, including:
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the third is that when the classification result output delay is L2+L3 frame, L2 and L3 are positive integers, and the processor is specifically configured to acquire the to-be-classified frame in the audio signal, the N3 frame before the frame to be classified, and the L2 after the frame to be classified.
  • the to-be-classified frame N3 frame and the tone distribution parameter of the L2 frame after the frame to be classified obtain the number of tonal components satisfying the continuity constraint in the frame to be classified, and N3 is positive And obtaining an energy distribution parameter of the to-be-classified frame in the audio signal, and the L3 frame of the to-be-classified frame and the L2 frame to be classified, and according to the to-be-classified frame in the audio signal, the N3 frame to be classified and the to-be-classified frame
  • the energy distribution parameter of the L2 frame after the classification frame acquires the continuous frame number of the frame to be classified in the low frequency region and/or the continuous frame number of the frame to be classified in the high frequency region; the continuity constraint is satisfied in the frame to be classified
  • the number of conditional tonal components is greater than a first threshold, the number of consecutive frames of the frame to be classified in the low frequency region is greater than a second threshold, or the number
  • the frame to be classified in the audio signal is corrected to a voice signal
  • N4 is a positive integer
  • the processor obtains a pitch distribution parameter of the frame to be classified in the audio signal, a pitch distribution parameter of the N3 frame before the frame to be classified, and a pitch distribution parameter of the L2 frame after the frame to be classified includes:
  • the frequency domain distribution information of the tonal components of the to-be-classified frame in the audio signal is used as the pitch distribution parameter of the frame to be classified, and the frequency domain distribution information of the tonal components of the pre-frame N3 frame is used as the pitch distribution parameter of the N3 frame before the frame to be classified.
  • the tonal component of the L2 frame after the frame to be classified The frequency domain distribution information is used as a pitch distribution parameter of the L2 frame after the frame to be classified.
  • the processor obtains, according to the pitch distribution parameter of the frame to be classified, the pitch distribution parameter of the N3 frame before the frame to be classified, and the pitch distribution parameter of the L2 frame after the frame to be classified, the number of tonal components satisfying the continuity constraint in the frame to be classified includes:
  • the processor obtains an energy distribution parameter of the frame to be classified in the audio signal
  • the energy distribution parameter of the N3 frame before the frame to be classified and the energy distribution parameter of the L2 frame after the frame to be classified include: acquiring a frame to be classified in the received audio signal
  • the high-frequency energy distribution ratio and the sound pressure level are used as the energy distribution parameters of the frame to be classified
  • the high-frequency energy distribution ratio and the sound pressure level of the N3 frame before the frame to be classified are the energy distribution parameters of the N3 frame before the frame to be classified
  • the to-be-classified The high frequency energy distribution ratio and the sound pressure level of the L2 frame after the frame frame are used as the energy distribution parameters of the L2 frame after the frame to be classified.
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the low frequency region includes:
  • the processor obtains, according to the energy distribution parameter of the frame to be classified in the audio signal, the energy distribution parameter of the N3 frame before the frame to be classified, and the energy distribution parameter of the L2 frame after the frame to be classified, the number of consecutive frames of the frame to be classified in the high frequency region is included. :
  • the ratio is greater than the ninth threshold, and the sound pressure level is greater than the tenth threshold.
  • the number of tonal components in the frame to be classified that are acquired by the processor that are greater than the sixth threshold is the number of tonal components that are greater than the seventh threshold in the frequency domain.
  • the aforementioned program can be stored in a computer readable storage medium. When the program is executed, the steps including the foregoing method embodiments are performed; and the foregoing Storage media include: R0M, RAM, disk or optical disk and other media that can store program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

La présente invention concerne un procédé, un appareil et un dispositif de traitement de classification de signal audio. Le procédé comprend les étapes consistant à : obtenir au moins l'un du nombre de composants tonaux satisfaisant une condition de contrainte de continuité dans une trame à classifier, le nombre de trames contiguës de la trame à classifier dans le signal audio dans une région basse fréquence, et le nombre de trames contiguës de la trame à classifier dans le signal audio dans une région haute fréquence (101) ; et déterminer si la trame à classifier dans la signal audio est un signal de musique ou un signal vocal selon le ou les du nombre de composants tonaux satisfaisant la condition de contrainte de continuité dans la trame à classifier dans le signal audio, le nombre de trames contiguës de la trame à classifier dans le signal audio dans la région basse fréquence, et le nombre de trames contiguës de la trame à classifier dans le signal audio dans une région haute fréquence (102).
PCT/CN2014/081400 2013-07-02 2014-07-01 Procédé, appareil et dispositif de traitement de classification de signal audio WO2015000401A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310274580.9A CN104282315B (zh) 2013-07-02 2013-07-02 音频信号分类处理方法、装置及设备
CN201310274580.9 2013-07-02

Publications (1)

Publication Number Publication Date
WO2015000401A1 true WO2015000401A1 (fr) 2015-01-08

Family

ID=52143107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/081400 WO2015000401A1 (fr) 2013-07-02 2014-07-01 Procédé, appareil et dispositif de traitement de classification de signal audio

Country Status (2)

Country Link
CN (1) CN104282315B (fr)
WO (1) WO2015000401A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (zh) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 一种自适应调节音效的方法及***

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9454893B1 (en) 2015-05-20 2016-09-27 Google Inc. Systems and methods for coordinating and administering self tests of smart home devices having audible outputs
EP3298598B1 (fr) * 2015-05-20 2020-06-03 Google LLC Systèmes et procédés de test de dispositifs domestiques intelligents

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (fr) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Systeme et algorithme de detection de musique a faible complexite
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (zh) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 音乐/非音乐的实时检测方法和装置
CN102237085A (zh) * 2010-04-26 2011-11-09 华为技术有限公司 音频信号的分类方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6785645B2 (en) * 2001-11-29 2004-08-31 Microsoft Corporation Real-time speech and music classifier
KR100964402B1 (ko) * 2006-12-14 2010-06-17 삼성전자주식회사 오디오 신호의 부호화 모드 결정 방법 및 장치와 이를 이용한 오디오 신호의 부호화/복호화 방법 및 장치
CN101577117B (zh) * 2009-03-12 2012-04-11 无锡中星微电子有限公司 伴奏音乐提取方法及装置
CN101847412B (zh) * 2009-03-27 2012-02-15 华为技术有限公司 音频信号的分类方法及装置
CN102446504B (zh) * 2010-10-08 2013-10-09 华为技术有限公司 语音/音乐识别方法及装置
CN102655000B (zh) * 2011-03-04 2014-02-19 华为技术有限公司 一种清浊音分类方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778335A (en) * 1996-02-26 1998-07-07 The Regents Of The University Of California Method and apparatus for efficient multiband celp wideband speech and music coding and decoding
WO2006019556A2 (fr) * 2004-07-16 2006-02-23 Mindspeed Technologies, Inc. Systeme et algorithme de detection de musique a faible complexite
US20070271093A1 (en) * 2006-05-22 2007-11-22 National Cheng Kung University Audio signal segmentation algorithm
CN101236742A (zh) * 2008-03-03 2008-08-06 中兴通讯股份有限公司 音乐/非音乐的实时检测方法和装置
CN102237085A (zh) * 2010-04-26 2011-11-09 华为技术有限公司 音频信号的分类方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104811864A (zh) * 2015-04-20 2015-07-29 深圳市冠旭电子有限公司 一种自适应调节音效的方法及***
CN104811864B (zh) * 2015-04-20 2018-11-13 深圳市冠旭电子股份有限公司 一种自适应调节音效的方法及***

Also Published As

Publication number Publication date
CN104282315A (zh) 2015-01-14
CN104282315B (zh) 2017-11-24

Similar Documents

Publication Publication Date Title
WO2020181824A1 (fr) Procédé, appareil et dispositif de reconnaissance d'empreinte vocale et support de stockage lisible par ordinateur
CN106664486B (zh) 用于风噪声检测的方法和装置
CN108896878B (zh) 一种基于超声波的局部放电检测方法
US9959886B2 (en) Spectral comb voice activity detection
WO2015078121A1 (fr) Procédé et dispositif de détection de qualité de signal audio
WO2014177084A1 (fr) Procédé et dispositif de détection d'activation vocale
WO2019233228A1 (fr) Dispositif électronique, et procédé de commande de dispositif
US20200365173A1 (en) Method for constructing voice detection model and voice endpoint detection system
US20150106087A1 (en) Efficient Discrimination of Voiced and Unvoiced Sounds
US9792898B2 (en) Concurrent segmentation of multiple similar vocalizations
JP2010112995A (ja) 通話音声処理装置、通話音声処理方法およびプログラム
WO2013078677A1 (fr) Procédé et dispositif de réglage adaptatif d'un effet sonore
WO2015000401A1 (fr) Procédé, appareil et dispositif de traitement de classification de signal audio
WO2016078439A1 (fr) Procédé et appareil de traitement vocal
KR101295727B1 (ko) 적응적 잡음추정 장치 및 방법
CN104732984B (zh) 一种快速检测单频提示音的方法及***
Dekens et al. Speech rate determination by vowel detection on the modulated energy envelope
CN109994129A (zh) 语音处理***、方法和设备
CN109377982A (zh) 一种有效语音获取方法
CN108847218A (zh) 一种自适应门限整定语音端点检测方法,设备及可读存储介质
Craciun et al. Correlation coefficient-based voice activity detector algorithm
WO2022068440A1 (fr) Procédé et appareil de suppression de sifflement, dispositif informatique et support de stockage
CN115567845A (zh) 一种信息处理方法及装置
CN110268726A (zh) 新式智能助听器
CN109427345B (zh) 一种风噪检测方法、装置及***

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14819505

Country of ref document: EP

Kind code of ref document: A1