CN113611288A

CN113611288A - Audio feature extraction method, device and system

Info

Publication number: CN113611288A
Application number: CN202110901109.2A
Authority: CN
Inventors: 岑吴镕; 李骊
Original assignee: Nanjing Huajie Imi Technology Co ltd; Beijing HJIMI Technology Co Ltd
Current assignee: Nanjing Huajie Imi Technology Co ltd; Beijing HJIMI Technology Co Ltd
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2021-11-05

Abstract

The application discloses an audio feature extraction method, an audio feature extraction device and an audio feature extraction system, wherein audio information is obtained, preprocessing for enhancing the performance of a voice signal is performed on the audio information, the preprocessed audio information is obtained, the preprocessed audio information is converted from a time domain to a frequency domain through fast Fourier transform, the audio information of the frequency domain is subjected to filtering processing of a Mel filter set for setting the frequency based on language information of the audio information, and an audio feature vector of the audio information is determined. According to the scheme, after the audio information is subjected to preprocessing and fast Fourier transform, filtering processing is carried out through the Mel filter bank, the Mel filter bank is the frequency set based on the language information of the audio information, so that the filtering processing of the audio information is realized based on the language information of the audio informativeness per se and is associated with the language characteristics of the audio information, the determined audio characteristic vector is more in line with the characteristics of the language information corresponding to the audio information, and the accuracy of audio identification is improved.

Description

Audio feature extraction method, device and system

Technical Field

The present application relates to the field of speech recognition, and in particular, to a method, an apparatus, and a system for extracting audio features.

Background

In the field of speech recognition, extraction of Mel-scale frequency spectrum Cepstral coeffkccents (MFCC) is currently the most common feature extraction method.

However, when the MFCC feature extraction is performed, the filtering process is generally performed by using M triangular filter banks that are equally divided, but the emphasis on the utterance frequency response varies for different languages, and if the filtering process is performed by using the triangular filter banks that are equally divided, the speech recognition accuracy may be degraded for some special languages.

Disclosure of Invention

In view of the above, the present application provides an audio feature extraction method, apparatus and system, and the specific scheme is as follows:

an audio feature extraction method, comprising:

acquiring audio information;

preprocessing the audio information to strengthen the performance of the voice signal to obtain preprocessed audio information;

converting the preprocessed audio information from a time domain to a frequency domain through fast Fourier transform;

and the audio information in the frequency domain is subjected to filtering processing of a Mel filter bank for setting the frequency based on language information of the audio information, and an audio characteristic vector of the audio information is determined.

Further, the processing of filtering the audio information in the frequency domain by a mel filter bank that sets frequencies based on language information of the audio information includes:

determining language information corresponding to the audio information based on the audio information;

determining a specific starting Mel frequency and a specific ending Mel frequency corresponding to each Mel filter in a preset number of Mel filters matched with the language information;

subjecting the audio information of the frequency domain to the filtering process of each mel-filter set based on the specific mel-frequency.

Further, determining a specific start mel frequency and a specific end mel frequency corresponding to each mel filter includes:

determining a first mode and a second mode based on language information corresponding to the audio information;

the specific starting Mel frequencies of the kth Mel filter and the preceding Mel filters are determined by a first mode; the specific starting Mel frequency of the (k +1) th Mel filter and the Mel filters thereafter is determined by a second method;

the specific termination Mel frequency of the k-1 st Mel filter and the preceding Mel filters is determined by a first method; the specific termination mel frequency of the kth mel filter and the following mel filters is determined by adopting a second mode;

wherein k is a positive integer less than half of the sum of the preset number and 1; k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1; the specific termination mel frequency of each mel filter is the specific start mel frequency of the next mel filter of said mel filters.

if the preset number is M, the number of the frequency points to be determined is M + 1;

when i is less than half of the sum of M and 1, the ith frequency point and the previous frequency points are determined in a first mode;

when i is more than or equal to half of the sum of M and 1, determining the ith frequency point and the subsequent frequency points by adopting a second mode;

and determining M +1 frequency points as a specific starting Mel frequency or a specific ending Mel frequency of a preset number of Mel filters in sequence.

Further, the determining the audio feature vector of the audio information by filtering the audio information in the frequency domain through a mel filter bank that sets frequencies based on the language information of the audio information includes:

filtering the audio information in the frequency domain by a Mel filter bank with frequency set based on language information of the audio information to obtain characteristic vectors matched with the Mel filters in the Mel filter bank in number;

and performing inverse cosine transform on the feature vectors matched with the number of the Mel filters to generate audio feature vectors of the audio information.

and converting the audio information from the frequency scale of the frequency domain into a Mel frequency spectrum scale based on a preset relationship, and subjecting the audio information converted into the Mel frequency spectrum scale to filtering processing of a Mel filter bank for setting the frequency based on language information of the audio information.

Further, the performing preprocessing for enhancing the performance of the voice signal on the audio information to obtain the preprocessed audio information includes:

performing framing processing on the audio information to obtain each frame of audio data;

and after pre-emphasis processing is carried out on each frame of audio data, carrying out window function processing on the audio data after the pre-emphasis processing is carried out on each frame of audio data, and obtaining pre-processed audio information.

An audio feature extraction system, comprising:

an acquisition unit configured to acquire audio information;

the preprocessing unit is used for executing preprocessing for enhancing the voice information performance on the audio information to obtain the preprocessed audio information;

the conversion unit is used for converting the audio information after the preprocessing into a frequency domain from a time domain through fast Fourier transform;

and the filtering unit is used for performing filtering processing on the audio information in the frequency domain through a Mel filter bank with frequency set based on language information of the audio information, and determining an audio characteristic vector of the audio information.

An audio feature extraction apparatus comprising:

a processor for obtaining audio information; preprocessing the audio information to strengthen the performance of the voice signal to obtain preprocessed audio information; converting the preprocessed audio information from a time domain to a frequency domain through fast Fourier transform; the audio information in the frequency domain is filtered by a Mel filter bank with frequency set based on language information of the audio information, and an audio characteristic vector of the audio information is determined;

and the memory is used for storing the program of the processor for executing the processing procedure.

A readable storage medium having stored thereon a computer program for execution by a processor for carrying out the steps of the audio feature extraction method as described above.

According to the technical scheme, the audio feature extraction method, the device and the system obtain the audio information, perform preprocessing for enhancing the performance of the voice signal on the audio information to obtain the preprocessed audio information, convert the preprocessed audio information from a time domain to a frequency domain through fast Fourier transform, and determine the audio feature vector of the audio information by filtering the audio information of the frequency domain through a Mel filter bank for setting the frequency based on the language information of the audio information. According to the scheme, after the audio information is subjected to preprocessing and fast Fourier transform, filtering processing is carried out through the Mel filter bank, wherein the Mel filter bank is the frequency set based on the language information of the audio information, so that the filtering processing of the audio information is realized based on the language information of the audio information property, and is associated with the language characteristics of the audio information, the determined audio characteristic vector is more in line with the characteristics of the language information corresponding to the audio information, and the accuracy of audio identification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of an audio feature extraction method disclosed in an embodiment of the present application;

fig. 2 is a flowchart of an audio feature extraction method disclosed in an embodiment of the present application;

FIG. 3 is a schematic diagram of filter frequency division for a prior art scheme and a scheme disclosed in an embodiment of the present application;

fig. 4 is a schematic structural diagram of an audio feature extraction system disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an audio feature extraction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The application discloses an audio feature extraction method, a flow chart of which is shown in fig. 1, comprising the following steps:

step S11, acquiring audio information;

step S12, preprocessing the audio information to strengthen the performance of the voice signal to obtain preprocessed audio information;

step S13, converting the audio information after the preprocessing from a time domain to a frequency domain through fast Fourier transform;

step S14, the audio information in the frequency domain is subjected to filtering processing by a mel filter bank that sets frequencies based on language information of the audio information, and an audio feature vector of the audio information is determined.

MFCC (Mel-scaleffrequency Cepstral coeffkccents, Mel-frequency spectrum Cepstral coefficients) is a Cepstral parameter extracted in the frequency domain of the Mel-scale, which describes the non-linear characteristic of human ear frequencies. In speech recognition and speaker recognition techniques, MFCC feature extraction is often used to facilitate efficient recognition of audio information through MFCC feature extraction.

At present, when the MFCC features are extracted, M triangular filter banks that are equally divided are generally used for filtering, that is, for a segment of audio data, frequencies in the audio data are equally divided, and each frequency band after being equally divided is filtered by one triangular filter.

In order to solve the problem, according to the scheme, after audio information is obtained, the audio information is preprocessed to achieve the purpose of enhancing the performance of a voice signal, then fast Fourier transform is carried out on the audio information to enable the audio information to be converted from a time domain to a frequency domain, and the audio information of the frequency domain is filtered by a Mel filter bank with frequency set based on language information of the audio information, so that the audio characteristic vector of the audio information is obtained.

After the audio information is obtained, the audio information is preprocessed, and simultaneously, the language of the audio information is analyzed, so that the language information corresponding to the audio information is determined, namely, the current audio information is output in which language, namely, English or Chinese, or Japanese and the like.

After the language information is acquired, because the pronunciation frequency response of the audio frequency of different language information is different in emphasis, after the language information is acquired, the initial frequency and the termination frequency of each Mel filter in the Mel filter set need to be set based on the characteristics of the language information, so that the language characteristics of the audio information can be highlighted after the current audio information passes through the Mel filter set which is subjected to frequency setting based on the characteristics of the language information, the audio characteristic vector of the audio information is acquired based on the characteristics of the language information, and the purpose of accurately identifying the audio information is achieved.

In addition, the audio information is preprocessed, which may specifically be: the audio information is subjected to framing processing to obtain each frame of audio data, and after each frame of audio data is subjected to pre-emphasis processing, the audio data subjected to pre-emphasis processing is subjected to window function processing to obtain the pre-processed audio information.

Specifically, the framing is to divide the audio information into multiple frames, each frame has a duration of about 20-30ms, wherein one frame may include N sampling points, and in order to avoid excessive variation between two adjacent frames, there is an overlapping region between the two adjacent frames, the overlapping region includes M sampling points, and M is usually 1/2 or 1/3 about the value of N. For example: each frame is 25 ms, each frame is shifted by 10 ms, namely, an overlapping area of 15 ms exists between two adjacent frames.

In addition, the signal-to-noise ratio of the output audio is reduced relative to the input signal-to-noise ratio due to the noise introduced by the process of processing and transmitting the audio, and most processes are disadvantageous for high frequencies, i.e., the higher the audio frequency, the higher the noise, and the smaller the amplitude of the component in the spectrum of the human voice or music, which is usually the higher the frequency, which causes the closer the audio at the output end is to the high end, the more serious the reduction of the signal-to-noise ratio. In order to improve the transmission quality of the high frequency components, the high frequency components are pre-processed at the audio input end, and the amplitude of the high frequency components is increased to improve the high frequency signal-to-noise ratio at the demodulation output end, namely pre-emphasis.

The pre-emphasis formula for audio may be:

Y_t+1＝X_t+1-α*X_t

wherein, X_tThe value of a sampling point at the time t is shown, Y shows the value of the sampling point after pre-emphasis, alpha is a pre-emphasis coefficient and ranges from 0.95 to 1, and the first sampling point of the audio is not changed.

After pre-emphasis, processing of a window function is performed, that is, a hamming window is added, which is to prevent oscillation phenomenon after fourier transform, and each frame is multiplied by the window function to increase continuity of left and right ends of the frame, and the formula can be:

Z_n＝Y_n*h_n

wherein Y represents a sampling point before windowing, Z represents a sampling point after windowing, h represents a windowing coefficient, wherein,

typically, α is 0.46, N represents the total number of points to be windowed, and N represents a certain sampling point.

In addition, the filtering processing of subjecting the audio information in the frequency domain to a mel filter bank that sets frequencies based on language information of the audio information includes:

the audio information is converted from the frequency scale of the frequency domain to the mel frequency spectrum scale based on the preset relationship, and the audio information converted to the mel frequency spectrum scale is subjected to the filtering processing of the mel filter bank which sets the frequency based on the language information of the audio information.

The preset relationship is as follows:

through the preset relation, the linear frequency of the audio information can be converted into the Mel frequency, so that the audio feature vector of the MFCC can be conveniently obtained.

After the audio information in the frequency domain is subjected to filtering processing of a mel filter bank which sets the frequency based on language information of the audio information, feature vectors matched with the number of mel filters in the mel filter bank are obtained, and the audio feature vectors of the audio information are generated by performing inverse cosine transformation on the feature vectors matched with the number of mel filters.

If M Mel filters are shared in the Mel filter bank, the obtained eigenvector is M dimensional eigenvector, and the M dimensional eigenvector is subjected to inverse cosine transform to obtain 13 dimensional eigenvector, and the 13 dimensional eigenvector is MFCC eigenvector.

The audio feature extraction method disclosed in this embodiment obtains audio information, performs preprocessing for enhancing the performance of a voice signal on the audio information to obtain preprocessed audio information, converts the preprocessed audio information from a time domain to a frequency domain through fast fourier transform, and determines an audio feature vector of the audio information by filtering the audio information in the frequency domain through a mel filter bank that sets frequencies based on language information of the audio information. According to the scheme, after the audio information is subjected to preprocessing and fast Fourier transform, filtering processing is carried out through the Mel filter bank, wherein the Mel filter bank is the frequency set based on the language information of the audio information, so that the filtering processing of the audio information is realized based on the language information of the audio information property, and is associated with the language characteristics of the audio information, the determined audio characteristic vector is more in line with the characteristics of the language information corresponding to the audio information, and the accuracy of audio identification is improved.

The embodiment discloses an audio feature extraction method, a flowchart of which is shown in fig. 2, and the method comprises the following steps:

step S21, acquiring audio information;

step S22, preprocessing the audio information to strengthen the performance of the voice signal to obtain preprocessed audio information;

step S23, converting the audio information after the preprocessing from a time domain to a frequency domain through fast Fourier transform;

step S24, determining language information corresponding to the audio information based on the audio information;

step S25, determining a specific starting Mel frequency and a specific ending Mel frequency corresponding to each Mel filter in a preset number of Mel filters matched with the language information;

step S26, the audio information in the frequency domain is subjected to a filtering process by each mel filter set based on a specific mel frequency, and an audio feature vector of the audio information is determined.

When each Mel filter in the Mel filters is set based on the language information, the set frequency of each Mel filter is the specific starting Mel frequency, the center frequency and the specific ending Mel frequency of each Mel filter, namely the set working frequency band of each Mel filter is actually set based on the language information.

In the existing scheme, the mel frequency is equally divided into M mel filters, i.e. triangular filters, and the equation of the equal division is as follows:

wherein, i is 0, 1.

The length of the frequency band divided by each triangular filter is the same based on the above formula, therefore, the frequency is divided into M sections equally and then allocated to M triangular filters, the end frequency of the previous triangular filter in every two adjacent triangular filters is the center frequency of the next triangular filter, the center frequency of the previous triangular filter is the start frequency of the next triangular filter, and so on.

In the scheme, the audio information is divided into two parts based on different languages, and the two parts of audio information adopt different modes to determine the frequency range, specifically:

determining a first mode and a second mode based on language information corresponding to the audio information, wherein a preset number of Mel filters are arranged in a Mel filter group in sequence, the specific starting Mel frequencies of the kth Mel filter and the Mel filters before the kth Mel filter are determined by adopting the first mode, and the specific starting Mel frequencies of the kth Mel filter, the +1 th Mel filter and the Mel filters after the kth Mel filter are determined by adopting the second mode; the specific termination Mel frequencies of the k-1 st Mel filter and the Mel filters before the k-1 st Mel filter are determined by a first method, and the specific termination Mel frequencies of the k-th Mel filter and the Mel filters after the k-th Mel filter are determined by a second method; wherein k is a positive integer less than half of the sum of the preset number and 1, and k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1, the specific termination mel frequency of each mel-filter is the center frequency of the next mel-filter of the mel-filter, and the center frequency of each mel-filter is the specific start mel frequency of the next mel-filter of the mel-filter.

By adopting the scheme, the frequency bands of the audios matched with the languages can be respectively set for each Mel filter, so that the frequency bands of each Mel filter are not divided into Mel spectrums equally, but correspond to specific frequency bands.

Specifically, if M mel filters in the mel filter bank are provided, that is, the preset number is M, k is a positive integer smaller than a half of the sum of the preset number and 1, and k +1 is a positive integer greater than or equal to a half of the sum of the preset number and 1, that is: k is less than or equal to (M +1)/2-1 and less than or equal to (M + 1)/2.

Wherein, the first mode is as follows:

the second mode is as follows:

where F is the maximum frequency after conversion to the mel-frequency spectrum, i 1_iI.e. representing the center frequency of the ith mel-filter; when determining a particular initial Mel frequency for a Mel filter, f in the equation_i-1Represents a specific initial mel frequency of the ith mel filter; when determining a specific termination Mel frequency of a certain Mel filter, passing f_i+1Represents a specific termination mel frequency of the ith mel filter; for example: i in the formula is equal to k when determining the center frequency of the kth mel filter, and i in the formula is equal to k-1 when determining the specific termination mel frequency of the kth mel filter.

Wherein (M +1)/2 is to divide the original (M +1) into upper and lower parts, and the square setting for i or (i-M-1) is to change f_iAfter the numerical distribution is changed, the extracted signal characteristics can respond to the low-frequency part more sensitively, so that the aim of improving the recognition rate is fulfilled.

k is a positive integer less than (M +1)/2, i.e., the specific starting Mel frequencies of the k-th Mel filter and the Mel filters disposed before the k-th Mel filter are determined in a first manner, and the specific starting Mel frequencies of the k + 1-th Mel filter and the Mel filters disposed after the k + 1-th Mel filter are determined in a second manner;

in calculating the specific termination frequency, the specific termination Mel frequency of the k-1 st Mel filter and the Mel filter arranged before the k-1 st Mel filter is determined by a first method, and the specific termination Mel frequency of the k-th Mel filter and the Mel filter arranged after the k-th Mel filter is determined by a second method.

Since the center frequency of each mel filter is the specific start mel frequency of the next mel filter of the mel filter, after the center frequency of a certain mel filter is determined, the specific start mel frequency of the next mel filter arranged at the mel filter is also determined. Therefore, the determination in the order of the mel-frequency points can be adopted without considering the second mel-filter, as long as the frequencies of all the mel-frequency points are determined, and then the M mel-filters are arranged in the order, as shown below:

wherein, i is used to represent the several frequency points, when M mel filters are required to be arranged on the mel frequency spectrum, the frequencies of two neighboring mel filters are overlapped, and because the center frequency of the previous mel filter of the neighboring mel filters is the starting frequency of the next mel filter, when setting the frequency points, only M +2 frequency points need to be determined. After M +2 frequency points are determined, a Mel filter is arranged between every three adjacent frequency points, and the three adjacent frequency points are the specific starting Mel frequency, the center frequency and the specific termination Mel frequency of the Mel filter;

wherein f is_iDenotes the center frequency, f, of the ith Mel filter_i-1Indicating a specific start of the ith Mel filterFrequency, f_i+1The specific termination frequency of the ith mel filter is shown, and the frequency band from the center frequency of the previous mel filter to the specific termination frequency in the two adjacent mel filters is overlapped with the frequency band from the specific start frequency of the next mel filter to the center frequency.

As shown in fig. 3, the straight line 31 is the frequency division of the mel filter determined by the conventional scheme, and the arc line 32 is the frequency division of the mel filter determined by the scheme, where M is 71 and F is 6539.

The embodiment discloses an audio feature extraction system, a schematic structural diagram of which is shown in fig. 4, and the audio feature extraction system includes:

an acquisition unit 41, a preprocessing unit 42, a conversion unit 43 and a filtering unit 44.

Wherein, the obtaining unit 41 is used for obtaining the audio information;

the preprocessing unit 42 is configured to perform preprocessing for enhancing the performance of the voice information on the audio information, so as to obtain preprocessed audio information;

the conversion unit 43 is configured to convert the preprocessed audio information from the time domain to the frequency domain through fast fourier transform;

the filtering unit 44 is configured to perform filtering processing on the audio information in the frequency domain through a mel filter bank that sets frequencies based on language information of the audio information, and determine an audio feature vector of the audio information.

The pre-emphasis formula for audio may be:

Y_t+1＝X_t+1-α*X_t

wherein, X_tThe value of a sampling point at the time t is shown, Y is the value of the sampling point after pre-emphasis, alpha is a pre-emphasis coefficient and ranges from 0.95 to 1, and the first sampling point of the audio is notAnd (6) changing.

Z_n＝Y_n*h_n

The preset relationship is as follows:

Further, the filtering unit 44 is configured to: determining language information corresponding to the audio information based on the audio information; determining a specific starting Mel frequency and a specific ending Mel frequency corresponding to each Mel filter in a preset number of Mel filters matched with language information; the audio information of the frequency domain is subjected to a filtering process of each mel filter set based on a specific mel frequency.

wherein, i is 0, 1.

Wherein, the first mode is as follows:

the second mode is as follows:

where F is the maximum frequency after conversion into the mel-frequency spectrum, i is 1, … …, M, and F in the formula is the center frequency of a certain mel-frequency filter when determining the center frequency of the mel-frequency filter_iI.e. representing the center frequency of the ith mel-filter; when determining a particular initial Mel frequency for a Mel filter, f in the equation_i-1Represents a specific initial mel frequency of the ith mel filter; when determining a specific termination Mel frequency of a certain Mel filter, passing f_i+1Represents a specific termination mel frequency of the ith mel filter; for example: it doesI in the equation equals k when the center frequency of the kth mel-filter is determined, and i in the equation equals k-1 when the specific termination mel-frequency of the kth mel-filter is determined.

wherein f is_iDenotes the center frequency, f, of the ith Mel filter_i-1Indicating a specific starting frequency, f, of the i-th Mel filter_i+1The specific termination frequency of the ith mel filter is shown, and the frequency band from the center frequency of the previous mel filter to the specific termination frequency in the two adjacent mel filters is overlapped with the frequency band from the specific start frequency of the next mel filter to the center frequency.

The audio feature extraction system disclosed in this embodiment obtains audio information, performs preprocessing for enhancing the performance of a voice signal on the audio information to obtain preprocessed audio information, converts the preprocessed audio information from a time domain to a frequency domain through fast fourier transform, and determines an audio feature vector of the audio information by filtering the audio information in the frequency domain through a mel filter bank that sets frequencies based on language information of the audio information. According to the scheme, after the audio information is subjected to preprocessing and fast Fourier transform, filtering processing is carried out through the Mel filter bank, wherein the Mel filter bank is the frequency set based on the language information of the audio information, so that the filtering processing of the audio information is realized based on the language information of the audio information property, and is associated with the language characteristics of the audio information, the determined audio characteristic vector is more in line with the characteristics of the language information corresponding to the audio information, and the accuracy of audio identification is improved.

The embodiment discloses an audio feature extraction device, a schematic structural diagram of which is shown in fig. 5, and the audio feature extraction device includes:

a processor 51 and a memory 52.

The processor 51 is used for obtaining audio information; preprocessing the audio information to strengthen the performance of the voice signal to obtain preprocessed audio information; converting the preprocessed audio information from a time domain to a frequency domain through fast Fourier transform; the audio information of the frequency domain is filtered by a Mel filter bank with frequency set based on language information of the audio information, and audio characteristic vectors of the audio information are determined;

the memory 52 is used to store programs for the processor to perform the above-described processing procedures.

Further, the processor subjects the audio information in the frequency domain to a filtering process of a mel filter bank that sets frequencies based on language information of the audio information, including:

the processor determines language information corresponding to the audio information based on the audio information; determining a specific starting Mel frequency and a specific ending Mel frequency corresponding to each Mel filter in a preset number of Mel filters matched with language information; the audio information of the frequency domain is subjected to a filtering process of each mel filter set based on a specific mel frequency.

Further, the processor determines a specific start mel frequency and a specific end mel frequency corresponding to each mel filter, including:

the processor determines a first mode and a second mode based on language information corresponding to the audio information; the specific starting Mel frequencies of the kth Mel filter and the preceding Mel filters are determined by a first mode; the specific starting Mel frequency of the (k +1) th Mel filter and the Mel filters thereafter is determined by a second method; the specific termination Mel frequency of the k-1 st Mel filter and the preceding Mel filters is determined by a first method; the specific termination mel frequency of the kth mel filter and the following mel filters is determined by adopting a second mode; wherein k is a positive integer less than half of the sum of the preset number and 1; k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1; the specific termination mel frequency of each mel filter is the specific start mel frequency of the next mel filter of that mel filter.

the processor determines a first mode and a second mode based on language information corresponding to the audio information; if the preset number is M, the number of the frequency points to be determined is M + 1; when i is less than half of the sum of M and 1, the ith frequency point and the previous frequency points are determined in a first mode; when i is more than or equal to half of the sum of M and 1, determining the ith frequency point and the subsequent frequency points by adopting a second mode; the M +1 frequency points are sequentially determined as a specific start mel frequency or a specific end mel frequency of a preset number of mel filters.

Further, the processor performs filtering processing on the audio information in the frequency domain through a mel filter bank which sets frequencies based on language information of the audio information, and determines the audio feature vector of the audio information, and the method includes:

the processor carries out filtering processing on the audio information in the frequency domain through a Mel filter bank with frequency set based on language information of the audio information to obtain characteristic vectors matched with the quantity of Mel filters in the Mel filter bank; and performing inverse cosine transform on the feature vectors matched with the number of the Mel filters to generate audio feature vectors of the audio information.

the processor converts the audio information from the frequency scale of the frequency domain to a mel-frequency spectrum scale based on a preset relationship, and subjects the audio information converted to the mel-frequency spectrum scale to a filtering process of a mel filter bank that sets frequencies based on language information of the audio information.

Further, the processor performs preprocessing for enhancing the performance of the voice signal on the audio information to obtain preprocessed audio information, including:

the processor carries out framing processing on the audio information to obtain each frame of audio data; and after pre-emphasis processing is carried out on each frame of audio data, carrying out window function processing on the audio data after the pre-emphasis processing is carried out on each frame of audio data, and obtaining the pre-processed audio information.

The audio feature extraction device disclosed in this embodiment is implemented based on the audio feature extraction method disclosed in the above embodiment, and is not described herein again.

The audio feature extraction device disclosed in this embodiment obtains audio information, performs preprocessing for enhancing the performance of a voice signal on the audio information to obtain preprocessed audio information, converts the preprocessed audio information from a time domain to a frequency domain through fast fourier transform, and determines an audio feature vector of the audio information by filtering the audio information in the frequency domain through a mel filter bank that sets frequencies based on language information of the audio information. According to the scheme, after the audio information is subjected to preprocessing and fast Fourier transform, filtering processing is carried out through the Mel filter bank, wherein the Mel filter bank is the frequency set based on the language information of the audio information, so that the filtering processing of the audio information is realized based on the language information of the audio information property, and is associated with the language characteristics of the audio information, the determined audio characteristic vector is more in line with the characteristics of the language information corresponding to the audio information, and the accuracy of audio identification is improved.

The embodiment of the present application further provides a readable storage medium, where a computer program is stored, and the computer program is loaded and executed by a processor to implement each step of the audio feature extraction method, where a specific implementation process may refer to descriptions of corresponding parts in the foregoing embodiment, and details are not repeated in this embodiment.

The present application also proposes a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the electronic device executes the method provided in the various optional implementation manners in the aspect of the audio feature extraction method, and the specific implementation process may refer to the description of the corresponding embodiment, which is not described in detail.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An audio feature extraction method, comprising:

acquiring audio information;

2. The method according to claim 1, wherein said subjecting the audio information in the frequency domain to a filtering process by a mel filter bank that sets frequencies based on language information of the audio information comprises:

3. The method of claim 2, wherein determining a specific starting mel frequency and a specific ending mel frequency for each mel filter comprises:

wherein k is a positive integer less than half of the sum of the preset number and 1, and k +1 is a positive integer greater than or equal to half of the sum of the preset number and 1; the specific termination mel frequency of each mel filter is the specific start mel frequency of the next mel filter of said mel filters.

4. The method of claim 2, wherein determining a specific starting mel frequency and a specific ending mel frequency for each mel filter comprises:

5. The method according to claim 1, wherein said determining the audio feature vector of the audio information by subjecting the audio information in the frequency domain to a filtering process of a mel filter bank that sets frequencies based on language information of the audio information comprises:

6. The method according to claim 1, wherein said subjecting the audio information in the frequency domain to a filtering process by a mel filter bank that sets frequencies based on language information of the audio information comprises:

7. The method of claim 1, wherein the performing pre-processing on the audio information to enhance the performance of the speech signal to obtain the pre-processed audio information comprises:

8. An audio feature extraction system, comprising:

an acquisition unit configured to acquire audio information;

9. An audio feature extraction device characterized by comprising:

10. A readable storage medium, having stored thereon a computer program, which is executed by a processor, for implementing the steps of the audio feature extraction method as described above.