CN113160797A

CN113160797A - Audio feature processing method and device, storage medium and electronic equipment

Info

Publication number: CN113160797A
Application number: CN202110447185.0A
Authority: CN
Inventors: 岑吴镕; 李骊
Original assignee: Beijing HJIMI Technology Co Ltd
Current assignee: Beijing HJIMI Technology Co Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-23
Anticipated expiration: 2041-04-25
Also published as: CN113160797B

Abstract

The invention provides an audio characteristic processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing method provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.

Description

Audio feature processing method and device, storage medium and electronic equipment

Technical Field

The present invention relates to the field of audio data processing technologies, and in particular, to an audio feature processing method and apparatus, a storage medium, and an electronic device.

Background

With the development of science and technology, the speech recognition model is widely applied to various industries and plays an important role in various scenes. The speech recognition model requires a large number of audio features for training, and the training effect of the model is poor in the case of insufficient number of audio features.

At present, in order to increase the number of audio features, the number of audio features is usually expanded by increasing speed disturbance, volume disturbance and noise to the original audio, however, expanding the number of audio features by the existing method may increase the number of audio to be processed, further increase the extraction time of the audio features, and consume large computing resources.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide an audio feature processing method, which can effectively reduce the time for expanding audio features.

The invention also provides an audio characteristic processing device which is used for ensuring the realization and the application of the method in practice.

An audio feature processing method, comprising:

acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data;

determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;

enhancing the target sub-characteristic data to obtain enhancer characteristic data;

and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.

Optionally, the method for obtaining the audio feature of the target audio frame of the audio to be processed includes:

framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;

determining a target audio frame of the audio to be processed in each audio frame;

and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.

Optionally, in the method, the extracting the features of the target audio frame to obtain the audio features of the target audio frame includes:

pre-emphasis processing is carried out on the target audio frame to obtain a first audio frame;

adding a Hamming window to the first audio frame to obtain a second audio frame;

performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;

obtaining a Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;

obtaining a triangular filter corresponding to each set characteristic dimension according to the Mel frequency spectrum;

and inputting energy corresponding to the frequency domain data into each triangular filter to obtain the audio characteristics of the target audio.

The method optionally, the enhancing the target sub-feature data to obtain the enhancer feature data includes:

determining an enhancement multiple corresponding to the target sub-feature data;

and enhancing the target sub-feature data based on the enhancement times to obtain the enhancer feature data.

The above method, optionally, further includes:

and training a preset voice recognition model by applying the target audio data.

An audio feature processing apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio features of a target audio frame of audio to be processed, and the audio features consist of sub-feature data with multiple dimensions;

the determining unit is used for determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;

the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;

and the second execution unit is used for replacing target sub-feature data in the audio features with the enhancer feature data to obtain the target audio features.

The above apparatus, optionally, the obtaining unit includes:

the framing subunit is used for framing the audio to be processed based on the set number of the sampling points and the moving step length to obtain each audio frame of the audio to be processed;

a first determining subunit, configured to determine, in the respective audio frames, a target audio frame of the audio to be processed;

and the feature extraction subunit is used for performing feature extraction on the target audio frame to obtain the audio features of the target audio frame.

The above apparatus, optionally, the feature extraction subunit includes:

the pre-emphasis processing subunit is configured to perform pre-emphasis processing on the target audio frame to obtain a first audio frame;

the first execution subunit is used for adding a Hamming window to the first audio frame to obtain a second audio frame;

the second execution subunit is configured to perform fast fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;

a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;

a fourth execution subunit, configured to obtain, according to the mel spectrum, a triangular filter corresponding to each set feature dimension;

and the fifth execution subunit is configured to input energy corresponding to the frequency domain data to each of the triangular filters, so as to obtain an audio feature of the target audio.

The above apparatus, optionally, the first execution unit includes:

the second determining subunit is used for determining the enhancement multiple corresponding to the target sub-feature data;

and the data enhancer unit is used for enhancing the target sub-characteristic data based on the enhancement multiple to obtain the enhancer characteristic data.

The above apparatus, optionally, further comprises: and the model training unit is used for applying the target audio data to train a preset voice recognition model.

A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform an audio feature processing method as described above.

An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method as described above.

Compared with the prior art, the invention has the following advantages:

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a method for processing audio features according to the present invention;

FIG. 2 is a flowchart of a process for obtaining audio features of a target audio frame of audio to be processed according to the present invention;

FIG. 3 is a flow chart of a process for obtaining audio features of a target audio frame according to the present invention;

FIG. 4 is a schematic structural diagram of an audio feature processing apparatus according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

An embodiment of the present invention provides an audio feature processing method, which may be applied to an electronic device, where a method flowchart of the method is shown in fig. 1, and specifically includes:

s101: the method comprises the steps of obtaining audio features of a target audio frame of audio to be processed, wherein the audio features are composed of multi-dimensional sub-feature data.

In the method provided by the embodiment of the present invention, the target audio frame may be a current audio frame to be processed.

Wherein the number of the target audio frames may be one or more.

Specifically, the audio feature may be an Fbank feature, the audio feature is a feature vector composed of sub-feature data of multiple dimensions, and the number of dimensions of the audio feature may be different numbers such as 71 dimensions or 72 dimensions.

Optionally, one feasible way to obtain the audio features of the target audio frame of the audio to be processed is to perform feature extraction on the target audio frame to obtain the audio features of the target audio frame.

S102: and determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features.

In the method provided by the embodiment of the present invention, the target sub-feature data may be randomly determined sub-feature data or sub-feature data of a specified dimension.

Wherein the number of target sub-feature data in the audio feature may be one or more.

S103: and enhancing the target sub-characteristic data to obtain the characteristic data of the enhancer.

In the method provided by the embodiment of the invention, the target sub-feature data can be enhanced according to a preset enhancement mode, and the enhancement sub-feature data of the target sub-feature data is obtained.

And under the condition that the number of the target sub-feature data is multiple, enhancing each target sub-feature data to obtain enhanced sub-feature data of each target sub-feature data.

S104: and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.

In the method provided by the embodiment of the present invention, the target audio feature includes each sub-feature data except the target sub-feature data in the audio feature and the enhanced sub-feature data.

The embodiment of the invention provides an audio characteristic processing method, which comprises the following steps: acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing method provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.

In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the process of acquiring the audio feature of the target audio frame of the audio to be processed specifically includes, as shown in fig. 2:

s201: and framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed.

In the method provided by the embodiment of the present invention, the number of the sampling points may be the number of the sampling points constituting one audio frame, and the length of the moving step may be a preset number of the sampling points, where the preset number may be smaller than the number of the sampling points.

The number of the sampling points may be any number, for example, 500 or 512.

Alternatively, the length of the moving step may be 160 points.

S202: and determining a target audio frame of the audio to be processed in each audio frame.

In the method provided by the embodiment of the invention, a plurality of audio frames which are continuous in sequence in the audio to be processed can be determined as the target audio frame, and the current audio frame to be processed in the audio to be processed can also be used as the target audio frame.

S203: and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.

In the method provided by the embodiment of the present invention, the audio features may be any type of audio features, and feature extraction may be performed on the target audio frame in a preset feature extraction manner to obtain audio features of the target audio frame corresponding to the feature extraction manner.

In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the process of performing feature extraction on the target audio frame to obtain the audio feature of the target audio frame specifically includes, as shown in fig. 3:

s301: and carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame.

The target audio frame can be pre-emphasized through a set pre-emphasis formula, and a first audio frame is obtained.

Optionally, the pre-emphasis formula may be: y is_t+1＝X_t+1-α*X_t。

Wherein X is_tRepresenting the value of the sample at time t, Y representing the value of the pre-emphasized sample, α being the pre-emphasis coefficient, the pre-emphasis system may range from 0.95 to 1, and the first sample of the target audio frame may be unchanged.

S302: and adding a Hamming window to the first audio frame to obtain a second audio frame.

Wherein a hamming window may be added to the first audio frame by a hamming window processing formula.

Optionally, the hamming window processing formula may be: z_n＝Y_n*h_n。

Wherein Y represents a sampling point before windowing, Z represents a sampling point after windowing, and h represents a windowing coefficient.

Wherein the content of the first and second substances,

where β may be set to 0.46, N represents the total number of points to be windowed, and N represents a certain sampling point.

S303: and carrying out fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame.

S304: and obtaining a Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data.

The frequency domain data can be calculated through a Mel frequency spectrum conversion formula, and a Mel frequency spectrum corresponding to the target audio frame is obtained.

Optionally, the mel-frequency spectrum conversion formula may be:

where mel (f) is a mel spectrum, and f is frequency domain data.

S305: and obtaining a triangular filter corresponding to each set characteristic dimension according to the Mel frequency spectrum.

The Mel frequency spectrum is equally divided into initial triangular filters with preset dimensionality quantity, and then each initial triangular filter is converted back to a frequency domain to obtain a triangular filter corresponding to each set characteristic dimensionality.

S306: and inputting energy corresponding to the frequency domain data into each triangular filter to obtain the audio characteristics of the target audio.

In the method provided by the embodiment of the present invention, the energy corresponding to the frequency domain data can be obtained by adding the square of the real part of the frequency domain data to the square of the imaginary part of the frequency domain data.

The energy corresponding to the frequency domain data can be processed through each triangular filter to obtain sub-feature data of each feature dimension, and the audio features of the target audio are formed by the sub-feature data of each dimension.

In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the enhancing the target sub-feature data to obtain the enhancer feature data includes:

In the method provided by the embodiment of the invention, a preset enhancement multiple set can be determined, any multiple is randomly selected in the enhancement multiple set, and the selected multiple is used as the enhancement multiple corresponding to the target sub-feature data.

The enhancement factor may be multiplied by the target sub-feature data to enhance the target sub-feature data to obtain enhanced sub-feature data of the target sub-feature data.

The set of enhancement factors may be set according to actual requirements, for example, may be [0.95, 1.05], and may also be [0.96, 1.06 ].

By applying the method provided by the embodiment of the invention, the target sub-feature data can be effectively enhanced, so that the target audio features can be formed by all sub-feature data except the target sub-feature data and the enhanced sub-feature data in the audio features.

In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the method further includes:

In the audio feature processing method provided by the present invention, in the practical application process, the audio feature of the target audio frame may be an Fbank feature, and the Fbank feature is taken as an example for explanation below:

firstly, extracting the Fbank characteristics of the audio to be processed, specifically as follows:

step a 1: and carrying out audio framing on the processed audio, wherein 512 sampling points are one frame, and each time 160 points are moved, so as to obtain each audio frame.

Step a 2: FBank features are extracted for each audio frame:

(1) audio pre-emphasis, whose formula is: y is_t+1＝X_t+1-α*X_t。

Wherein, X_tThe value of the sample point at the time t is shown, Y shows the value of the sample point after pre-emphasis, alpha is a pre-emphasis coefficient and ranges from 0.95 to 1, and the first sample point of the audio is not changed.

(2) A hamming window is added.

The hamming window is added to prevent the oscillation phenomenon of the edge after Fourier transform.

The concrete formula is as follows: z_n＝Y_n*h_n。

Wherein, Y represents the sampling point before windowing, Z represents the sampling point after windowing, and h represents the windowing coefficient.

Wherein the content of the first and second substances,

it is possible to set β to 0.46, N to the total number of points to be windowed, and N to a certain sampling point.

(3) And converting the audio frame subjected to pre-emphasis and Hamming window addition from a time domain to a frequency domain through fast Fourier transform to obtain frequency domain data.

(4) By the formula

The frequency domain data is converted into a Mel frequency spectrum, then the Mel frequency spectrum is equally divided into 71 triangular filters, and then the triangular filters are converted back to the frequency domain.

(5) The corresponding energy of the frequency domain is passed through the triangular filter to obtain a feature vector of 71 dimensions.

Secondly, the characteristics of the FBank are enhanced as follows:

for the extracted 71-dimensional Fbank characteristics of each audio frame, randomly sampling 71-dimensional vectors (71 numbers) in each audio frame once to extract 1 number from the 71 numbers, performing random size transformation on the extracted numbers by 0.95-1.05 times, and replacing the original numbers with the transformed numbers to obtain the target audio characteristics of the audio frame.

Wherein, assuming that the enhancement factor is 0.97 and the extracted number is 10, the transformed number is 9.7, and then the transformed number 9.7 is substituted for the original number 10.

Corresponding to the method illustrated in fig. 1, an embodiment of the present invention further provides an audio feature processing apparatus, which is used for implementing the method illustrated in fig. 1 specifically, where the audio feature processing apparatus provided in the embodiment of the present invention may be applied to an electronic device, and a schematic structural diagram of the audio feature processing apparatus is illustrated in fig. 4, and specifically includes:

an obtaining unit 401, configured to obtain an audio feature of a target audio frame of an audio to be processed, where the audio feature is composed of sub-feature data of multiple dimensions;

a determining unit 402, configured to determine target sub-feature data of the audio feature from the sub-feature data of each dimension of the audio feature;

a first execution unit 403, configured to enhance the target sub-feature data to obtain enhanced sub-feature data;

a second executing unit 404, configured to replace target sub-feature data in the audio feature with the enhancer feature data, so as to obtain a target audio feature.

The embodiment of the invention provides an audio characteristic processing device, which can acquire audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of multi-dimensional sub-characteristic data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing device provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.

In an embodiment provided by the present invention, based on the above scheme, optionally, the obtaining unit 401 includes:

In an embodiment provided by the present invention, based on the above scheme, optionally, the feature extraction subunit includes:

In an embodiment of the present invention, based on the above scheme, optionally, the first execution unit 403 includes:

In an embodiment provided by the present invention, based on the above scheme, optionally, the method further includes: and the model training unit is used for applying the target audio data to train a preset voice recognition model.

The specific principle and the execution process of each unit and each module in the audio feature processing apparatus disclosed in the above embodiment of the present invention are the same as those of the audio feature processing method disclosed in the above embodiment of the present invention, and reference may be made to corresponding parts in the audio feature processing method provided in the above embodiment of the present invention, which are not described herein again.

The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the device where the storage medium is located is controlled to execute the audio feature processing method.

An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the following operations according to the one or more instructions 502:

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The audio characteristic processing method provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the description of the above examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An audio feature processing method, comprising:

2. The method according to claim 1, wherein the obtaining audio features of a target audio frame of the audio to be processed comprises:

3. The method according to claim 2, wherein the performing feature extraction on the target audio frame to obtain the audio feature of the target audio frame comprises:

4. The method according to claim 1, wherein said enhancing said target sub-feature data to obtain enhanced sub-feature data comprises:

5. The method of claim 1, further comprising:

6. An audio feature processing apparatus, comprising:

7. The apparatus of claim 6, wherein the obtaining unit comprises:

8. The apparatus of claim 7, wherein the feature extraction subunit comprises:

9. A storage medium, characterized in that the storage medium comprises stored instructions, wherein when the instructions are executed, the apparatus on which the storage medium is located is controlled to execute the audio feature processing method according to any one of claims 1 to 5.

10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method of any one of claims 1-5.