CN113160797A - Audio feature processing method and device, storage medium and electronic equipment - Google Patents

Audio feature processing method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113160797A
CN113160797A CN202110447185.0A CN202110447185A CN113160797A CN 113160797 A CN113160797 A CN 113160797A CN 202110447185 A CN202110447185 A CN 202110447185A CN 113160797 A CN113160797 A CN 113160797A
Authority
CN
China
Prior art keywords
audio
target
feature
audio frame
feature data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110447185.0A
Other languages
Chinese (zh)
Other versions
CN113160797B (en
Inventor
岑吴镕
李骊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing HJIMI Technology Co Ltd
Original Assignee
Beijing HJIMI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing HJIMI Technology Co Ltd filed Critical Beijing HJIMI Technology Co Ltd
Priority to CN202110447185.0A priority Critical patent/CN113160797B/en
Publication of CN113160797A publication Critical patent/CN113160797A/en
Application granted granted Critical
Publication of CN113160797B publication Critical patent/CN113160797B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Stereophonic System (AREA)

Abstract

The invention provides an audio characteristic processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing method provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.

Description

Audio feature processing method and device, storage medium and electronic equipment
Technical Field
The present invention relates to the field of audio data processing technologies, and in particular, to an audio feature processing method and apparatus, a storage medium, and an electronic device.
Background
With the development of science and technology, the speech recognition model is widely applied to various industries and plays an important role in various scenes. The speech recognition model requires a large number of audio features for training, and the training effect of the model is poor in the case of insufficient number of audio features.
At present, in order to increase the number of audio features, the number of audio features is usually expanded by increasing speed disturbance, volume disturbance and noise to the original audio, however, expanding the number of audio features by the existing method may increase the number of audio to be processed, further increase the extraction time of the audio features, and consume large computing resources.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide an audio feature processing method, which can effectively reduce the time for expanding audio features.
The invention also provides an audio characteristic processing device which is used for ensuring the realization and the application of the method in practice.
An audio feature processing method, comprising:
acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data;
determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;
enhancing the target sub-characteristic data to obtain enhancer characteristic data;
and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.
Optionally, the method for obtaining the audio feature of the target audio frame of the audio to be processed includes:
framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
determining a target audio frame of the audio to be processed in each audio frame;
and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
Optionally, in the method, the extracting the features of the target audio frame to obtain the audio features of the target audio frame includes:
pre-emphasis processing is carried out on the target audio frame to obtain a first audio frame;
adding a Hamming window to the first audio frame to obtain a second audio frame;
performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
obtaining a Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;
obtaining a triangular filter corresponding to each set characteristic dimension according to the Mel frequency spectrum;
and inputting energy corresponding to the frequency domain data into each triangular filter to obtain the audio characteristics of the target audio.
The method optionally, the enhancing the target sub-feature data to obtain the enhancer feature data includes:
determining an enhancement multiple corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement times to obtain the enhancer feature data.
The above method, optionally, further includes:
and training a preset voice recognition model by applying the target audio data.
An audio feature processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio features of a target audio frame of audio to be processed, and the audio features consist of sub-feature data with multiple dimensions;
the determining unit is used for determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;
the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;
and the second execution unit is used for replacing target sub-feature data in the audio features with the enhancer feature data to obtain the target audio features.
The above apparatus, optionally, the obtaining unit includes:
the framing subunit is used for framing the audio to be processed based on the set number of the sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine, in the respective audio frames, a target audio frame of the audio to be processed;
and the feature extraction subunit is used for performing feature extraction on the target audio frame to obtain the audio features of the target audio frame.
The above apparatus, optionally, the feature extraction subunit includes:
the pre-emphasis processing subunit is configured to perform pre-emphasis processing on the target audio frame to obtain a first audio frame;
the first execution subunit is used for adding a Hamming window to the first audio frame to obtain a second audio frame;
the second execution subunit is configured to perform fast fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain, according to the mel spectrum, a triangular filter corresponding to each set feature dimension;
and the fifth execution subunit is configured to input energy corresponding to the frequency domain data to each of the triangular filters, so as to obtain an audio feature of the target audio.
The above apparatus, optionally, the first execution unit includes:
the second determining subunit is used for determining the enhancement multiple corresponding to the target sub-feature data;
and the data enhancer unit is used for enhancing the target sub-characteristic data based on the enhancement multiple to obtain the enhancer characteristic data.
The above apparatus, optionally, further comprises: and the model training unit is used for applying the target audio data to train a preset voice recognition model.
A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform an audio feature processing method as described above.
An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method as described above.
Compared with the prior art, the invention has the following advantages:
the invention provides an audio characteristic processing method and device, a storage medium and electronic equipment, wherein the method comprises the following steps: acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing method provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a flow chart of a method for processing audio features according to the present invention;
FIG. 2 is a flowchart of a process for obtaining audio features of a target audio frame of audio to be processed according to the present invention;
FIG. 3 is a flow chart of a process for obtaining audio features of a target audio frame according to the present invention;
FIG. 4 is a schematic structural diagram of an audio feature processing apparatus according to the present invention;
fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
An embodiment of the present invention provides an audio feature processing method, which may be applied to an electronic device, where a method flowchart of the method is shown in fig. 1, and specifically includes:
s101: the method comprises the steps of obtaining audio features of a target audio frame of audio to be processed, wherein the audio features are composed of multi-dimensional sub-feature data.
In the method provided by the embodiment of the present invention, the target audio frame may be a current audio frame to be processed.
Wherein the number of the target audio frames may be one or more.
Specifically, the audio feature may be an Fbank feature, the audio feature is a feature vector composed of sub-feature data of multiple dimensions, and the number of dimensions of the audio feature may be different numbers such as 71 dimensions or 72 dimensions.
Optionally, one feasible way to obtain the audio features of the target audio frame of the audio to be processed is to perform feature extraction on the target audio frame to obtain the audio features of the target audio frame.
S102: and determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features.
In the method provided by the embodiment of the present invention, the target sub-feature data may be randomly determined sub-feature data or sub-feature data of a specified dimension.
Wherein the number of target sub-feature data in the audio feature may be one or more.
S103: and enhancing the target sub-characteristic data to obtain the characteristic data of the enhancer.
In the method provided by the embodiment of the invention, the target sub-feature data can be enhanced according to a preset enhancement mode, and the enhancement sub-feature data of the target sub-feature data is obtained.
And under the condition that the number of the target sub-feature data is multiple, enhancing each target sub-feature data to obtain enhanced sub-feature data of each target sub-feature data.
S104: and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.
In the method provided by the embodiment of the present invention, the target audio feature includes each sub-feature data except the target sub-feature data in the audio feature and the enhanced sub-feature data.
The embodiment of the invention provides an audio characteristic processing method, which comprises the following steps: acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing method provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the process of acquiring the audio feature of the target audio frame of the audio to be processed specifically includes, as shown in fig. 2:
s201: and framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed.
In the method provided by the embodiment of the present invention, the number of the sampling points may be the number of the sampling points constituting one audio frame, and the length of the moving step may be a preset number of the sampling points, where the preset number may be smaller than the number of the sampling points.
The number of the sampling points may be any number, for example, 500 or 512.
Alternatively, the length of the moving step may be 160 points.
S202: and determining a target audio frame of the audio to be processed in each audio frame.
In the method provided by the embodiment of the invention, a plurality of audio frames which are continuous in sequence in the audio to be processed can be determined as the target audio frame, and the current audio frame to be processed in the audio to be processed can also be used as the target audio frame.
S203: and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
In the method provided by the embodiment of the present invention, the audio features may be any type of audio features, and feature extraction may be performed on the target audio frame in a preset feature extraction manner to obtain audio features of the target audio frame corresponding to the feature extraction manner.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the process of performing feature extraction on the target audio frame to obtain the audio feature of the target audio frame specifically includes, as shown in fig. 3:
s301: and carrying out pre-emphasis processing on the target audio frame to obtain a first audio frame.
The target audio frame can be pre-emphasized through a set pre-emphasis formula, and a first audio frame is obtained.
Optionally, the pre-emphasis formula may be: y ist+1=Xt+1-α*Xt
Wherein X istRepresenting the value of the sample at time t, Y representing the value of the pre-emphasized sample, α being the pre-emphasis coefficient, the pre-emphasis system may range from 0.95 to 1, and the first sample of the target audio frame may be unchanged.
S302: and adding a Hamming window to the first audio frame to obtain a second audio frame.
Wherein a hamming window may be added to the first audio frame by a hamming window processing formula.
Optionally, the hamming window processing formula may be: zn=Yn*hn
Wherein Y represents a sampling point before windowing, Z represents a sampling point after windowing, and h represents a windowing coefficient.
Wherein the content of the first and second substances,
Figure BDA0003037334220000071
where β may be set to 0.46, N represents the total number of points to be windowed, and N represents a certain sampling point.
S303: and carrying out fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame.
S304: and obtaining a Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data.
The frequency domain data can be calculated through a Mel frequency spectrum conversion formula, and a Mel frequency spectrum corresponding to the target audio frame is obtained.
Optionally, the mel-frequency spectrum conversion formula may be:
Figure BDA0003037334220000081
where mel (f) is a mel spectrum, and f is frequency domain data.
S305: and obtaining a triangular filter corresponding to each set characteristic dimension according to the Mel frequency spectrum.
The Mel frequency spectrum is equally divided into initial triangular filters with preset dimensionality quantity, and then each initial triangular filter is converted back to a frequency domain to obtain a triangular filter corresponding to each set characteristic dimensionality.
S306: and inputting energy corresponding to the frequency domain data into each triangular filter to obtain the audio characteristics of the target audio.
In the method provided by the embodiment of the present invention, the energy corresponding to the frequency domain data can be obtained by adding the square of the real part of the frequency domain data to the square of the imaginary part of the frequency domain data.
The energy corresponding to the frequency domain data can be processed through each triangular filter to obtain sub-feature data of each feature dimension, and the audio features of the target audio are formed by the sub-feature data of each dimension.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the enhancing the target sub-feature data to obtain the enhancer feature data includes:
determining an enhancement multiple corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement times to obtain the enhancer feature data.
In the method provided by the embodiment of the invention, a preset enhancement multiple set can be determined, any multiple is randomly selected in the enhancement multiple set, and the selected multiple is used as the enhancement multiple corresponding to the target sub-feature data.
The enhancement factor may be multiplied by the target sub-feature data to enhance the target sub-feature data to obtain enhanced sub-feature data of the target sub-feature data.
The set of enhancement factors may be set according to actual requirements, for example, may be [0.95, 1.05], and may also be [0.96, 1.06 ].
By applying the method provided by the embodiment of the invention, the target sub-feature data can be effectively enhanced, so that the target audio features can be formed by all sub-feature data except the target sub-feature data and the enhanced sub-feature data in the audio features.
In the method provided in the embodiment of the present invention, based on the implementation process, specifically, the method further includes:
and training a preset voice recognition model by applying the target audio data.
In the audio feature processing method provided by the present invention, in the practical application process, the audio feature of the target audio frame may be an Fbank feature, and the Fbank feature is taken as an example for explanation below:
firstly, extracting the Fbank characteristics of the audio to be processed, specifically as follows:
step a 1: and carrying out audio framing on the processed audio, wherein 512 sampling points are one frame, and each time 160 points are moved, so as to obtain each audio frame.
Step a 2: FBank features are extracted for each audio frame:
(1) audio pre-emphasis, whose formula is: y ist+1=Xt+1-α*Xt
Wherein, XtThe value of the sample point at the time t is shown, Y shows the value of the sample point after pre-emphasis, alpha is a pre-emphasis coefficient and ranges from 0.95 to 1, and the first sample point of the audio is not changed.
(2) A hamming window is added.
The hamming window is added to prevent the oscillation phenomenon of the edge after Fourier transform.
The concrete formula is as follows: zn=Yn*hn
Wherein, Y represents the sampling point before windowing, Z represents the sampling point after windowing, and h represents the windowing coefficient.
Wherein the content of the first and second substances,
Figure BDA0003037334220000091
it is possible to set β to 0.46, N to the total number of points to be windowed, and N to a certain sampling point.
(3) And converting the audio frame subjected to pre-emphasis and Hamming window addition from a time domain to a frequency domain through fast Fourier transform to obtain frequency domain data.
(4) By the formula
Figure BDA0003037334220000092
The frequency domain data is converted into a Mel frequency spectrum, then the Mel frequency spectrum is equally divided into 71 triangular filters, and then the triangular filters are converted back to the frequency domain.
(5) The corresponding energy of the frequency domain is passed through the triangular filter to obtain a feature vector of 71 dimensions.
Secondly, the characteristics of the FBank are enhanced as follows:
for the extracted 71-dimensional Fbank characteristics of each audio frame, randomly sampling 71-dimensional vectors (71 numbers) in each audio frame once to extract 1 number from the 71 numbers, performing random size transformation on the extracted numbers by 0.95-1.05 times, and replacing the original numbers with the transformed numbers to obtain the target audio characteristics of the audio frame.
Wherein, assuming that the enhancement factor is 0.97 and the extracted number is 10, the transformed number is 9.7, and then the transformed number 9.7 is substituted for the original number 10.
Corresponding to the method illustrated in fig. 1, an embodiment of the present invention further provides an audio feature processing apparatus, which is used for implementing the method illustrated in fig. 1 specifically, where the audio feature processing apparatus provided in the embodiment of the present invention may be applied to an electronic device, and a schematic structural diagram of the audio feature processing apparatus is illustrated in fig. 4, and specifically includes:
an obtaining unit 401, configured to obtain an audio feature of a target audio frame of an audio to be processed, where the audio feature is composed of sub-feature data of multiple dimensions;
a determining unit 402, configured to determine target sub-feature data of the audio feature from the sub-feature data of each dimension of the audio feature;
a first execution unit 403, configured to enhance the target sub-feature data to obtain enhanced sub-feature data;
a second executing unit 404, configured to replace target sub-feature data in the audio feature with the enhancer feature data, so as to obtain a target audio feature.
The embodiment of the invention provides an audio characteristic processing device, which can acquire audio characteristics of a target audio frame of audio to be processed, wherein the audio characteristics consist of multi-dimensional sub-characteristic data; determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features; enhancing the target sub-characteristic data to obtain enhancer characteristic data; and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features. By applying the audio feature processing device provided by the invention, the sub-feature data of partial feature dimensions of the audio features can be enhanced, so that new target audio feature data can be obtained, the time spent on expanding the audio features can be greatly reduced, the computing resources are saved, and the expansion efficiency of the audio features is improved.
In an embodiment provided by the present invention, based on the above scheme, optionally, the obtaining unit 401 includes:
the framing subunit is used for framing the audio to be processed based on the set number of the sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine, in the respective audio frames, a target audio frame of the audio to be processed;
and the feature extraction subunit is used for performing feature extraction on the target audio frame to obtain the audio features of the target audio frame.
In an embodiment provided by the present invention, based on the above scheme, optionally, the feature extraction subunit includes:
the pre-emphasis processing subunit is configured to perform pre-emphasis processing on the target audio frame to obtain a first audio frame;
the first execution subunit is used for adding a Hamming window to the first audio frame to obtain a second audio frame;
the second execution subunit is configured to perform fast fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain, according to the mel spectrum, a triangular filter corresponding to each set feature dimension;
and the fifth execution subunit is configured to input energy corresponding to the frequency domain data to each of the triangular filters, so as to obtain an audio feature of the target audio.
In an embodiment of the present invention, based on the above scheme, optionally, the first execution unit 403 includes:
the second determining subunit is used for determining the enhancement multiple corresponding to the target sub-feature data;
and the data enhancer unit is used for enhancing the target sub-characteristic data based on the enhancement multiple to obtain the enhancer characteristic data.
In an embodiment provided by the present invention, based on the above scheme, optionally, the method further includes: and the model training unit is used for applying the target audio data to train a preset voice recognition model.
The specific principle and the execution process of each unit and each module in the audio feature processing apparatus disclosed in the above embodiment of the present invention are the same as those of the audio feature processing method disclosed in the above embodiment of the present invention, and reference may be made to corresponding parts in the audio feature processing method provided in the above embodiment of the present invention, which are not described herein again.
The embodiment of the invention also provides a storage medium, which comprises a stored instruction, wherein when the instruction runs, the device where the storage medium is located is controlled to execute the audio feature processing method.
An electronic device is provided in an embodiment of the present invention, and the structural diagram of the electronic device is shown in fig. 5, which specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the following operations according to the one or more instructions 502:
acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data;
determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;
enhancing the target sub-characteristic data to obtain enhancer characteristic data;
and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The audio characteristic processing method provided by the present invention is described in detail above, and the principle and the implementation of the present invention are explained in this document by applying specific examples, and the description of the above examples is only used to help understanding the method of the present invention and the core idea thereof; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. An audio feature processing method, comprising:
acquiring audio features of a target audio frame of audio to be processed, wherein the audio features consist of multi-dimensional sub-feature data;
determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;
enhancing the target sub-characteristic data to obtain enhancer characteristic data;
and replacing target sub-feature data in the audio features with the enhancer feature data to obtain target audio features.
2. The method according to claim 1, wherein the obtaining audio features of a target audio frame of the audio to be processed comprises:
framing the audio to be processed based on the set number of sampling points and the moving step length to obtain each audio frame of the audio to be processed;
determining a target audio frame of the audio to be processed in each audio frame;
and extracting the characteristics of the target audio frame to obtain the audio characteristics of the target audio frame.
3. The method according to claim 2, wherein the performing feature extraction on the target audio frame to obtain the audio feature of the target audio frame comprises:
pre-emphasis processing is carried out on the target audio frame to obtain a first audio frame;
adding a Hamming window to the first audio frame to obtain a second audio frame;
performing fast Fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
obtaining a Mel frequency spectrum corresponding to the target audio frame based on the frequency domain data;
obtaining a triangular filter corresponding to each set characteristic dimension according to the Mel frequency spectrum;
and inputting energy corresponding to the frequency domain data into each triangular filter to obtain the audio characteristics of the target audio.
4. The method according to claim 1, wherein said enhancing said target sub-feature data to obtain enhanced sub-feature data comprises:
determining an enhancement multiple corresponding to the target sub-feature data;
and enhancing the target sub-feature data based on the enhancement times to obtain the enhancer feature data.
5. The method of claim 1, further comprising:
and training a preset voice recognition model by applying the target audio data.
6. An audio feature processing apparatus, comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring audio features of a target audio frame of audio to be processed, and the audio features consist of sub-feature data with multiple dimensions;
the determining unit is used for determining target sub-feature data of the audio features from the sub-feature data of each dimension of the audio features;
the first execution unit is used for enhancing the target sub-feature data to obtain enhanced sub-feature data;
and the second execution unit is used for replacing target sub-feature data in the audio features with the enhancer feature data to obtain the target audio features.
7. The apparatus of claim 6, wherein the obtaining unit comprises:
the framing subunit is used for framing the audio to be processed based on the set number of the sampling points and the moving step length to obtain each audio frame of the audio to be processed;
a first determining subunit, configured to determine, in the respective audio frames, a target audio frame of the audio to be processed;
and the feature extraction subunit is used for performing feature extraction on the target audio frame to obtain the audio features of the target audio frame.
8. The apparatus of claim 7, wherein the feature extraction subunit comprises:
the pre-emphasis processing subunit is configured to perform pre-emphasis processing on the target audio frame to obtain a first audio frame;
the first execution subunit is used for adding a Hamming window to the first audio frame to obtain a second audio frame;
the second execution subunit is configured to perform fast fourier transform on the second audio frame to obtain frequency domain data of the second audio frame;
a third execution subunit, configured to obtain a mel spectrum corresponding to the target audio frame based on the frequency domain data;
a fourth execution subunit, configured to obtain, according to the mel spectrum, a triangular filter corresponding to each set feature dimension;
and the fifth execution subunit is configured to input energy corresponding to the frequency domain data to each of the triangular filters, so as to obtain an audio feature of the target audio.
9. A storage medium, characterized in that the storage medium comprises stored instructions, wherein when the instructions are executed, the apparatus on which the storage medium is located is controlled to execute the audio feature processing method according to any one of claims 1 to 5.
10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the audio feature processing method of any one of claims 1-5.
CN202110447185.0A 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment Active CN113160797B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110447185.0A CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110447185.0A CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113160797A true CN113160797A (en) 2021-07-23
CN113160797B CN113160797B (en) 2023-06-02

Family

ID=76870199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110447185.0A Active CN113160797B (en) 2021-04-25 2021-04-25 Audio feature processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113160797B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120185243A1 (en) * 2009-08-28 2012-07-19 International Business Machines Corp. Speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program
US20130179158A1 (en) * 2012-01-10 2013-07-11 Kabushiki Kaisha Toshiba Speech Feature Extraction Apparatus and Speech Feature Extraction Method
CN104240719A (en) * 2013-06-24 2014-12-24 浙江大华技术股份有限公司 Feature extraction method and classification method for audios and related devices
CN108922541A (en) * 2018-05-25 2018-11-30 南京邮电大学 Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
US20190043477A1 (en) * 2018-06-28 2019-02-07 Intel Corporation Method and system of temporal-domain feature extraction for automatic speech recognition
CN111261189A (en) * 2020-04-02 2020-06-09 中国科学院上海微***与信息技术研究所 Vehicle sound signal feature extraction method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120185243A1 (en) * 2009-08-28 2012-07-19 International Business Machines Corp. Speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program
US20130179158A1 (en) * 2012-01-10 2013-07-11 Kabushiki Kaisha Toshiba Speech Feature Extraction Apparatus and Speech Feature Extraction Method
CN104240719A (en) * 2013-06-24 2014-12-24 浙江大华技术股份有限公司 Feature extraction method and classification method for audios and related devices
CN108922541A (en) * 2018-05-25 2018-11-30 南京邮电大学 Multidimensional characteristic parameter method for recognizing sound-groove based on DTW and GMM model
US20190043477A1 (en) * 2018-06-28 2019-02-07 Intel Corporation Method and system of temporal-domain feature extraction for automatic speech recognition
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN111261189A (en) * 2020-04-02 2020-06-09 中国科学院上海微***与信息技术研究所 Vehicle sound signal feature extraction method

Also Published As

Publication number Publication date
CN113160797B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN105976812B (en) A kind of audio recognition method and its equipment
CN109740053B (en) Sensitive word shielding method and device based on NLP technology
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN107527620A (en) Electronic installation, the method for authentication and computer-readable recording medium
CN110517698B (en) Method, device and equipment for determining voiceprint model and storage medium
CN112767922B (en) Speech recognition method for contrast predictive coding self-supervision structure joint training
CN115083423B (en) Data processing method and device for voice authentication
CN109300484B (en) Audio alignment method and device, computer equipment and readable storage medium
Kumar et al. Gender classification using pitch and formants
CN113160797A (en) Audio feature processing method and device, storage medium and electronic equipment
CN111625468A (en) Test case duplicate removal method and device
CN103354091B (en) Based on audio feature extraction methods and the device of frequency domain conversion
CN113239151B (en) Method, system and equipment for enhancing spoken language understanding data based on BART model
CN114783423A (en) Speech segmentation method and device based on speech rate adjustment, computer equipment and medium
Wang et al. Audio fingerprint based on spectral flux for audio retrieval
EP3792917A1 (en) Pitch enhancement device, method, program and recording medium therefor
CN110600015B (en) Voice dense classification method and related device
Yang et al. Constant-q deep coefficients for playback attack detection
CN112820267B (en) Waveform generation method, training method of related model, related equipment and device
CN114400007A (en) Voice processing method and device
Hokking et al. A hybrid of fractal code descriptor and harmonic pattern generator for improving speech recognition of different sampling rates
CN116312583A (en) Tone color conversion method, device, storage medium and computer equipment
CN117351928A (en) Voice data processing method, device, computer equipment and storage medium
Nguyen et al. Removal of Various Noise Types and Voice-Based Gender Classification for Dubbed Videos
CN117037800A (en) Voiceprint recognition model training method, voiceprint recognition device and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant