CN112908307A - Audio feature extraction method, system, device and medium - Google Patents

Audio feature extraction method, system, device and medium Download PDF

Info

Publication number
CN112908307A
CN112908307A CN202110134475.XA CN202110134475A CN112908307A CN 112908307 A CN112908307 A CN 112908307A CN 202110134475 A CN202110134475 A CN 202110134475A CN 112908307 A CN112908307 A CN 112908307A
Authority
CN
China
Prior art keywords
audio
features
feature extraction
file
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110134475.XA
Other languages
Chinese (zh)
Inventor
邱实
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuncong Technology Group Co Ltd
Original Assignee
Yuncong Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuncong Technology Group Co Ltd filed Critical Yuncong Technology Group Co Ltd
Priority to CN202110134475.XA priority Critical patent/CN112908307A/en
Publication of CN112908307A publication Critical patent/CN112908307A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides an audio characteristic extraction method, system, device and medium, which execute a corresponding reading command according to an audio data source to obtain audio content; and performing one or more times of feature extraction on the audio content, and storing the audio features extracted each time into a preset file according to a preset frame. Aiming at the existing problems, the invention designs a set of multifunctional audio characteristic extraction mode, which comprises three basic functions of audio data analysis, audio characteristic extraction and characteristic result storage, can provide support of various result formats and optimizes the processing efficiency. The invention realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. The invention has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.

Description

Audio feature extraction method, system, device and medium
Technical Field
The present invention relates to the field of audio extraction technologies, and in particular, to a method, a system, a device, and a medium for extracting audio features.
Background
In speech recognition and other speech related scenarios, audio feature extraction is a key step therein. Feature extraction converts audio signals in the time domain into various features in the Frequency domain, such as fft (fast Fourier transform), Fbank, mfcc (mel Frequency Cepstral coeffients), etc. Many tools for speech algorithms and libraries for scientific computing contain features extraction functions. However, the existing audio feature extraction tools often have the following problems: (1) different voice algorithm tools often use different data formats, audio feature extraction results in different formats are difficult to multiplex, and one feature extraction tool is difficult to use under another set of algorithm framework; (2) the self-contained feature extraction tool of an open-source speech algorithm framework (such as kaldi) is difficult to satisfy in performance and resource occupation; (3) although languages such as c + +, python and the like all have open source tools for audio feature extraction, these tools only provide interfaces with basic functions, and it is difficult to meet the rich and varied data requirements.
Disclosure of Invention
In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide an audio feature extraction method, system, device and medium for solving the technical problems of the prior art.
To achieve the above and other related objects, the present invention provides an audio feature extraction method, comprising:
executing a corresponding reading command according to an audio data source to acquire audio content;
and performing one or more times of feature extraction on the audio content, and storing the audio features extracted each time into a preset file according to a preset frame.
Optionally, the extracted audio features comprise at least one of: fast Fourier transform features, Mel filter coefficient features, Mel cepstrum coefficient features, pitch features, identity vector features.
Optionally, performing one or more feature extractions on the audio content, including:
performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics;
applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features;
and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics.
Optionally, the storing the audio features extracted each time into a preset file according to a preset frame includes: the audio features extracted each time are saved to a csv file, a numpy npy file, and/or a kaldi binary ark file.
Optionally, executing a corresponding read command according to an audio data source to obtain the audio content, including:
if the audio data source is a wav file, directly reading the wav file to acquire corresponding audio content;
and if the audio data source is a shell command, executing the shell command in the pipeline to acquire corresponding audio content.
Optionally, constructing a voice data information table according to the voice data set, and determining an audio data source according to the voice data information table;
wherein the content in the voice data set comprises at least one of: the audio number, the storage position of the audio file, the length of the audio and the text content label corresponding to the audio.
The invention also provides an audio feature extraction system, which comprises:
the acquisition module is used for executing a corresponding reading command according to an audio data source and acquiring audio content;
the audio feature extraction module is used for carrying out one or more times of feature extraction on the audio content;
and the storage module is used for storing the audio features extracted each time into a preset file according to a preset frame.
Optionally, the extracted audio features comprise at least one of: fast Fourier transform features, Mel filter coefficient features, Mel cepstrum coefficient features, pitch features, identity vector features.
Optionally, performing one or more feature extractions on the audio content, including:
performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics;
applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features;
and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics.
Optionally, the storing the audio features extracted each time into a preset file according to a preset frame includes: the audio features extracted each time are saved to a csv file, a numpy npy file, and/or a kaldi binary ark file.
The present invention also provides a computer apparatus comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as in any one of the above.
The invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method as described in any one of the above.
As described above, the present invention provides an audio feature extraction method, system, device, and medium, which have the following advantages: executing a corresponding reading command according to an audio data source to acquire audio content; and performing one or more times of feature extraction on the audio content, and storing the audio features extracted each time into a preset file according to a preset frame. Aiming at the existing problems, the invention designs a set of multifunctional audio characteristic extraction mode, which comprises three basic functions of audio data analysis, audio characteristic extraction and characteristic result storage, can provide support of various result formats and optimizes the processing efficiency. The invention realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. The invention has rich functions, and can extract FFT features (namely fast Fourier transform features), Fbank features (namely Mel filter coefficient features) and MFCC features (namely Mel cepstrum coefficient features) with different lengths; the method has wide application range, and the extracted result can be applied to model training and prediction of various speech recognition frameworks. Meanwhile, the invention has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.
Drawings
Fig. 1 is a schematic flowchart of an audio feature extraction method according to an embodiment;
fig. 2 is a schematic flowchart of an audio feature extraction method according to another embodiment;
fig. 3 is a schematic hardware structure diagram of an audio feature extraction system according to an embodiment;
fig. 4 is a schematic hardware structure diagram of a terminal device according to an embodiment;
fig. 5 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.
Description of the element reference numerals
M10 acquisition module
M20 audio feature extraction module
M30 storage module
1100 input device
1101 first processor
1102 output device
1103 first memory
1104 communication bus
1200 processing assembly
1201 second processor
1202 second memory
1203 communication assembly
1204 Power supply Assembly
1205 multimedia assembly
1206 Audio component
1207 input/output interface
1208 sensor assembly
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
Referring to fig. 1, the present invention provides an audio feature extraction method, which includes the following steps:
s100, executing a corresponding reading command according to an audio data source to acquire audio content;
s200, performing one or more times of feature extraction on the audio content, and storing the audio features extracted each time into a preset file according to a preset frame.
Aiming at the existing problems, the method designs a set of multifunctional audio feature extraction mode, which comprises three basic functions of audio data analysis, audio feature extraction and feature result storage, can provide support of various result formats, and optimizes the processing efficiency. The method realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. The method has rich functions, and can extract FFT features (namely fast Fourier transform features), Fbank features (namely Mel filter coefficient features) and MFCC features (namely Mel cepstrum coefficient features) with different lengths; the method is wide in application range, and the extracted result can be applied to model training and prediction of various speech recognition frameworks. Meanwhile, the method has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.
In some exemplary embodiments, the extracted audio features include at least one of: fast fourier transform features (i.e., FFT features), mel-filter coefficient features (i.e., Fbank features), mel-cepstral coefficient features (i.e., MFCC features), pitch features (i.e., pitch features), identity vector features (including vector features and xvcertor features). As an example, performing one or more feature extractions on the audio content may include: performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics; applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features; and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics. By extracting the audio features, FFT features, Fbank features and MFCC features with different lengths can be extracted, and the extraction result can be applied to model training and prediction of various speech recognition frameworks. As an example, the pitch feature can be equivalently replaced by a multi-dimensional Fbank feature when the speech recognition model is trained; the vector and xvector features contain speaker information and need supervised training according to a specific data set. Under the condition of only using the Fbank features, the effect of training a speech recognition model on kaldi (a speech recognition algorithm framework) basically reaches the level of the combined training of the three features of Fbank + pitch + driver.
In some exemplary embodiments, saving the audio features extracted each time into a preset file according to a preset framework includes: the audio features extracted each time are saved to a csv file, a numpy (a value calculation extension library in python) npy (a type of binary matrix file in numpy) file, and/or a kaldi binary ark file. If the dimension of the audio features is large, the extracted feature data is also huge. Among them, npy matrix in numpy is a widely used matrix format in python. In order to save storage space and accelerate reading speed, the audio features often adopt a binary file storage mode. By way of example, embodiments of the present application may select npy one of the resulting saved formats, and thus may support most python-based speech algorithm frameworks; the result may also be selected to be saved as an ark binary file in kaldi. When the csv file is used for sorting the audio data, various information of the audio data can be easily organized into a csv form through python and other tools, so that the method can be applied to most voice data processing scenes. Meanwhile, the data format of the kaldi is supported by the method through saving the data in the kaldi file, so that most of requirements in the field of voice recognition can be met. In addition, the storage location of the audio file may also be in the form of a shell command, i.e. an enhancement operation is performed on the original audio. Many speech algorithms perform enhancement operations on audio data during the data preparation stage, such as increasing or decreasing volume, adjusting audio speed, increasing reverberation and noise, etc. If the enhanced audio is saved to a file, the disk space occupied by the data is multiplied, and the efficiency of subsequent tasks is affected by reading the data from the file. Therefore, the method provides support for the data enhancement mode in the kaldi, namely the enhancement mode of recording the audio in the form of shell commands is not executed immediately, the commands are executed when the characteristics are extracted, and the results are transmitted in the form of pipelines, so that the data do not fall to the ground. According to the invention, by optimizing the use mode of the pipeline, the memory occupation is reduced, and the number of tasks which can be performed in parallel on the multi-core equipment is increased. In addition, in order to save storage space and increase reading efficiency, the method selects a form of saving the feature extraction result as a binary file. However, due to the lack of a uniform standard for the organization of audio features, it is difficult to achieve compatibility between binary profiles used by different speech tools. Therefore, the method provides two common binary file format supports, one is npy matrix in numpy, and the other is the binary format of kaldi. Meanwhile, because the current mainstream machine learning and deep learning frameworks are all based on python, the support of npy format means that the method can be compatible with the mainstream frameworks; and the support of the kaldi binary system also provides conditions for comparison experiments between kaldi and other frameworks. Meanwhile, the inventor compares the resource occupation and the efficiency of the method with the kaldi self-contained feature extraction tool. Under the condition of adopting the same voice data set, the number of parallel tasks in the method can be about 3 times of that of kaldi, namely, the single-task memory occupies only one third of that of kaldi; under the condition of single task, the time for extracting the three characteristics of FFT, Fbank and MFCC is equivalent to the time for extracting the single Fbank characteristic by kaldi, namely the extraction speed of the method is three times of that of kaldi under the condition of the same memory occupation.
In an exemplary embodiment, the method further comprises constructing a voice data information table according to the voice data set, and determining the audio data source according to the voice data information table; the voice data information table is a table combining the information of the fields and supports outputting the table in a csv format. The content in the voice data set includes at least one of: the audio number, the storage position of the audio file, the length of the audio and the text content label corresponding to the audio. Specifically, in a scenario of massive audio processing, whether the upstream and downstream data formats are matched directly affects the progress of the whole task. On the upstream data input side, the method supports two data organization forms: one is to write various information in a csv file; the other is the data format of kaldi, and different information is stored in different files respectively.
In some exemplary embodiments, the obtaining the audio content according to the audio data source executing the corresponding read command includes: if the audio data source is a wav file, directly reading the wav file to acquire corresponding audio content; and if the audio data source is a shell command, executing the shell command in the pipeline to acquire corresponding audio content. In the embodiment of the application, pipeline takes the output of the previous command as the input of the next command, and when c + + is used for implementation, the pipeline can be used for reading the output of the shell command as a file object. By way of example, for example, "20170001P 00001U000A0018/qs _ data/train _1000h/ST-CMDS-20170001_1-OS/20170001P00001U000A0018. wav" may be determined as the audio source being a wav file, and "00000 sox- -vol1.7082909716094652-t wav/qs _ data/speed/audio/LDC/vast _ cmn _ translation/wav/VVC038177. wav-t wav- |" may be determined as the audio source being a shell command, where the first "00000" in the shell command represents an audio ID.
As an example, as shown in fig. 2, a specific audio feature extraction process is provided, which includes:
and step S1, reading the voice data set and constructing a voice data information table. Among other things, a speech data set typically contains the following: the audio ID, the storage position of the audio file, the length of the audio and the text content label corresponding to the audio. Different speech algorithm frameworks have data sets of different formats. The embodiment of the application supports two data set formats: one is to write the above information in a csv file; the other is that different information is stored in different files in kaldi respectively. The storage location of the audio file may also be in the form of a shell command, i.e. an enhancement operation on the original audio. The voice data information table is a table combining the information of the fields and supports outputting the table in a csv format.
In step S2, the audio content is read. From the speech data in the table generated in step S1, the audio is read from the audio source in the table. If the audio source is a wav file, directly reading the file; if the audio source is a shell command, the command is executed in the pipeline, thereby obtaining audio content. By way of example, for example, "20170001P 00001U000A0018/qs _ data/train _1000h/ST-CMDS-20170001_1-OS/20170001P00001U000A0018. wav" may be determined as the audio source being a wav file, and "00000 sox- -vol1.7082909716094652-t wav/qs _ data/speed/audio/LDC/vast _ cmn _ translation/wav/VVC038177. wav-t wav- |" may be determined as the audio source being a shell command, where the first "00000" in the shell command represents an audio ID.
In step S3, FFT features are extracted. FFT features are extracted for the audio read in step S2 according to a discrete fourier transform algorithm, and the user can set the dimension of the FFT result. Fast Fourier Transform (FFT), a general term for an efficient and fast calculation method for calculating Discrete Fourier Transform (DFT) using a computer, is called FFT.
And step S4, extracting Fbank characteristics. And (4) applying Mel cepstrum to the FFT feature obtained in the step (S3) to obtain an Fbank feature, wherein a user can set the dimension of the Fbank result. Fbank (filterbank) is a front-end processing algorithm that processes audio in a manner similar to that of the human ear, which improves speech recognition performance. The general steps to obtain the Fbank feature of a speech signal are: pre-emphasis, framing, windowing, fourier transform (FFT), mel-filtering, de-averaging, etc.
In step S5, MFCC features are extracted. Applying discrete cosine transform to the Fbank feature obtained in step S4 to obtain an MFCC feature, and a user can set the dimension of the MFCC result.
In step S6, the feature extraction result is saved. If the dimension of the extracted audio feature is large, the extracted feature data is also huge. In order to save storage space and accelerate reading speed, the audio features often adopt a binary file storage mode. The npy matrix in numpy is a widely used matrix format in python, and the invention selects npy as one of the resulting saved formats, thereby supporting most of the python-based speech algorithm framework; the present invention may also save the results as ark binary files in kaldi.
In order to save storage space and accelerate reading speed, the method usually adopts a binary file storage mode for audio characteristics. By way of example, embodiments of the present application may select npy one of the resulting saved formats, and thus may support most python-based speech algorithm frameworks; the result may also be selected to be saved as an ark binary file in kaldi. When the csv file is used for sorting the audio data, various information of the audio data can be easily organized into a csv form through python and other tools, so that the method can be applied to most voice data processing scenes. Meanwhile, the data format of the kaldi is supported by the method through saving the data in the kaldi file, so that most of requirements in the field of voice recognition can be met. In addition, the storage location of the audio file may also be in the form of a shell command, i.e. an enhancement operation is performed on the original audio. Many speech algorithms perform enhancement operations on audio data during the data preparation stage, such as increasing or decreasing volume, adjusting audio speed, increasing reverberation and noise, etc. If the enhanced audio is saved to a file, the disk space occupied by the data is multiplied, and the efficiency of subsequent tasks is affected by reading the data from the file. Therefore, the method provides support for the data enhancement mode in the kaldi, namely the enhancement mode of recording the audio in the form of shell commands is not executed immediately, the commands are executed when the characteristics are extracted, and the results are transmitted in the form of pipelines, so that the data do not fall to the ground. According to the invention, by optimizing the use mode of the pipeline, the memory occupation is reduced, and the number of tasks which can be performed in parallel on the multi-core equipment is increased. In addition, in order to save storage space and increase reading efficiency, the method selects a form of saving the feature extraction result as a binary file. However, due to the lack of a uniform standard for the organization of audio features, it is difficult to achieve compatibility between binary profiles used by different speech tools. Therefore, the method provides two common binary file format supports, one is npy matrix in numpy, and the other is the binary format of kaldi. Meanwhile, because the current mainstream machine learning and deep learning frameworks are all based on python, the support of npy format means that the method can be compatible with the mainstream frameworks; and the support of the kaldi binary system also provides conditions for comparison experiments between kaldi and other frameworks. Meanwhile, the inventor compares the resource occupation and the efficiency of the method with the kaldi self-contained feature extraction tool. Under the condition of adopting the same voice data set, the number of parallel tasks in the method can be about 3 times of that of kaldi, namely, the single-task memory occupies only one third of that of kaldi; under the condition of single task, the time for extracting the three characteristics of FFT, Fbank and MFCC is equivalent to the time for extracting the single Fbank characteristic by kaldi, namely the extraction speed of the method is three times of that of kaldi under the condition of the same memory occupation.
In summary, the method designs a set of multifunctional audio feature extraction mode aiming at the existing problems, including three basic functions of audio data analysis, audio feature extraction and feature result storage, and can provide support of multiple result formats and optimize processing efficiency. The method realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. The method has rich functions, and can extract FFT features (namely fast Fourier transform features), Fbank features (namely Mel filter coefficient features) and MFCC features (namely Mel cepstrum coefficient features) with different lengths; the method is wide in application range, and the extracted result can be applied to model training and prediction of various speech recognition frameworks. Meanwhile, the method has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.
As shown in fig. 3, the present invention further provides an audio feature extraction system, which includes:
the acquisition module M10 is used for executing a corresponding reading command according to an audio data source to acquire audio content;
an audio feature extraction module M20, configured to perform one or more feature extractions on the audio content;
and the storage module M30 is configured to store the audio features extracted each time into a preset file according to a preset frame.
Aiming at the existing problems, the system designs a set of multifunctional audio feature extraction mode, which comprises three basic functions of audio data analysis, audio feature extraction and feature result storage, can provide support of various result formats, and optimizes the processing efficiency. The system realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. Moreover, the system has rich functions, and can extract FFT features (namely fast Fourier transform features), Fbank features (namely Mel filter coefficient features) and MFCC features (namely Mel cepstrum coefficient features) with different lengths; the system has wide application range, and the extraction result can be applied to model training and prediction of various speech recognition frameworks. Meanwhile, the system has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.
In some exemplary embodiments, the audio features extracted by the audio feature extraction module M20 include at least one of: fast fourier transform features (i.e., FFT features), mel-filter coefficient features (i.e., Fbank features), mel-cepstral coefficient features (i.e., MFCC features), pitch features (i.e., pitch features), identity vector features (including vector features and xvcertor features). As an example, performing one or more feature extractions on the audio content may include: performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics; applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features; and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics. By extracting the audio features, FFT features, Fbank features and MFCC features with different lengths can be extracted, and the extraction result can be applied to model training and prediction of various speech recognition frameworks. As an example, the pitch feature can be equivalently replaced by a multi-dimensional Fbank feature when the speech recognition model is trained; the vector and xvector features contain speaker information and need supervised training according to a specific data set. Under the condition that the system only uses the Fbank features, the effect of training a speech recognition model on kaldi (a speech recognition algorithm framework) basically reaches the level of the combined training of the three features of Fbank + pitch + driver.
In some exemplary embodiments, the storing module M30 stores the audio features extracted each time into a preset file according to a preset frame, including: the audio features extracted each time are saved to a csv file, a numpy (a value calculation extension library in python) npy (a type of binary matrix file in numpy) file, and/or a kaldi binary ark file. If the dimension of the audio features is large, the extracted feature data is also huge. Among them, npy matrix in numpy is a widely used matrix format in python. In order to save storage space and accelerate reading speed, the audio features often adopt a binary file storage mode. By way of example, embodiments of the present application may select npy one of the resulting saved formats, and thus may support most python-based speech algorithm frameworks; the result may also be selected to be saved as an ark binary file in kaldi. When the csv file is used for sorting the audio data, various information of the audio data can be easily organized into a csv form through python and other tools, so that the system can be applied to most voice data processing scenes. Meanwhile, the system supports the data format of the kaldi by saving the kaldi file, and can meet most requirements in the field of voice recognition. In addition, the storage location of the audio file may also be in the form of a shell command, i.e. an enhancement operation is performed on the original audio. Many speech algorithms perform enhancement operations on audio data during the data preparation stage, such as increasing or decreasing volume, adjusting audio speed, increasing reverberation and noise, etc. If the enhanced audio is saved to a file, the disk space occupied by the data is multiplied, and the efficiency of subsequent tasks is affected by reading the data from the file. Therefore, the system provides support for the data enhancement mode in kaldi, namely the enhancement mode of recording audio in the form of shell commands is not executed immediately, the commands are executed when the characteristics are extracted, and the results are transmitted in the form of pipelines, so that the data do not fall to the ground. According to the invention, by optimizing the use mode of the pipeline, the memory occupation is reduced, and the number of tasks which can be performed in parallel on the multi-core equipment is increased. In addition, in order to save storage space and increase reading efficiency, the system selects a form of saving the feature extraction result as a binary file. However, due to the lack of a uniform standard for the organization of audio features, it is difficult to achieve compatibility between binary profiles used by different speech tools. Therefore, the system provides two common binary file format supports, one is the npy matrix in numpy, and the other is the binary format of kaldi. Meanwhile, because the current mainstream machine learning and deep learning frameworks are based on python, the npy format support means that the system can be compatible with the mainstream frameworks; and the support of the kaldi binary system also provides conditions for comparison experiments between kaldi and other frameworks. Meanwhile, the inventor compares the resource occupation and the efficiency of the system and the kaldi self-contained feature extraction tool. Under the condition of adopting the same voice data set, the system can enable the number of parallel tasks to be about 3 times of kaldi, namely, the single-task memory occupies only one third of the kaldi; under the condition of single task, the time for extracting the three characteristics of FFT, Fbank and MFCC is equivalent to the time for extracting the single Fbank characteristic by kaldi, namely the extraction speed of the invention is three times of that of kaldi under the same memory occupation.
In an exemplary embodiment, the method further comprises constructing a voice data information table according to the voice data set, and determining the audio data source according to the voice data information table; the voice data information table is a table combining the information of the fields and supports outputting the table in a csv format. The content in the voice data set includes at least one of: the audio number, the storage position of the audio file, the length of the audio and the text content label corresponding to the audio. Specifically, in a scenario of massive audio processing, whether the upstream and downstream data formats are matched directly affects the progress of the whole task. On the upstream data input side, the system supports two data organization forms: one is to write various information in a csv file; the other is the data format of kaldi, and different information is stored in different files respectively.
In some exemplary embodiments, the obtaining the audio content according to the audio data source executing the corresponding read command includes: if the audio data source is a wav file, directly reading the wav file to acquire corresponding audio content; and if the audio data source is a shell command, executing the shell command in the pipeline to acquire corresponding audio content. In the embodiment of the application, pipeline takes the output of the previous command as the input of the next command, and when c + + is used for implementation, the pipeline can be used for reading the output of the shell command as a file object. By way of example, for example, "20170001P 00001U000A0018/qs _ data/train _1000h/ST-CMDS-20170001_1-OS/20170001P00001U000A0018. wav" may be determined as the audio source being a wav file, and "00000 sox- -vol1.7082909716094652-t wav/qs _ data/speed/audio/LDC/vast _ cmn _ translation/wav/VVC038177. wav-t wav- |" may be determined as the audio source being a shell command, where the first "00000" in the shell command represents an audio ID.
As an example, as shown in fig. 2, a specific audio feature extraction process is provided, and specific functions and technical effects may refer to the foregoing embodiments, which are not repeated in the present system,
in order to save storage space and accelerate reading speed, the system usually adopts a binary file storage mode for audio characteristics. By way of example, embodiments of the present application may select npy one of the resulting saved formats, and thus may support most python-based speech algorithm frameworks; the result may also be selected to be saved as an ark binary file in kaldi. When the csv file is used for sorting the audio data, various information of the audio data can be easily organized into a csv form through python and other tools, so that the system can be applied to most voice data processing scenes. Meanwhile, the system supports the data format of the kaldi by saving the kaldi file, and can meet most requirements in the field of voice recognition. In addition, the storage location of the audio file may also be in the form of a shell command, i.e. an enhancement operation is performed on the original audio. Many speech algorithms perform enhancement operations on audio data during the data preparation stage, such as increasing or decreasing volume, adjusting audio speed, increasing reverberation and noise, etc. If the enhanced audio is saved to a file, the disk space occupied by the data is multiplied, and the efficiency of subsequent tasks is affected by reading the data from the file. Therefore, the system provides support for the data enhancement mode in kaldi, namely the enhancement mode of recording audio in the form of shell commands is not executed immediately, the commands are executed when the characteristics are extracted, and the results are transmitted in the form of pipelines, so that the data do not fall to the ground. According to the invention, by optimizing the use mode of the pipeline, the memory occupation is reduced, and the number of tasks which can be performed in parallel on the multi-core equipment is increased. In addition, in order to save storage space and increase reading efficiency, the system selects a form of saving the feature extraction result as a binary file. However, due to the lack of a uniform standard for the organization of audio features, it is difficult to achieve compatibility between binary profiles used by different speech tools. Therefore, the system provides two common binary file format supports, one is the npy matrix in numpy, and the other is the binary format of kaldi. Meanwhile, because the current mainstream machine learning and deep learning frameworks are based on python, the npy format support means that the system can be compatible with the mainstream frameworks; and the support of the kaldi binary system also provides conditions for comparison experiments between kaldi and other frameworks. Meanwhile, the inventor compares the resource occupation and the efficiency of the system and the kaldi self-contained feature extraction tool. Under the condition of adopting the same voice data set, the system can enable the number of parallel tasks to be about 3 times of kaldi, namely, the single-task memory occupies only one third of the kaldi; under the condition of single task, the time for extracting the three characteristics of FFT, Fbank and MFCC is equivalent to the time for extracting the single Fbank characteristic by kaldi, namely the extraction speed of the invention is three times of that of kaldi under the same memory occupation.
In summary, the system designs a set of multifunctional audio feature extraction mode aiming at the existing problems, including three basic functions of audio data analysis, audio feature extraction and feature result storage, and can provide support of multiple result formats and optimize processing efficiency. The system realizes one-click processing on the audio data set; the extracted feature content supports various voice algorithm frames, and the problem that different platforms are difficult to compare experiments due to different feature extraction algorithms is solved. Moreover, the system has rich functions, and can extract FFT features (namely fast Fourier transform features), Fbank features (namely Mel filter coefficient features) and MFCC features (namely Mel cepstrum coefficient features) with different lengths; the system has wide application range, and the extraction result can be applied to model training and prediction of various speech recognition frameworks. Meanwhile, the system has high extraction efficiency and small resource occupation, and reduces the cost required by processing mass audio data.
The embodiment of the present application further provides an audio feature extraction device, including:
acquiring corresponding human body characteristic data in one or more target space regions;
and judging whether one or more human body characteristic data collected by the equipment access layer in the target space region accord with a preset human body characteristic data screening rule or not, and if so, determining a one-dimensional or multi-dimensional analysis result corresponding to the human body characteristic data according to the human body characteristic data transmitted by the data layer.
In this embodiment, the audio feature extraction device executes the system or the method, and specific functions and technical effects are only required to refer to the embodiment, which is not described herein again.
An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: the mobile terminal includes a smart phone, a tablet computer, an electronic book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a vehicle-mounted computer, a desktop computer, a set-top box, an intelligent television, a wearable device, and the like.
The present embodiment also provides a non-volatile readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the data processing method in fig. 1 according to the present embodiment.
Fig. 4 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.
Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.
Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.
In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.
Fig. 5 is a schematic hardware structure diagram of a terminal device according to another embodiment of the present application. Fig. 5 is a specific embodiment of the implementation process of fig. 4. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.
The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.
The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.
Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication components 1203, power components 1204, multimedia components 1205, audio components 1206, input/output interfaces 1207, and/or sensor components 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.
The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the method illustrated in fig. 1 described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.
The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.
The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.
The audio component 1206 is configured to output and/or input speech signals. For example, the audio component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, audio component 1206 also includes a speaker for outputting voice signals.
The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.
The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.
The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.
As can be seen from the above, the communication component 1203, the audio component 1206, the input/output interface 1207 and the sensor component 1208 in the embodiment of fig. 5 may be implemented as the input device in the embodiment of fig. 4.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (12)

1. An audio feature extraction method is characterized by comprising the following steps:
executing a corresponding reading command according to an audio data source to acquire audio content;
and performing one or more times of feature extraction on the audio content, and storing the audio features extracted each time into a preset file according to a preset frame.
2. The audio feature extraction method of claim 1, wherein the extracted audio features comprise at least one of: fast Fourier transform features, Mel filter coefficient features, Mel cepstrum coefficient features, pitch features, identity vector features.
3. The audio feature extraction method according to claim 1 or 2, wherein performing one or more feature extractions on the audio content comprises:
performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics;
applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features;
and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics.
4. The audio feature extraction method of claim 1, wherein storing the audio features extracted each time into a preset file according to a preset frame comprises: the audio features extracted each time are saved to a csv file, a numpy npy file, and/or a kaldi binary ark file.
5. The audio feature extraction method of claim 1, wherein executing a corresponding read command according to an audio data source to obtain audio content comprises:
if the audio data source is a wav file, directly reading the wav file to acquire corresponding audio content;
and if the audio data source is a shell command, executing the shell command in a pipeline to acquire corresponding audio content.
6. The audio feature extraction method according to claim 1 or 5, further comprising: constructing a voice data information table according to a voice data set, and determining an audio data source according to the voice data information table;
wherein the content in the voice data set comprises at least one of: the audio number, the storage position of the audio file, the length of the audio and the text content label corresponding to the audio.
7. An audio feature extraction system, comprising:
the acquisition module is used for executing a corresponding reading command according to an audio data source and acquiring audio content;
the audio feature extraction module is used for carrying out one or more times of feature extraction on the audio content;
and the storage module is used for storing the audio features extracted each time into a preset file according to a preset frame.
8. The audio feature extraction system of claim 7, wherein the audio features extracted by the audio feature extraction module comprise at least one of: fast Fourier transform features, Mel filter coefficient features, Mel cepstrum coefficient features, pitch features, identity vector features.
9. The audio feature extraction system according to claim 7 or 8, wherein the audio feature extraction module performs one or more feature extractions on the audio content, and comprises:
performing discrete Fourier transform on the audio content, and extracting fast Fourier transform characteristics;
applying Mel cepstrum to the fast Fourier transform features, and extracting Mel filter coefficient features;
and carrying out discrete cosine transform on the Mel filter coefficient characteristics, and extracting Mel cepstrum coefficient characteristics.
10. The audio feature extraction system of claim 7, wherein the storing module stores the audio features extracted each time into a preset file according to a preset frame, and the storing module further comprises: the audio features extracted each time are saved to a csv file, a numpy npy file, and/or a kaldi binary ark file.
11. A computer device, comprising:
one or more processors; and
one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of any of claims 1-6.
12. One or more machine readable media having instructions stored thereon that, when executed by one or more processors, cause an apparatus to perform the method of any of claims 1-6.
CN202110134475.XA 2021-01-29 2021-01-29 Audio feature extraction method, system, device and medium Pending CN112908307A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110134475.XA CN112908307A (en) 2021-01-29 2021-01-29 Audio feature extraction method, system, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110134475.XA CN112908307A (en) 2021-01-29 2021-01-29 Audio feature extraction method, system, device and medium

Publications (1)

Publication Number Publication Date
CN112908307A true CN112908307A (en) 2021-06-04

Family

ID=76122323

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110134475.XA Pending CN112908307A (en) 2021-01-29 2021-01-29 Audio feature extraction method, system, device and medium

Country Status (1)

Country Link
CN (1) CN112908307A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673364A (en) * 2021-07-28 2021-11-19 上海影谱科技有限公司 Video violence detection method and device based on deep neural network

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119116A (en) * 1997-12-18 2000-09-12 International Business Machines Corp. System and method for accessing and distributing audio CD data over a network
US20050265162A1 (en) * 2004-05-27 2005-12-01 Canon Kabushiki Kaisha File system, file recording method, and file reading method
US20110029583A1 (en) * 2008-04-10 2011-02-03 Masahiro Nakanishi Nonvolatile storage module, access module, musical sound data file generation module and musical sound generation system
CN101977260A (en) * 2010-09-07 2011-02-16 中兴通讯股份有限公司 Voice frequency regulating method and device
CN103269374A (en) * 2013-05-29 2013-08-28 北京小米科技有限责任公司 Method, device and equipment for recording synchronization
CN103870466A (en) * 2012-12-10 2014-06-18 哈尔滨网腾科技开发有限公司 Automatic extracting method for audio examples
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN109300483A (en) * 2018-09-14 2019-02-01 美林数据技术股份有限公司 A kind of intelligent audio abnormal sound detection method
CN111554318A (en) * 2020-04-27 2020-08-18 天津大学 Method for realizing mobile phone end pronunciation visualization system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6119116A (en) * 1997-12-18 2000-09-12 International Business Machines Corp. System and method for accessing and distributing audio CD data over a network
US20050265162A1 (en) * 2004-05-27 2005-12-01 Canon Kabushiki Kaisha File system, file recording method, and file reading method
US20110029583A1 (en) * 2008-04-10 2011-02-03 Masahiro Nakanishi Nonvolatile storage module, access module, musical sound data file generation module and musical sound generation system
CN101977260A (en) * 2010-09-07 2011-02-16 中兴通讯股份有限公司 Voice frequency regulating method and device
CN103870466A (en) * 2012-12-10 2014-06-18 哈尔滨网腾科技开发有限公司 Automatic extracting method for audio examples
CN103269374A (en) * 2013-05-29 2013-08-28 北京小米科技有限责任公司 Method, device and equipment for recording synchronization
CN109300483A (en) * 2018-09-14 2019-02-01 美林数据技术股份有限公司 A kind of intelligent audio abnormal sound detection method
CN109147818A (en) * 2018-10-30 2019-01-04 Oppo广东移动通信有限公司 Acoustic feature extracting method, device, storage medium and terminal device
CN111554318A (en) * 2020-04-27 2020-08-18 天津大学 Method for realizing mobile phone end pronunciation visualization system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673364A (en) * 2021-07-28 2021-11-19 上海影谱科技有限公司 Video violence detection method and device based on deep neural network

Similar Documents

Publication Publication Date Title
CN109032470B (en) Screenshot method, screenshot device, terminal and computer-readable storage medium
CN112420069A (en) Voice processing method, device, machine readable medium and equipment
CN112200318B (en) Target detection method, device, machine readable medium and equipment
CN109947971B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN111898495B (en) Dynamic threshold management method, system, device and medium
CN111598012B (en) Picture clustering management method, system, device and medium
CN109360551B (en) Voice recognition method and device
CN111310725A (en) Object identification method, system, machine readable medium and device
CN111028828A (en) Voice interaction method based on screen drawing, screen drawing and storage medium
CN114676825A (en) Neural network model quantification method, system, device and medium
CN112908307A (en) Audio feature extraction method, system, device and medium
CN106873798B (en) Method and apparatus for outputting information
CN112989210A (en) Insurance recommendation method, system, equipment and medium based on health portrait
CN113157240A (en) Voice processing method, device, equipment, storage medium and computer program product
CN109710939B (en) Method and device for determining theme
WO2020258503A1 (en) Tensorflow-based voice fusion method, electronic device, and storage medium
US9047059B2 (en) Controlling a voice site using non-standard haptic commands
CN112331187B (en) Multi-task speech recognition model training method and multi-task speech recognition method
CN112417197B (en) Sorting method, sorting device, machine readable medium and equipment
US11238863B2 (en) Query disambiguation using environmental audio
CN114676785A (en) Method, system, equipment and medium for generating target detection model
CN112596846A (en) Method and device for determining interface display content, terminal equipment and storage medium
CN112069184A (en) Vector retrieval method, system, device and medium
CN112257581A (en) Face detection method, device, medium and equipment
CN211604645U (en) Anti-money laundering anti-counterfeit money propaganda device based on voice interaction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210604