CN104240697A

CN104240697A - Audio data feature extraction method and device

Info

Publication number: CN104240697A
Application number: CN201310255723.1A
Authority: CN
Inventors: 谢志明; 潘石柱; 张兴明; 傅利泉; 朱江明; 吴军; 吴坚
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2013-06-24
Filing date: 2013-06-24
Publication date: 2014-12-24

Abstract

The invention discloses an audio data feature extraction method and device which are used for extracting feature vectors with the same length from audio data sequences with different lengths. The method includes the steps that the audio data sequences are obtained; each of the obtained audio data sequences is cut so that a plurality of audio data subsequences can be obtained; the specified feature of each of the multiple audio data subsequences is extracted; the extracted specified features of all the audio data subsequences are combined; the number of the multiple audio data subsequences is equal to a preset number; each of the multiple audio data subsequences comprises the same total size of data.

Description

A kind of feature extracting method of voice data and device

Technical field

The present invention relates to field of information processing, particularly relate to a kind of feature extracting method and device of voice data.

Background technology

At audio classification with identification, the common feature extracting same classification voice data is extremely important, because in the prior art, usually needs to rely on these common features to carry out Classification and Identification to the voice data of unknown classification.

In the extraction scheme of the common feature of the voice data adopted in prior art, general is all that the audio data sequence of time fixed length (sequence be namely made up of multiple voice data) is carried out framing short time treatment, audio data sequence cutting by time fixed length is multiframe voice data subsequence, then after each frame voice data subsequence obtained being carried out pre-service, extract MFCC cepstrum (the Mel Frequency Cepstrum Coefficient of each frame voice data subsequence, MFCC), linear prediction MFCC cepstrum (Linear Predictive Mel Frequency Cepstral Coding, LPMFCC) etc.Further, then the feature of these section audio data will be used as the Feature Combination from each frame voice data subsequence.Adopt which, the feature of each section audio data sequence according to training can be realized, cluster is carried out to the audio data sequence of training and obtains the common feature of all kinds of audio data sequence.

Wherein it should be noted that, Mei Er (Mel) is the unit of subjective pitch, and hertz (Hz) is then the unit of objective pitch.Mei Er frequency puts forward based on human hearing characteristic, and it becomes nonlinear correspondence relation with hertz frequency.MFCC cepstrum (MFCC) is then this relation utilized between them, the Hz spectrum signature calculated.Linear prediction residue error (LPMFCC) is then on the basis of linear predictor coefficient, uses for reference MFCC cepstrum (MFCC) computing method, carries out Mel-cepstrum calculating and a kind of new characteristic parameter that obtains to linear predictor coefficient.Based on this characteristic parameter voice data classified and be conducive to improving the discrimination of voice data sorter.

The extraction scheme of above-mentioned common feature can reach good effect in the Classification and Identification of voice data, but because the program obtains the segmentation rules of the voice data subsequence of identical duration according to carrying out cutting to different audio data sequence, cutting is carried out to audio data sequence, thus require that the duration of audio data sequence must be fixed length, could be the audio data section of same number of frames by different audio data sequence cuttings.Therefore, there is certain defect in the program: when the duration of audio data sequence is greater than stipulated time length, and need to carry out cutting process to audio data sequence, such mode of operation can destroy the integrality of audio data sequence; And when the duration of audio data sequence is less than stipulated time length, then the program can not be adopted to process this audio data sequence.The reason of above-mentioned defect is caused to be, if the duration of different audio data sequence is unequal, so, the length of the proper vector of the audio data sequence obtained also is unequal, and based on unequal proper vector, the cluster to audio data sequence or classification based training cannot be realized.

Summary of the invention

The embodiment of the present invention provides a kind of feature extracting method of voice data, cannot realize the problem of the proper vector extracting equal length from the audio data sequence of different length in order to solve prior art.

The embodiment of the present invention is by the following technical solutions:

A feature extracting method for voice data, comprising: obtain audio data sequence; Perform for each audio data sequence obtained: this audio data sequence is carried out cutting, obtains multiple voice data subsequence; Extract the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively; The specific characteristic of each voice data subsequence of extracting is combined; Wherein, the quantity of described multiple voice data subsequence equals predetermined number; And each voice data subsequence in described multiple voice data subsequence all comprises identical data total amount.

A feature deriving means for voice data, comprising: obtain unit, for obtaining audio data sequence;

Feature extraction unit, for performing for each audio data sequence obtaining unit acquisition: this audio data sequence is carried out cutting, obtains multiple voice data subsequence; Extract the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively; The specific characteristic of each voice data subsequence of extracting is combined; Wherein, the quantity of described multiple voice data subsequence equals predetermined number; And each voice data subsequence in described multiple voice data subsequence all comprises identical data total amount.

The beneficial effect of the embodiment of the present invention is as follows:

By the technical scheme that the embodiment of the present invention provides, owing to voice data to be cut into the voice data subsequence of fixing frame number, can ensure that the combine length of the feature obtained of the specific characteristic extracted respectively from each voice data subsequence is also fixing, thus solve the problem that cannot realize the proper vector extracting equal length from the voice data of different length in prior art.The program adopting the embodiment of the present invention to provide, when can be implemented in the training of audio data sequence sample, more effectively utilizes more audio data sequence sample.

Accompanying drawing explanation

The process flow diagram of the feature extracting method of the voice data that Fig. 1 provides for the embodiment of the present invention;

The embody rule process flow diagram of the feature extracting method of the voice data that Fig. 2 provides for the embodiment of the present invention;

The concrete structure schematic diagram of the feature deriving means of the voice data that Fig. 3 provides for the embodiment of the present invention.

Embodiment

Inventor is by the analysis and research to prior art, find that the extracting method of the voice data common feature adopted in prior art also exists a common defect, namely require that the duration of audio data sequence must be fixed length, could be the audio data section of same number of frames by different audio data sequence cuttings, and then ensure the proper vector extracting equal length from each audio data sequence.In order to solve this problem, embodiments provide a kind of extracting method of common feature of the voice data for different duration, in the method, each audio data sequence is cut into the voice data subsequence of fixing frame number by inventor, can ensure that the combine length of the feature obtained of the specific characteristic extracted respectively from each voice data subsequence is also fixing, thus solve the problem that cannot realize the proper vector extracting equal length from the voice data of different length in prior art.

Below in conjunction with Figure of description, embodiments of the invention are described, should be appreciated that embodiment described herein is only for instruction and explanation of the present invention, is not limited to the present invention.And when not conflicting, the embodiment in this explanation and the feature in embodiment can be combined with each other.

First, the embodiment of the present invention provides a kind of feature extracting method of voice data, and the idiographic flow schematic diagram of the method as shown in Figure 1, comprises the following steps.It should be noted that, following steps only for any audio data sequence obtained, illustrate and how from this audio data sequence, to extract feature.It will be understood by those skilled in the art that for each audio data sequence obtained, following step all can be adopted to process, to make the length of the feature extracted from each audio data sequence obtained respectively identical.

Step 11, carries out cutting by the audio data sequence of acquisition, obtains multiple voice data subsequence;

In the embodiment of the present invention, this audio data sequence is carried out cutting and a kind of concrete implementation obtaining multiple voice data subsequence can comprise following sub-step:

First, according to data total amount, predetermined number and default anchor-frame overlapping percentages that this audio data sequence comprises, the data total amount that voice data subsequence comprises is determined;

Then, according to the data total amount that this anchor-frame overlapping percentages and the voice data subsequence determined comprise, this audio data sequence is carried out cutting, obtains multiple voice data subsequence.

In an embodiment of the present invention, audio data sequence is generally obtain by carrying out the sound signal of reality sampling; Predetermined number refers to the number of the prespecified voice data subsequence this audio data sequence cutting obtained, and this predetermined number also claims the frame number fixed; Anchor-frame overlapping percentages represents the accounting in the data total amount that the time quantity of voice data that has of upper two adjacent voice data subsequences comprises voice data subsequence.

Especially, above-mentioned anchor-frame overlapping percentages can be 0.

Optionally, if what this audio data sequence comprised is the voice data be evenly distributed, namely this audio data sequence comprises, length is that the data volume of each voice data subsequence of unit time span is mutually the same, and, the time span of this audio data sequence can divide exactly by previously described predetermined number, so, also cutting can be carried out not in accordance with above-mentioned sub-step to this audio data sequence, but can directly according to time span and the predetermined number of this audio data sequence, this audio data sequence is divided into multiple voice data subsequences that time span is equal to each other, and make the quantity dividing the multiple voice data subsequences obtained equal predetermined number.

Optionally, specific characteristic for the ease of extracting follow-up each voice data subsequence to obtaining from carrying out cutting to this audio data sequence respectively combines, in the embodiment of the present invention, those voice data subsequences that can also obtain this audio data sequence cutting putting in order in time carries out record.

Step 12, each audio data sequence obtained for carrying out the cutting as described in step 11 to this audio data sequence performs: the specific characteristic extracting each voice data subsequence in multiple voice data subsequence respectively;

In the embodiment of the present invention, in order to smoothed data waveform, reduce noise interference, before extract specific characteristic from voice data subsequence, pre-service can also be carried out to each voice data subsequence obtained, obtain each voice data subsequence pretreated.Then, then from each voice data subsequence pretreated extract specific characteristic respectively.It will be appreciated by those skilled in the art that if do not need smoothed data waveform, the interference of reduction noise, also can not perform the pre-service to voice data subsequence.

General, carrying out pretreated mode to voice data subsequence has a variety of, and in an embodiment of the present invention, which can be, but not limited to one or more the combination comprised in following manner:

Zero-mean process, pre-emphasis process and windowing process.Due to the technology that zero-mean process, pre-emphasis process and windowing process are comparative maturities in prior art, and those technology not emphasis of the present invention, therefore repeat no more.

Step 13, combines the specific characteristic of each voice data subsequence of extracting.Combine the feature (proper vector) that the feature (being often proper vector) obtained is this audio data sequence.

Optionally, the each voice data subsequence this audio data sequence cutting obtained if having recorded in the embodiment of the present invention putting in order in time, then can put in order by this record, the specific characteristic of each voice data subsequence is arranged in order, and obtains the feature of a synthesis.

The said method provided from the embodiment of the present invention: because the method can ensure that the length of the proper vector extracted from each audio data sequence is equal, enable the extraction of the common feature of the voice data adapting to different duration, thus the method that the embodiment of the present invention provides has the wider scope of application.

Below introduce said method a kind of embody rule flow process in practice that the embodiment of the present invention provides.This application flow comprises following step as shown in Figure 2:

Step 21, gathers a section audio data sequence, and carries out record to corresponding Information Monitoring, and wherein, this Information Monitoring comprises time span T and the sampling rate K of this audio data sequence;

Step 22, carries out cutting to this section audio data sequence collected, obtains the voice data subsequence that multiple data volume is equal;

Particularly, first according to time span and the sampling rate of this audio data sequence, the data total amount K*T of this audio data sequence is obtained.Then, because K*T meets following formula [1]

K*T＝N*X*(1-P)+X*P [1]

Thus the data total amount X that voice data subsequence comprises can be calculated, shown in [2]:

X＝K*T/(N-N*P+P) [2]

Wherein, N is default anchor-frame quantity, P is described anchor-frame overlapping percentages, T is the time span of this audio data sequence, the sampling rate that K adopts when being and sampling obtain this audio data sequence to original audio, the data total amount that K*T comprises for this audio data sequence, the data total amount that X comprises for voice data subsequence.

Finally, according to anchor-frame overlapping percentages P and the X that calculates, this audio data sequence is carried out cutting, obtains the voice data subsequence that multiple data volume is equal.

Step 23, carries out pre-service to each voice data subsequence, obtains each voice data subsequence pretreated; Then, from each voice data subsequence pretreated, specific characteristic is extracted respectively;

Optionally, in an embodiment of the present invention, because number of frames is fixing, and total duration is change, so the stability of feature that the stability of feature that frequency domain is correlated with can be correlated with than time domain is high, therefore above-mentioned specific characteristic can be generally frequency domain character, such as MFCC cepstrum (MFCC), linear prediction residue error (LPMFCC) etc., or their combination.Like this, better effect could be reached in the characteristic extraction procedure of voice data.

Step 24, the specific characteristic of each voice data subsequence of extracting is combined, by multiple voice data subsequences putting in order in time of record, the specific characteristic of each voice data subsequence is synthesized a proper vector, and determine the proper vector of the audio data sequence of this proper vector corresponding to each voice data subsequence.

Such as, one section audio data sequence ABC is when carrying out cutting process, be voice data subsequence A, voice data subsequence B and voice data subsequence C by cutting successively, so, the specific characteristic of three the corresponding voice data subsequences extracted respectively from three voice data subsequences A, B, C is respectively a, b, c; Then, according to the putting in order in time of three voice data subsequences during cutting, specific characteristic a, b, c that three voice data subsequences are corresponding respectively synthesize a proper vector abc, finally, determine that this proper vector abc is the proper vector of this section audio data sequence ABC.

Wherein, the length of the specific characteristic of each voice data subsequence is fixing, and default number of frames is fixing, then the length of the proper vector of each audio data sequence obtained also is fixing.

By the technical scheme that the embodiment of the present invention provides, due to the data total amount comprised by quantity, anchor-frame overlapping percentages and this audio data sequence preset the audio data sequence of one section of known class, this audio data sequence is carried out respectively to the sub-frame processing of predetermined number, can ensure that the length of the proper vector extracted from this audio data sequence is constant, and then the specific characteristic of equal length never can be extracted with the audio data sequence of duration, expand the scope of application of the method that the embodiment of the present invention provides.

Corresponding to the feature extracting method of the voice data that the embodiment of the present invention provides, the embodiment of the present invention also provides a kind of feature deriving means of voice data, and as shown in Figure 3, it specifically comprises following functional unit to the concrete structure schematic diagram of this device:

Obtain unit 31, for obtaining audio data sequence;

Feature extraction unit 32, for performing for each audio data sequence obtaining unit 31 acquisition: this audio data sequence is carried out cutting, obtains multiple voice data subsequence; Extract the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively; The specific characteristic of each voice data subsequence of extracting is combined;

Wherein, the quantity of described multiple voice data subsequence equals predetermined number; And each voice data subsequence in described multiple voice data subsequence all comprises identical data total amount.

Optionally, feature extraction unit 32 specifically may be used for: the data total amount, predetermined number and the default anchor-frame overlapping percentages that comprise according to this audio data sequence, determines the data total amount that voice data subsequence comprises; Wherein, anchor-frame overlapping percentages represents the accounting in the data total amount that the time quantity of voice data that has of upper two adjacent voice data subsequences comprises voice data subsequence;

According to the data total amount that anchor-frame overlapping percentages and the voice data subsequence determined comprise, this audio data sequence is carried out cutting, obtain multiple voice data subsequence, and record is carried out to multiple voice data subsequence putting in order in time.

Optionally, feature extraction unit 32 specifically may be used for: by adopting following formula, calculates the data total amount that voice data subsequence comprises:

X＝K*T/(N-N*P+P)

Wherein, N is predetermined number, P is anchor-frame overlapping percentages, T is the time span of this audio data sequence, K is that original audio carries out the sampling rate adopted when obtaining this audio data sequence of sampling, the data total amount that K*T comprises for this audio data sequence, the data total amount that X comprises for voice data subsequence.

Optionally, feature extraction unit 32 specifically may be used for:

Pre-service is carried out to each voice data subsequence, obtains each voice data subsequence pretreated; Specific characteristic is extracted respectively from each voice data subsequence pretreated.Wherein, pre-service specifically comprises one or more the combination in following manner:

Zero-mean process; Pre-emphasis process; Windowing process.

Optionally, this device that the embodiment of the present invention provides can also comprise: record cell, for this audio data sequence being carried out cutting in feature extraction unit 32, when obtaining multiple voice data subsequence, record is carried out to the multiple voice data subsequences obtained putting in order in time.Based on the function of this record cell, feature extraction unit 32 specifically may be used for: by multiple voice data subsequences putting in order in time of this recording unit records, the specific characteristic of each voice data subsequence is synthesized a proper vector.

Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code.

The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.

Although describe the preferred embodiments of the present invention, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the scope of the invention.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims

1. a feature extracting method for voice data, is characterized in that, comprising:

Obtain audio data sequence;

Perform for each audio data sequence obtained: this audio data sequence is carried out cutting, obtains multiple voice data subsequence; Extract the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively; The specific characteristic of each voice data subsequence of extracting is combined;

2. the method for claim 1, is characterized in that, this audio data sequence is carried out cutting, obtains multiple voice data subsequence, specifically comprises:

According to data total amount, described predetermined number and default anchor-frame overlapping percentages that this audio data sequence comprises, determine the data total amount that voice data subsequence comprises; Wherein, described anchor-frame overlapping percentages represents the accounting in the data total amount that the time quantity of voice data that has of upper two adjacent voice data subsequences comprises voice data subsequence;

According to the data total amount that described anchor-frame overlapping percentages and the voice data subsequence determined comprise, this audio data sequence is carried out cutting, obtains described multiple voice data subsequence.

3. method as claimed in claim 2, is characterized in that, according to data total amount, described predetermined number and default anchor-frame overlapping percentages that this audio data sequence comprises, determine specifically to comprise the data total amount that voice data subsequence comprises:

By adopting following formula, calculate the data total amount that voice data subsequence comprises:

X＝K*T/(N-N*P+P)

Wherein, N is described predetermined number, P is described anchor-frame overlapping percentages, T is the time span of this audio data sequence, K is that original audio carries out the sampling rate adopted when obtaining this audio data sequence of sampling, the data total amount that K*T comprises for this audio data sequence, the data total amount that X comprises for voice data subsequence.

4. the method for claim 1, is characterized in that, extracts the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively, specifically comprises:

Pre-service is carried out to described each voice data subsequence, obtains each voice data subsequence pretreated;

Specific characteristic is extracted respectively from each voice data subsequence pretreated;

Wherein, described pre-service specifically comprises one or more the combination in following manner:

Zero-mean process; Pre-emphasis process; Windowing process.

5. the method for claim 1, is characterized in that, described method also comprises:

This audio data sequence is carried out cutting, when obtaining multiple voice data subsequence, record is carried out to described multiple voice data subsequence putting in order in time; Then

The specific characteristic of each voice data subsequence of extracting is combined, specifically comprises:

By described multiple voice data subsequences putting in order in time of record, the specific characteristic of each voice data subsequence described is synthesized a proper vector.

6. a feature deriving means for voice data, is characterized in that, comprising:

Obtain unit, for obtaining audio data sequence;

Feature extraction unit, for performing for each audio data sequence obtaining unit acquisition: this audio data sequence is carried out cutting, obtains multiple voice data subsequence; Extract the specific characteristic of each voice data subsequence in described multiple voice data subsequence respectively; The specific characteristic of each voice data subsequence of extracting is combined;

7. device as claimed in claim 6, it is characterized in that, described feature extraction unit, specifically for the data total amount, described predetermined number and the default anchor-frame overlapping percentages that comprise according to this audio data sequence, determines the data total amount that voice data subsequence comprises; Wherein, described anchor-frame overlapping percentages represents the accounting in the data total amount that the time quantity of voice data that has of upper two adjacent voice data subsequences comprises voice data subsequence;

According to the data total amount that described anchor-frame overlapping percentages and the voice data subsequence determined comprise, this audio data sequence is carried out cutting, obtain described multiple voice data subsequence, record is carried out to described multiple voice data subsequence putting in order in time.

8. device according to claim 7, is characterized in that, described feature extraction unit specifically for:

X＝K*T/(N-N*P+P)

9. device as claimed in claim 6, is characterized in that, described feature extraction unit specifically for::

Zero-mean process; Pre-emphasis process; Windowing process.

10. device as claimed in claim 6, it is characterized in that, described device also comprises:

Record cell, for this audio data sequence being carried out cutting in described feature extraction unit, when obtaining multiple voice data subsequence, carries out record to described multiple voice data subsequence putting in order in time; Then

Described feature extraction unit specifically for: by described multiple voice data subsequences putting in order in time of recording unit records, the specific characteristic of each voice data subsequence described is synthesized a proper vector.