CN109326294A

CN109326294A - A kind of relevant vocal print key generation method of text

Info

Publication number: CN109326294A
Application number: CN201811139547.4A
Authority: CN
Inventors: 吴震东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-02-12
Anticipated expiration: 2038-09-28
Also published as: CN109326294B

Abstract

The present invention relates to a kind of relevant vocal print key generation methods of text.The present invention includes the training of vocal print key and vocal print cipher key-extraction；The vocal print sample training that the training of vocal print key is acquired by early period goes out vocal print cipher key-extraction matrix.Vocal print sample to be extracted after pretreatment, is multiplied by the cipher key-extraction matrix that the training of vocal print key obtains, obtains vocal print key by vocal print cipher key-extraction.The present invention while more fully expressing the sound speciality of words person, keeps front and back sample to have more stable similitude using the relevant sound spectrograph of words person's text.On this basis, vocal print invariant feature vector is trained from multiple sound spectrographs with machine learning method and extracts matrix, subsequent samples are handled with the matrix, can extract more stable vocal print key.Method has the characteristics that stability is good, succinct, convenient to use.

Description

A kind of relevant vocal print key generation method of text

Technical field

The invention belongs to cyberspace security technology area, it is related to a kind of relevant vocal print key generation method of text.

Background technique

Sound groove recognition technology in e has been a kind of biometrics identification technology of comparative maturity, with artificial intelligence technology in recent years Rapid development, the accuracy rate of Application on Voiceprint Recognition also obtained a degree of raising, and Application on Voiceprint Recognition is accurate in lower noise environment Rate can reach 96% or more, be widely used in authentication scene.

With going deep into for vocal print technical application, begins trying directly to extract from mankind's vocal print in the art and stablize number Word sequence is used as biological secret key, it can all kinds of keys is directly generated with vocal print, with existing password, public private key cryptographic skill Fusion that art is seamless can remove the inconvenience of vocal print acquisition and storage process, and the safety issue that may cause from, further rich The means and method of rich network authentication.

Vocal print biological secret key technology has a degree of research, as Chinese invention patent ZL201110003202.8 is based on The file encryption and decryption method of vocal print propose one and extract the scheme for stablizing key sequence from voiceprint.But it should Scheme only stablizes vocal print feature value with chessboard method, and stablizing effect is limited and key length is insufficient.Chinese invention patent A kind of mankind's vocal print biological secret key generation method of ZL201410074511.8 proposes a kind of extraction vocal print Gauss model, and will Aspect of model parameter is projected to higher dimensional space to obtain the technology path of stable vocal print key.The program obtain vocal print key compared with Previous patent stability is significantly improved, but for the key authentication environment of high-stability requirement, above-mentioned technical proposal is extracted The stability of vocal print biological secret key still needs to be further increased.

Summary of the invention

It is an object of the invention to provide a kind of relevant vocal print key generation methods of text.

The present invention includes the training of vocal print key and vocal print cipher key-extraction；The vocal print sample that the training of vocal print key is acquired by early period Originally vocal print cipher key-extraction matrix is trained.Vocal print sample to be extracted after pretreatment, is multiplied by vocal print key by vocal print cipher key-extraction The cipher key-extraction matrix that training obtains, obtains vocal print key.Specific step is as follows:

Step 1: the training of vocal print key, specific steps are as follows:

The first step, user enroll own voices, repeat 20 to the same text information, generally 1-3 continuous words More than secondary, number is adjusted by user according to training.

Second step, admission 10 or more different user read the voice of same text information, are respectively repeated 20 times above；Admission 10 or more different users read voice similar in different text informations, duration, are respectively repeated 20 times above.

Third step pre-processes the first and second step admission voice, extracts vocal print sound spectrograph detailed process are as follows:

1) enhancing (Pre-Emphasis) in advance:

Voice time domain signal is indicated with S1 (n), wherein n=0,1,2 ..., N-1, enhance formula in advance are as follows: S (n)=S1 (n)- A*S1 (n-1), 0.9 < a < 1.0.A is coefficient to be aggravated, to adjust amplitude to be enhanced.

2) sound frame (Framing), i.e., to voice signal framing.

3) Hamming window (Hamming Windowing) is handled:

Voice time domain signal after sound frame is S (n), n=0,1,2 ..., N-1, and expression has been divided into n sections of voice signals；That Voice time domain signal after being multiplied by Hamming window is S ' (n), sees formula (1):

S ' (n)=S (n) * W (n) is (1)；

:The value interval of a=0.46, a are 0.3 Between~0.7, specific value is determined according to experiment and empirical data.W (n) is Hamming window function, special with smoother low pass Property, it can preferably reflect the frequency characteristic of Short Time Speech signal.

4) fast Fourier transform (FFT):

Implement base 2FFT transformation to voice time domain signal S ' (n) after Hamming window is multiplied by, obtains linear spectral X (n, k), base 2FFT is transformed to general-purpose algorithm in the art；Spectral energy density function of the X (n, k) for n-th section of speech frame, k corresponding spectrum section, Each section of speech frame has corresponded to a timeslice on time shaft.

5) text correlation vocal print sound spectrograph is generated:

Use time n as time axial coordinate, k is incited somebody to action as frequency spectrum axial coordinate | X (n, k) |²Value be expressed as gray level, show Show on corresponding coordinate points position, that is, constitutes vocal print sound spectrograph.By converting 10log₁₀(|X(n,k)|²) obtain sound spectrograph DB indicate.

4th step, the pretreatment such as is filtered to vocal print sound spectrograph, normalizes, and specific filtering mode has Gauss, small echo, two The general filtering mode of the field of signal processing such as value, specifically using the combination of which kind of mode or several ways, by user according to reality The selection of border test case.Normalized finger speech spectrogram size is unified to fixed length and width size, each pixel of sound spectrograph Primary system one within the scope of 0-255, specific method can be all made of universal method in the art, as picture size adjustment can be used Imresize function in matlab function library is realized.

5th step carries out machine learning to vocal print sound spectrograph, obtains vocal print invariant feature learning matrix, i.e. vocal print cipher key-extraction Matrix.

The vocal print sound spectrograph that 4th step obtains is divided into two major classes, and one kind is the related text vocal print sound spectrograph of user, The comparison vocal print sound spectrograph that the another kind of related text for non-user is mixed with irrelevant text, referred to as positive and negative sample set It closes.

With M=[M₁,M₂] indicate to participate in the positive and negative sample set of training, M_i=[x_i1,x_i2,...,x_iL], i ∈ { 1,2 } table Show that the i-th class sample set, i=1 are positive sample, i=2 is negative sample；x_ir∈R^d, 1≤i≤2,1≤r≤L, x_irFor it is one-dimensional arrange to Amount, forms a two-dimensional matrix by the value of all pixels point of a vocal print sound spectrograph, then sequentially by every a line of two-dimensional matrix Splicing, obtains one-dimensional row vector, a dimensional vector x is obtained after transposition_ir, x_irLength is d, R^dIndicate that d ties up real number field, L indicates same There are L vocal print sound spectrographs, i.e. L column vector in a kind of sample set.

The characteristics of now according to two class samples, training vocal print cipher key-extraction matrix W₁, W₁∈R^d×dz, obtain formula (2):

WhereinFor the positive sample mean value of training sample,For the negative sample mean value of training sample.J is cost function, instead Training sample has been reflected through vocal print cipher key-extraction matrix W₁It is poor with the distance between positive and negative sample set mean value after projection, with Euclidean away from From calculating.

It enables:

Solution matrix (H₁-H₂) eigen vector, obtain vocal print cipher key-extraction matrix W₁, it may be assumed that (H₁-H₂) w=λ w；W is matrix (H₁-H₂) feature vector, λ is characterized value.

Due to { w₁,w₂,...,w_dzIt is feature vector, respectively correspond characteristic value { λ₁,λ₂,...,λ_dz, wherein λ₁≥λ₂ ≥...≥λ_dz>=0, feature vector of the characteristic value less than 0 is not included into matrix W₁Construction.

So far vocal print cipher key-extraction matrix W is trained₁。

Step 2: vocal print cipher key-extraction, specific steps are as follows:

Step 1, user enroll itself text related voice, and 3 seconds or so.

Step 2 extracts vocal print sound spectrograph, with specific reference to step 1 third step.

Step 3 the pretreatment such as is filtered to vocal print sound spectrograph, normalizes, vocal print sound spectrograph is then switched to rectangular Formula, and sequentially splice by row, obtain vocal print vector x_t。

Step 4, with the vocal print invariant feature learning matrix W of step 1 training₁, premultiplication step 3 obtains after transposition vocal print Vector x_t, i.e. W₁ ^T·x_t, obtain d_zTie up vocal print feature vector x_tz, x_tzFor vocal print feature vector after stabilization.

Step 5, to x_tzPer one-dimensional component carry out a chessboard method operation, further stablize vocal print feature vector be

Chessboard method operation, steps are as follows:

To x_tzEach of dimension component be denoted as x_tzi；

(3) quantitative formula is shown in formula:

Wherein, D is the grid size of chessboard method, takes positive number, occurrence can rule of thumb be selected by user, general satisfaction Λ (x) value is between 0~63, x_tziFor x_tzEach of component, Λ (x) be integer value.

Λ (x) i.e. x_tziValue after quantization is in checker-wise closest to x_tziThe coordinate value of point and the grid of coordinate origin.

Step 6 takes the 5th step calculated result vectorPreceding 32 or 64 components, front and back splicing, with each component value 0~64, the calculating of 4 bit keys can be formed, the vocal print key of 128bit or 256bit can be formed；Complete mentioning for vocal print key It takes.

The present invention while more fully expressing the sound speciality of words person, is kept using the relevant sound spectrograph of words person's text Front and back sample has more stable similitude.On this basis, it is trained from multiple sound spectrographs with machine learning method Vocal print invariant feature vector extracts matrix, is handled with the matrix subsequent samples, can extract more stable vocal print key.Side Method has the characteristics that stability is good, succinct, convenient to use.

Detailed description of the invention

Fig. 1 is vocal print key of the present invention training flow chart；

Fig. 2 is vocal print sound spectrograph product process figure of the present invention；

Fig. 3 is vocal print sound spectrograph of the present invention；

Fig. 4 is vocal print cipher key-extraction flow chart of the present invention.

Fig. 5 is vocal print feature machine learning schematic diagram of the present invention.

Specific embodiment

A kind of relevant vocal print key generation method of text includes that vocal print key is trained and vocal print cipher key-extraction；Vocal print key The vocal print sample training that training is acquired by early period goes out vocal print cipher key-extraction matrix.Vocal print cipher key-extraction is by vocal print sample to be extracted After pretreatment, it is multiplied by the cipher key-extraction matrix that the training of vocal print key obtains, obtains vocal print key.Specific step is as follows:

Step 1: the training of vocal print key, as shown in Figure 1, specific steps are as follows:

The first step, user enroll own voices, repeat 20 to the same text information, generally 1-3 continuous words Secondary above (number can be adjusted by user according to training).

Third step pre-processes the first and second step admission voice, as shown in Figures 2 and 3, it is specific to extract vocal print sound spectrograph Process are as follows:

1) enhancing (Pre-Emphasis) in advance:

2) sound frame (Framing), i.e., to voice signal framing.

3) Hamming window (Hamming Windowing) is handled:

S ' (n)=S (n) * W (n) is (1)；

5) fast Fourier transform (FFT):

5) text correlation vocal print sound spectrograph is generated:

It enables:

So far vocal print cipher key-extraction matrix W is trained₁。

Step 2: vocal print cipher key-extraction, as shown in figure 4, specific steps are as follows:

Step 1, user enroll itself text related voice, and 3 seconds or so.

Chessboard method operation, steps are as follows:

To x_tzEach of dimension component be denoted as x_tzi；

(3) quantitative formula is shown in formula:

The present invention has the characteristics that higher similitude using same words person's text related voice vocal print frequency spectrum, from text phase It closes and extracts vocal print sound spectrograph in voice, multiple vocal print sound spectrographs that same section of text of the same words person is obtained through multiple repairing weld have Higher similitude, meanwhile, there is obvious difference between the vocal print sound spectrograph of same section of Text Feature Extraction of different words persons.Extraction sound After line sound spectrograph, common characteristic information is extracted from multiple vocal print sound spectrographs by machine learning method as shown in Figure 5, is passed through After crossing segment quantization, text correlation vocal print key is obtained.Vocal print key retains biometric templates without server-side, has higher Safety, and can be merged with universal networks enciphering and deciphering algorithms such as AES, RSA, it is user-friendly.This method can obtain More stable vocal print key, vocal print cipher key-extraction accuracy rate are greater than 95%, and key length is up to 256bit.

Claims

1. a kind of relevant vocal print key generation method of text, it is characterised in that: trained including vocal print key and vocal print key mentions It takes；The vocal print sample training that the training of vocal print key is acquired by early period goes out vocal print cipher key-extraction matrix；Vocal print cipher key-extraction will be to It extracts vocal print sample after pretreatment, is multiplied by the cipher key-extraction matrix that the training of vocal print key obtains, obtains vocal print key；Specific step It is rapid as follows:

Step 1: the training of vocal print key, specific steps are as follows:

The first step, for user to the same text information, generally 1-3 continuous words enroll own voices, be repeated 20 times with On, number is adjusted by user according to training；

Second step, admission 10 or more different user read the voice of same text information, are respectively repeated 20 times above；Admission 10 The above different user reads voice similar in different text informations, duration, is respectively repeated 20 times above；

1) pre- enhancing:

Voice time domain signal is indicated with S1 (n), wherein n=0,1,2 ..., N-1, enhance formula in advance are as follows: S (n)=S1 (n)-a*S1 (n-1), 0.9 < a < 1.0；A is coefficient to be aggravated, to adjust amplitude to be enhanced；

2) sound frame, i.e., to voice signal framing；

3) Hamming window is handled:

Voice time domain signal after sound frame is S (n), n=0,1,2 ..., N-1, and expression has been divided into n sections of voice signals；So multiply Voice time domain signal after upper Hamming window is S ' (n), sees formula (1):

S ' (n)=S (n) * W (n) (1)；

:The value interval of a=0.46, a are 0.3~0.7 Between, specific value is determined according to experiment and empirical data；W (n) is Hamming window function, has smoother low-pass characteristic, energy The frequency characteristic of enough preferable reflection Short Time Speech signals；

4) fast Fourier transform FFT:

Implement base 2FFT transformation to voice time domain signal S ' (n) after Hamming window is multiplied by, obtains linear spectral X (n, k), base 2FFT It is transformed to general-purpose algorithm in the art；X (n, k) is the spectral energy density function of n-th section of speech frame, and k corresponding spectrum section is each Section speech frame has corresponded to a timeslice on time shaft；

5) text correlation vocal print sound spectrograph is generated:

Use time n as time axial coordinate, k is incited somebody to action as frequency spectrum axial coordinate | X (n, k) |²Value be expressed as gray level, be shown in phase On the coordinate points position answered, that is, constitute vocal print sound spectrograph；By converting 10log₁₀(|X(n,k)|²) obtain the dB table of sound spectrograph Show；

4th step, the pretreatment such as is filtered to vocal print sound spectrograph, normalizes, and specific filtering mode has Gauss, small echo, binaryzation Etc. the general filtering mode of field of signal processing, specifically using the combination of which kind of mode or several ways, by user according to practical survey Try situation selection；

5th step carries out machine learning to vocal print sound spectrograph, obtains vocal print invariant feature learning matrix, i.e. vocal print cipher key-extraction square Battle array；

The vocal print sound spectrograph that 4th step obtains is divided into two major classes, and one kind is the related text vocal print sound spectrograph of user, another Class is the comparison vocal print sound spectrograph that the related text of non-user is mixed with irrelevant text, referred to as positive and negative sample set；

With M=[M₁,M₂] indicate to participate in the positive and negative sample set of training, M_i=[x_i1,x_i2,...,x_iL], i ∈ { 1,2 } indicates i-th Class sample set, i=1 are positive sample, and i=2 is negative sample；x_ir∈R^d, 1≤i≤2,1≤r≤L, x_irFor a dimensional vector, by The value of all pixels point of one vocal print sound spectrograph forms a two-dimensional matrix, then every a line of two-dimensional matrix is sequentially spliced, One-dimensional row vector is obtained, a dimensional vector x is obtained after transposition_ir, x_irLength is d, R^dIndicate that d ties up real number field, L indicates same class sample There are L vocal print sound spectrographs, i.e. L column vector in this set；

WhereinFor the positive sample mean value of training sample,For the negative sample mean value of training sample；J is cost function, is reflected Training sample is through vocal print cipher key-extraction matrix W₁It is poor with the distance between positive and negative sample set mean value after projection, with Euclidean distance meter It calculates；

It enables:

Solution matrix (H₁-H₂) eigen vector, obtain vocal print cipher key-extraction matrix W₁, it may be assumed that (H₁-H₂) w=λ w；w For matrix (H₁-H₂) feature vector, λ is characterized value；

Due to { w₁,w₂,...,w_dzIt is feature vector, respectively correspond characteristic value { λ₁,λ₂,...,λ_dz, wherein λ₁≥λ₂≥...≥ λ_dz>=0, feature vector of the characteristic value less than 0 is not included into matrix W₁Construction；

So far vocal print cipher key-extraction matrix W is trained₁；

Step 2: vocal print cipher key-extraction, specific steps are as follows:

Step 1, user enroll itself text related voice, and 3 seconds or so；

Step 2 extracts vocal print sound spectrograph, with specific reference to step 1 third step；

Step 3 the pretreatment such as is filtered to vocal print sound spectrograph, normalizes, vocal print sound spectrograph is then switched to matrix form, and Sequentially splice by row, obtains vocal print vector x_t；

Step 4, with the vocal print invariant feature learning matrix W of step 1 training₁, premultiplication step 3 obtains after transposition vocal print vector x_t, i.e. W₁ ^T·x_t, obtain d_zTie up vocal print feature vector x_tz, x_tzFor vocal print feature vector after stabilization；

Chessboard method operation, steps are as follows:

To x_tzEach of dimension component be denoted as x_tzi；

(3) quantitative formula is shown in formula:

Wherein, D is the grid size of chessboard method, takes positive number, occurrence can rule of thumb be selected by user, general satisfaction Λ (x) Value between 0~63, x_tziFor x_tzEach of component, Λ (x) be integer value；

Λ (x) i.e. x_tziValue after quantization is in checker-wise closest to x_tziThe coordinate value of point and the grid of coordinate origin；

Step 6 takes the 5th step calculated result vectorPreceding 32 or 64 components, front and back splicing, with each component value 0~ 64, the calculating of 4 bit keys can be formed, the vocal print key of 128bit or 256bit can be formed；Complete the extraction of vocal print key.

2. the relevant vocal print key generation method of a kind of text as described in claim 1, it is characterised in that: described in the 4th step Normalized finger speech spectrogram size is unified to fixed length and width size, and the primary system one of each pixel of sound spectrograph arrives 0-255 In range, it can be realized using the imresize function in matlab function library.