CN109887496A

CN109887496A - Orientation confrontation audio generation method and system under a kind of black box scene

Info

Publication number: CN109887496A
Application number: CN201910060662.0A
Authority: CN
Inventors: 纪守领; 杜天宇; 李进锋; 陈建海
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2019-06-14

Abstract

The present invention relates to the orientation confrontation audio generation methods and system that resisting sample generation technique field, disclose under a kind of black box scene.Wherein method includes: (1) selection target black box speech recognition modeling as audio identification model, selects source audio and sets target of attack；(2) requirement according to audio identification model to input audio sample rate carries out resampling to source audio；(3) extract resampling after source audio MFCC feature；(4) the MFCC feature is identified using audio identification model, obtains recognition result；(5) objective function is set, makes the smallest optimum noise of target function value using particle swarm algorithm searching, optimum noise is superimposed with source audio, the orientation that recognition result is target of attack is obtained and fights audio.This method can reach and speech recognition modeling is allowed to be identified as specific content by adding small disturbance in source audio.

Description

Orientation confrontation audio generation method and system under a kind of black box scene

Technical field

The present invention relates to raw to the orientation confrontation audio under resisting sample generation technique field more particularly to a kind of black box scene At method and system.

Background technique

Speech recognition is just with the gesture of boundless in occupation of the high point in an intelligent epoch.One tune of Google, U.S. publication It looks into report to show, in the teenager between 13 years old to 18 years old, will use the number ratio of phonetic search daily be about 55%.With The development of the technologies such as big data, machine learning, cloud computing, artificial intelligence, speech recognition in the both hands for liberating user step by step, The gesture of voice input frame also big substituted mouse, keyboard.Along with the universal of Intelligent mobile equipment, interactive voice is as a kind of new The man-machine interaction mode of type just increasingly causes the attention of entire IT industry circle.

Although the development of artificial intelligence technology largely improves speech recognition modeling accuracy rate, artificial intelligence Mysterious internal mechanism is also that practical application has buried many security risks.Usually in planing machine learning system, it is The system for guaranteeing design is safe, reliable and result can achieve the desired results, we would generally consider specifically Threat modeling, these models are the attacking ability and attack for the attacker for making our machine learning system malfunction those attempts The hypothesis of target.So far, existing most of machine learning model is designed both for a very weak threat modeling It realizes, there is no worry about attacker.Although these models can have very perfect table when in face of naturally input It is existing but nearest the study found that even if the model of function admirable is vulnerable to attack resisting sample --- add in the sample After the imperceptible small sample perturbations of human eye, sample can be classified mistake with very high confidence level.If be classified as to resisting sample The classification that attacker specifies, then being just referred to as orientation to resisting sample.

What most of current existing work considered is the generation for fighting image, and the rare people of confrontation audio studies, especially Orientation under black box scene fights audio.Under black box scene, attacker is unknown to the inside structure and ginseng of the model of attack Number, can only obtain the probability that input data is classified as each classification.Very due to attacker knows under this scene information Limited, there is presently no people to study the confrontation audio generation method of the orientation under black box scene.In view of speech recognition modeling exists It is generally under black box scene when being applied in real life, therefore studies the formation mechanism of black box confrontation audio sample for research Corresponding defence method is very necessary with the robustness for enhancing the speech recognition modeling in practical application.

Summary of the invention

The present invention provides the orientations under a kind of black box scene to fight audio generation method, and this method can be by source sound Small disturbance is added on frequency to achieve the purpose that allow speech recognition modeling to be identified as specific content.

Specific technical solution is as follows:

A kind of orientation under black box scene fights audio generation method, comprising the following steps:

(1) selection target black box speech recognition modeling selects source audio and sets target of attack as audio identification model；

(2) requirement according to audio identification model to input audio sample rate carries out resampling to source audio；

(3) extract resampling after source audio MFCC feature；

(4) the MFCC feature is identified using audio identification model, obtains recognition result；

(5) objective function is set, makes the smallest optimum noise of target function value using particle swarm algorithm searching, will most preferably make an uproar Sound is superimposed with source audio, is obtained the orientation that recognition result is target of attack and is fought audio.

The black box speech recognition modeling refers to the speech recognition modeling of unknown parameters.Black box speech recognition of the invention Model is to be classified and exported the fixed model of classification to voice, such as order word identification model.Target of attack refers to black box language Sound identification model sounds " no " the expection recognition result of orientation confrontation audio for example, orientation fights audio in human ear, and The recognition result of black box speech recognition modeling is " yes ", and " yes " is its target of attack.

In step (3), the MFCC feature is mel cepstrum coefficients.Since MFCC simulates human ear to a certain extent To the processing feature of voice, the research achievement of human auditory system perceptible aspect is applied, voice is helped to improve using this technology The performance of identifying system.

Step (3) includes:

(3-1) carries out preemphasis processing to pretreated audio, and the frequency spectrum of audio is made to become flat；

Audio is divided into several frames after (3-2), and by each frame multiplied by Hamming window；

(3-3) carries out Fast Fourier Transform (FFT) to each frame audio, obtains the frequency spectrum of each frame audio, obtains from the frequency spectrum of audio The energy spectrum of audio；

The energy spectrum of audio is passed through the triangle filter group of one group of Mel scale by (3-4)；

The logarithmic energy that (3-5) calculates each triangle filter output is obtained by logarithmic energy through discrete cosine transform The Mel-scaleCepstrum parameter of MFCC coefficient order rank；Extract the dynamic difference parameter of audio；

(3-6) obtains MFCC feature.

Preferably, the parameter in MFCC feature extraction are as follows: pre-emphasis parameters 0.97；512 sampled points are a frame, frame with Overlapping region between frame includes 171 sampled points, and adding window parameter is 0.46；Fast Fourier Transform points are 512；Triangle Number of filter is 26；MFCC order is 16.

In step (5), optimal noise δ, quilt after optimal noise δ is superimposed with source audio are found in aiming at for particle swarm algorithm Audio identification model is identified as target of attack.

In step (5), the objective function are as follows:

Wherein, x is source audio, p_i(i=1 ..., N) is i-th of particle, and N is positive integer；f(x+p_i)_jFor audio identification Model is for input x+p_iOutput is the probability of jth class result；T is target of attack, f (x+p_i)_tIt is audio identification model for defeated Enter x+p_iOutput is the probability of t；Parameter κ is the constant less than or equal to 0.Parameter κ is the confidence level for controlling misclassification, smaller κ mean generate orientation confrontation audio will be identified as t with higher confidence level, that is, generate orientation confrontation audio Attack effect it is better.

In step (5), make the smallest optimum noise of target function value using particle swarm algorithm searching, comprising:

The number of iterations is initialized as 0 by (5-1), is uniformly distributed and is generated N number of particle p_i(i=1 ..., N), the length of particle It is identical as source audio length；

(5-2) is by each particle p_iIt is superimposed respectively with source audio x, obtains N number of audio x+p_i；

(5-3) extracts audio x+p_iMFCC feature, using audio identification model to audio x+p_iMFCC feature known Not, each audio x+p is obtained_iRecognition result, and calculate its target function value g (x+p_i)；

Any audio x+p if it exists_iRecognition result be target of attack, then success attack, particle p_iAs optimum noise；

Otherwise, step (5-4) is executed；

The number of iterations is added 1 by (5-4), is uniformly distributed and is generated N-1 particle p_i(i=1 ..., N-1), and by last round of time In with minimum target functional value particle be added, the seed as next round iteration；

Step (5-2)~(5-3) is repeated, until objective function is restrained, acquisition makes the convergent particle p of objective function_i, as Optimum noise；

If objective function is still not converged when the number of iterations reaches the maximum number of iterations of setting, attacks and fail.

The present invention also provides the orientation confrontation audios under a kind of black box scene to generate system, comprising:

Data preprocessing module carries out resampling to source audio data, and the sample rate of source audio is made to meet the knowledge of black box voice Requirement of the other model to input audio sample rate；

Audio feature extraction module extracts the MFCC feature of audio data；

Audio identification module, has black box speech recognition modeling, and the black box speech recognition modeling is special to the MFCC of audio Sign is identified, recognition result is obtained；

Particle group optimizing module has objective function, finds optimal noise using particle swarm algorithm, optimal noise is added Source audio obtains orientation confrontation audio.

The orientation confrontation audio generates system and generates orientation confrontation sound using the orientation confrontation audio generation method Frequently.

Compared with prior art, the invention has the benefit that

Orientation confrontation audio generation method of the invention can generate a kind of confrontation audio for adding small sample perturbations, in human ear It is the same for sounding the content of this confrontation audio and original audio, and speech recognition modeling can be by confrontation audio identification in addition Specific content.The it is proposed of this confrontation audio provides to analyse in depth the fragility of the speech recognition modeling based on deep learning How basis defends confrontation audio convenient for follow-up study, improves the robustness of speech recognition modeling.

Detailed description of the invention

Fig. 1 is the configuration diagram that orientation confrontation audio generates system；

Fig. 2 is the flow diagram of orientation confrontation audio generation method；

Fig. 3 is the flow diagram of MFCC feature extraction；

Fig. 4 is the flow diagram that optimum noise is found using particle swarm algorithm.

Specific embodiment

Present invention is further described in detail with reference to the accompanying drawings and examples, it should be pointed out that reality as described below It applies example to be intended to convenient for the understanding of the present invention, and does not play any restriction effect to it.

It includes four modules that orientation confrontation audio under black box scene based on particle swarm algorithm of the invention, which generates system: Data preprocessing module, characteristic extracting module, audio identification module, objective function optimization module, system architecture such as Fig. 1 institute Show.

The process that orientation confrontation audio generates system generation orientation confrontation audio is as shown in Figure 2.Assuming that existing black box language Sound identification model, user want one human ear of generation and sound " no " (target text) but be identified as " yes " by this model 1 second 12kHz sample rate, duration audio of (original audio), whole flow process are as follows:

(1) human ear is sounded " no " as black box speech recognition modeling by the order word identification model for providing Google Audio as original audio, " yes " is used as target of attack；

(2) sample rate to input audio of order word identification model requires to be 16kHz, according to order word identification model Input requirements, data preprocessing module carry out resampling, i.e., the sound for being 16kHz by the audio resampling of 12kHz to original audio Frequently；

(3) MFCC feature extraction is carried out to pretreated audio, the process of MFCC feature extraction is as shown in Figure 3.Specifically Extraction process is as follows:

(i) preemphasis is handled.Firstly, voice signal is then tied after preemphasis is handled by a high-pass filter Fruit is y (n)=x (n)-ax (n-1), and wherein x (n) is n moment speech sample value, and a is pre emphasis factor, is usually arranged as 0.97.Preemphasis purpose is to eliminate the effect of vocal cords and lip in voiced process, to compensate voice signal by articulatory system The high frequency section inhibited, while the formant of prominent high frequency.

(ii) framing adding window.After the completion of preemphasis, needs to carry out sub-frame processing to audio, i.e., adopt every 512 of audio Sampling point assembles a frame, and the overlapping region between frame and frame includes 171 sampled points, then by each frame after framing multiplied by the Chinese Bright window is to increase continuity of the frame left end to right end, adding window parameter a=0.46.

(iii) Fast Fourier Transform (FFT).After the completion of framing adding window, Fast Fourier Transform (FFT) is carried out to each frame signal and is obtained respectively The frequency spectrum of frame.Then to the frequency spectrum modulus square of voice signal (square to take absolute value) and divided by the points of Fourier transformation The energy spectrum of voice signal is obtained, usual Fourier transformation points are set as 512.

(iv) triangle bandpass filtering.The triangle filter group that energy spectrum is passed through to one group of Mel scale carries out energy spectrum Smoothly, and the effect of harmonic carcellation, the formant of original voice is highlighted.Triangle bandpass filter number is 26.

(v) logarithmic energy of filter output is calculated.Firstly, calculating the logarithmic energy s (m) of each filter output, so After will calculate resulting logarithmic energy and substitute into discrete cosine transform, find out MFCC coefficientWherein M is triangular filter number, is 26；N For Fourier transformation points, L is MFCC coefficient order, takes 16.

(vi) extraction of dynamic difference parameter.The cepstrum parameter MFCC of standard has only reacted the static characteristic of speech parameter. We can describe the dynamic characteristic of voice by extracting dynamic difference parameter.

Dynamic difference parameter calculates as follows:

Wherein, d_tIndicate t-th of first-order difference parameter, C_tIndicate that t-th of cepstrum coefficient, Q indicate the order of cepstrum coefficient, K Indicate the time difference (can value 1 or 2) of first derivative.d_tThe second differnce parameter of MFCC can be obtained in formula iteration twice.

(4) identify that obtaining recognition result is to the MFCC feature of extraction with the order word identification model that Google provides " no ", the confidence level of recognition result are 0.9；

(5) make the smallest disturbance of target function value using particle swarm algorithm searching, specifically:

In order to make population mobile towards the direction for maximizing target category probability, objective function is arranged are as follows:

Wherein, x represents the audio of input, p_i(i=1 ..., N) represents particle i, shares N number of particle.f(x+p_i)_jIt represents Speech recognition modeling is for input x+p_iThe probability of the jth class of output；T is enabled to represent the classification that attacker specifies, then f (x+p_i) t generation Input is classified as the probability of t by table speech model；Parameter κ is the confidence level for controlling misclassification, and value is less than or equal to 0, compared with Small κ means that the confrontation audio generated will be identified as t with higher confidence level, that is, the attack of the confrontation audio generated Effect is better.

According to given parameter (κ=0), objective function is minimized using particle swarm algorithm, as shown in figure 4, specifically including:

(a) the number of iterations is initialized as 0 first, and is uniformly distributed from [- 1,1] and generates 25 particle random sequences, grain The length of son is identical as original audio length, is 16000 points；

(b) by each particle p_iIt is added on original audio x respectively and obtains 25 new audio x+p_i, repeat step (3) and (4), each x+p is recorded_iRecognition result f (x+p_i) and calculate its target function value g (x+p_i)；

(c) if there is any x+p_iRecognition result be " yes ", then success attack, and particle p_iIt is expected best Noise δ；

Otherwise, (d) is thened follow the steps；

(d) the number of iterations is added 1, is uniformly distributed from [- 1,1] and generates 24 particles, and will there is minimum in a upper coherence The particle of target function value is added, the seed as next round iteration；

Step (b)~(c) is repeated, until objective function is restrained, acquisition makes the convergent particle p of objective function_i, as expected Optimum noise δ；

If objective function is still not converged when the number of iterations reaches the maximum number of iterations of setting, then it represents that attack failure；

(6) optimum noise δ is superimposed with original audio x, just obtains confrontation audio, i.e., audio sound in human ear be " no ", but " yes " is identified as by speech recognition modeling.

Technical solution of the present invention and beneficial effect is described in detail in embodiment described above, it should be understood that Above is only a specific embodiment of the present invention, it is not intended to restrict the invention, it is all to be done in spirit of the invention Any modification, supplementary, and equivalent replacement etc., should all be included in the protection scope of the present invention.

Claims

1. the orientation under a kind of black box scene fights audio generation method, which comprises the following steps:

(3) extract resampling after source audio MFCC feature；

(5) set objective function, using particle swarm algorithm searching make the smallest optimum noise of target function value, by optimum noise with Source audio superposition obtains the orientation that recognition result is target of attack and fights audio.

2. the orientation under black box scene according to claim 1 fights audio generation method, which is characterized in that described is black Box speech recognition modeling is to be classified and exported the fixed speech recognition modeling of classification to voice.

3. the orientation under black box scene according to claim 1 fights audio generation method, which is characterized in that step (3) Include:

(3-3) carries out Fast Fourier Transform (FFT) to each frame audio, obtains the frequency spectrum of each frame audio, obtains audio from the frequency spectrum of audio Energy spectrum；

The logarithmic energy that (3-5) calculates each triangle filter output obtains MFCC by logarithmic energy through discrete cosine transform The Mel-scaleCepstrum parameter of coefficient order rank；Extract the dynamic difference parameter of audio；

(3-6) obtains MFCC feature.

4. the orientation under black box scene according to claim 3 fights audio generation method, which is characterized in that MFCC feature Parameter in extraction are as follows: pre-emphasis parameters 0.97；512 sampled points are a frame, and the overlapping region between frame and frame includes 171 A sampled point, adding window parameter are 0.46；Fast Fourier Transform points are 512；Triangle filter number is 26；MFCC order It is 16.

5. the orientation under black box scene according to claim 1 fights audio generation method, which is characterized in that the mesh Scalar functions are as follows:

Wherein, x is source audio, p_i(i=1 ..., N) is i-th of particle, and N is positive integer；f(x+p_i)_jFor audio identification model For input x+p_iOutput is the probability of jth class result；T is target of attack, f (x+p_i)_tIt is audio identification model for input x+ p_iOutput is the probability of t；Parameter κ is the constant less than or equal to 0.

6. the orientation under black box scene according to claim 5 fights audio generation method, which is characterized in that step (5) In, make the smallest optimum noise of target function value using particle swarm algorithm searching, comprising:

The number of iterations is initialized as 0 by (5-1), is uniformly distributed and is generated N number of particle p_i(i=1 ..., N), the length of particle and source Audio length is identical；

(5-3) extracts audio x+p_iMFCC feature, using audio identification model to audio x+p_iMFCC feature identified, Obtain each audio x+p_iRecognition result, and calculate its target function value g (x+p_i)；

Otherwise, step (5-4) is executed；

The number of iterations is added 1 by (5-4), is uniformly distributed and is generated N-1 particle p_i(i=1 ..., N-1), and will have in last round of time There is the particle of minimum target functional value to be added, the seed as next round iteration；

Step (5-2)~(5-3) is repeated, until objective function is restrained, acquisition makes the convergent particle p of objective function_i, as most preferably Noise；

7. the orientation confrontation audio under a kind of black box scene generates system characterized by comprising

Data preprocessing module carries out resampling to source audio data, the sample rate of source audio is made to meet black box speech recognition mould Requirement of the type to input audio sample rate；

Audio feature extraction module extracts the MFCC feature of audio data；

Audio identification module, has a black box speech recognition modeling, the black box speech recognition modeling to the MFCC feature of audio into Row identification, obtains recognition result；

Particle group optimizing module has objective function, finds optimal noise using particle swarm algorithm, and source sound is added in optimal noise Frequently, orientation confrontation audio is obtained.

The orientation confrontation audio generates system and generates orientation confrontation audio using the orientation confrontation audio generation method；

The orientation confrontation audio generates system and fights audio generation side using the described in any item orientations of claim 1~6 Method generates orientation confrontation audio.