CN111564163A

CN111564163A - RNN-based voice detection method for various counterfeit operations

Info

Publication number: CN111564163A
Application number: CN202010382185.2A
Authority: CN
Inventors: 严迪群; 乌婷婷; 王让定
Original assignee: Ningbo University
Current assignee: Ningbo University
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-21
Anticipated expiration: 2040-05-08
Also published as: CN111564163B

Abstract

The invention discloses a method for detecting various fake operations based on RNN, which comprises the following steps: 1) obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the voices to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model; 2) obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.

Description

RNN-based voice detection method for various counterfeit operations

Technical Field

The invention relates to a voice detection method, in particular to a voice detection method for various false operations based on RNN.

Background

With the continuous enhancement of the functions of the voice editing software, a non-professional person can easily modify the voice content. If a lawless person maliciously forges and modifies the voice, even the modified voice is used in the fields of news reports, judicial evidence collection, scientific research and the like, which brings huge threats and even causes immeasurable influence on social stability. The digital voice evidence obtaining method is used for detecting the counterfeit operation, plays a vital role in identifying the originality and the authenticity of the audio material, and is a key research topic in the current multimedia evidence obtaining field.

Most of the existing digital voice evidence-obtaining detection technologies detect single counterfeiting operation, namely, a prover assumes that the voice to be detected can pass through a certain specific counterfeiting operation. Mengyu Qiao et al propose a detection algorithm based on statistical features of quantized MDCT coefficients and their derivatives, detect up-converted and down-converted MP3 audio files, generate reference audio signals by recompressing and calibrating the audio, and then classify with a support vector machine, experimental results show that the method effectively detects MP3 double compression and can detect digitally forensified audio processing history. For example, Wanglihua et al propose a history detection of tonal modification speech processing based on convolutional neural network, which performs tonal modification by applying four different tonal modification software to three speech libraries, and performs detection in the speech libraries, among the libraries and among tonal modification methods on tonal modification factors of speech by using CNN, and the detection rate reaches over 90%.

The existing digital voice evidence-obtaining detection technology can detect single counterfeiting operation, and the detection rate can reach a high level. However, in practical applications, the forensics often cannot predict the specific operation of counterfeiting, and a false judgment may occur when a certain operation classifier is used for detection.

At present, most of the existing digital evidence obtaining works suitable for various counterfeiting operations are concentrated on the field of digital images, and research on digital voice evidence obtaining is still less. In the field of digital speech, the luviauqi team designs a convolutional neural network model, which can be used for detecting audio processing operations of default settings in two different audio editing software and provides better results. However, although this experiment pioneers the detection of various voice forgery operations, there are some problems that are not negligible, such as too high computational complexity, and an ideal application scenario for the forgery operation.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a method for detecting various counterfeit operation voices based on RNN, aiming at the defects in the prior art, and the detection accuracy can be improved.

The technical scheme adopted by the invention for solving the technical problems is as follows: a kind of voice detection method of various forging operations based on RNN, characterized by that: the method comprises the following steps:

1) training a network: obtaining an original voice sample, performing M kinds of counterfeiting processing on the original voice sample to obtain M voices after counterfeiting operation and 1 original voice without processing, performing feature extraction on the M voices after counterfeiting and the 1 original voice to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into an RNN classifier network for training to obtain a multi-classification training model;

2) and (3) voice recognition: obtaining a section of test voice, extracting the characteristics of the test voice to obtain an LFCC matrix of the test voice data, sending the LFCC matrix into the RNN classifier trained in the step 1) for classification, obtaining an output probability for each test voice, and combining all the output probabilities as a final prediction result: if the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as a falsified voice subjected to a corresponding falsification operation.

Preferably, in steps 1) and 2), the step of obtaining the LFCC matrix is:

1) FFT: firstly, preprocessing voice, and calculating the spectrum energy E (i, k) of each voice frame after FFT:

where i is the number of speech frames, k is the frequency component, x_i(m) is speech signal data of the i-th frame, N is the number of fourier transforms;

then, calculating the energy of the spectral energy E (i, k) of each frame after passing through a triangular filter bank:

wherein H_i(k) Representing the frequency response of the triangular filter, f (L) is the filtering function of the ith triangular filter, S (i, L) is the spectral line energy after passing through the triangular filter bank, L represents the number of the triangular filter, and L is the total number of the triangular filters;

2) DCT: the output data lfcc (i, n) of each triangular filter bank is calculated using DCT:

wherein n represents a spectral line after DCT of the ith frame;

3) obtaining LFCC statistical moment: taking 12-order LFCC coefficients from LFCC (i, n), calculating a mean value and a correlation coefficient, and obtaining an LFCC matrix extracted from a section of voice, wherein the LFCC matrix is as follows:

wherein x_s,1…x_1,nN LFCCs for the calculated s-th frame of speech data.

Preferably, the RNN classifier includes an LSTM network, a Dropout layer, a full connection layer, and a Softmax layer connected in sequence, where the Dropout layer is connected to the last LSTM network.

Preferably, the LSTM network has two, with parameters set to (64,128) and (128,64), respectively.

Preferably, the LSTM network uses a tanh activation function.

Preferably, the Dropout function value of the Dropout layer is 0.5.

Preferably, the original speech is in WAV format.

Compared with the prior art, the invention has the advantages that: the voice cepstrum features are adopted, the result probability is output in a classified mode through a recurrent neural network, the accuracy of voice detection is improved, the method is more suitable for digital voice carriers, and different forged traces can be recognized; compared with the existing deep learning-based method, the calculation complexity of the method is greatly reduced through the shared parameters in the RNN.

Drawings

FIG. 1 is a diagram illustrating the process of extracting the LFCC statistical moments of the speech detection method according to the embodiment of the present invention;

FIG. 2 is a general framework schematic diagram of a speech detection method according to an embodiment of the present invention;

fig. 3 is a network structure diagram of a voice detection method according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions.

In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the present invention and to simplify the description, but are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and that the directional terms are used for purposes of illustration and are not to be construed as limiting, for example, because the disclosed embodiments of the present invention may be oriented in different directions, "lower" is not necessarily limited to a direction opposite to or coincident with the direction of gravity. Furthermore, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

A speech detection method based on RNN (recurrent neural network) for various false operations is realized by constructing a recurrent neural network framework based on cepstrum characteristics. Referring to fig. 2, the frame is made up of two parts: firstly, extracting the cepstrum characteristics of a voice sample, and then sending the cepstrum characteristics into a designed network frame for classification, thereby achieving the task of identifying various counterfeiting operations.

Specifically, in the present invention, feature extraction of speech is realized in the following manner. The cepstrum used in the present invention is characterized by Linear Frequency Cepstral Coeffients (LFCC). The speech cepstrum feature is one of the most commonly used feature parameters in speech technology, characterizes the human auditory features, and is widely used for speaker recognition.

LFCC is the average distribution of the band pass filters from low to high frequency. The LFCC statistical moment extraction process of the invention can be seen in FIG. 1:

1) FFT: firstly, preprocessing voice, and calculating the spectral energy E (i, k) of each voice frame after Fast Fourier Transform (FFT):

where i is the number of speech frames, k is the frequency component, x_i(m) is speech signal data of the i-th frame, and N is the number of Fourier transforms.

Calculating the energy of the spectral energy E (i, k) of each frame after passing through a triangular filter bank:

wherein H_i(k) The frequency response of the triangular filter is shown, f (L) is the filtering function of the ith triangular filter, S (i, L) is the spectral line energy after passing through the triangular filter bank, L is the number of the triangular filter, and L is the total number of the triangular filters.

2) DCT: then, the output data lfcc (i, n) of each triangular filter bank is calculated using Discrete Cosine Transform (DCT):

wherein n represents the spectral line after DCT of the ith frame.

3) Obtaining LFCC statistical moment: taking LFCC coefficients of 12 orders from LFCC (i, n), calculating a mean value and a correlation coefficient, wherein the steps can be realized by the existing matlab function, and assuming that a certain segment of preprocessed voice has s frames in common, the LFCC matrix extracted from the segment of voice is as follows:

wherein x_s,1…x_1,nN LFCCs for the calculated s-th frame of speech data.

Referring to fig. 3, the network framework employs RNN classifiers, the selection of the number of network layers of which is crucial for the optimization algorithm, and deeper networks can learn more knowledge, but training at the same time also takes a long time and may be overfitting. Therefore, in the present invention, a network structure of the RNN classifier is proposed as shown in fig. 3. The network structure comprises 2 LSTM networks, parameters are respectively set to (64,128) and (128,64), and the performance of the model is improved by using a tanh activation function. The system also comprises a Dropout layer, a full connection layer (dense) and a Softmax layer which are connected in sequence, wherein the Dropout layer is connected with the last LSTM network. Setting the value of the Dropout function to 0.5 helps to reduce overfitting, and using the Softmax layer (Softmax classifier) to output the probability after the fully connected layer dimensionality reduction. The overall iterative training of the network framework was set to 50 rounds. Certain adjustments may be made during specific training.

Referring to fig. 2 again, the voice detection method includes the following steps:

1) the network framework needs to be trained first. Supposing that M kinds of forgery operations are provided, M kinds of forgery processing are respectively carried out on the original voice to obtain M +1 kinds of voice samples, including the voices after the M kinds of forgery operations and 1 original voice without processing. In the invention, certain constraint is provided for the input of the original voice, and a certain amount of WAV format audio sample library is required to be provided as training data of a network framework. Performing feature extraction on the M +1 voice samples to obtain an LFCC matrix of a training voice sample, and sending the LFCC matrix into a designed RNN classifier network for training to obtain a multi-classification training model; a plurality of original voice samples can be stored in a database, and each original voice sample is subjected to feature extraction and is sent to an RNN classifier for training;

2) then, obtaining a detection recognition result through the trained network framework: when a section of test voice is obtained, feature extraction is carried out on the test voice to obtain an LFCC matrix of the test voice data, and the LFCC matrix is sent into a trained RNN classifier to be classified. Each test voice will get an output probability, and all the output probabilities are combined as the final prediction result. If the prediction result is the original voice, the test voice is recognized as the original voice; if the prediction result is a voice subjected to a certain falsification operation, the test voice is recognized as the falsified voice. The acquirer can judge whether a certain voice is subjected to the counterfeiting operation according to the experimental result.

Claims

1. A kind of voice detection method of various forging operations based on RNN, characterized by that: the method comprises the following steps:

2. The RNN-based voice detection method for multiple false operations according to claim 1, wherein: in steps 1) and 2), the steps of obtaining the LFCC matrix are:

wherein H_i(k) Representing the frequency response of the triangular filter, f (l) being the filter function of the ith triangular filterS (i, L) is spectral line energy after passing through the triangular filter group, L represents the number of the triangular filter, and L is the total number of the triangular filter;

wherein n represents a spectral line after DCT of the ith frame;

wherein x_s,1…x_1,nN LFCCs for the calculated s-th frame of speech data.

3. The RNN-based voice detection method for multiple false operations according to claim 1, wherein: the RNN classifier comprises an LSTM network, and a Dropout layer, a full connection layer and a Softmax layer which are sequentially connected, wherein the Dropout layer is connected with the last LSTM network.

4. An RNN-based multiple false operation voice detection method according to claim 3, wherein: the LSTM network has two, with parameters set to (64,128) and (128,64), respectively.

5. An RNN-based multiple false operation voice detection method according to claim 3, wherein: the LSTM network uses a tanh activation function.

6. An RNN-based multiple false operation voice detection method according to claim 3, wherein: the Dropout function value of the Dropout layer is 0.5.

7. The RNN-based voice detection method for multiple false operations according to claim 1, wherein: the original speech is in WAV format.