CN113611331A

CN113611331A - Transformer voiceprint anomaly detection method

Info

Publication number: CN113611331A
Application number: CN202110872885.4A
Authority: CN
Inventors: 刘颜鹏; 吴道平; 章海兵; 汪中原
Original assignee: Hefei Technological University Intelligent Robot Technology Co ltd
Current assignee: Hefei Technological University Intelligent Robot Technology Co ltd
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2021-11-05

Abstract

The invention discloses a transformer voiceprint anomaly detection method, which belongs to the technical field of artificial intelligence and comprises the following steps: acquiring transformer voiceprint data to be detected; denoising the transformer voiceprint data by utilizing a denoising model U-net to obtain the denoised transformer voiceprint data; performing feature extraction on the transformer voiceprint data after the noise is removed by using a Mel frequency spectrum feature extraction method to obtain Mel frequency spectrum features; detecting Mel frequency spectrum characteristics by using a detection model G-MADE to obtain scores of transformer voiceprint data; and judging whether the transformer is normal or not according to the score of the transformer voiceprint data. The method removes noise of related noise, ensures data accuracy, uses a model which can reflect data time sequence relation by a detection module, is more accordant with objective business scenes, mainly uses normal sample data without labels during modeling, and solves the problems of less labeled data, high labeling cost and less abnormal samples in a big data environment.

Description

Transformer voiceprint anomaly detection method

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a transformer voiceprint anomaly detection method.

Background

The voiceprint recognition technology is one of biological recognition technologies, and is a technology for automatically recognizing the identity of a speaker according to voice parameters reflecting physiological and behavioral characteristics of the speaker in a voice waveform. The principle of voiceprint recognition is that a unique voice feature of a speaker is extracted and stored in a database by inputting a voice sample of the speaker in advance, and when the voiceprint recognition is applied, the voice to be verified is matched with the feature in the database, so that the identity of the speaker is determined.

The voiceprint recognition technology is applied to the industrial field, historical voiceprint data of the equipment under various working conditions are collected and labeled and a model is built through different voiceprint performances of the equipment under normal operation/various fault states, and then the hidden danger of equipment faults is automatically recognized in an actual scene. However, in an actual scene, the device data is basically normal data, and few abnormal data appear, so that the traditional supervised learning method cannot be well applied to the device data, and therefore an unsupervised learning method, namely a G-MADE (Group-Maskedautoencoder) self-encoder is used.

Some existing technical schemes have the following defects:

(1) from the perspective of data preprocessing, most of existing schemes consider data under ideal conditions, do not consider the problem of data noise (such as car noise, footstep noise and the like) under actual scenes, and some schemes mention the problem of data noise reduction, but the processing method is to fuse the noise reduction process with the self-coding model building process for detecting the abnormity, so that the interpretability of the noise reduction effect is reduced.

(2) From the aspect of feature extraction, in the existing scheme, the feature extraction is performed on the voiceprint data by using a natural language technology; the method also comprises the step of extracting the features of the voiceprint data by using an association rule mining technology, but a spectrum feature extraction method more suitable for the voiceprint data is not used, and some schemes refer to spectrum feature extraction, but do not use the extracted spectrum features as image data to construct a subsequent model, and ignore the inherent time sequence relation of the data.

(3) From the perspective of a model, in the existing scheme, a traditional self-encoder structure is used, and the structure cannot reconstruct data by using a trained model, namely, the quality of the model cannot be evaluated qualitatively. The learning mode corresponding to the pixel point violates the learning target of the normal data mode to a certain extent; the variational self-encoder is used, is a self-encoder for learning data distribution, can be used for generating data, optimizes the defects of the traditional self-encoder, but does not model a time sequence relation.

Disclosure of Invention

The invention aims to overcome the defects in the background technology and improve the accuracy of detecting the abnormal voiceprint of the transformer.

In order to achieve the above object, a transformer voiceprint anomaly detection method is adopted, which comprises the following steps:

acquiring transformer voiceprint data to be detected;

denoising the transformer voiceprint data by utilizing a denoising model U-net to obtain the denoised transformer voiceprint data;

performing feature extraction on the transformer voiceprint data after the noise is removed by using a Mel frequency spectrum feature extraction method to obtain Mel frequency spectrum features;

detecting Mel frequency spectrum characteristics by using a detection model G-MADE to obtain scores of transformer voiceprint data;

and judging whether the transformer is normal or not according to the score of the transformer voiceprint data.

Further, the loss function of the denoising model U-net adopts weighted SDR loss, and the formula is as follows:

wherein the content of the first and second substances,

and y is an actual tag value and is used as an input value of the noise removing model U-net, and oc represents the direct proportion.

Further, before the acquiring the transformer voiceprint data to be detected, the method further includes:

collecting continuous transformer equipment voiceprint data as original data;

adding noise to continuous transformer equipment voiceprint data to obtain noise data;

and respectively taking the noise data and the original data as the input and the output of the noise-removing model U-net, and training the noise-removing model U-net to obtain the trained noise-removing model U-net.

Further, the using the Mel-frequency spectrum feature extraction method to perform feature extraction on the transformer voiceprint data after the noise is removed to obtain the Mel-frequency spectrum feature includes:

performing framing processing on the de-manic transformer voiceprint data to obtain multi-frame data;

windowing each frame of data to obtain windowed data;

carrying out short-time Fourier transform on the windowed data, and converting the windowed data from a time domain to Mel frequency to obtain a frequency spectrum sequence;

and performing Mel feature extraction on the spectrum sequence by using a Mel filter bank comprising k filters to obtain the Mel spectrum features.

Further, the functional form of the Mel filters in the Mel filter bank is as follows:

where k is the frequency value of the point to be calculated, m is the mth Mel-filter, f (m) is the average of the mth filter frequency, f (m-1) is the minimum of the mth filter frequency, and f (m +1) is the maximum of the mth filter frequency.

Further, the Mel spectrum feature representation form obtained by extraction is as follows:

wherein F (ω) represents a short-time Fourier transform, F^-1Representing an inverse fourier transform.

Further, the loss function of the detection model G-MADE is defined as an overall negative log-likelihood function of the whole:

wherein p (.) is a normal distribution or a mixed normal distribution, D represents the total number of samples, x_dRepresenting the d-th sample, W, V is represented as a model weight matrix.

performing Mel frequency spectrum feature extraction on the original data output by the denoising model U-net;

and training the detection model G-MADE by using the extracted Mel frequency spectrum characteristics to obtain the trained detection model G-MADE.

Further, the determining whether the transformer is normal according to the score of the transformer voiceprint data includes:

comparing the score of the transformer voiceprint data with a set detection threshold;

if the score is larger than the detection threshold, determining that the transformer is normal;

and if the score is less than or equal to the detection threshold value, determining that the transformer is abnormal.

Compared with the prior art, the invention has the following technical effects: in the invention, when data is preprocessed, in addition to the traditional preprocessing method (normalization, abnormal value processing, interpolation and the like), a U-net self-encoder is used for denoising the actual scene vocal print data; during feature extraction, extracting the frequency spectrum feature of the voiceprint by using a Mel frequency spectrum method, and reserving the time series relation of data; in model construction, a G-MADE (Group-Maskedautoencoder) self-encoder is used for carrying out distribution feature learning on time sequence data.

Drawings

The following detailed description of embodiments of the invention refers to the accompanying drawings in which:

FIG. 1 is a flow chart of a transformer voiceprint anomaly detection method;

FIG. 2 is a block diagram of the denoising model U-net;

FIG. 3 is a detailed flow chart of the STFT;

FIG. 4 is a diagram of a convolution operation block;

FIG. 5 is a graph of water pump voiceprint data for a 10s period;

FIG. 6 is a diagram of voiceprint data framing;

FIG. 7 is a schematic diagram of the overlap between two frames of data;

FIG. 8 is a schematic view of frame data windowing;

FIG. 9 is a filter bank image containing 10 filters;

FIG. 10 is a schematic flow chart of framing, windowing, Fourier transforming, Mel feature extraction for time domain voiceprint data;

FIG. 11 is a diagram showing the structure of the detection model G-MADE;

fig. 12 is a schematic overall flow chart of a transformer voiceprint anomaly detection method.

Detailed Description

To further illustrate the features of the present invention, refer to the following detailed description of the invention and the accompanying drawings. The drawings are for reference and illustration purposes only and are not intended to limit the scope of the present disclosure.

As shown in fig. 1, the present embodiment discloses a transformer voiceprint anomaly detection method, which includes the following steps S1 to S5:

s1, acquiring transformer voiceprint data to be detected;

s2, denoising the transformer voiceprint data by utilizing a denoising model U-net to obtain the denoised transformer voiceprint data;

s3, performing feature extraction on the de-manic transformer voiceprint data by using a Mel frequency spectrum feature extraction method to obtain Mel frequency spectrum features;

s4, detecting Mel frequency spectrum characteristics by using a detection model G-MADE to obtain scores of transformer voiceprint data;

and S5, judging whether the transformer is normal or not according to the score of the transformer voiceprint data.

It should be noted that, in the actual working environment, in the voiceprint data collected by the inspection robot, other noises except for the equipment may be mixed, such as: human footsteps, bird calls, human speeches, car sounds, and the like. If the noise is not processed, the model is likely to identify a piece of equipment voiceprint data containing the bird song as abnormal data (because abnormal patterns exist in the data), and in the embodiment, in addition to a traditional preprocessing method (normalization, abnormal value processing, interpolation and the like), a U-net self-encoder is used for denoising the actual scene voiceprint data during data preprocessing, so that the accuracy of data denoising is ensured.

The feature extraction of the voiceprint data uses a Mel frequency spectrum feature extraction method, and an important characteristic of the Mel frequency spectrum feature extraction method is as follows: the perception of the human ear on this scale of Hz is not linear, sensitive to low frequency tones, and insensitive to high frequency tones, and the Mel spectrum changes this scale so that the human ear perceives the transformed scale as linear.

Detecting Mel spectral characteristics by using an unsupervised learning method, namely a G-MADE (Group-Maskedautoencoder) self-encoder, wherein the self-encoder realizes a self-encoder architecture for performing distributed learning on data with a sequence relation.

As a further preferred technical solution, as shown in fig. 2, the denoising model U-net in this embodiment includes a short-time fourier transform STFT, a convolution operation and an inverse fourier transform, where X is input noise data (which may or may not have noise) of the model, y ^ is an estimated value of corresponding original data, the data is expressed in a vector form, that is, amplitude vectors [ X _1, X _2, …, X _ n ] and [ y _1, y _2, …, y _ n ] read from original wav data, the STFT is a short-time fourier transform and is used to convert time-domain data, such as voiceprint data, into frequency-domain data, and the X in a color block is obtained after the conversion, and a specific flowchart of the STFT is shown in fig. 3:

fourier transform formula:

inverse fourier transform formula:

the convolution operation in the denoising model U-net is a commonly used module in an image, and as shown in fig. 4, a loss function of the denoising model U-net adopts a weighted SDR (source-to-distribution) loss, and the formula is expressed as follows:

wherein the content of the first and second substances,

for the output of the manic model U-net, the y actual tag value, for the value that is input from the encoder, oc represents a direct ratio.

As a more preferable embodiment, in step S1: before acquiring transformer voiceprint data to be detected, training a noise-removing model U-net, specifically:

(1-1) collecting continuous transformer equipment voiceprint data as original data;

it should be noted that the original continuous transformer device voiceprint data is divided into wav data of 10s segments (the specific division manner is not limited to 10s and wav). It should be noted that the data collected is made as secure as possible without noise, and this data is used as the output data part of the U-net.

(1-2) adding noise to continuous transformer equipment voiceprint data to obtain noise data;

noise is added to the multiple pieces of data obtained by segmentation in step (1), and an open source data set UrbanSound8K is used as the noise data, and the data set contains 8732 urban common sound segments (< ═ 4s), including 10 types of sounds such as car horn sound, alarm sound, dog sound, music sound, and the like.

The specific adding mode is as follows: selecting a section of original 10s wav data, generating a random number R from bernuli (1, 0.5) (the distribution generates 0 or 1 with a probability of 0.5, 0.5 is a hyper-parameter and can be adjusted) R represents whether noise is added or not, namely, if R is 0, the section of data does not add noise; if R is 1, the segment needs to be added with noise and the execution is continued.

For example, for a 10s original voiceprint data and 4s car noise data, an integer T _2 can be randomly generated from (0, 10-4), 4s sound segments can be mixed into the original data from T _2s, and the mixing mode uses a mixing method of linear superposition averaging of amplitudes.

This embodiment is done because, when a model is actually applied, it is desirable to remove noise if data containing noise is input, and if data containing no noise is output as intact as possible, so that some input data containing no noise is also constructed when a data set is constructed.

And (1-3) respectively taking noise data and original data as the input and the output of the noise-removing model U-net, and training the noise-removing model U-net to obtain the trained noise-removing model U-net.

The training process is as follows: and (4) input data of the denoising model U-net model are (x, y), y is the original data obtained in the step (1-1), and x is data of y subjected to noise processing in the step (1-2). The data x is obtained by calculation through a U-net structure

By minimizing weighted SDR loss

Enabling output of U-net

Approximating data y as closely as possible without noise, the U-net model learns the way to denoise x.

The principle of using U-net to carry out data denoising is that a model learns the function mapping relation between input data containing noise and output data without noise through training, so that when the model deduces, a section of data is input, and the data after the section of data denoising is output.

As a more preferable embodiment, in step S3: performing feature extraction on the de-manized transformer voiceprint data by using a Mel frequency spectrum feature extraction method to obtain Mel frequency spectrum features, wherein the method specifically comprises the following subdivision steps S31-S34:

s31, performing framing processing on the de-manic transformer voiceprint data to obtain multi-frame data;

in order to improve the continuity of the division in the actual frame division, there is often an overlapping portion between two frames, and the overlapping portion is called a frame shift.

S32, windowing each frame of data to obtain windowed data;

it should be noted that the commonly used window functions include a rectangular window, a gaussian window, a hamming window, etc., and the hamming window function has the form:

for a frame of data r (n), the windowing calculation method is:

W(n)＝ω(n)*R(n)。

s33, carrying out short-time Fourier transform on the windowed data, and converting the windowed data from a time domain to a Mel frequency to obtain a frequency spectrum sequence;

it should be noted that, the feature extraction uses Mel-frequency spectrum feature extraction method, and the step is mainly to convert the time domain voiceprint data into frequency domain data and extract features, so the step of transformation and feature extraction is performed because the time domain voiceprint data looks like disorder and is irregular, and therefore needs to be converted into a frequency domain which is easier to explain.

Short-time Fourier transform, namely converting the windowed data from a time domain to a frequency domain by using a Fourier transform formula:

simultaneously frequency conversion to Mel frequency:

Mel(f)＝2595*log₁₀(1+f/700)。

s34, performing Mel feature extraction on the spectrum sequence by using a Mel filter group comprising k filters to obtain Mel spectrum features.

As a further preferred technical solution, the functional form of the Mel filter in the Mel filter bank is as follows:

For example, a filter bank image containing 10 filters is shown in fig. 9:

each filter is multiplied point-by-point by M (epsilon) and summed, as in the windowing approach, so that the data becomes a vector containing k values. Taking logarithm of the k numerical values, and performing inverse Fourier transform (dispersion) to obtain the extracted feature vector:

the Mel spectrum feature representation form obtained by extraction is as follows:

As a more preferable embodiment, in step S1: before acquiring transformer voiceprint data to be detected, training a detection model G-MADE, which specifically comprises the following steps:

(2-1) subjecting the raw data output by the denoising model U-net in the step (1-3) to Mel spectral feature extraction, as shown in FIG. 10:

the feature extraction uses a Mel frequency spectrum feature extraction method, the step mainly converts the time domain voiceprint data into frequency domain data and extracts features, so the step of transformation and feature extraction is to be performed because the time domain voiceprint data looks like disorder and is irregular, so that the time domain voiceprint data needs to be converted into a frequency domain which is easier to explain, for example, a section of 10s of water pump voiceprint data is shown in fig. 5, and the processing procedure is as follows:

(2-1-1) dividing the voice print data after the noise is removed into k equal parts, namely k frames, uniformly dividing the whole voice print data into k equal parts, wherein each frame is input data in the subsequent modeling, and multiple frames can be combined to be used as one piece of data for improving the accuracy; in actual operation, in order to improve the continuity of the segmentation, there is often an overlapping portion between two frames, and the overlapping portion is called frame shift, as shown in fig. 6;

(2-1-2) windowing, namely, performing windowing on each obtained frame of data, namely, multiplying one frame of data point by a window function, as shown in fig. 7;

(2-1-3) performing short-time Fourier transform, performing time-domain to frequency-domain conversion on the windowed data by using a Fourier transform formula, and simultaneously converting the frequency to Mel frequency to obtain a frequency spectrum sequence M (epsilon), wherein the frequency spectrum sequence M (epsilon) is shown in FIG. 8;

and (2-1-4) extracting Mel frequency spectrum feature vectors from the calculated frequency spectrum sequence M (epsilon).

And (2-2) training the detection model G-MADE by using the extracted Mel frequency spectrum characteristics to obtain the trained detection model G-MADE.

As a further preferred solution, as shown in fig. 11, the left side (Autoencoder) of the detection model G-MADE is an auto-encoder model (e.g. variational auto-encoder VAE) that can be used for learning data distribution, where V, W1, W2 are weight matrices, MADE solves the problem of dependency on data time-sequence by adding constraint MASKs to weights, MASKs is actually a matrix (black is 0, white is 1) composed of 0 and 1, in the figure, the parameter matrix V is multiplied by MASKs matrix Mv point by point, so that some weights in V are 0 (inactivated), and an actual connection result graph on the right side is obtained, and since the second rows of Mv are all 0 (black), the weights of the top row connected to the second node (1 node) in the right side graph are all disappeared. The timing dependency problem can be realized by careful design of MASKs, for example, the top right node (2 node) in the figure, and the complete connection path is only related to the input middle node (1 node), i.e., x3 is only related to x2 and not to x1 (in this example, the sequential relation of time series may be x2 → x3 → x 1). The design principle model authors of MASKs gave the following:

as a further preferred solution, the loss function of the detection model G-MADE is defined as an overall negative log-likelihood function:

The MADE model is described above, the G-MADE model is based on grouping, step (2-1-1) refers to that multiple frames can be combined into one piece of data, G-MADE is used if combination is needed, one piece of data corresponds to one group, each group of data corresponds to one MADE module, and then likelihood functions are combined together (namely summation).

It should be noted that, the distribution characteristics are learned by the model through a large amount of normal device voiceprint data, so if the voiceprint data of the normal device is encountered in the detection, the loss function calculated by the model is generally similar to the training data (normal data) and is relatively low, otherwise, when the model encounters the fault data and the mode is different from the training data, the loss is relatively high. To fit the usage habit, we describe the result by the negative number of losses, called the score. That is, if the voiceprint data of the normal device is input, the score will be higher, and the voiceprint data of the abnormal device, the score will be lower.

As a more preferable embodiment, in step S5: according to the score of the transformer voiceprint data, whether the transformer is normal or not is judged, and the method comprises the following steps:

Specifically, with a trained detection model, for a section of voiceprint data to be detected (for example, wav of 10s mentioned in step 1.1) generally contains many frame data, for example, N frames, then the model also outputs N scores s1, s2, …, sN, and finally uses the mean or median of the N scores as the final score of the section of data to be detected. Whether the training data is abnormal or not is judged by selecting a proper detection threshold T based on the training data scoring condition, wherein the detection threshold T is more than the following 5% quantile (or determining a threshold by using evaluation indexes of a classification model such as precision, recall rate, auc and the like in combination with label information). If the score of the data to be measured is larger than T, the model infers normal data, and if the score of the data to be measured is smaller than T, the model infers abnormal data.

It should be noted that, as shown in fig. 12, in the present embodiment, an offline model building is performed first, wherein the denoising model U-net and the detection model G-MADE are based on a tensierflow framework of python, specifically using tensierflow _ robustness, and two model files are returned, and the file formats may be various, for example, pb,. h5 or other formats. Mel spectral feature extraction uses the library framework of python, where no results are returned, but a fixed set of processing method parameters are recorded. And then online inference is carried out on the voiceprint data to be detected, the intermediate data communication interaction technology is not limited, and the voiceprint data detection method can be based on a flash frame of python, an mqtt frame of python and the like.

The invention has the following beneficial effects:

(1) during modeling, normal sample data without labels are mainly used, and the problems of less labeled data, high labeling cost and less abnormal samples in a big data environment are solved:

in an actual working environment, a large amount of equipment voiceprint data can be acquired through the voiceprint acquisition equipment of the inspection robot, but most of the data are data under normal operation of the equipment, and few or no abnormal data exist. The invention constructs a model under the condition of only using normal data (or adding a small amount of abnormal data), and the model can identify the abnormal data.

The running modes of the equipment in the normal state are stable and uniform, but the abnormal state is various, and the model can detect various abnormal states under the condition of lacking abnormal data.

(2) The detection module uses a model which can embody the data time sequence relation and is more in line with objective business scenes.

(3) In actual work environment, in the voiceprint data that inspection robot gathered, other noises outside equipment may be mingled with, for example: human footsteps, bird calls, human speeches, car sounds, and the like. The invention enables the model to identify the voiceprint data of the equipment more accurately by processing the noise.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A transformer voiceprint anomaly detection method is characterized by comprising the following steps:

acquiring transformer voiceprint data to be detected;

2. The method for detecting the abnormal voiceprint of the transformer according to claim 1, wherein a loss function of the denoising model U-net adopts a weighted SDR loss, and the formula is as follows:

wherein the content of the first and second substances,

3. The method for detecting the abnormal voiceprint of the transformer according to claim 1, wherein before the step of acquiring the voiceprint data of the transformer to be detected, the method further comprises the following steps:

collecting continuous transformer equipment voiceprint data as original data;

4. The method for detecting the abnormal voiceprint of the transformer according to claim 1, wherein the step of performing feature extraction on the de-noised voiceprint data of the transformer by using a Mel spectral feature extraction method to obtain Mel spectral features comprises the following steps:

windowing each frame of data to obtain windowed data;

5. The transformer voiceprint anomaly detection method according to claim 4, wherein the functional form of the Mel filter in the Mel filter bank is as follows:

6. The transformer voiceprint anomaly detection method according to claim 4, wherein the Mel frequency spectrum feature representation form obtained by extraction is as follows:

7. The transformer voiceprint anomaly detection method of claim 1, wherein a loss function of the detection model G-MADE is defined as an overall negative log-likelihood function of the whole:

8. The method for detecting the abnormal voiceprint of the transformer according to claim 3, wherein before the step of acquiring the voiceprint data of the transformer to be detected, the method further comprises the following steps:

9. The method for detecting the voiceprint abnormality of the transformer according to claim 1, wherein the step of judging whether the transformer is normal or not according to the score of the voiceprint data of the transformer comprises the following steps: