CN110414536B

CN110414536B - Playback detection method, storage medium, and electronic device

Info

Publication number: CN110414536B
Application number: CN201910646885.5A
Authority: CN
Inventors: 郑方; 徐明星; 程星亮
Original assignee: Beijing D Ear Technologies Co ltd
Current assignee: Beijing D Ear Technologies Co ltd
Priority date: 2019-07-17
Filing date: 2019-07-17
Publication date: 2022-03-25
Anticipated expiration: 2039-07-17
Also published as: CN110414536A

Abstract

The embodiment of the invention provides a data feature extraction method, a record playback detection method, a storage medium and electronic equipment. The data feature extraction method comprises the following steps: acquiring data to be processed; the method comprises the steps that through a deep neural network comprising at least one network building block, feature data are extracted from data to be processed, wherein the network building block comprises a fusion module and a plurality of parallel processing modules, partial feature data of the data to be processed are respectively extracted through the processing modules, partial feature data extracted by the processing modules belong to feature data with different dimensions, and partial feature data output by the processing modules are spliced together through the fusion module to serve as output data of the network building module. Thus, a small deep neural network for extracting feature data can be constructed by a small number of network building blocks, so that a desired function can be realized using the small deep neural network in the case of insufficient data or limited computational power.

Description

Playback detection method, storage medium, and electronic device

Technical Field

Embodiments of the present invention relate to information processing technologies, and in particular, to a data feature extraction method, a recording playback detection method, a storage medium, and an electronic device.

Background

The residual error neural Network ResNet is a convolution neural Network, adopts a Network-In-Network structure, and forms a final Network structure by stacking the most basic residual error modules. One basic building block in ResNet is shown in fig. 1, where the basic building block is represented by two stacked boxes, each box representing a convolution operation, and the labels within each box represent (number of in-channels, convolution kernel, number of out-channels). In the basic building block shown in fig. 1, each convolution operation involves 64 in-channel numbers, a 3 × 3 convolution kernel, and 64 out-channel numbers.

ResNeXt is an improved version of ResNet, and increases the number of paths in each basic building block through a three-step strategy of splitting, converting and fusing. The method can improve the performance of the model on the basis of keeping the parameter quantity of the model basically unchanged. By reconstructing the basic building blocks of ResNet shown in FIG. 1 according to the above-mentioned idea, the structure of the basic building blocks shown in FIG. 2 can be obtained. Wherein, the same feature extraction processing is performed by the residual modules of 32 paths, and then the feature data extracted from the 32 paths are added to obtain fused feature data.

When the depth of a basic building block is 2, a basic building block constructed according to the ResNeXt idea is equivalent to a dense convolution with a wider width. Therefore, the construction idea of ResNeXt can only be used in basic building blocks with depth > 3.

On the other hand, when basic building blocks with a depth >3 are used in ResNet, the network depth of the constructed neural network is at least 50 layers or more, while smaller neural network models typically use 2-layer basic building blocks. Small models are more useful than large models due to many situations such as insufficient data or limited computational power. In this case, ResNeXt cannot be used as a place.

Disclosure of Invention

It is an aim of embodiments of the present invention to provide a data processing scheme to enable the required data processing functionality to be achieved by a small neural network.

According to a first aspect of the embodiments of the present invention, there is provided a data feature extraction method, including: acquiring data to be processed; extracting feature data from the data to be processed through a deep neural network comprising at least one network building block, wherein the network building block comprises a fusion module and a plurality of parallel processing modules, extracting part of feature data of the data to be processed through each processing module, wherein the part of feature data extracted by each processing module belongs to feature data with different dimensions, and splicing the part of feature data output by the plurality of processing modules together through the fusion module to serve as output data of the network building module.

Optionally, the processing module includes a first convolutional layer and a second convolutional layer, an input end of the first convolutional layer receives an input of the processing module, an input end of the second convolutional layer receives an output of the first convolutional layer, and the number of output channels of the first convolutional layer is the same as the number of input channels of the second convolutional layer.

Optionally, the number of input channels of the first convolutional layer is greater than the number of output channels of the first convolutional layer.

Optionally, the number of input channels of the first convolutional layer is the same as the number of output channels of the second convolutional layer.

Optionally, a plurality of the network building blocks are stacked, and the network building block stacked later has a processing module with a greater number of paths than the network building block stacked earlier.

Optionally, the network building block is a residual learning-based network building block.

According to a second aspect of the embodiments of the present invention, there is provided a playback detection method including: acquiring voice data to be processed; extracting feature data of voice data to be processed according to any one of the data feature extraction methods; and carrying out record playback detection on the voice data according to the characteristic data of the voice data.

According to a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions, wherein the program instructions, when executed by a processor, implement the steps of any of the aforementioned data processing methods.

According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to any one of the data feature extraction methods.

According to a fifth aspect of embodiments of the present invention, there is provided a computer program comprising computer program instructions, wherein the program instructions, when executed by a processor, implement the steps of any of the aforementioned data feature extraction methods.

According to the data feature extraction scheme provided by the embodiment of the invention, the small deep neural network for extracting feature data can be constructed by the network construction blocks with fewer layers, so that the expected functions can be realized by using the small deep neural network under the condition of insufficient data or limited computing capacity.

In addition, according to the playback detection scheme provided by the embodiment of the invention, the playback detection of voice data can be effectively realized by using the small voice feature extraction neural network constructed by the network construction block used in the data feature extraction method, so that the voice data can be used for equipment with limited computing capacity.

Drawings

FIG. 1 is a schematic diagram showing the structure of basic building blocks in an existing residual error network (ResNet);

fig. 2 is a schematic diagram showing the structure of basic building blocks in a resenext network;

FIG. 3 is a diagram illustrating an exemplary structure of a network building block 300 according to an embodiment of the invention;

FIG. 4 is a flow diagram illustrating a data feature extraction method according to some embodiments of the invention;

FIG. 5 is a schematic diagram illustrating an exemplary deep neural network, according to an embodiment of the present invention;

FIG. 6 is a flow diagram illustrating a playback detection method according to some embodiments of the invention;

fig. 7 is a schematic configuration diagram showing an electronic apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described in detail below with reference to the accompanying drawings.

In this application, "plurality" means two or more, and "at least one" means one, two or more. Any component, data, or structure referred to in this application, unless explicitly defined as such, may be understood to mean one or more.

Neural networks play an important role in many fields as a general model. However, in some application scenarios, for example, there are fewer data sets available for training the neural network, and the large neural network thus trained is prone to overfitting problems, which is more suitable for training a small neural network.

According to the general inventive concept of the present invention, in order to use neural networks in devices with limited computational power or in case of small data sets for training neural networks, at least the following improvements are proposed for the network building blocks: and the multi-path division, conversion and re-fusion modes are adopted, so that the number of paths for network learning is increased, the connection density is thinned, and the number of network parameters is reduced. To this end, an exemplary structure of a network building block as shown in fig. 3 is employed.

Hereinafter, the structure of the network building block proposed by the present invention and the related feature extraction method are explained only by taking the network building block based on residual learning as an example, but the present general inventive concept is not limited to be applied to a residual network and its residual network building block, but is applied to a structure of a network building block including a plurality of network layers for constructing a small neural network.

Referring to fig. 3, the network building block 300 includes multiple (n) processing modules 310-1 to 310-n connected in parallel and a fusion module 320, similar to the structure of the resenext basic building block, but the data processing objects, operations, and processing results of the respective modules are different from those of the existing resenext basic building block.

The processing modules 310-1 to 310-n respectively process data to be processed (such as image data, voice data, text data, and the like), and respectively acquire part of feature data of the data to be processed, and part of feature data acquired by the processing modules 310-1 to 310-n respectively belong to feature data with different dimensions, that is, part of feature data acquired by the processing modules 310-1 to 310-n respectively do not overlap and intersect, and part of feature data acquired by the processing modules 310-1 to 310-n respectively are combined to form feature data of the data to be processed. The fusion module 320 splices part of the feature data output by the multi-path processing modules 310-1 to 310-n together to serve as the output data (i.e., the feature data of the data to be processed) of the network construction module 300.

Different from the ResNeXt basic building block, each processing module 310-1 to 310-n shown in FIG. 3 is respectively used for extracting partial feature data with different dimensions from the data to be processed, and correspondingly, the fusion module 320 splices the partial feature data output by the multi-path processing modules 310-1 to 310-n to obtain the feature data of the data to be processed. Therefore, with the network building block 300 according to the embodiment of the present invention, a small neural network with a relatively small number of layers can be built to achieve a function equivalent to that of a neural network using resenext while keeping the number of network parameters relatively unchanged.

As mentioned above, the network building block 300 may be a residual network building block, which is constructed and trained based on residual learning between the network layers in the processing modules 310-1-310-n.

FIG. 4 is a flow diagram illustrating a data feature extraction method according to some embodiments of the invention.

Referring to fig. 4, in step S410, data to be processed is acquired.

Here, the acquired data to be processed may be any form of data having a content or pattern characteristic, such as text content data, voice data, image data, video data, and the like. In the subsequent processing, feature data thereof is to be extracted from the data to be processed.

In step S420, feature data is extracted from the data to be processed through a deep neural network including at least one network building block 300. The network building block 300 comprises a fusion module 320 and multiple processing modules 310-1-310-n connected in parallel, partial feature data of the data to be processed are respectively extracted through the processing modules 310-1-310-n, the partial feature data extracted by the processing modules 310-1-310-n belong to feature data with different dimensions, and the partial feature data output by the processing modules 310-1-310-n are spliced together through the fusion module 320 to serve as output data of the network building block 300.

Through the foregoing process, a small deep neural network for extracting feature data can be constructed by a network building block having a small number of layers, thereby achieving a desired function using the small deep neural network in the case where data is insufficient or computing power is limited.

The processing module 310 may have any number of convolutional layers with no less than 2 layers. For example, in the network building block 300 shown in fig. 3, the processing module 310 has two convolutional layers, including a first convolutional layer whose input receives the input of the processing module 310 and a second convolutional layer whose input receives the output of the first convolutional layer. Optionally, the first convolutional layer and the second convolutional layer are learned and fitted by a residual function.

The number of output channels of the first convolution layer is the same as the number of input channels of the second convolution layer. Thus, the second convolutional layer can directly receive and process the output data of the first convolutional layer without any conversion, merging, deletion, padding, and the like.

In order to keep the number of output channels of the network building block 300, optionally, the number of input channels of the first convolutional layer in the processing module 310 is set to be the same as the number of output channels of the second convolutional layer.

In addition, in some neural networks, the number of input channels of the first convolutional layer in the processing module 310 may be set to be greater than the number of output channels of the first convolutional layer to control the number of network parameters in the processing module 310.

According to an alternative embodiment of the present invention, in the deep neural network, since a pooling layer for reducing the size of input, for example, a maximum pooling layer (max pooling), needs to be added at an appropriate position, a plurality of network building blocks 300 are stacked, and a network building block 300 stacked later has a greater number of processing modules 310 than a network building block 300 stacked earlier, thereby reducing loss of information and facilitating model modeling.

FIG. 5 is a schematic diagram illustrating an exemplary deep neural network, according to an embodiment of the present invention. The deep neural network 500 has a network building block 300 as shown in fig. 3. Specifically, the deep neural network 500 has a Resnet-18 architecture.

Referring to fig. 5, in the deep neural network 500, after one convolutional layer and the maximum pooling layer from the input end, 2 network building blocks 300 of 64-way processing modules, 2 network building blocks 300 of 128-way processing modules, 2 network building blocks 300 of 256-way processing modules, and 2 network building blocks 300 of 512-way processing modules are sequentially stacked.

It can be seen that, among the network building blocks 300 stacked in the aforementioned deep neural network 500, the number of ways of the processing modules in the network building blocks 300 is doubled by 1 every 1 layer. With the small deep neural network 500, feature data thereof can be extracted from input data to perform processing such as detection, classification, recognition, and the like on the input data.

The embodiment of the invention also provides a computer-readable storage medium for storing the steps of executing any one of the data feature extraction methods.

Furthermore, an embodiment of the present invention further provides a computer program product including at least one executable instruction, which when executed by a processor is configured to implement any one of the foregoing data feature extraction methods.

On the other hand, the speaker verification technology is a technology for verifying the identity of a person by the voice of the person when speaking. This technique is widely used in a variety of fields. However, speaker verification techniques are susceptible to malicious attacks.

Currently, the main attack techniques are divided into the following four types: emulation attacks, voice conversion attacks, speech synthesis attacks, and playback attacks. Wherein, the imitation attack means that an attacker imitates the voice of a target speaker to try to enter the system. The voice conversion attack is that an attacker converts own voice into voice similar to a target speaker by means of the force of a computer algorithm, and then attacks. The voice synthesis attack attacks the system by directly synthesizing the voice of the target speaker by the computer. The attack of record replay refers to that an attacker records the voice of the target speaker in advance, and then replays the record through a replay device (such as a loudspeaker) so as to attack the system. Among these four attacks, the replay attack is very simple to implement and requires no expertise. Meanwhile, a great deal of literature proves that the replay attack of the recording has great influence on the safety of the confirmation of the speaker, and is a problem to be solved urgently. In addition, in the application fields of voice and voiceprint recognition, the data set available for training the neural network model is relatively small, and the method is particularly suitable for feature extraction, detection and the like by using a small neural network.

Therefore, the embodiment of the invention also provides a playback detection method. The method uses a speech feature extraction neural network for extracting feature data from speech data, the speech feature extraction neural network comprising a plurality of stacked aforementioned network building blocks 300.

FIG. 6 is a flow diagram illustrating a playback detection method according to some embodiments of the invention.

Referring to fig. 6, voice data to be processed is acquired in step S610.

In step S620, feature data of the voice data to be processed is extracted according to the aforementioned data feature extraction method.

In step S630, playback detection is performed on the voice data based on the feature data of the voice data extracted in step S620.

Thus, by using the small-scale voice feature extraction neural network of the aforementioned network building block 300, playback detection of voice data is efficiently achieved.

For the model training of the speech feature extraction neural network, a proper Loss function (Loss function) and a proper optimization method can be selected for model training, and the model hyper-parameters are adjusted through a proper hyper-parameter search algorithm.

For the loss function, a selection needs to be made for different problems. For the sentence-level or frame-level playback detection problem, cross entropy may be selected as the loss function. For other tasks, squared loss error, etc. may also be used as a loss function.

Parameter estimation of models of speech feature extraction neural networks is typically done using gradient-based back propagation algorithms. There are various optimization algorithms to choose from, such as, but not limited to, Stochastic Gradient Descent (SGD), Adam, AdaGrad, RMSProp, etc., and optimization of the neural network can be performed by any of the aforementioned optimization algorithms.

Then, for the hyper-parameters in the model, the hyper-parameters cannot be directly estimated through an optimization algorithm, and manual parameter adjustment is needed. In this case, the hyper-parameter adjustment may be performed by using various methods such as a mesh method, a random method, and a bayesian method.

And performing model training of a speech feature extraction neural network on a training set, observing generalization performance on the training set, and selecting a proper time to stop training according to the generalization performance on the development set. And finally testing the performance of the trained speech feature extraction neural network on a test set by the trained neural network model.

The overall training process of the model of the speech feature extraction neural network is as follows:

1. selecting a suitable loss function;

2. selecting a proper optimization algorithm;

3. selecting a set of hyper-parameters;

4. training the current model by using the current loss function, the optimization algorithm and the hyper-parameter, and stopping training at a proper time according to the performance of the current model on a development set;

5. and observing the performance of the current model. And if the model performance is satisfactory, ending. Otherwise, according to the performance of the current model and the performance condition of the historical model, reselecting a group of hyper-parameters, repeating the step 4, and continuing to execute the model training.

Embodiments of the present invention further provide a computer-readable storage medium storing a program for executing any of the foregoing playback detection methods.

Furthermore, embodiments of the present invention also provide a computer program product comprising at least one executable instruction, which when executed by a processor, is configured to implement any of the foregoing playback detection methods.

The embodiment of the invention also provides the electronic equipment. Fig. 7 is a schematic diagram illustrating a structure of an electronic device 700 according to an embodiment of the present invention. The electronic device 700 may be, for example, a mobile terminal, a Personal Computer (PC), a tablet, a server, or the like. Referring now to FIG. 7, a schematic diagram of an electronic device 700 suitable for use in implementing embodiments of the present invention is shown: as shown in fig. 7, an electronic device 700 may include a memory and a processor. In particular, the electronic device 700 includes one or more processors, communication elements, and the like, for example: one or more Central Processing Units (CPUs) 701, and/or one or more image processors (GPUs) 713, etc., which may perform various suitable actions and processes according to executable instructions stored in a Read Only Memory (ROM)702 or loaded from a storage section 708 into a Random Access Memory (RAM) 703. The communication element includes a communication component 712 and/or a communication interface 709. Among other things, the communication component 712 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, the communication interface 709 includes a communication interface such as a network interface card of a LAN card, a modem, etc., and the communication interface 709 performs communication processing via a network such as the internet.

The processor may communicate with the read-only memory 702 and/or the random access memory 703 to execute the executable instructions, connect with the communication component 712 through the communication bus 704, and communicate with other target devices through the communication component 712, thereby performing operations corresponding to any broadcast-based anti-loss detection method provided by the embodiment of the present invention, for example, acquiring data to be processed; extracting feature data from the data to be processed through a deep neural network comprising at least one network building block, wherein the network building block comprises a fusion module and a plurality of parallel processing modules, extracting part of feature data of the data to be processed through each processing module, wherein the part of feature data extracted by each processing module belongs to feature data with different dimensions, and splicing the part of feature data output by the plurality of processing modules together through the fusion module to serve as output data of the network building module.

In addition, in the RAM 703, various programs and data necessary for the operation of the device can also be stored. The CPU 701 or GPU 713, ROM 702, and RAM 703 are connected to each other by a communication bus 704. The ROM 702 is an optional module in case of the RAM 703. The RAM 703 stores or writes executable instructions into the ROM 702 at runtime, and the executable instructions cause the processor to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 705 is also connected to communication bus 704. The communication component 712 may be integrated or may be configured with multiple sub-modules (e.g., IB cards) and linked over a communication bus.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication interface 709 including a network interface card such as a LAN card, modem, or the like. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

It should be noted that the architecture shown in fig. 7 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 7 may be selected, deleted, added or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the methods illustrated in the flowcharts, the program code may include instructions corresponding to performing the method steps provided by embodiments of the invention, e.g., executable code for obtaining data to be processed; the executable code is used for extracting feature data from the data to be processed through a deep neural network comprising at least one network building block, wherein the network building block comprises a fusion module and a plurality of parallel processing modules, partial feature data of the data to be processed are respectively extracted through the processing modules, the partial feature data extracted by the processing modules belong to feature data with different dimensions, and the partial feature data output by the processing modules are spliced together through the fusion module to serve as output data of the network building block.

In such an embodiment, the computer program may be downloaded and installed from a network via the communication element, and/or installed from the removable media 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the methods of embodiments of the present invention.

The electronic device of the embodiment of the present invention may be configured to implement the corresponding data feature extraction method or the corresponding recording playback detection method in the above embodiments, and each device in the electronic device may be configured to perform each step in the above method embodiments, for example, the data feature extraction method or the recording playback detection method described above may be implemented by a processor of the electronic device calling a related instruction stored in a memory, and for brevity, no further description is provided here.

It should be noted that, according to the implementation requirement, each component/step described in the present application may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.

The method and apparatus, electronic device, and storage medium of the present disclosure may be implemented in many ways. For example, the method and apparatus, the electronic device, and the storage medium of the embodiments of the present invention may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustrative purposes only, and the steps of the method of the embodiments of the present invention are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing methods according to embodiments of the present invention. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the embodiment of the present invention.

The description of the present embodiments has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the invention in the form disclosed, and many modifications and variations will be apparent to those skilled in the art. The embodiment was chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure for various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A playback detection method, comprising:

acquiring voice data to be processed;

extracting feature data from the voice data through a deep neural network comprising at least one network building block, wherein the network building block comprises a fusion module and a plurality of parallel processing modules, partial feature data of the voice data are respectively extracted through the processing modules, the partial feature data extracted by the processing modules belong to feature data with different dimensions, partial feature data output by the processing modules are spliced together through the fusion module to serve as output data of the network building block, and each processing module comprises two convolution layers;

and carrying out record playback detection on the voice data according to the characteristic data of the voice data.

2. The method of claim 1, wherein the two convolutional layers comprise a first convolutional layer and a second convolutional layer, an input of the first convolutional layer receives an input of the processing module, an input of the second convolutional layer receives an output of the first convolutional layer, and the number of output channels of the first convolutional layer is the same as the number of input channels of the second convolutional layer.

3. The method of claim 2, wherein the number of input channels of the first convolutional layer is greater than the number of output channels of the first convolutional layer.

4. The method of claim 3, wherein the number of input channels of the first convolutional layer is the same as the number of output channels of the second convolutional layer.

5. A method according to any one of claims 1 to 4, wherein a plurality of said network building blocks are stacked and the network building block stacked after that has a greater number of processing modules than the network building block stacked before.

6. The method according to any one of claims 1 to 4, wherein the network building block is a residual learning based network building block.

7. A computer readable storage medium having computer program instructions stored thereon, wherein the program instructions, when executed by a processor, implement the steps of the playback detection method of any of claims 1-6.

8. An electronic device, comprising: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the playback detection method as claimed in any one of claims 1-6.