WO2021135454A1 - 一种伪冒语音的识别方法、设备及计算机可读存储介质 - Google Patents

一种伪冒语音的识别方法、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2021135454A1
WO2021135454A1 PCT/CN2020/118450 CN2020118450W WO2021135454A1 WO 2021135454 A1 WO2021135454 A1 WO 2021135454A1 CN 2020118450 W CN2020118450 W CN 2020118450W WO 2021135454 A1 WO2021135454 A1 WO 2021135454A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature map
recognized
layer
convolutional layer
Prior art date
Application number
PCT/CN2020/118450
Other languages
English (en)
French (fr)
Inventor
张超
马骏
王少军
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021135454A1 publication Critical patent/WO2021135454A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/14Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a method for recognizing fake speech, a recognition device, a computer device, and a computer-readable storage medium.
  • ASV automatic speaker verification
  • the ASV or voiceprint system itself does not have the ability to recognize fake speech, and with the maturity of the text to speech (TTS) technology of speech synthesis, fake speech on the voice side is increasing. The more difficult it is to recognize, including high-quality recording equipment recording and playback, the latest technology of speech synthesis, etc.
  • TTS text to speech
  • the inventor realized that the existing implementation at least contained the following problem: how to recognize fake speech is an urgent problem to be solved.
  • the purpose of the embodiments of this application is to propose a method, recognition device, computer device, and computer-readable storage medium for fake speech recognition, so as to solve the problem that may exist in the prior art due to the lack of recognition means for fake speech Security vulnerabilities.
  • the embodiments of the present application provide a method for recognizing fake speech, a recognition device, a computer device, and a computer-readable storage medium, and the following technical solutions are adopted:
  • an embodiment of the present application provides a method for recognizing fake speech, which may include:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into the target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • the recognition method may further include:
  • the voice in the real voice data set is converted into a voice signal feature map of the first category by using STFT transformation
  • the voice in the fake voice data set is converted into a voice signal feature map of the second category, so as to obtain the voice signal feature map that can include the first category.
  • the recognition method may further include:
  • the training of the initial DenseNet network using the first speech signal feature map data set may include:
  • the target DenseNet network model may sequentially include the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, the third channel expansion module, and the first fully connected layer.
  • a second fully connected layer, the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used in order
  • the features of the feature map of the voice signal to be recognized are sequentially extracted and the first feature map is output.
  • the first fully connected layer and the second fully connected layer are used to further extract the features of the first feature map, and output two features according to the extracted features.
  • the classification result is used to further extract the features of the first feature map, and output two features according to the extracted features.
  • first channel expansion module, the second channel expansion module, and the third channel expansion module may each include 4 upper structures and 4 lower structures, respectively, and the upper structure may in turn include a second convolutional layer, 4
  • the third convolutional layer, the fourth convolutional layer, and the first SE block are arranged in parallel.
  • the underlying structure may sequentially include the fifth convolutional layer, four paralleled sixth convolutional layers, seventh convolutional layer, and second convolutional layer. SE block.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer are all convolutional layers with a core size of 1 ⁇ 1, and the second convolutional layer and The fifth convolutional layer is used to reduce the number of channels of the input feature map, and the fourth convolutional layer is used to splice the four feature maps output by the third convolutional layer and input the first SEblock for processing, The seventh convolutional layer is used to splice the four feature maps output by the sixth convolutional layer and input the second SE block for processing, and the first SE block is used for the input of the fourth convolutional layer.
  • Each feature map is assigned a corresponding weight according to the channel
  • the second SE block is used to assign each feature map input to the seventh convolutional layer with a corresponding weight according to the channel.
  • performing STFT conversion processing on the voice to be recognized may include:
  • STFT conversion processing After performing framing and windowing operations on the voice to be recognized, STFT conversion processing is performed.
  • an embodiment of the present application provides a fake voice recognition device, and the recognition device may include:
  • the first acquisition module is used to acquire the voice to be recognized
  • the first conversion module is configured to perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized;
  • the processing module is used to input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as a real voice or a fake voice.
  • an embodiment of the present application provides a computer device, including a memory and a processor.
  • the memory stores computer-readable instructions.
  • the processor executes the computer-readable instructions, the following pseudo Steps of the recognition method of fake speech:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • an embodiment of the present application provides a computer-readable storage medium having computer-readable instructions stored on the computer-readable storage medium.
  • the computer-readable instructions are executed by a processor, the following pseudo Steps of the recognition method of fake speech:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • this solution uses the target DenseNet network model for speech recognition. Based on the self-learning function of the neural network, it provides a highly accurate method of automatically identifying fake speech and reduces the generation of security vulnerabilities in ASV or voiceprint systems. .
  • FIG. 1 is a schematic diagram of an embodiment of a method for recognizing fake speech provided by an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a target DenseNet network model provided by an embodiment of the present application
  • Fig. 3 is a schematic structural diagram of the first channel expansion module in the target DenseNet network model shown in Fig. 2;
  • Fig. 4 is a schematic structural diagram of the first SE block in the first channel expansion module shown in Fig. 3;
  • FIG. 5 is a schematic diagram of another embodiment of a method for recognizing fake speech according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another embodiment of a method for recognizing fake speech provided by an embodiment of the present application.
  • FIG. 7A is a schematic diagram of an embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • FIG. 7B is a schematic diagram of another embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • FIG. 7C is a schematic diagram of another embodiment of a fake speech recognition device provided by an embodiment of the present application.
  • Fig. 8 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the method for recognizing fake speech includes the following steps:
  • Step S101 Obtain a voice to be recognized.
  • the method for recognizing fake voices runs on an electronic device (for example, a server/terminal device) on which the fake voice is recognized, and the electronic device can collect voice data to be recognized.
  • an electronic device for example, a server/terminal device
  • the acquired speech to be recognized can be stored in the blockchain.
  • Step S102 Perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized.
  • the electronic device may perform a short-time Fourier transform (STFT) on the voice to be recognized, so as to convert the voice data to be recognized into a feature map of the voice signal to be recognized.
  • STFT short-time Fourier transform
  • the conversion process may sequentially include framing, hamming windowing, and STFT conversion operations.
  • framing refers to segmenting the collected speech to be recognized in the time domain, and then dividing the segmented speech obtained in the previous step into multiple frames according to the preset duration of one frame.
  • Windowing is to use the window function to process the voice data of each frame to obtain a time segment, and then use the time segment obtained by the truncation to perform period extension processing to obtain a virtual infinite signal, so that the signal can be STFT Mathematical processing such as transformation and correlation analysis.
  • the corresponding window function can be selected according to the waveform of the voice to be recognized.
  • the selection of the specific window function is not too limited here.
  • the electronic device further performs STFT transformation processing on each frame of voice data after the windowing operation, and converts the voice to be recognized in the time domain into a feature map of the voice signal to be recognized.
  • the horizontal axis represents the time dimension
  • the vertical axis represents the frequency dimension.
  • each segment of 5 seconds can be divided into 10 segments of voice.
  • the common frame length is generally 20-50 milliseconds.
  • 25 milliseconds can be selected as the frame length, and each segmented voice can be divided into 200 frames.
  • the electronic device performs a windowing operation on each frame, and then performs STFT transformation on each frame of voice data after windowing, and converts it to obtain a feature map of the voice signal to be recognized.
  • Step S103 input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as real voice or fake voice.
  • the electronic device inputs the feature map of the voice signal to be recognized obtained in step S102 into the trained target DenseNet network model, so as to output the binary classification result of the voice signal to be recognized as a real voice or a fake voice.
  • the target DenseNet network model may be trained in advance by the electronic device, or it may be sent to the electronic device by another electronic device after the training is completed.
  • the target DenseNet network model may be an improved DenseNet network model.
  • the main improvement to the DenseNet network lies in: changing the dense block of the existing DenseNet network to a custom channel stretch block structure.
  • the improved target DenseNet network model can significantly reduce the number of parameters of the training model.
  • the parameter amount of the existing DenseNet network is 1.71x10 ⁇ 5, and the amount of floating point calculation is 7.16x10 ⁇ 9, and the parameter amount of the target DenseNet network model obtained after improvement is 8.2x10 ⁇ 4, and the amount of floating point calculation is 3.53x10 ⁇ 9.
  • the existing DenseNet network is a common network structure at present, and this embodiment will not elaborate too much.
  • the target DenseNet network model obtained by improving the existing DenseNet network can refer to Figure 2.
  • the target DenseNet network model can include:
  • the first convolutional layer (convolutional layer), the first channel expansion module, the first transition layer (transition layer), the second channel expansion module, the second transition layer 205 and the third channel expansion module, the first fully connected layer (fully connected layers, FC) and the second fully connected layer.
  • the first convolutional layer is a convolutional layer whose kernel size can be 1 ⁇ 1.
  • the first transition layer and the second transition layer are composed of a convolutional layer. It is composed of a pooling layer.
  • the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used to sequentially extract the features of the feature map of the voice signal to be recognized and output The first feature map.
  • the first fully connected layer and the second fully connected layer can convert the input data into different categories, which can be used to further extract the features of the first feature map, and output the discrimination result of the two classifications according to the extracted features.
  • first-channel telescopic module the aforementioned first-channel telescopic module, second-channel telescopic module, and third-channel telescopic module have the same structure, including 4 upper structures and 4 lower structures, respectively.
  • Figure 3 it is a schematic diagram of the structure of the first channel telescopic module, which may include:
  • each upper layer structure includes: the second convolutional layer, 4 parallel third convolutional layers, the fourth convolutional layer and the first SE block (Squeeze-and-Excitation block); 4 lower structures , Each lower layer structure includes: a fifth convolutional layer, four parallel sixth convolutional layers, a seventh convolutional layer, and a second SEblock.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer may all be 1 ⁇ 1 convolutional layers.
  • the second convolutional layer and the fifth convolutional layer can be used to perform 1 ⁇ 1 convolution on the received feature maps, reduce the number of input feature maps, and input the output feature maps in parallel to 4 third volumes respectively Build-up layers and 4 sixth convolutional layers. For example, if the number of channels of the feature map input to the second convolutional layer is 64, after 1 ⁇ 1 convolution by the second convolution layer, the feature map with the number of channels of 32 can be output, and the output feature map can be input in parallel Give 4 third convolutional layers of 3 ⁇ 3 core size.
  • the fourth convolutional layer and the seventh convolutional layer can respectively perform splicing operations on the feature maps output by the 4 third convolutional layers and 4 sixth convolutional layers (adding by channel), and output after the splicing operation
  • the feature map of is input to the first SE block and the second SE block for processing.
  • the first SE block and the second SE block have similar and the same structures.
  • the first SE block is taken as an example.
  • FIG. 4 is a schematic diagram of the structure of the first SE block in an embodiment of the application.
  • SEblock can in turn include:
  • G Global pooling layer, fully connected layer, activation layer (Relu), fully connected layer, sigmoid layer and scale layer.
  • C represents the number of channels
  • r is a self-set parameter, which can be set to 16.
  • the brief flow of the first SE block processing can include: the feature maps of the C channels output from the fourth convolutional layer, after inputting the first SE block, the C channels are finally calculated through the layers on the right side of the first SE block. C weights W corresponding to the channel. After that, in the scale layer, the weight W is used to multiply the feature maps of each channel corresponding to the original input to output the weighted feature maps.
  • the first SE block and the second SE block are used to learn feature weights according to the loss function, and the effective channel weights in the feature map are adjusted to be larger, and the invalid or less effective channel weights are reduced to make the model training better.
  • the weights corresponding to each channel of the feature map are allocated and adjusted in the network. For example, there are 64 channels in the network. In the prior art, the contributions of these channels to the network are the same, or the weights are the same. If SE block is added, different weights can be assigned to achieve better results.
  • the discrimination result of the speech to be recognized and the two classifications can also be stored in a node of a blockchain.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the fake speech recognition method provided in the embodiments of the present application can be applied in the fields of smart medical care, smart government affairs, smart education, or technology finance.
  • the fake voice recognition method can be used to identify and verify the collected voice to identify whether it is a real person's voice, so as to avoid system security loopholes caused by fake voice.
  • this solution uses the target DenseNet network model for speech recognition. Based on the self-learning function of the neural network, it provides a highly accurate method of automatically identifying fake speech and reduces the generation of security vulnerabilities in ASV or voiceprint systems. .
  • this is a schematic diagram of another embodiment of a method for recognizing fake speech according to an embodiment of the present application:
  • the above-mentioned electronic device may also perform the training process of the initial DenseNet network, and obtain the target DenseNet network model when the training result reaches the expected target.
  • the process of obtaining the target DenseNet network model by training may include:
  • S501 Acquire a real voice data set and a fake voice data set.
  • the electronic device can obtain the real voice data set and the fake voice data set from the external device.
  • the real voice data set can include directly collected real-person voice data collected under different conditions such as different ages, different genders, different regions, and different emotions
  • the fake voice data set can include the use of speech synthesis technology (text to Voice technology) simulated human fake voice, voice conversion fake voice (voice-to-speech technology, using a segment of the target person’s voice to convert any non-target person’s voice into the target person’s voice), using part of the real person’s voice
  • speech synthesis technology text to Voice technology
  • voice conversion fake voice voice-to-speech technology
  • S502 Use STFT transformation to convert the voice in the real voice data set into a voice signal feature map of the first category, and convert the voice in the fake voice data set into a voice signal feature map of the second category to obtain a voice signal feature map that includes the first category. And the first speech signal feature map data set of the second category speech signal feature map.
  • the electronic device uses STFT transformation to convert the voice in the real voice data set obtained above into a voice signal feature map of the first category, and converts the voice in the fake voice data set into a voice signal feature map of the second category to obtain A first speech signal feature map data set including a first category voice signal feature map and a second category voice signal feature map.
  • STFT transformation is similar to the processing method of step S102 in the embodiment shown in FIG. 1, and will not be repeated here.
  • the electronic device after obtaining the first category voice signal feature map and the second category voice signal feature map, the electronic device also needs to respond to the user's labeling operation, set tags for different categories of voice signal feature maps to generate label files, and save Enter the first voice signal feature map data set.
  • the label is set to a two-category form, for example, it can be set to 0 or 1, where 0 represents the feature map of the first category of speech signals, and 1 represents the feature map of the second category of speech signals.
  • S503 Use the first speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock the weight of each layer of the initial DenseNet network Parameters to get the target DenseNet network model.
  • the electronic device uses the first speech signal feature map data set to train the initial DenseNet network, and performs weighting parameters for each layer of the initial DenseNet network based on the loss function. Adjust until the loss function is less than the preset value, lock the weight parameters of each layer of the initial DenseNet network to obtain the target DenseNet network model.
  • this is a schematic diagram of another embodiment of a method for recognizing fake speech provided by an embodiment of the present application.
  • the method for recognizing fake speech may include:
  • S601 Perform a masking operation on part of the frequency features of the part of the voice signal feature maps in the first voice signal feature map data set, so as to convert the first voice signal feature map data set into the second voice signal feature map data set.
  • the electronic device may perform a mask operation on some features of a part of the voice signal feature maps in the first voice signal feature map data set.
  • the continuous part of the features of the voice signal feature map can be reset to 0.
  • the frequency dimension of the original voice signal feature map is 256 dimensions and the range is from 0 to 8000 Hz. Then 30 of the 256 dimensions can be randomly selected. Zero, for the frequency, the information from 0 to 8000Hz is erased, which increases the unknownness of the data for the model.
  • we do not use random dropout in the network. Which greatly improves the generalization performance of the network, and further improves the accuracy of network recognition by about 30%.
  • Step S503 Use the first speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is smaller than the preset value, lock the initial DenseNet network layer
  • the weight parameters to obtain the target DenseNet network model can include:
  • S602 Use the second speech signal feature map data set to train the initial DenseNet network, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock the weight of each layer of the initial DenseNet network Parameters to get the target DenseNet network model.
  • step S602 in this embodiment is similar to step S503 in the embodiment shown in FIG. 5, and will not be repeated here.
  • the feature masking method increases the unknownness of the data for the model, greatly improves the generalization performance of the network, and thereby improves the recognition ability of the target DenseNet network model for unknown fake speech.
  • FIG. 7A is a schematic structural diagram of a fake speech recognition device provided by an embodiment of the application, and the recognition device may include:
  • the first obtaining module 701 is configured to obtain a voice to be recognized
  • the first conversion module 702 is configured to perform STFT conversion processing on the voice to be recognized, and convert the voice to be recognized into a feature map of the voice signal to be recognized;
  • the processing module 703 is configured to input the feature map of the voice signal to be recognized into the target DenseNet network model, and output the binary classification result of the voice to be recognized as a real voice or a fake voice.
  • FIG. 7B is another schematic structural diagram of a fake voice recognition device provided by an embodiment of the application, and the fake voice recognition device may further include:
  • the second obtaining module 704 is used to obtain a real voice data set and a fake voice data set
  • the second transformation module 705 is configured to use STFT transformation to convert the voice in the real voice data set into a voice signal feature map of the first category, and convert the voice in the fake voice data set into a voice signal feature map of the second category to obtain A first voice signal feature map data set that may include the first category of voice signal feature maps and the second category of voice signal feature maps;
  • the training module 706 is used to train the initial DenseNet network using the first speech signal feature map data set, and adjust the weight parameters of each layer of the initial DenseNet network based on the loss function, until the loss function is less than the preset value, lock
  • the weight parameters of each layer of the initial DenseNet network are used to obtain the target DenseNet network model.
  • FIG. 7C is another schematic structural diagram of a fake voice recognition device provided by an embodiment of the application, and the fake voice recognition device may further include:
  • the editing module 707 is configured to perform a masking operation on part of the frequency features of the part of the speech signal feature maps in the first speech signal feature map data set, so as to convert the first speech signal feature map data set into a second speech signal feature map data set ;
  • the training module 706 is also specifically configured to train the initial DenseNet network by using the second speech signal feature map data set.
  • the target DenseNet network model may sequentially include the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, the third channel expansion module, and the first fully connected layer.
  • a second fully connected layer, the first convolutional layer, the first channel expansion module, the first transition layer, the second channel expansion module, the second transition layer, and the third channel expansion module are used in order
  • the features of the feature map of the voice signal to be recognized are sequentially extracted and the first feature map is output.
  • the first fully connected layer and the second fully connected layer are used to further extract the features of the first feature map, and output two features according to the extracted features.
  • the classification result is used to further extract the features of the first feature map, and output two features according to the extracted features.
  • first channel expansion module, the second channel expansion module, and the third channel expansion module may each include 4 upper structures and 4 lower structures, respectively, and the upper structure may in turn include a second convolutional layer, 4
  • the third convolutional layer, the fourth convolutional layer, and the first SE block are arranged in parallel.
  • the underlying structure may sequentially include the fifth convolutional layer, four paralleled sixth convolutional layers, seventh convolutional layer, and second convolutional layer. SE block.
  • the second convolutional layer, the fourth convolutional layer, the fifth convolutional layer, and the seventh convolutional layer are all convolutional layers with a core size of 1 ⁇ 1, and the second convolutional layer and The fifth convolutional layer is used to reduce the number of channels of the input feature map, and the fourth convolutional layer is used to splice the four feature maps output by the third convolutional layer and input the first SEblock for processing,
  • the seventh convolutional layer is used to splice the four feature maps output by the sixth convolutional layer and input the second SE block for processing.
  • the first SE block is used for the input of the fourth convolutional layer.
  • Each feature map is assigned a corresponding weight according to the channel
  • the second SE block is used to assign each feature map input to the seventh convolutional layer with a corresponding weight according to the channel.
  • the first conversion module 702 is specifically configured to perform STFT conversion processing after performing framing and windowing operations on the voice to be recognized.
  • FIG. 8 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 8 includes a memory 81, a processor 82, and a network interface 83 that are connected to each other in communication through a system bus. It should be pointed out that the figure only shows the computer device 8 with components 81-83, but it should be understood that it is not required to implement all the shown components, and more or fewer components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 81 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static memory Random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disk, optical disk, etc.
  • the memory 81 may be an internal storage unit of the computer device 8, such as a hard disk or a memory of the computer device 8.
  • the memory 81 may also be an external storage device of the computer device 8, such as a plug-in hard disk equipped on the computer device 8, a smart media card (SMC), a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 81 may also include both the internal storage unit of the computer device 8 and its external storage device.
  • the memory 81 is generally used to store the operating system and various application software installed in the computer device 8, for example, to implement the steps of the method for recognizing fake speech as described below:
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • the memory 81 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 82 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips.
  • the processor 82 is generally used to control the overall operation of the computer device 8.
  • the processor 82 is configured to run computer-readable instructions or processed data stored in the memory 81, for example, to run a fake voice recognition in the embodiment shown in FIG. 1, 5, or 6. Computer readable instructions for the method.
  • the network interface 83 may include a wireless network interface or a wired network interface, and the network interface 83 is generally used to establish a communication connection between the computer device 8 and other electronic devices.
  • the processor 82 on the computer device 8 executes the computer-readable instructions of the method for recognizing fake speech in the embodiment shown in FIG. 1, 5, or 6, thereby providing a neural-based Method for automatically identifying fake voice in network
  • the present application also provides another implementation manner, that is, a computer-readable storage medium is provided, where the computer-readable storage medium may be non-volatile or volatile.
  • the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor, so that the at least one processor executes the steps of the method for recognizing fake speech as described below :
  • STFT transformation processing is performed on the voice to be recognized, and the voice to be recognized is converted into a feature map of the voice signal to be recognized;
  • the feature map of the voice signal to be recognized is input into a target DenseNet network model, and the binary classification result of the voice to be recognized as a real voice or a fake voice is output.
  • this application can be used in many general or special-purpose computer system environments or configurations. For example: personal computers, server computers, handheld devices or portable devices, tablet devices, multi-processor systems, microprocessor-based systems, set-top boxes, programmable consumer electronic devices, network PCs, small computers, large computers, including Distributed computing environment for any of the above systems or equipment, etc.
  • This application may be described in the general context of computer-executable instructions executed by a computer, such as a program module.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • This application can also be practiced in distributed computing environments. In these distributed computing environments, tasks are performed by remote processing devices connected through a communication network.
  • program modules can be located in local and remote computer storage media including storage devices.
  • the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to enable a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the identification method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本申请实施例属于人工智能领域,涉及一种伪冒语音的识别方法,包括:获取待识别语音;对待识别语音进行STFT变换处理,将待识别语音转换为待识别语音信号特征图;将待识别语音信号特征图输入目标DenseNet网络模型,输出待识别语音为真实语音或伪冒语音的二分类判别结果。本申请还提供一种伪冒语音的识别装置、计算机设备及存储介质。此外,本申请还涉及区块链技术,获取的用户的语音数据和二分类判别结果可存储于区块链中。本方案通过利用目标DenseNet网络模型进行语音识别工作,基于神经网络的自学习功能,提供了一种准确度高的自动识别伪冒语音的方法,减少ASV或者声纹***安全漏洞的产生。本申请可应用于智慧医疗、智慧政务、智慧教育或科技金融等领域。

Description

一种伪冒语音的识别方法、设备及计算机可读存储介质
本申请以2020年07月16日提交的申请号为202010688484.9,名称为“一种伪冒语音的识别方法、设备及计算机可读存储介质”的中国发明专利申请为基础,并要求其优先权。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种伪冒语音的识别方法、识别设备、计算机设备和计算机可读存储介质。
背景技术
随着语音识别和人工智能(artificial intelligence,AI)技术的逐步发展,其在实际应用中越来越普遍,尤其是说话人自动确认(automatic speaker verification,ASV)技术和声纹技术普遍用于手机唤醒,语音解锁,智能音箱和语音支付领域。
然而需要说明的是,ASV或者声纹***本身并不具备识别伪冒语音的能力,并且随着语音合成从文本到语言(text to speech,TTS)技术的成熟,现在语音端的伪冒语音越来越难以识别,包括高质量录音设备录音重放,最新技术的语音合成等。对未来尤其是跟安全领域相关的方向使用ASV和声纹技术时,或者未来涉及到大数据挖掘方向时,能否区分语音是否真实来自于用户或者客户就变得越来越重要。因此,在实现本申请的过程中,发明人意识到现有的实现方式至少存入如下问题:如何对伪冒语音进行识别,是一个亟需解决的问题。
发明内容
本申请实施例的目的在于提出一种伪冒语音的识别方法、识别设备、计算机设备和计算机可读存储介质,以解决现有技术中,由于缺乏对伪冒语音的识别手段,而可能存在的安全漏洞问题。
为了解决上述技术问题,本申请实施例提供一种伪冒语音的识别方法、识别设备、计算机设备和计算机可读存储介质,采用了如下该的技术方案:
第一方面,本申请实施例提供了一种一种伪冒语音的识别方法,可以包括:
获取待识别语音;
对该待识别语音进行STFT变换处理,将该待识别语音转换为待识别语音信号特征图;
将该待识别语音信号特征图输入目标DenseNet网络模型,输出该待识别语音为真实语音或伪冒语音的二分类判别结果。
进一步地,该将该待识别语音信号特征图输入目标DenseNet网络模型之前,该识别方法还可以包括:
获取真实语音数据集和伪冒语音数据集;
利用STFT变换将该真实语音数据集中的语音转换为第一类别语音信号特征图,将该伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到可以包括该第一类别语音信号特征图和该第二类别语音信号特征图的第一语音信号特征图数据集;
利用该第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对该初始 DenseNet网络各层的权重参数进行调整,直至该损失函数小于小于预设值时,锁定该初始DenseNet网络各层的权重参数以得到该目标DenseNet网络模型。
进一步地,该利用STFT变换将该真实语音数据集中的语音转换为第一类别语音信号特征图,将该伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到可以包括该第一类别语音信号特征图和该第二类别语音信号特征图的第一语音信号特征图数据集之后,该识别方法还可以包括:
将该第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩操作,以将该第一语音信号特征图数据集转换为第二语音信号特征图数据集;
该利用该第一语音信号特征图数据集训练初始DenseNet网络,可以包括:
利用该第二语音信号特征图数据集训练初始DenseNet网络。
进一步地,该目标DenseNet网络模型依次可以包括第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层、第三通道伸缩模块、第一全连接层和第二全连接层,该第一卷积层、该第一通道伸缩模块、该第一过渡层、该第二通道伸缩模块、该第二过渡层和该第三通道伸缩模块用于按照顺序依次提取该待识别语音信号特征图的特征并输出第一特征图,该第一全连接层和该第二全连接层用于进一步提取该第一特征图的特征,并根据提取的特征输出二分类的判别结果。
进一步地,该第一通道伸缩模块、该第二通道伸缩模块和该第三通道伸缩模块均分别可以包括4个上层结构和4个下层结构,该上层结构依次可以包括第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block,该下层结构依次可以包括第五卷积层、4个并列的第六卷积层、第七卷积层和第二SE block。
进一步地,该第二卷积层、该第四卷积层、该第五卷积层和该第七卷积层均为核大小为1×1的卷积层,该第二卷积层和该第五卷积层用于减少输入的特征图的通道数,该第四卷积层用于对4个该第三卷积层输出的特征图进行拼接操作并输入该第一SEblock进行处理,该第七卷积层用于对4个该第六卷积层输出的特征图进行拼接操作并输入该第二SE block进行处理,该第一SE block用于为该第四卷积层输入的各个特征图按照通道分配对应的权重,该第二SE block用于为该第七卷积层输入的各个特征图按照通道分配对应的权重。
进一步地,该对该待识别语音进行STFT变换处理,可以包括:
对该待识别语音进行分帧和加窗操作之后,进行STFT变换处理。
第二方面,本申请实施例提供了一种伪冒语音的识别设备,该识别设备可以包括:
第一获取模块,用于获取待识别语音;
第一变换模块,用于对该待识别语音进行STFT变换处理,将该待识别语音转换为待识别语音信号特征图;
处理模块,用于将该待识别语音信号特征图输入目标DenseNet网络模型,输出该待识别语音为真实语音或伪冒语音的二分类判别结果。
第三方面,本申请实施例提供了一种计算机设备,包括存储器和处理器,所述存储器 中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的伪冒语音的识别方法的步骤:
获取待识别语音;
对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的伪冒语音的识别方法的步骤:
获取待识别语音;
对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
与现有技术相比,本申请实施例主要有以下有益效果:
在获取到待识别语音后,对该待识别语音进行STFT变换处理,得到处理后的待识别语音信号特征图。之后,将该待识别语音信号特征图输入目标DenseNet网络模型,从而输出该待识别语音为真实语音或伪冒语音的二分类判别结果。也即,本本方案通过利用目标DenseNet网络模型进行语音识别工作,基于神经网络的自学习功能,提供了一种准确度高的自动识别伪冒语音的方法,减少ASV或者声纹***安全漏洞的产生。
附图说明
为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种伪冒语音的识别方法的一个实施例示意图;
图2是本申请实施例提供的目标DenseNet网络模型的结构示意图;
图3是图2所示的目标DenseNet网络模型中第一通道伸缩模块的结构示意图;
图4是图3所示的第一通道伸缩模块中第一SE block的结构示意图;
图5是本申请实施例提供的一种伪冒语音的识别方法的另一个实施例示意图;
图6是本申请实施例提供的一种伪冒语音的识别方法的又一个实施例示意图;
图7A是本申请实施例提供的一种伪冒语音的识别设备的一个实施例示意图;
图7B是本申请实施例提供的一种伪冒语音的识别设备的又一个实施例示意图;
图7C是本申请实施例提供的一种伪冒语音的识别设备的又一个实施例示意图;
图8是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。
如图1所示,示出了根据本申请的一种伪冒语音的识别方法的一个实施例的流程图。所述的伪冒语音的识别方法,包括以下步骤:
步骤S101,获取待识别语音。
在本实施例中,所述的伪冒语音的识别方法运行于其上的电子设备(例如可以是服务器/终端设备)上,该电子设备可以采集获取待识别语音数据。
在一些可能的实现方式中,获取的待识别语音可以保存在区块链中。
步骤S102,对待识别语音进行STFT变换处理,将待识别语音转换为待识别语音信号特征图。
在本实施例中,电子设备可以对该待识别语音进行短时快速傅里叶变化(short-time Fourier transform,STFT),以使得将待识别语音数据转换为待识别语音信号特征图。具体的,该转换过程可以依次包括分帧、加窗(hamming window)和STFT变换操作。其中,分帧是指,将采集得到的待识别语音在时间域上先进行分段操作后,再按照预设的一帧的时长将上一步骤得到的分段语音再分成多帧。加窗则是利用窗函数对每帧的语音数据进行处理,以得到一个时间片段,然后用该截断得到的时间片段进行周期延拓处理,得到虚拟的无限长的信号,以便可以对信号进行STFT变换、相关分析等数学处理。其中,为了避免频谱能量泄漏,可以根据待识别语音的波形选择对应的窗函数,具体窗函数的选择,此处不做过多限定。进行加窗操作之后,电子设备进一步将加窗操作后的每帧语音数据执行STFT变换处理,将时域的待识别语音转换为待识别语音信号特征图。其中,该待识别语音信号特征图中,横轴表示时间维度,纵轴表示频率维度。
举例说明,例如本方案中,假设待识别的语音时长为50秒,每段时长5秒,则可以先按照每段5秒分成10段分段语音。常见的帧长一般取为20-50毫秒,本方案中可以选用25毫秒为帧长,则可以将每个分段语音分为200帧。之后,电子设备对每一帧进行加窗操作,之后再对加窗后的每帧语音数据进行STFT变换,将其转换得到待识别语音信号特征图。
步骤S103,将待识别语音信号特征图输入目标DenseNet网络模型,输出待识别语音为真实语音或伪冒语音的二分类判别结果。
在本实施例中,电子设备将步骤S102中得到的待识别语音信号特征图输入已训练得到的目标DenseNet网络模型,从而输出该待识别语音为真实语音或伪冒语音的二分类判别结果。其中,该目标DenseNet网络模型可以由该电子设备提前训练得到,也可以是由另一电子设备在训练完成后,发送至该电子设备上。
在一些可能的实现方式中,相比现有技术中的DenseNet网络,目标DenseNet网络模型可以是一种改进的DenseNet网络模型。具体地,本申请中,对DenseNet网络的主要改进点在于:将现有DenseNet网络这种的密集块(dense block)更改为一种自定义的通道伸缩模块(channel stretch block)结构。
改进后得到的目标DenseNet网络模型可以显著降低训练模型的参数量。具体的,现有DenseNet网络的参数量1.71x10^5,浮点计算量是7.16x10^9,而改进后得到的目标DenseNet网络模型的参数量为8.2x10^4,浮点计算量是3.53x10^9。需要说明的是,现有DenseNet网络为目前常见的一种网络结构,本实施例不做过多阐述。
对现有DenseNet网络进行改进后得到的目标DenseNet网络模型可以参照图2所示,该目标DenseNet网络模型可以包括:
第一卷积层(convolutional layer)、第一通道伸缩模块、第一过渡层(transition layer)、第二通道伸缩模块、第二过渡层205和第三通道伸缩模块、第一全连接层(fully connected layers,FC)和第二全连接层。
其中,各层的连接关系如图2所示,第一卷积层为核(kernel)大小可以为1×1的卷积层,第一过渡层和第二过渡层分别由一层卷积层和一层池化层(pooling layer)组成。该第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层和第三通道伸缩模块用于按照顺序依次提取待识别语音信号特征图的特征并输出第一特征图。第一全连接层和第二全连接层可以将输入的数据转换为不同类别,可用于进一步提取第一特征图的特征,并根据提取的特征输出二分类的判别结果。
在一些可能的实现方式中,上述的第一通道伸缩模块、第二通道伸缩模块和第三通道伸缩模块具有相同的结构,分别包括4个上层结构和4个下层结构。以图3所示,为第一通道伸缩模块的结构示意图,可以包括:
4个上层结构,每个上层结构包括:第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block(Squeeze-and-Excitation block);4个下层结构,每个下层结构包括:第五卷积层、4个并列的第六卷积层、第七卷积层和第二SEblock。
在一些可能的实现方式中,如图3所示,第二卷积层、第四卷积层、第五卷积层和第七卷积层均可以为1×1卷积层。第二卷积层和第五卷积层可以用于对接收到的特征图进行1×1卷积,减少输入的特征图的数量,并分别将输出的特征图并行输入给4个第三卷积层和4个第六卷积层。例如,若输入第二卷积层的特征图通道数为64,则可以通过第二卷积层进行1×1卷积后,输出通道数为32的特征图,并将输出的特征图并行输入给4个 3×3核大小的第三卷积层。第四卷积层和第七卷积层则可以分别对4个第三卷积层和4个第六卷积层输出的特征图进行拼接操作(按通道相加),并将拼接操作后输出的特征图分别输入给第一SE block和第二SE block进行处理。
其中,第一SE block和第二SE block具有类似相同的结构,具体其结构以第一SEblock为例,可以参照图4,图4为本申请实施例中第一SE block的结构示意图,第一SEblock可以依次包括:
全局池化(Global pooling)层、全连接层、激活层(Relu)、全连接层、sigmoid层和scale层。图示中,C表示通道数,r为自设的一个参数,可以设置为16。
第一SE block处理的简要流程可以包括:从第四卷积层输出的C个通道的特征图,输入第一SE block后,先经过第一SE block右侧的各层最终计算得到该C个通道对应的C个权重W。之后,在scale层利用该权重W与原输入对应的各个通道的特征图进行乘积,输出加权后的特征图。
由上述可见,第一SE block和第二SE block用于根据损失函数学习特征权重,将特征图中有效的通道权重调大,无效或者效果小的通道权重调小,以使模型训练达到更好的结果,也即对网络中,对特征图的各个通道对应的权重进行分配和调整。比如网络中有64个通道,现有技术中,这些通道对网络的贡献是一样的,或者说权重是一样的,而如果加了SE block,就可以分配不同的权重来达到更优的结果。
需要强调的是,为进一步保证上述待识别语音和二分类的判别结果的私密和安全性,上述待识别语音和二分类的判别结果还可以存储于一区块链的节点中。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
在一些可能的实现方式中,本申请实施例所提供的伪冒语音的识别方法可以应用在智慧医疗、智慧政务、智慧教育或科技金融等领域。例如,应用在智慧政务或科技金融中时,可以采用伪冒语音的识别方法对采集的语音进行身份识别验证,识别是否为真人语音,避免伪冒语音带来的***安全漏洞。
与现有技术相比,本申请实施例主要有以下有益效果:
本方案中,在获取到待识别语音后,对该待识别语音进行STFT变换处理,得到处理后的待识别语音信号特征图。之后,将该待识别语音信号特征图输入目标DenseNet网络模型,从而输出该待识别语音为真实语音或伪冒语音的二分类判别结果。也即,本方案通过利用目标DenseNet网络模型进行语音识别工作,基于神经网络的自学习功能,提供了一种准确度高的自动识别伪冒语音的方法,减少ASV或者声纹***安全漏洞的产生。
在本申请实施例的一些可选的实现方式中,具体参照图5,为本申请实施例提供的一种伪冒语音的识别方法的另一个实施例示意图:
基于图1所示实施例,在步骤103之前,上述的电子设备还可以执行对初始DenseNet网络的训练过程,并在训练结果达到预期目标时,得到目标DenseNet网络模型。具体该训练得到目标DenseNet网络模型的过程可以包括:
S501,获取真实语音数据集和伪冒语音数据集。
本实施例中,电子设备可以从外部设备获取真实语音数据集和伪冒语音数据集。其中,真实语音数据集可以包括直接采集的不同年龄、不同性别、不同地区和不同情绪等各种不同情况下采集的真人语音数据,而伪冒语音数据集则可以包括利用语音合成技术(文字到语音技术)得到的模仿真人的伪冒语音、语音转换得到的伪冒语音(语音到语音技术,用目标人的一段语音将任何非目标人的语音转换为目标人的语音)、利用部分真人语音和机器合成拼接得到的语音数据和二次或多次采集录音设备录音重放的真人语音等。
S502,利用STFT变换将真实语音数据集中的语音转换为第一类别语音信号特征图,将伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括第一类别语音信号特征图和第二类别语音信号特征图的第一语音信号特征图数据集。
本实施例中,电子设备利用STFT变换将上述得到的真实语音数据集中的语音转换为第一类别语音信号特征图,将伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括第一类别语音信号特征图和第二类别语音信号特征图的第一语音信号特征图数据集。其中,具体利用STFT变换的过程与图1所示实施例中步骤S102的处理方法类似,此处不再赘述。
需要说明的是,在得到第一类别语音信号特征图和第二类别语音信号特征图后,电子设备还需要响应用户的标注操作,为不同类别的语音信号特征图设置标签生成标签文件,并保存进第一语音信号特征图数据集中。标签设置为二分类形式,例如可以设置为0或1,0代表第一类别语音信号特征图,1代表第二类别语音信号特征图。
S503,利用第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对初始DenseNet网络各层的权重参数进行调整,直至损失函数小于小于预设值时,锁定初始DenseNet网络各层的权重参数以得到目标DenseNet网络模型。
本实施例中,在得到第一语音信号特征图数据集后,电子设备利用该第一语音信号特征图数据集对初始DenseNet网络进行训练,并且基于损失函数对初始DenseNet网络各层的权重参数进行调整,直至损失函数小于小于预设值时,锁定初始DenseNet网络各层的权重参数以得到目标DenseNet网络模型。
在一些可能的实现方式中,该损失函数为二分类的交叉墒损失函数。具体地,对于样本(x,y)来讲,x为样本,y为对应的标签,在二分类问题中,其取值的集合可能为{0,1}。假设某个样本的真实标签为yt,该样本的yt=1的概率为yp,则该样本的损失函数为:
log(yt︱yp)=-(yt*log(yp)+(1-yt)log(1-yp))
与现有技术相比,本申请实施例主要有以下有益效果:
本方案中,通过获取不同年龄、不同性别、不同地区和不同情绪等各种不同情况下采集的真人语音数据集,和获取利用语音合成技术得到的模仿真人的伪冒语音、语音转换得 到的伪冒语音、利用部分真人语音和机器合成拼接得到的语音数据和二次或多次采集录音设备录音重放的真人语音等的伪冒语音数据集,构建基于该真人语音数据集和该伪冒语音数据集的第一语音信号特征图数据集。之后,利用该第一语音信号特征图数据集对初始DenseNet网络训练,从而得到一种可以识别多种类型伪冒语音的目标DenseNet网络模型,增加了伪冒语音的识别范围。
在本申请实施例的一些可选的实现方式中,具体参照图6,为本申请实施例提供的一种伪冒语音的识别方法的另一个实施例示意图。
其中,图5所示实施例中,步骤S502之后,S503之前,该冒语音的识别方法可以包括:
S601,将第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩操作,以将第一语音信号特征图数据集转换为第二语音信号特征图数据集。
本实施例中,电子设备在得到第一语音信号特征图数据集后,可以对该第一语音信号特征图数据集中的一部分语音信号特征图的部分特征进行遮掩(mask)操作。具体的,可以将语音信号特征图特征的连续一部分全部重置成0,比如原始的语音信号特征图中频率维度为256维,范围从0到8000Hz,则可以随机选取256维中的30维置零,对于频率而言就是从0到8000Hz中一段的信息被抹去了,对于模型来说增加了数据的未知性,通过这样的方法,我们在网络没有使用随机失活(dropout)的情况下,大幅提升了网络的泛化性能,将网络识别的准确率进一步提升了30%左右。
步骤S503,利用第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对初始DenseNet网络各层的权重参数进行调整,直至损失函数小于小于预设值时,锁定初始DenseNet网络各层的权重参数以得到目标DenseNet网络模型,可以包括:
S602,利用第二语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对初始DenseNet网络各层的权重参数进行调整,直至损失函数小于小于预设值时,锁定初始DenseNet网络各层的权重参数以得到目标DenseNet网络模型。
需要说明的是,本实施例中步骤S602与图5所示实施例中步骤S503类似,此处不再赘述。
与现有技术相比,本申请实施例主要有以下有益效果:
本申请实施例中,通过特征遮掩的方法,对于模型来说增加了数据的未知性,大幅提升了网络的泛化性能,从而提高了目标DenseNet网络模型对未知的伪冒语音的识别能力。
下面具体参照图7A,图7A为本申请实施例提供的一种伪冒语音的识别设备的一个结构示意图,该识别设备可以包括:
第一获取模块701,用于获取待识别语音;
第一变换模块702,用于对该待识别语音进行STFT变换处理,将该待识别语音转换为待识别语音信号特征图;
处理模块703,用于将该待识别语音信号特征图输入目标DenseNet网络模型,输出该待识别语音为真实语音或伪冒语音的二分类判别结果。
进一步地,具体参照图7B,图7B为本申请实施例提供的一种伪冒语音的识别设备的又一个结构示意图,该伪冒语音的识别设备还可以包括:
第二获取模块704、第二变换模块705和训练模块706;
其中,第二获取模块704,用于获取真实语音数据集和伪冒语音数据集;
第二变换模块705,用于利用STFT变换将该真实语音数据集中的语音转换为第一类别语音信号特征图,将该伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到可以包括该第一类别语音信号特征图和该第二类别语音信号特征图的第一语音信号特征图数据集;
训练模块706,用于利用该第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对该初始DenseNet网络各层的权重参数进行调整,直至该损失函数小于小于预设值时,锁定该初始DenseNet网络各层的权重参数以得到该目标DenseNet网络模型。
进一步地,具体参照图7C,图7C为本申请实施例提供的一种伪冒语音的识别设备的又一个结构示意图,该伪冒语音的识别设备还可以包括:
编辑模块707,用于将该第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩操作,以将该第一语音信号特征图数据集转换为第二语音信号特征图数据集;
训练模块706,还具体用于利用该第二语音信号特征图数据集训练初始DenseNet网络。
进一步地,该目标DenseNet网络模型依次可以包括第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层、第三通道伸缩模块、第一全连接层和第二全连接层,该第一卷积层、该第一通道伸缩模块、该第一过渡层、该第二通道伸缩模块、该第二过渡层和该第三通道伸缩模块用于按照顺序依次提取该待识别语音信号特征图的特征并输出第一特征图,该第一全连接层和该第二全连接层用于进一步提取该第一特征图的特征,并根据提取的特征输出二分类的判别结果。
进一步地,该第一通道伸缩模块、该第二通道伸缩模块和该第三通道伸缩模块均分别可以包括4个上层结构和4个下层结构,该上层结构依次可以包括第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block,该下层结构依次可以包括第五卷积层、4个并列的第六卷积层、第七卷积层和第二SE block。
进一步地,该第二卷积层、该第四卷积层、该第五卷积层和该第七卷积层均为核大小为1×1的卷积层,该第二卷积层和该第五卷积层用于减少输入的特征图的通道数,该第四卷积层用于对4个该第三卷积层输出的特征图进行拼接操作并输入该第一SEblock进行处理,该第七卷积层用于对4个该第六卷积层输出的特征图进行拼接操作并输入该第二SE block进行处理,该第一SE block用于为该第四卷积层输入的各个特征图按照通道分配对应的权重,该第二SE block用于为该第七卷积层输入的各个特征图按照通道分配对应的权重。
进一步地,第一变换模块702,具体用于对该待识别语音进行分帧和加窗操作之后,进行STFT变换处理。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图8,图8为本实施例计算机设备基本结构框图。
所述计算机设备8包括通过***总线相互通信连接存储器81、处理器82、网络接口83。需要指出的是,图中仅示出了具有组件81-83的计算机设备8,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器81至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器81可以是所述计算机设备8的内部存储单元,例如该计算机设备8的硬盘或内存。在另一些实施例中,所述存储器81也可以是所述计算机设备8的外部存储设备,例如该计算机设备8上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器81还可以既包括所述计算机设备8的内部存储单元也包括其外部存储设备。本实施例中,所述存储器81通常用于存储安装于所述计算机设备8的操作***和各类应用软件,例如用于实现如下所述的伪冒语音的识别方法的步骤:
获取待识别语音;
对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
此外,所述存储器81还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器82在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器82通常用于控制所述计算机设备8的总体操作。本实施例中,所述处理器82用于运行所述存储器81中存储的计算机可读指令或者处理数据,例如运行所述图1、5或6所示实施例中一种伪冒语音的识别方法的计算机可读指令。
所述网络接口83可包括无线网络接口或有线网络接口,该网络接口83通常用于在所述计算机设备8与其他电子设备之间建立通信连接。
本申请实施例中,通过在该计算机设备8上的处理器82执行图1、5或6所示实施例中一种伪冒语音的识别方法的计算机可读指令,从而提供了一种基于神经网络自动识别伪冒语音的方法
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,其中,所述计算机可读存储介质可以是非易失性,也可以是易失性。所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如如下所述的伪冒语音的识别方法的步骤:
获取待识别语音;
对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
本申请实施例中,通过在该计算机可读存储介质存储计算机可读指令,使得该计算机可读指令被至少一个处理器执行时,实现图1、5或6所示实施例中一种伪冒语音的识别方法,从而提供了一种基于神经网络自动识别伪冒语音的方法
此外,需要说明的是,本申请可用于众多通用或专用的计算机***环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、多处理器***、基于微处理器的***、置顶盒、可编程的消费电子设备、网络PC、小型计算机、大型计算机、包括以上任何***或设备的分布式计算环境等等。本申请可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,该计算机可读指令可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的识别方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种伪冒语音的识别方法,其中,所述识别方法包括:
    获取待识别语音;
    对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
    将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
  2. 根据权利要求1所述的识别方法,其中,所述将所述待识别语音信号特征图输入目标DenseNet网络模型之前,所述识别方法还包括:
    获取真实语音数据集和伪冒语音数据集;
    利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集;
    利用所述第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对所述初始DenseNet网络各层的权重参数进行调整,直至所述损失函数小于预设值时,锁定所述初始DenseNet网络各层的权重参数以得到所述目标DenseNet网络模型。
  3. 根据权利要求2所述的识别方法,其中,所述利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集之后,所述识别方法还包括:
    将所述第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩操作,以将所述第一语音信号特征图数据集转换为第二语音信号特征图数据集;
    所述利用所述第一语音信号特征图数据集训练初始DenseNet网络,包括:
    利用所述第二语音信号特征图数据集训练初始DenseNet网络。
  4. 根据权利要求1至3中任一项所述的识别方法,其中,所述目标DenseNet网络模型依次包括第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层、第三通道伸缩模块、第一全连接层和第二全连接层,所述第一卷积层、所述第一通道伸缩模块、所述第一过渡层、所述第二通道伸缩模块、所述第二过渡层和所述第三通道伸缩模块用于按照顺序依次提取所述待识别语音信号特征图的特征并输出第一特征图,所述第一全连接层和所述第二全连接层用于进一步提取所述第一特征图的特征,并根据提取的特征输出二分类的判别结果。
  5. 根据权利要求4所述的识别方法,其中,所述第一通道伸缩模块、所述第二通道伸缩模块和所述第三通道伸缩模块均分别包括4个上层结构和4个下层结构,所述上层结构依次包括第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block,所述下层结构依次包括第五卷积层、4个并列的第六卷积层、第七卷积层和第二SE block。
  6. 根据权利要求5所述的识别方法,其中,所述第二卷积层、所述第四卷积层、所述第五卷积层和所述第七卷积层均为核大小为1×1的卷积层,所述第二卷积层和所述第五 卷积层用于减少输入的特征图的通道数,所述第四卷积层用于对4个所述第三卷积层输出的特征图进行拼接操作并输入所述第一SE block进行处理,所述第七卷积层用于对4个所述第六卷积层输出的特征图进行拼接操作并输入所述第二SE block进行处理,所述第一SE block用于为所述第四卷积层输入的各个特征图按照通道分配对应的权重,所述第二SE block用于为所述第七卷积层输入的各个特征图按照通道分配对应的权重。
  7. 根据权利要求1至3中任一项所述的识别方法,其中,所述对所述待识别语音进行STFT变换处理,包括:
    对所述待识别语音进行分帧和加窗操作之后,进行STFT变换处理。
  8. 一种伪冒语音的识别设备,其中,所述识别设备包括:
    第一获取模块,用于获取待识别语音;
    第一变换模块,用于对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
    处理模块,用于将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
  9. 一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如下所述的伪冒语音的识别方法的步骤:
    获取待识别语音;
    对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
    将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
  10. 根据权利要求9所述的计算机设备,其中,所述将所述待识别语音信号特征图输入目标DenseNet网络模型之前,所述处理器执行所述计算机可读指令时还实现如下步骤:
    获取真实语音数据集和伪冒语音数据集;
    利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集;
    利用所述第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对所述初始DenseNet网络各层的权重参数进行调整,直至所述损失函数小于预设值时,锁定所述初始DenseNet网络各层的权重参数以得到所述目标DenseNet网络模型。
  11. 根据权利要求10所述的计算机设备,其中,所述利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集之后,所述处理器执行所述计算机可读指令时还实现如下步骤:
    将所述第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩 操作,以将所述第一语音信号特征图数据集转换为第二语音信号特征图数据集;
    所述利用所述第一语音信号特征图数据集训练初始DenseNet网络,包括:
    利用所述第二语音信号特征图数据集训练初始DenseNet网络。
  12. 根据权利要求9至11中任一项所述的计算机设备,其中,所述目标DenseNet网络模型依次包括第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层、第三通道伸缩模块、第一全连接层和第二全连接层,所述第一卷积层、所述第一通道伸缩模块、所述第一过渡层、所述第二通道伸缩模块、所述第二过渡层和所述第三通道伸缩模块用于按照顺序依次提取所述待识别语音信号特征图的特征并输出第一特征图,所述第一全连接层和所述第二全连接层用于进一步提取所述第一特征图的特征,并根据提取的特征输出二分类的判别结果。
  13. 根据权利要求12所述的计算机设备,其中,所述第一通道伸缩模块、所述第二通道伸缩模块和所述第三通道伸缩模块均分别包括4个上层结构和4个下层结构,所述上层结构依次包括第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block,所述下层结构依次包括第五卷积层、4个并列的第六卷积层、第七卷积层和第二SE block。
  14. 根据权利要求13所述的计算机设备,其中,所述第二卷积层、所述第四卷积层、所述第五卷积层和所述第七卷积层均为核大小为1×1的卷积层,所述第二卷积层和所述第五卷积层用于减少输入的特征图的通道数,所述第四卷积层用于对4个所述第三卷积层输出的特征图进行拼接操作并输入所述第一SE block进行处理,所述第七卷积层用于对4个所述第六卷积层输出的特征图进行拼接操作并输入所述第二SE block进行处理,所述第一SE block用于为所述第四卷积层输入的各个特征图按照通道分配对应的权重,所述第二SE block用于为所述第七卷积层输入的各个特征图按照通道分配对应的权重。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下所述的伪冒语音的识别方法的步骤:
    获取待识别语音;
    对所述待识别语音进行STFT变换处理,将所述待识别语音转换为待识别语音信号特征图;
    将所述待识别语音信号特征图输入目标DenseNet网络模型,输出所述待识别语音为真实语音或伪冒语音的二分类判别结果。
  16. 根据权利要求15所述的计算机可读存储介质,其中,所述将所述待识别语音信号特征图输入目标DenseNet网络模型之前,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行如下步骤:
    获取真实语音数据集和伪冒语音数据集;
    利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集;
    利用所述第一语音信号特征图数据集训练初始DenseNet网络,并基于损失函数对所述 初始DenseNet网络各层的权重参数进行调整,直至所述损失函数小于预设值时,锁定所述初始DenseNet网络各层的权重参数以得到所述目标DenseNet网络模型。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述利用STFT变换将所述真实语音数据集中的语音转换为第一类别语音信号特征图,将所述伪冒语音数据集中的语音转换为第二类别语音信号特征图,以得到包括所述第一类别语音信号特征图和所述第二类别语音信号特征图的第一语音信号特征图数据集之后,所述计算机可读指令被所述处理器执行时,使得所述处理器还执行如下步骤:
    将所述第一语音信号特征图数据集中部分语音信号特征图的部分频率特征执行遮掩操作,以将所述第一语音信号特征图数据集转换为第二语音信号特征图数据集;
    所述利用所述第一语音信号特征图数据集训练初始DenseNet网络,包括:
    利用所述第二语音信号特征图数据集训练初始DenseNet网络。
  18. 根据权利要求15至17中任一项所述的计算机可读存储介质,其中,所述目标DenseNet网络模型依次包括第一卷积层、第一通道伸缩模块、第一过渡层、第二通道伸缩模块、第二过渡层、第三通道伸缩模块、第一全连接层和第二全连接层,所述第一卷积层、所述第一通道伸缩模块、所述第一过渡层、所述第二通道伸缩模块、所述第二过渡层和所述第三通道伸缩模块用于按照顺序依次提取所述待识别语音信号特征图的特征并输出第一特征图,所述第一全连接层和所述第二全连接层用于进一步提取所述第一特征图的特征,并根据提取的特征输出二分类的判别结果。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述第一通道伸缩模块、所述第二通道伸缩模块和所述第三通道伸缩模块均分别包括4个上层结构和4个下层结构,所述上层结构依次包括第二卷积层、4个并列的第三卷积层、第四卷积层和第一SE block,所述下层结构依次包括第五卷积层、4个并列的第六卷积层、第七卷积层和第二SE block。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述第二卷积层、所述第四卷积层、所述第五卷积层和所述第七卷积层均为核大小为1×1的卷积层,所述第二卷积层和所述第五卷积层用于减少输入的特征图的通道数,所述第四卷积层用于对4个所述第三卷积层输出的特征图进行拼接操作并输入所述第一SE block进行处理,所述第七卷积层用于对4个所述第六卷积层输出的特征图进行拼接操作并输入所述第二SE block进行处理,所述第一SE block用于为所述第四卷积层输入的各个特征图按照通道分配对应的权重,所述第二SE block用于为所述第七卷积层输入的各个特征图按照通道分配对应的权重。
PCT/CN2020/118450 2020-07-16 2020-09-28 一种伪冒语音的识别方法、设备及计算机可读存储介质 WO2021135454A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010688484.9 2020-07-16
CN202010688484.9A CN111933154B (zh) 2020-07-16 2020-07-16 一种伪冒语音的识别方法、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2021135454A1 true WO2021135454A1 (zh) 2021-07-08

Family

ID=73313228

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118450 WO2021135454A1 (zh) 2020-07-16 2020-09-28 一种伪冒语音的识别方法、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN111933154B (zh)
WO (1) WO2021135454A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172739A1 (en) * 2020-12-02 2022-06-02 Google Llc Self-Supervised Speech Representations for Fake Audio Detection

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113327621A (zh) * 2021-06-09 2021-08-31 携程旅游信息技术(上海)有限公司 模型训练方法、用户识别方法、***、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109767776A (zh) * 2019-01-14 2019-05-17 广东技术师范学院 一种基于密集神经网络的欺骗语音检测方法
US20190172476A1 (en) * 2017-12-04 2019-06-06 Apple Inc. Deep learning driven multi-channel filtering for speech enhancement
US20200005046A1 (en) * 2018-07-02 2020-01-02 Adobe Inc. Brand safety in video content

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108281158A (zh) * 2018-01-12 2018-07-13 平安科技(深圳)有限公司 基于深度学习的语音活体检测方法、服务器及存储介质
CN110767218A (zh) * 2019-10-31 2020-02-07 南京励智心理大数据产业研究院有限公司 端到端语音识别方法、***、装置及其存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190172476A1 (en) * 2017-12-04 2019-06-06 Apple Inc. Deep learning driven multi-channel filtering for speech enhancement
US20200005046A1 (en) * 2018-07-02 2020-01-02 Adobe Inc. Brand safety in video content
CN109767776A (zh) * 2019-01-14 2019-05-17 广东技术师范学院 一种基于密集神经网络的欺骗语音检测方法

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220172739A1 (en) * 2020-12-02 2022-06-02 Google Llc Self-Supervised Speech Representations for Fake Audio Detection
US11756572B2 (en) * 2020-12-02 2023-09-12 Google Llc Self-supervised speech representations for fake audio detection
US20230386506A1 (en) * 2020-12-02 2023-11-30 Google Llc Self-supervised speech representations for fake audio detection

Also Published As

Publication number Publication date
CN111933154B (zh) 2024-02-13
CN111933154A (zh) 2020-11-13

Similar Documents

Publication Publication Date Title
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
WO2020177380A1 (zh) 基于短文本的声纹检测方法、装置、设备及存储介质
CN110443692B (zh) 企业信贷审核方法、装置、设备及计算机可读存储介质
CN112562691B (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
CN106887225B (zh) 基于卷积神经网络的声学特征提取方法、装置和终端设备
WO2020073665A1 (zh) 在频谱上对语音进行情绪识别的方法、***及存储介质
CN110276259A (zh) 唇语识别方法、装置、计算机设备及存储介质
CN107221320A (zh) 训练声学特征提取模型的方法、装置、设备和计算机存储介质
WO2021208728A1 (zh) 基于神经网络的语音端点检测方法、装置、设备及介质
CN112328761B (zh) 一种意图标签设置方法、装置、计算机设备及存储介质
CN110633991A (zh) 风险识别方法、装置和电子设备
WO2021135454A1 (zh) 一种伪冒语音的识别方法、设备及计算机可读存储介质
CN111653274B (zh) 唤醒词识别的方法、装置及存储介质
Tiwari et al. Virtual home assistant for voice based controlling and scheduling with short speech speaker identification
CN112669876B (zh) 情绪识别方法、装置、计算机设备及存储介质
CN107341464A (zh) 一种用于提供交友对象的方法、设备及***
CN111507259B (zh) 脸部特征提取方法、装置、电子设备
CN113450822B (zh) 语音增强方法、装置、设备及存储介质
US10446138B2 (en) System and method for assessing audio files for transcription services
WO2021128847A1 (zh) 终端交互方法、装置、计算机设备及存储介质
WO2023222071A1 (zh) 语音信号的处理方法、装置、设备及介质
Shah et al. Speech recognition using spectrogram-based visual features
Koteswararao et al. Multichannel speech separation using hybrid GOMF and enthalpy-based deep neural networks
Bear et al. Comparing heterogeneous visual gestures for measuring the diversity of visual speech signals
CN114241411B (zh) 基于目标检测的计数模型处理方法、装置及计算机设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20908575

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20908575

Country of ref document: EP

Kind code of ref document: A1