WO2022116420A1 - Speech event detection method and apparatus, electronic device, and computer storage medium - Google Patents

Speech event detection method and apparatus, electronic device, and computer storage medium Download PDF

Info

Publication number
WO2022116420A1
WO2022116420A1 PCT/CN2021/082872 CN2021082872W WO2022116420A1 WO 2022116420 A1 WO2022116420 A1 WO 2022116420A1 CN 2021082872 W CN2021082872 W CN 2021082872W WO 2022116420 A1 WO2022116420 A1 WO 2022116420A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
classification model
feature
speech
event
Prior art date
Application number
PCT/CN2021/082872
Other languages
French (fr)
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022116420A1 publication Critical patent/WO2022116420A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a voice event detection method, apparatus, electronic device, and computer-readable storage medium.
  • Voice event detection refers to the detection of human voice, singing, tapping, dog barking, car chirping and other events in audio, and marking their start time and end time.
  • Traditional speech event detection methods include methods based on signal processing and methods based on hidden Markov models.
  • the inventor realized that the occurrence of events often has high uncertainty, and it is difficult to collect a large number of samples of speech events, so the accuracy of traditional speech event detection methods is low; at the same time, for a random speech event, the model passes In the frame-level judgment, the judgment of different frames of the same event may be different, resulting in the instability of the event detection result.
  • a voice event detection method provided by this application includes:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides a voice event detection device, the device comprising:
  • a feature extraction module used to obtain the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence
  • the self-attention module is used to perform feature analysis on the speech frame feature sequence by using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified;
  • an identification module configured to perform event identification on the hidden state sequence to be identified by using the classification model to obtain an event label sequence
  • the smoothing module is used for smoothing the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides an electronic device, the electronic device comprising:
  • the processor executes the computer program stored in the memory to realize the following steps:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • FIG. 1 is a schematic flowchart of a voice event detection method provided by an embodiment of the present application.
  • FIG. 2 is a schematic block diagram of a voice event detection apparatus provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of the internal structure of an electronic device implementing a voice event detection method provided by an embodiment of the present application
  • the embodiment of the present application provides a voice event detection method.
  • the executive body of the voice event detection method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the voice event detection method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • FIG. 1 it is a schematic flowchart of a voice event detection method according to an embodiment of the present application.
  • the voice event detection method includes:
  • the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events.
  • the to-be-detected audio can be obtained from a database.
  • the audio to be detected can be obtained from a node of a blockchain.
  • performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence including:
  • Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  • the embodiment of the present application performs acoustic feature extraction on the audio to be detected.
  • the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected.
  • the Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
  • the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events.
  • the classification model includes an input layer, a hidden layer and a fully connected layer.
  • the self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis.
  • the feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
  • the feature analysis is performed on the speech frame feature sequence using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified, including:
  • the feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  • the embodiment of the present application may divide the speech frame feature sequence into several windows:
  • t represents the frame sequence number
  • T represents the length of the window
  • x t represents the feature whose length is 1 in the speech frame feature sequence.
  • the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks.
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
  • the second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached.
  • Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  • the common speech feature of the current window is z t
  • the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
  • D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1 As the input of the next layer of self-attention mechanism network.
  • an event label sequence including:
  • the hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
  • Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  • the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
  • the category of sound events contained in a window may be defined as y t :
  • a classification model based on a self-attention mechanism before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
  • the training sample set is input into the classification model to obtain the predicted label sequence
  • the training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  • the real event sequence y t described in the embodiment of the present application is obtained by marking.
  • the predicted label sequence of the classification model is In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
  • Loss training error
  • the cross-entropy loss function includes:
  • Loss is the training error
  • N is the total number of categories of sound events
  • y t is the real event sequence
  • the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result.
  • a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
  • the smoothing of the event label sequence to obtain the speech event detection result corresponding to the speech to be detected includes:
  • the event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence
  • the endpoints in the smooth event label sequence determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
  • the multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  • the classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
  • a classification model based on a self-attention mechanism is used to perform feature analysis on the speech frame feature sequence.
  • the classification model combines the features of multiple windows to improve the accuracy of speech features.
  • the classification model It is based on a self-attention mechanism, which can improve the accuracy of event detection; smoothing the sequence of event labels improves the stability of event detection results. Therefore, the voice event detection method, device and computer-readable storage medium proposed in this application can improve the stability and accuracy of voice event detection.
  • FIG. 2 it is a schematic diagram of a module of the voice event detection apparatus of the present application.
  • the voice event detection apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the voice event detection apparatus may include a feature extraction module 101 , a self-attention module 102 , a recognition module 103 and a smoothing module 104 .
  • the modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the feature extraction module 101 is configured to acquire the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence.
  • the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events.
  • the to-be-detected audio can be obtained from a database.
  • the audio to be detected can be obtained from a node of a blockchain.
  • the feature extraction module 101 when performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence, the feature extraction module 101 specifically performs the following operations:
  • Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  • the embodiment of the present application performs acoustic feature extraction on the audio to be detected.
  • the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected.
  • the Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
  • the self-attention module 102 is configured to perform feature analysis on the speech frame feature sequence using a classification model based on the self-attention mechanism to obtain a hidden state sequence to be identified.
  • the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events.
  • the classification model includes an input layer, a hidden layer and a fully connected layer.
  • the self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis.
  • the feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
  • the self-attention module 102 is specifically used for:
  • the feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  • the embodiment of the present application may divide the speech frame feature sequence into several windows:
  • t represents the frame sequence number
  • T represents the length of the window
  • x t represents the feature whose length is 1 in the speech frame feature sequence.
  • the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks.
  • the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
  • the second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached.
  • Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  • the common speech feature of the current window is z t
  • the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
  • D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1 As the input of the next layer of self-attention mechanism network.
  • the identifying module 103 is configured to use the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence.
  • the identification module 103 is specifically used for:
  • the hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
  • Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  • the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
  • the category of sound events contained in a window may be defined as y t :
  • a classification model based on a self-attention mechanism before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
  • the training sample set is input into the classification model to obtain the predicted label sequence
  • the training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  • the real event sequence y t described in the embodiment of the present application is obtained by marking.
  • the predicted label sequence of the classification model is In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
  • Loss training error
  • the cross-entropy loss function includes:
  • Loss is the training error
  • N is the total number of categories of sound events
  • y t is the real event sequence
  • the smoothing module 104 is configured to perform smoothing processing on the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
  • the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result.
  • a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
  • the smoothing module 104 is specifically used for:
  • the event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence
  • the endpoint in the smooth event label sequence determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
  • the multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  • the classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
  • FIG. 3 it is a schematic structural diagram of an electronic device implementing the voice event detection method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a voice event detection program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 , such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the voice event detection program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. Voice event detection program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the voice event detection program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, it can realize:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
  • the present application also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:
  • the event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A speech event detection method, a speech event detection apparatus (100), an electronic device (1), and a computer readable storage medium, relating to the artificial intelligence technology. The method comprises: obtaining an audio under detection, and performing acoustic feature extraction on the audio to obtain a speech frame feature sequence (S1); performing feature analysis on the speech frame feature sequence by using a self-attention mechanism-based classification model to obtain a hidden state sequence to be identified (S2); performing event identification on the hidden state sequence by using the classification model to obtain an event label sequence (S3); and performing smoothing processing on the event label sequence to obtain a speech event detection result corresponding to a speech under detection (S4). The present invention relates to the blockchain technology, and an audio under detection is stored in a blockchain node. Stability and accuracy of speech event detection are improved.

Description

语音事件检测方法、装置、电子设备及计算机存储介质Voice event detection method, device, electronic device and computer storage medium
本申请要求于2020年12月1日提交中国专利局、申请号为CN202011381842.8、名称为“语音事件检测方法、装置、电子设备及计算机存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number CN202011381842.8 and the title of "Voice Event Detection Method, Device, Electronic Equipment and Computer Storage Medium" filed with the Chinese Patent Office on December 1, 2020, the entire contents of which are Incorporated herein by reference.
技术领域technical field
本申请涉及人工智能技术领域,尤其涉及一种语音事件检测方法、装置、电子设备及计算机可读存储介质。The present application relates to the technical field of artificial intelligence, and in particular, to a voice event detection method, apparatus, electronic device, and computer-readable storage medium.
背景技术Background technique
语音事件检测是指在音频中检测人声、歌声、敲击声、狗叫、车鸣等事件,并且标记出他们的开始时间和结束时间。Voice event detection refers to the detection of human voice, singing, tapping, dog barking, car chirping and other events in audio, and marking their start time and end time.
传统的语音事件检测方法包括基于信号处理的方法和基于隐马尔科夫模型的方法。发明人意识到事件的发生往往具有很高的不确定性,语音事件的的样本难以大量采集,因此传统的语音事件检测方法的准确性较低;同时,对于随机出现的一个语音事件,模型通过帧级判断可能出现同一个事件不同帧判断的不同,造成事件检测结果的不稳定性。Traditional speech event detection methods include methods based on signal processing and methods based on hidden Markov models. The inventor realized that the occurrence of events often has high uncertainty, and it is difficult to collect a large number of samples of speech events, so the accuracy of traditional speech event detection methods is low; at the same time, for a random speech event, the model passes In the frame-level judgment, the judgment of different frames of the same event may be different, resulting in the instability of the event detection result.
发明内容SUMMARY OF THE INVENTION
本申请提供的一种语音事件检测方法,包括:A voice event detection method provided by this application includes:
获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
本申请还提供一种语音事件检测装置,所述装置包括:The present application also provides a voice event detection device, the device comprising:
特征提取模块,用于获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;a feature extraction module, used to obtain the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
自注意力模块,用于利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;The self-attention module is used to perform feature analysis on the speech frame feature sequence by using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified;
识别模块,用于使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;an identification module, configured to perform event identification on the hidden state sequence to be identified by using the classification model to obtain an event label sequence;
平滑模块,用于对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The smoothing module is used for smoothing the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
本申请还提供一种电子设备,所述电子设备包括:The present application also provides an electronic device, the electronic device comprising:
存储器,存储至少一个计算机程序;及a memory that stores at least one computer program; and
处理器,执行所述存储器中存储的计算机程序以实现如下步骤:The processor executes the computer program stored in the memory to realize the following steps:
获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
本申请还提供一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下步骤:The present application also provides a computer-readable storage medium, including a storage data area and a storage program area, the storage data area stores created data, and the storage program area stores a computer program; wherein, the computer program is implemented as follows when executed by a processor step:
获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
附图说明Description of drawings
图1为本申请一实施例提供的语音事件检测方法的流程示意图;1 is a schematic flowchart of a voice event detection method provided by an embodiment of the present application;
图2为本申请一实施例提供的语音事件检测装置的模块示意图;FIG. 2 is a schematic block diagram of a voice event detection apparatus provided by an embodiment of the present application;
图3为本申请一实施例提供的实现语音事件检测方法的电子设备的内部结构示意图;3 is a schematic diagram of the internal structure of an electronic device implementing a voice event detection method provided by an embodiment of the present application;
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
具体实施方式Detailed ways
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.
本申请实施例提供一种语音事件检测方法。所述语音事件检测方法的执行主体包括但不限于服务端、终端等能够被配置为执行本申请实施例提供的该方法的电子设备中的至少一种。换言之,所述语音事件检测方法可以由安装在终端设备或服务端设备的软件或硬件来执行,所述软件可以是区块链平台。所述服务端包括但不限于:单台服务器、服务器集群、云端服务器或云端服务器集群等。The embodiment of the present application provides a voice event detection method. The executive body of the voice event detection method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal. In other words, the voice event detection method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
参照图1所示,为本申请一实施例提供的语音事件检测方法的流程示意图。Referring to FIG. 1 , it is a schematic flowchart of a voice event detection method according to an embodiment of the present application.
在本实施例中,所述语音事件检测方法包括:In this embodiment, the voice event detection method includes:
S1、获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列。S1. Acquire audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence.
本申请实施例中,所述待检测音频是包含多种声音事件的音频,如人声、歌声、敲击声、狗叫、车鸣等事件。进一步地,所述待检测音频可以从数据库中获取。为了保证所述待检测音频的安全性和私密性,所述待检测音频可以从一区块链的节点中获取。In the embodiment of the present application, the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events. Further, the to-be-detected audio can be obtained from a database. In order to ensure the security and privacy of the audio to be detected, the audio to be detected can be obtained from a node of a blockchain.
详细地,所述对所述待检测音频进行声学特征提取,得到语音帧特征序列,包括:In detail, performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence, including:
将所述待检测音频进行分帧处理,得到语音帧序列;Framing the audio to be detected to obtain a sequence of speech frames;
对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;For each frame of speech in the speech frame sequence, obtain the corresponding frequency spectrum through fast Fourier transform;
通过梅尔滤波器组将所述频谱转换为梅尔频谱;converting the spectrum to a mel spectrum through a mel filter bank;
在所述梅尔频谱上进行倒谱分析,得到所述待检测音频对应的语音帧特征序列。Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
本申请实施例对所述待检测音频进行声学特征提取,首先需要对所述待检测音频进行分帧处理,如将每10毫秒为一帧,然后对每一帧语音通过计算梅尔频谱,并进行倒谱分析,从而提取所述待检测音频的声学特征。其中,所述梅尔频谱是一种常用的语音特征表示,可以有效地刻画语音的基本信息,便于后续对语音进行分析处理。The embodiment of the present application performs acoustic feature extraction on the audio to be detected. First, the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected. The Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
S2、利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列。S2. Use a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified.
本申请实施例中,所述分类模型是一种可以识别语音特征,用于对不同声音事件进行检测分类的深度学习模型。所述分类模型包括输入层、隐藏层和全连接层。In the embodiment of the present application, the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events. The classification model includes an input layer, a hidden layer and a fully connected layer.
所述自注意机制是根据检测目标去关注部分细节,而不是基于全局进行分析的一种机制,通过所述自注意机制重新计算的特征向量,可以充分考虑到连续音频中上下文之间的联系。The self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis. The feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
详细地,所述利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列,包括:In detail, the feature analysis is performed on the speech frame feature sequence using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified, including:
将所述语音帧特征序列划分为多个预设长度的窗口;dividing the speech frame feature sequence into a plurality of windows of preset lengths;
按照时间顺序选择所述窗口中的其中一个窗口作为当前窗口;Select one of the windows as the current window in chronological order;
通过所述分类模型的输入层计算所述当前窗口的特征向量,计算当前窗口的下一个窗口的特征向量,并获取当前窗口的上一个窗口的特征向量;Calculate the feature vector of the current window through the input layer of the classification model, calculate the feature vector of the next window of the current window, and obtain the feature vector of the previous window of the current window;
通过所述分类模型的输入层将所述上一个窗口的特征向量,所述当前窗口的特征向量,以及所述下一个窗口的特征向量进行合并,得到共同语音特征;The feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列。The common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
本申请其中一个示例中,假设语音帧特征序列为x t则本申请实施例可以将所述语音帧特征序列分成若干个窗口: In one of the examples in the present application, assuming that the speech frame feature sequence is x t , the embodiment of the present application may divide the speech frame feature sequence into several windows:
z t=[x t,…,x t+T] z t =[x t ,...,x t+T ]
其中,t表示帧序号,T表示窗口的长度,x t表示所述语音帧特征序列中长度为1的特征。 Wherein, t represents the frame sequence number, T represents the length of the window, and x t represents the feature whose length is 1 in the speech frame feature sequence.
进一步地,本申请实施例中所述分类模型的隐藏层是由若干层自注意力(Self Attention)机制网络组成的深度神经网络。本申请实施例所述将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列,包括:Further, the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks. According to the embodiment of the present application, the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
将所述共同语音特征输入至所述分类模型的隐藏层的第一层自注意力机制网络中进行计算,得到第一隐藏状态序列;Inputting the common speech feature into the first-layer self-attention mechanism network of the hidden layer of the classification model for calculation to obtain a first hidden state sequence;
将所述第一隐藏状态序列作为所述分类模型的隐藏层中第二层自注意力机制网络的输入进行计算,得到第二隐藏状态序列;Calculate the first hidden state sequence as the input of the second layer of self-attention mechanism network in the hidden layer of the classification model to obtain the second hidden state sequence;
将所述第二隐藏状态序列作为所述分类模型的隐藏层中第三层自注意力机制网络的输入进行计算步骤,并进行重复递进,直到到达所述分类模型的隐藏层中的最后一层自注意力机制网络,得到待识别隐藏状态序列。The second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached. Layer self-attention mechanism network to get the hidden state sequence to be recognized.
例如,当前窗口的共同语音特征为z t,通过所述隐藏层的若干层自注意力机制网络计算得到的隐藏状态序列为o t: For example, the common speech feature of the current window is z t , and the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
Figure PCTCN2021082872-appb-000001
Figure PCTCN2021082872-appb-000001
Figure PCTCN2021082872-appb-000002
Figure PCTCN2021082872-appb-000002
其中,z t是共同语音特征,
Figure PCTCN2021082872-appb-000003
表示为第t个窗口,第l层的隐藏状态序列,l总共有L层。D l为所述分类模型的隐藏层中第l层自注意力机制网络,上一层l-1的输出
Figure PCTCN2021082872-appb-000004
作为下一层自注意力机制网络的输入。
where z t is the common speech feature,
Figure PCTCN2021082872-appb-000003
Denoted as the t-th window, the hidden state sequence of the l-th layer, l has a total of L layers. D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1
Figure PCTCN2021082872-appb-000004
As the input of the next layer of self-attention mechanism network.
S3、使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列。S3. Use the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence.
详细地,所述使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列,包括:In detail, using the classification model to perform event identification on the hidden state sequence to be identified, to obtain an event label sequence, including:
通过所述分类模的全连接层将所述待识别隐藏状态序列映射为多维空间向量;The hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
利用预设的激活函数对所述多维空间向量进行概率计算,得到事件标签序列。Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
其中,所述事件标签序列是所述待检测音频中每帧语音包含的各种声音事件的概率值。Wherein, the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
本申请实施例中可以定义在一个窗口内包含的声音事件的类别为y tIn this embodiment of the present application, the category of sound events contained in a window may be defined as y t :
y t=[r 1,…,r N] y t =[r 1 ,...,r N ]
其中,r i表示第i个声音事件的事件类别,如果窗口里存在某一个声音事件,则r i=1,否则r i=0,总共有N个声音事件的类别。 Among them, ri represents the event category of the ith sound event, if there is a certain sound event in the window, then ri =1, otherwise ri =0, there are a total of N sound event categories.
可选地,本申请实施例中,在利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列之前,还包括对所述分类模型进行训练:Optionally, in the embodiment of the present application, before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
获取训练样本集,并对所述训练样本集进行标注,得到真实事件序列;Obtain a training sample set, and label the training sample set to obtain a real event sequence;
将所述训练样本集输入至分类模型中,得到预测标签序列;The training sample set is input into the classification model to obtain the predicted label sequence;
利用预设的损失函数和真实事件序列计算所述预测标签序列的训练误差,并根据所述训练误差对所述分类模型进行更新,得到训练好的所述分类模型。The training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
其中,本申请实施例中所述真实事件序列y t由标注获得。假设分类模型的预测标签序列为
Figure PCTCN2021082872-appb-000005
本申请实施例可以使用交叉熵(Cross Entropy)损失函数计算训练误差(Loss),从而对所述分类模型进行更新学习,直到所述分类模型收敛,停止训练,得到训练好的所述分类模型。
Wherein, the real event sequence y t described in the embodiment of the present application is obtained by marking. Suppose the predicted label sequence of the classification model is
Figure PCTCN2021082872-appb-000005
In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
本申请实施例中,所述交叉熵损失函数,包括:In this embodiment of the present application, the cross-entropy loss function includes:
Figure PCTCN2021082872-appb-000006
Figure PCTCN2021082872-appb-000006
其中,Loss为训练误差,N为声音事件的类别总数,y t为真实事件序列,
Figure PCTCN2021082872-appb-000007
为预测标签序列。
Among them, Loss is the training error, N is the total number of categories of sound events, y t is the real event sequence,
Figure PCTCN2021082872-appb-000007
is the predicted label sequence.
S4、对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。S4. Perform smooth processing on the event label sequence to obtain a voice event detection result corresponding to the voice to be detected.
本申请实施例中,所述事件标签序列是逐帧输出的结果,因此难免会存在结果的毛刺和抖动,本申请实施例中使用序列匹配网络(SMN),将所述事件标签序列在时间维度上进行平滑,填补事件预测过程中突然的缺失,以及极短时的毛刺。In the embodiment of the present application, the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result. In the embodiment of the present application, a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
详细地,所述对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果,包括:In detail, the smoothing of the event label sequence to obtain the speech event detection result corresponding to the speech to be detected includes:
利用预设的序列匹配网络对所述事件标签序列进行平滑处理,得到平滑事件标签序列;The event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence;
根据所述平滑事件标签序列中的端点,确定所述待检测语音中包含的各事件的开始时间和结束时间,得到多个事件检测结果;According to the endpoints in the smooth event label sequence, determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
汇集所述多个事件检测结果,得到所述待检测语音对应的语音事件检测结果。The multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
本申请实施例中所述分类模型可以对音频进行实时检测,从而可以应用在对实时性要求比较高的场景,例如体育比赛,直播等领域,具有较高的实际应用价值。The classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
本申请实施例利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,所述分类模型将多个窗口的特征进行合并,提高了语音特征的准确性,同时所述分类模型是基于自注意力机制的,可以提高事件检测的准确性;对所述事件标签序列进行平滑处理,提高了事件检测结果的稳定性。因此本申请提出的语音事件检测方法、装置及计算机可读存储介质,可以提高语音事件检测的稳定性和准确性。In this embodiment of the present application, a classification model based on a self-attention mechanism is used to perform feature analysis on the speech frame feature sequence. The classification model combines the features of multiple windows to improve the accuracy of speech features. At the same time, the classification model It is based on a self-attention mechanism, which can improve the accuracy of event detection; smoothing the sequence of event labels improves the stability of event detection results. Therefore, the voice event detection method, device and computer-readable storage medium proposed in this application can improve the stability and accuracy of voice event detection.
如图2所示,是本申请语音事件检测装置的模块示意图。As shown in FIG. 2 , it is a schematic diagram of a module of the voice event detection apparatus of the present application.
本申请所述语音事件检测装置100可以安装于电子设备中。根据实现的功能,所述语音事件检测装置可以包括特征提取模块101、自注意力模块102、识别模块103和平滑模块104。本申请所述模块也可以称之为单元,是指一种能够被电子设备处理器所执行,并且能够完成固定功能的一系列计算机程序段,其存储在电子设备的存储器中。The voice event detection apparatus 100 described in this application can be installed in an electronic device. According to the implemented functions, the voice event detection apparatus may include a feature extraction module 101 , a self-attention module 102 , a recognition module 103 and a smoothing module 104 . The modules described in this application may also be referred to as units, which refer to a series of computer program segments that can be executed by the processor of an electronic device and can perform fixed functions, and are stored in the memory of the electronic device.
在本实施例中,关于各模块/单元的功能如下:In this embodiment, the functions of each module/unit are as follows:
所述特征提取模块101,用于获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列。The feature extraction module 101 is configured to acquire the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence.
本申请实施例中,所述待检测音频是包含多种声音事件的音频,如人声、歌声、敲击声、狗叫、车鸣等事件。进一步地,所述待检测音频可以从数据库中获取。为了保证所述待检测音频的安全性和私密性,所述待检测音频可以从一区块链的节点中获取。In the embodiment of the present application, the audio to be detected is audio that includes various sound events, such as human voice, singing, percussion, dog barking, car chirping and other events. Further, the to-be-detected audio can be obtained from a database. In order to ensure the security and privacy of the audio to be detected, the audio to be detected can be obtained from a node of a blockchain.
详细地,在对所述待检测音频进行声学特征提取,得到语音帧特征序列时,所述特征提取模块101具体执行下述操作:In detail, when performing acoustic feature extraction on the to-be-detected audio to obtain a speech frame feature sequence, the feature extraction module 101 specifically performs the following operations:
将所述待检测音频进行分帧处理,得到语音帧序列;Framing the audio to be detected to obtain a sequence of speech frames;
对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;For each frame of speech in the speech frame sequence, obtain the corresponding frequency spectrum through fast Fourier transform;
通过梅尔滤波器组将所述频谱转换为梅尔频谱;converting the spectrum to a mel spectrum through a mel filter bank;
在所述梅尔频谱上进行倒谱分析,得到所述待检测音频对应的语音帧特征序列。Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
本申请实施例对所述待检测音频进行声学特征提取,首先需要对所述待检测音频进行分帧处理,如将每10毫秒为一帧,然后对每一帧语音通过计算梅尔频谱,并进行倒谱分析,从而提取所述待检测音频的声学特征。其中,所述梅尔频谱是一种常用的语音特征表示,可以有效地刻画语音的基本信息,便于后续对语音进行分析处理。The embodiment of the present application performs acoustic feature extraction on the audio to be detected. First, the audio to be detected needs to be framed, for example, every 10 milliseconds is a frame, and then the Mel spectrum is calculated for each frame of speech, and the Cepstral analysis is performed to extract acoustic features of the audio to be detected. The Mel spectrum is a commonly used speech feature representation, which can effectively describe the basic information of speech and facilitate subsequent analysis and processing of speech.
所述自注意力模块102,用于利用基于自注意力机制的分类模型对所述语音帧特征序 列进行特征分析,得到待识别隐藏状态序列。The self-attention module 102 is configured to perform feature analysis on the speech frame feature sequence using a classification model based on the self-attention mechanism to obtain a hidden state sequence to be identified.
本申请实施例中,所述分类模型是一种可以识别语音特征,用于对不同声音事件进行检测分类的深度学习模型。所述分类模型包括输入层、隐藏层和全连接层。In the embodiment of the present application, the classification model is a deep learning model that can recognize speech features and is used to detect and classify different sound events. The classification model includes an input layer, a hidden layer and a fully connected layer.
所述自注意机制是根据检测目标去关注部分细节,而不是基于全局进行分析的一种机制,通过所述自注意机制重新计算的特征向量,可以充分考虑到连续音频中上下文之间的联系。The self-attention mechanism is a mechanism that pays attention to some details according to the detection target, rather than based on the global analysis. The feature vector recalculated by the self-attention mechanism can fully consider the connection between the contexts in the continuous audio.
详细地,所述自注意力模块102具体用于:In detail, the self-attention module 102 is specifically used for:
将所述语音帧特征序列划分为多个预设长度的窗口;dividing the speech frame feature sequence into a plurality of windows of preset lengths;
按照时间顺序选择所述窗口中的其中一个窗口作为当前窗口;Select one of the windows as the current window in chronological order;
通过所述分类模型的输入层计算所述当前窗口的特征向量,计算当前窗口的下一个窗口的特征向量,并获取当前窗口的上一个窗口的特征向量;Calculate the feature vector of the current window through the input layer of the classification model, calculate the feature vector of the next window of the current window, and obtain the feature vector of the previous window of the current window;
通过所述分类模型的输入层将所述上一个窗口的特征向量,所述当前窗口的特征向量,以及所述下一个窗口的特征向量进行合并,得到共同语音特征;The feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列。The common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
本申请其中一个示例中,假设语音帧特征序列为x t则本申请实施例可以将所述语音帧特征序列分成若干个窗口: In one of the examples in the present application, assuming that the speech frame feature sequence is x t , the embodiment of the present application may divide the speech frame feature sequence into several windows:
z t=[x t,…,x t+T] z t =[x t ,...,x t+T ]
其中,t表示帧序号,T表示窗口的长度,x t表示所述语音帧特征序列中长度为1的特征。 Wherein, t represents the frame sequence number, T represents the length of the window, and x t represents the feature whose length is 1 in the speech frame feature sequence.
进一步地,本申请实施例中所述分类模型的隐藏层是由若干层自注意力(Self Attention)机制网络组成的深度神经网络。本申请实施例所述将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列,包括:Further, the hidden layer of the classification model described in the embodiments of the present application is a deep neural network composed of several layers of self-attention mechanism networks. According to the embodiment of the present application, the common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism, including:
将所述共同语音特征输入至所述分类模型的隐藏层的第一层自注意力机制网络中进行计算,得到第一隐藏状态序列;Inputting the common speech feature into the first-layer self-attention mechanism network of the hidden layer of the classification model for calculation to obtain a first hidden state sequence;
将所述第一隐藏状态序列作为所述分类模型的隐藏层中第二层自注意力机制网络的输入进行计算,得到第二隐藏状态序列;Calculate the first hidden state sequence as the input of the second layer of self-attention mechanism network in the hidden layer of the classification model to obtain the second hidden state sequence;
将所述第二隐藏状态序列作为所述分类模型的隐藏层中第三层自注意力机制网络的输入进行计算步骤,并进行重复递进,直到到达所述分类模型的隐藏层中的最后一层自注意力机制网络,得到待识别隐藏状态序列。The second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached. Layer self-attention mechanism network to get the hidden state sequence to be recognized.
例如,当前窗口的共同语音特征为z t,通过所述隐藏层的若干层自注意力机制网络计算得到的隐藏状态序列为o t: For example, the common speech feature of the current window is z t , and the hidden state sequence calculated by the self-attention mechanism network of several layers of the hidden layer is o t :
Figure PCTCN2021082872-appb-000008
Figure PCTCN2021082872-appb-000008
Figure PCTCN2021082872-appb-000009
Figure PCTCN2021082872-appb-000009
其中,z t是共同语音特征,
Figure PCTCN2021082872-appb-000010
表示为第t个窗口,第l层的隐藏状态序列,l总共有L层。D l为所述分类模型的隐藏层中第l层自注意力机制网络,上一层l-1的输出
Figure PCTCN2021082872-appb-000011
作为下一层自注意力机制网络的输入。
where z t is the common speech feature,
Figure PCTCN2021082872-appb-000010
Denoted as the t-th window, the hidden state sequence of the l-th layer, l has a total of L layers. D l is the l-th layer of self-attention mechanism network in the hidden layer of the classification model, the output of the previous layer l-1
Figure PCTCN2021082872-appb-000011
As the input of the next layer of self-attention mechanism network.
所述识别模块103,用于使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列。The identifying module 103 is configured to use the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence.
详细地,所述识别模块103具体用于:In detail, the identification module 103 is specifically used for:
通过所述分类模的全连接层将所述待识别隐藏状态序列映射为多维空间向量;The hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
利用预设的激活函数对所述多维空间向量进行概率计算,得到事件标签序列。Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
其中,所述事件标签序列是所述待检测音频中每帧语音包含的各种声音事件的概率值。Wherein, the event tag sequence is the probability value of various sound events included in each frame of speech in the audio to be detected.
本申请实施例中可以定义在一个窗口内包含的声音事件的类别为y tIn this embodiment of the present application, the category of sound events contained in a window may be defined as y t :
y t=[r 1,…,r N] y t =[r 1 ,...,r N ]
其中,r i表示第i个声音事件的事件类别,如果窗口里存在某一个声音事件,则r i=1, 否则r i=0,总共有N个声音事件的类别。 Among them, ri represents the event category of the i -th sound event, if there is a certain sound event in the window, then ri =1, otherwise ri =0, there are a total of N sound event categories.
可选地,本申请实施例中,在利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列之前,还包括对所述分类模型进行训练:Optionally, in the embodiment of the present application, before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, it also includes training the classification model:
获取训练样本集,并对所述训练样本集进行标注,得到真实事件序列;Obtain a training sample set, and label the training sample set to obtain a real event sequence;
将所述训练样本集输入至分类模型中,得到预测标签序列;The training sample set is input into the classification model to obtain the predicted label sequence;
利用预设的损失函数和真实事件序列计算所述预测标签序列的训练误差,并根据所述训练误差对所述分类模型进行更新,得到训练好的所述分类模型。The training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
其中,本申请实施例中所述真实事件序列y t由标注获得。假设分类模型的预测标签序列为
Figure PCTCN2021082872-appb-000012
本申请实施例可以使用交叉熵(Cross Entropy)损失函数计算训练误差(Loss),从而对所述分类模型进行更新学习,直到所述分类模型收敛,停止训练,得到训练好的所述分类模型。
Wherein, the real event sequence y t described in the embodiment of the present application is obtained by marking. Suppose the predicted label sequence of the classification model is
Figure PCTCN2021082872-appb-000012
In this embodiment of the present application, a cross entropy loss function can be used to calculate a training error (Loss), so as to update and learn the classification model, until the classification model converges, stop training, and obtain the trained classification model.
本申请实施例中,所述交叉熵损失函数,包括:In this embodiment of the present application, the cross-entropy loss function includes:
Figure PCTCN2021082872-appb-000013
Figure PCTCN2021082872-appb-000013
其中,Loss为训练误差,N为声音事件的类别总数,y t为真实事件序列,
Figure PCTCN2021082872-appb-000014
为预测标签序列。
Among them, Loss is the training error, N is the total number of categories of sound events, y t is the real event sequence,
Figure PCTCN2021082872-appb-000014
is the predicted label sequence.
所述平滑模块104,用于对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The smoothing module 104 is configured to perform smoothing processing on the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
本申请实施例中,所述事件标签序列是逐帧输出的结果,因此难免会存在结果的毛刺和抖动,本申请实施例中使用序列匹配网络(SMN),将所述事件标签序列在时间维度上进行平滑,填补事件预测过程中突然的缺失,以及极短时的毛刺。In the embodiment of the present application, the event label sequence is the result of frame-by-frame output, so there will inevitably be glitches and jitters in the result. In the embodiment of the present application, a sequence matching network (SMN) is used to map the event label sequence in the time dimension. Smoothing on top to fill in sudden gaps in event prediction, as well as very brief glitches.
详细地,所述平滑模块104具体用于:In detail, the smoothing module 104 is specifically used for:
利用预设的序列匹配网络对所述事件标签序列进行平滑处理,得到平滑事件标签序列;The event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence;
根据所述平滑事件标签序列中的端点,确定所述待检测语音中包含的各事件的开始时间和结束时间,得到多个事件检测结果;According to the endpoint in the smooth event label sequence, determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
汇集所述多个事件检测结果,得到所述待检测语音对应的语音事件检测结果。The multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
本申请实施例中所述分类模型可以对音频进行实时检测,从而可以应用在对实时性要求比较高的场景,例如体育比赛,直播等领域,具有较高的实际应用价值。The classification model described in the embodiments of the present application can detect audio in real time, so that it can be applied to scenarios with high real-time requirements, such as sports competitions, live broadcasts and other fields, and has high practical application value.
如图3所示,是本申请实现语音事件检测方法的电子设备的结构示意图。As shown in FIG. 3 , it is a schematic structural diagram of an electronic device implementing the voice event detection method of the present application.
所述电子设备1可以包括处理器10、存储器11和总线,还可以包括存储在所述存储器11中并可在所述处理器10上运行的计算机程序,如语音事件检测程序12。The electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a voice event detection program 12.
其中,所述存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、移动硬盘、多媒体卡、卡型存储器(例如:SD或DX存储器等)、磁性存储器、磁盘、光盘等。所述存储器11在一些实施例中可以是电子设备1的内部存储单元,例如该电子设备1的移动硬盘。所述存储器11在另一些实施例中也可以是电子设备1的外部存储设备,例如电子设备1上配备的插接式移动硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。进一步地,所述存储器11还可以既包括电子设备1的内部存储单元也包括外部存储设备。所述存储器11不仅可以用于存储安装于电子设备1的应用软件及各类数据,例如语音事件检测程序12的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。Wherein, the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc. In some embodiments, the memory 11 may be an internal storage unit of the electronic device 1 , such as a mobile hard disk of the electronic device 1 . In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the voice event detection program 12, etc., but also can be used to temporarily store data that has been output or will be output.
所述处理器10在一些实施例中可以由集成电路组成,例如可以由单个封装的集成电路所组成,也可以是由多个相同功能或不同功能封装的集成电路所组成,包括一个或者多个中央处理器(Central Processing unit,CPU)、微处理器、数字处理芯片、图形处理器及各种控制芯片的组合等。所述处理器10是所述电子设备的控制核心(Control Unit),利用各种接口和线路连接整个电子设备的各个部件,通过运行或执行存储在所述存储器11内的 程序或者模块(例如执行语音事件检测程序等),以及调用存储在所述存储器11内的数据,以执行电子设备1的各种功能和处理数据。In some embodiments, the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits. Central Processing Unit (CPU), microprocessor, digital processing chip, graphics processor and combination of various control chips, etc. The processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. Voice event detection program, etc.), and call the data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
所述总线可以是外设部件互连标准(peripheral component interconnect,简称PCI)总线或扩展工业标准结构(extended industry standard architecture,简称EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。所述总线被设置为实现所述存储器11以及至少一个处理器10等之间的连接通信。The bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like. The bus can be divided into address bus, data bus, control bus and so on. The bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
图3仅示出了具有部件的电子设备,本领域技术人员可以理解的是,图3示出的结构并不构成对所述电子设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。FIG. 3 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 3 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.
例如,尽管未示出,所述电子设备1还可以包括给各个部件供电的电源(比如电池),优选地,电源可以通过电源管理装置与所述至少一个处理器10逻辑相连,从而通过电源管理装置实现充电管理、放电管理、以及功耗管理等功能。电源还可以包括一个或一个以上的直流或交流电源、再充电装置、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。所述电子设备1还可以包括多种传感器、蓝牙模块、Wi-Fi模块等,在此不再赘述。For example, although not shown, the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management The device implements functions such as charge management, discharge management, and power consumption management. The power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components. The electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
进一步地,所述电子设备1还可以包括网络接口,可选地,所述网络接口可以包括有线接口和/或无线接口(如WI-FI接口、蓝牙接口等),通常用于在该电子设备1与其他电子设备之间建立通信连接。Further, the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
可选地,该电子设备1还可以包括用户接口,用户接口可以是显示器(Display)、输入单元(比如键盘(Keyboard)),可选地,用户接口还可以是标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在电子设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like. The display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
应该了解,所述实施例仅为说明之用,在专利申请范围上并不受此结构的限制。It should be understood that the embodiments are only used for illustration, and are not limited by this structure in the scope of the patent application.
所述电子设备1中的所述存储器11存储的语音事件检测程序12是多个计算机程序的组合,在所述处理器10中运行时,可以实现:The voice event detection program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple computer programs, and when running in the processor 10, it can realize:
获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
进一步地,所述电子设备1集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。所述计算机可读存储介质可以是易失性的,也可以是非易失性的。例如,所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)。Further, if the modules/units integrated in the electronic device 1 are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium. The computer-readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disc, a computer memory, a read-only memory (ROM, Read-Only). Memory).
本申请还提供一种计算机可读存储介质,所述可读存储介质存储有计算机程序,所述计算机程序在被电子设备的处理器所执行时,可以实现:The present application also provides a computer-readable storage medium, where the readable storage medium stores a computer program, and when executed by a processor of an electronic device, the computer program can realize:
获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
进一步地,所述计算机可用存储介质可主要包括存储程序区和存储数据区,其中,存 储程序区可存储操作***、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。In the several embodiments provided in this application, it should be understood that the disclosed apparatus, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the modules is only a logical function division, and there may be other division manners in actual implementation.
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The modules described as separate components may or may not be physically separated, and components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能模块的形式实现。In addition, each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。It will be apparent to those skilled in the art that the present application is not limited to the details of the above-described exemplary embodiments, but that the present application can be implemented in other specific forms without departing from the spirit or essential characteristics of the present application.
因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附关联图表记视为限制所涉及的权利要求。Accordingly, the embodiments are to be regarded in all respects as illustrative and not restrictive, and the scope of the application is to be defined by the appended claims rather than the foregoing description, which is therefore intended to fall within the scope of the claims. All changes within the meaning and scope of the equivalents of , are included in this application. Any accompanying reference signs in the claims should not be construed as limiting the involved claims.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。***权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第二等词语用来表示名称,而并不表示任何特定的顺序。Furthermore, it is clear that the word "comprising" does not exclude other units or steps and the singular does not exclude the plural. Several units or means recited in the system claims can also be realized by one unit or means by means of software or hardware. Second-class terms are used to denote names and do not denote any particular order.
最后应说明的是,以上实施例仅用以说明本申请的技术方案而非限制,尽管参照较佳实施例对本申请进行了详细说明,本领域的普通技术人员应当理解,可以对本申请的技术方案进行修改或等同替换,而不脱离本申请技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application rather than limitations. Although the present application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present application can be Modifications or equivalent substitutions can be made without departing from the spirit and scope of the technical solutions of the present application.

Claims (20)

  1. 一种语音事件检测方法,其中,所述方法包括:A voice event detection method, wherein the method comprises:
    获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
    利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
    使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
    对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  2. 如权利要求1所述的语音事件检测方法,其中,所述对所述待检测音频进行声学特征提取,得到语音帧特征序列,包括:The voice event detection method according to claim 1, wherein the extraction of acoustic features on the to-be-detected audio to obtain a voice frame feature sequence, comprising:
    将所述待检测音频进行分帧处理,得到语音帧序列;Framing the audio to be detected to obtain a sequence of speech frames;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;For each frame of speech in the speech frame sequence, obtain the corresponding frequency spectrum through fast Fourier transform;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;converting the spectrum to a mel spectrum through a mel filter bank;
    在所述梅尔频谱上进行倒谱分析,得到所述待检测音频对应的语音帧特征序列。Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  3. 如权利要求1所述的语音事件检测方法,其中,所述利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列,包括:The voice event detection method according to claim 1, wherein the feature analysis is performed on the voice frame feature sequence by using a classification model based on a self-attention mechanism to obtain a hidden state sequence to be identified, comprising:
    将所述语音帧特征序列划分为多个预设长度的窗口;dividing the speech frame feature sequence into a plurality of windows of preset lengths;
    按照时间顺序选择所述窗口中的其中一个窗口作为当前窗口;Select one of the windows as the current window in chronological order;
    通过所述分类模型的输入层计算所述当前窗口的特征向量,计算当前窗口的下一个窗口的特征向量,并获取当前窗口的上一个窗口的特征向量;Calculate the feature vector of the current window through the input layer of the classification model, calculate the feature vector of the next window of the current window, and obtain the feature vector of the previous window of the current window;
    通过所述分类模型的输入层将所述上一个窗口的特征向量,所述当前窗口的特征向量,以及所述下一个窗口的特征向量进行合并,得到共同语音特征;The feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
    将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列。The common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  4. 如权利要求3所述的语音事件检测方法,其中,所述将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列,包括:The voice event detection method according to claim 3, wherein the inputting the common voice feature into the hidden layer of the classification model, and obtaining the hidden state sequence to be identified based on a self-attention mechanism, comprising:
    将所述共同语音特征输入至所述分类模型的隐藏层的第一层自注意力机制网络中进行计算,得到第一隐藏状态序列;Inputting the common speech feature into the first-layer self-attention mechanism network of the hidden layer of the classification model for calculation to obtain a first hidden state sequence;
    将所述第一隐藏状态序列作为所述分类模型的隐藏层中第二层自注意力机制网络的输入进行计算,得到第二隐藏状态序列;Calculate the first hidden state sequence as the input of the second layer of self-attention mechanism network in the hidden layer of the classification model to obtain the second hidden state sequence;
    将所述第二隐藏状态序列作为所述分类模型的隐藏层中第三层自注意力机制网络的输入进行计算步骤,并进行重复递进,直到到达所述分类模型的隐藏层中的最后一层自注意力机制网络,得到待识别隐藏状态序列。The second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached. Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  5. 如权利要求1中所述的语音事件检测方法,其中,所述使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列,包括:The voice event detection method as claimed in claim 1, wherein the using the classification model to perform event recognition on the hidden state sequence to be recognized to obtain an event label sequence, comprising:
    通过所述分类模的全连接层将所述待识别隐藏状态序列映射为多维空间向量;The hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
    利用预设的激活函数对所述多维空间向量进行概率计算,得到事件标签序列。Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  6. 如权利要求1至5中任一项所述的语音事件检测方法,其中,在利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列之前,该方法还包括:The speech event detection method according to any one of claims 1 to 5, wherein, before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, the Methods also include:
    获取训练样本集,并对所述训练样本集进行标注,得到真实事件序列;Obtain a training sample set, and label the training sample set to obtain a real event sequence;
    将所述训练样本集输入至分类模型中,得到预测标签序列;The training sample set is input into the classification model to obtain the predicted label sequence;
    利用预设的损失函数和真实事件序列计算所述预测标签序列的训练误差,并根据所述训练误差对所述分类模型进行更新,得到训练好的所述分类模型。The training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  7. 如权利要求1所述的语音事件检测方法,其中,所述对所述事件标签序列进行平 滑处理,得到所述待检测语音对应的语音事件检测结果,包括:The voice event detection method as claimed in claim 1, wherein, the described event label sequence is smoothed, and the corresponding voice event detection result of the voice to be detected is obtained, comprising:
    利用预设的序列匹配网络对所述事件标签序列进行平滑处理,得到平滑事件标签序列;The event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence;
    根据所述平滑事件标签序列中的端点,确定所述待检测语音中包含的各事件的开始时间和结束时间,得到多个事件检测结果;According to the endpoints in the smooth event label sequence, determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
    汇集所述多个事件检测结果,得到所述待检测语音对应的语音事件检测结果。The multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  8. 一种语音事件检测装置,其中,所述装置包括:A voice event detection device, wherein the device comprises:
    特征提取模块,用于获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;a feature extraction module, used to obtain the audio to be detected, and perform acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
    自注意力模块,用于利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;The self-attention module is used to perform feature analysis on the speech frame feature sequence by using the classification model based on the self-attention mechanism to obtain the hidden state sequence to be identified;
    识别模块,用于使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;an identification module, configured to perform event identification on the hidden state sequence to be identified by using the classification model to obtain an event label sequence;
    平滑模块,用于对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The smoothing module is used for smoothing the event label sequence to obtain a speech event detection result corresponding to the speech to be detected.
  9. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    至少一个处理器;以及,at least one processor; and,
    与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
    所述存储器存储有可被所述至少一个处理器执行的计算机程序,所述计算机程序被所述至少一个处理器执行,以使所述至少一个处理器能够执行如下步骤:The memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to perform the steps of:
    获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
    利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified;
    使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
    对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  10. 如权利要求9所述的电子设备,其中,所述对所述待检测音频进行声学特征提取,得到语音帧特征序列,包括:The electronic device according to claim 9, wherein the extraction of acoustic features on the to-be-detected audio to obtain a speech frame feature sequence, comprising:
    将所述待检测音频进行分帧处理,得到语音帧序列;Framing the audio to be detected to obtain a sequence of speech frames;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;For each frame of speech in the speech frame sequence, obtain the corresponding frequency spectrum through fast Fourier transform;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;converting the spectrum to a mel spectrum through a mel filter bank;
    在所述梅尔频谱上进行倒谱分析,得到所述待检测音频对应的语音帧特征序列。Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  11. 如权利要求9所述的电子设备,其中,所述利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列,包括:The electronic device according to claim 9, wherein the feature analysis is performed on the speech frame feature sequence by using a classification model based on a self-attention mechanism to obtain a hidden state sequence to be identified, comprising:
    将所述语音帧特征序列划分为多个预设长度的窗口;dividing the speech frame feature sequence into a plurality of windows of preset lengths;
    按照时间顺序选择所述窗口中的其中一个窗口作为当前窗口;Select one of the windows as the current window in chronological order;
    通过所述分类模型的输入层计算所述当前窗口的特征向量,计算当前窗口的下一个窗口的特征向量,并获取当前窗口的上一个窗口的特征向量;Calculate the feature vector of the current window through the input layer of the classification model, calculate the feature vector of the next window of the current window, and obtain the feature vector of the previous window of the current window;
    通过所述分类模型的输入层将所述上一个窗口的特征向量,所述当前窗口的特征向量,以及所述下一个窗口的特征向量进行合并,得到共同语音特征;The feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
    将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列。The common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  12. 如权利要求11所述的电子设备,其中,所述将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列,包括:The electronic device according to claim 11 , wherein, inputting the common speech feature into a hidden layer of the classification model, and obtaining a sequence of hidden states to be identified based on a self-attention mechanism, comprises:
    将所述共同语音特征输入至所述分类模型的隐藏层的第一层自注意力机制网络中进行计算,得到第一隐藏状态序列;Inputting the common speech feature into the first-layer self-attention mechanism network of the hidden layer of the classification model for calculation to obtain a first hidden state sequence;
    将所述第一隐藏状态序列作为所述分类模型的隐藏层中第二层自注意力机制网络的 输入进行计算,得到第二隐藏状态序列;The first hidden state sequence is calculated as the input of the second layer of self-attention mechanism network in the hidden layer of the classification model, and the second hidden state sequence is obtained;
    将所述第二隐藏状态序列作为所述分类模型的隐藏层中第三层自注意力机制网络的输入进行计算步骤,并进行重复递进,直到到达所述分类模型的隐藏层中的最后一层自注意力机制网络,得到待识别隐藏状态序列。The second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached. Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  13. 如权利要求9中所述的电子设备,其中,所述使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列,包括:The electronic device as claimed in claim 9, wherein the use of the classification model to perform event recognition on the to-be-identified hidden state sequence to obtain an event label sequence, comprising:
    通过所述分类模的全连接层将所述待识别隐藏状态序列映射为多维空间向量;The hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
    利用预设的激活函数对所述多维空间向量进行概率计算,得到事件标签序列。Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
  14. 如权利要求9至13中任一项所述的电子设备,其中,在利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列之前,所述计算机程序被所述至少一个处理器执行时还实现如下步骤:The electronic device according to any one of claims 9 to 13, wherein, before using a classification model based on a self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain a hidden state sequence to be identified, the computer When the program is executed by the at least one processor, the following steps are further implemented:
    获取训练样本集,并对所述训练样本集进行标注,得到真实事件序列;Obtain a training sample set, and label the training sample set to obtain a real event sequence;
    将所述训练样本集输入至分类模型中,得到预测标签序列;The training sample set is input into the classification model to obtain the predicted label sequence;
    利用预设的损失函数和真实事件序列计算所述预测标签序列的训练误差,并根据所述训练误差对所述分类模型进行更新,得到训练好的所述分类模型。The training error of the predicted label sequence is calculated by using a preset loss function and a real event sequence, and the classification model is updated according to the training error to obtain the trained classification model.
  15. 如权利要求9所述的电子设备,其中,所述对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果,包括:The electronic device according to claim 9, wherein the smoothing of the event label sequence to obtain a voice event detection result corresponding to the to-be-detected voice comprises:
    利用预设的序列匹配网络对所述事件标签序列进行平滑处理,得到平滑事件标签序列;The event label sequence is smoothed by using a preset sequence matching network to obtain a smooth event label sequence;
    根据所述平滑事件标签序列中的端点,确定所述待检测语音中包含的各事件的开始时间和结束时间,得到多个事件检测结果;According to the endpoint in the smooth event label sequence, determine the start time and end time of each event included in the to-be-detected speech, and obtain multiple event detection results;
    汇集所述多个事件检测结果,得到所述待检测语音对应的语音事件检测结果。The multiple event detection results are collected to obtain a voice event detection result corresponding to the to-be-detected voice.
  16. 一种计算机可读存储介质,包括存储数据区和存储程序区,存储数据区存储创建的数据,存储程序区存储有计算机程序;其中,所述计算机程序被处理器执行时实现如下步骤:A computer-readable storage medium, comprising a storage data area and a storage program area, the storage data area stores data created, and the storage program area stores a computer program; wherein, the computer program is executed by a processor The following steps are implemented:
    获取待检测音频,并对所述待检测音频进行声学特征提取,得到语音帧特征序列;Acquiring the audio to be detected, and performing acoustic feature extraction on the audio to be detected to obtain a speech frame feature sequence;
    利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列;Using the classification model based on the self-attention mechanism to perform feature analysis on the speech frame feature sequence to obtain the hidden state sequence to be identified;
    使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列;Using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence;
    对所述事件标签序列进行平滑处理,得到所述待检测语音对应的语音事件检测结果。The event label sequence is smoothed to obtain a speech event detection result corresponding to the speech to be detected.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述对所述待检测音频进行声学特征提取,得到语音帧特征序列,包括:The computer-readable storage medium according to claim 16, wherein the extraction of acoustic features on the to-be-detected audio to obtain a sequence of speech frame features comprises:
    将所述待检测音频进行分帧处理,得到语音帧序列;Framing the audio to be detected to obtain a sequence of speech frames;
    对所述语音帧序列中的每一帧语音,通过快速傅里叶变换得到对应的频谱;For each frame of speech in the speech frame sequence, obtain the corresponding frequency spectrum through fast Fourier transform;
    通过梅尔滤波器组将所述频谱转换为梅尔频谱;converting the spectrum to a mel spectrum through a mel filter bank;
    在所述梅尔频谱上进行倒谱分析,得到所述待检测音频对应的语音帧特征序列。Cepstral analysis is performed on the Mel spectrum to obtain a speech frame feature sequence corresponding to the audio to be detected.
  18. 如权利要求16所述的计算机可读存储介质,其中,所述利用基于自注意力机制的分类模型对所述语音帧特征序列进行特征分析,得到待识别隐藏状态序列,包括:The computer-readable storage medium according to claim 16, wherein the feature analysis is performed on the speech frame feature sequence by using a classification model based on a self-attention mechanism to obtain a hidden state sequence to be identified, comprising:
    将所述语音帧特征序列划分为多个预设长度的窗口;dividing the speech frame feature sequence into a plurality of windows of preset lengths;
    按照时间顺序选择所述窗口中的其中一个窗口作为当前窗口;Select one of the windows as the current window in chronological order;
    通过所述分类模型的输入层计算所述当前窗口的特征向量,计算当前窗口的下一个窗口的特征向量,并获取当前窗口的上一个窗口的特征向量;Calculate the feature vector of the current window through the input layer of the classification model, calculate the feature vector of the next window of the current window, and obtain the feature vector of the previous window of the current window;
    通过所述分类模型的输入层将所述上一个窗口的特征向量,所述当前窗口的特征向量,以及所述下一个窗口的特征向量进行合并,得到共同语音特征;The feature vector of the previous window, the feature vector of the current window, and the feature vector of the next window are combined through the input layer of the classification model to obtain a common speech feature;
    将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列。The common speech feature is input into the hidden layer of the classification model, and the hidden state sequence to be recognized is obtained based on the self-attention mechanism.
  19. 如权利要求18所述的计算机可读存储介质,其中,所述将所述共同语音特征输入至所述分类模型的隐藏层中,基于自注意机制得到待识别隐藏状态序列,包括:The computer-readable storage medium according to claim 18, wherein the inputting the common speech feature into a hidden layer of the classification model, and obtaining a sequence of hidden states to be identified based on a self-attention mechanism, comprises:
    将所述共同语音特征输入至所述分类模型的隐藏层的第一层自注意力机制网络中进行计算,得到第一隐藏状态序列;Inputting the common speech feature into the first layer of the self-attention mechanism network of the hidden layer of the classification model for calculation to obtain a first hidden state sequence;
    将所述第一隐藏状态序列作为所述分类模型的隐藏层中第二层自注意力机制网络的输入进行计算,得到第二隐藏状态序列;Calculate the first hidden state sequence as the input of the second layer of self-attention mechanism network in the hidden layer of the classification model to obtain the second hidden state sequence;
    将所述第二隐藏状态序列作为所述分类模型的隐藏层中第三层自注意力机制网络的输入进行计算步骤,并进行重复递进,直到到达所述分类模型的隐藏层中的最后一层自注意力机制网络,得到待识别隐藏状态序列。The second hidden state sequence is used as the input of the third layer of the self-attention mechanism network in the hidden layer of the classification model, and the calculation step is performed repeatedly until the last hidden layer of the classification model is reached. Layer self-attention mechanism network to get the hidden state sequence to be recognized.
  20. 如权利要求16中所述的计算机可读存储介质,其中,所述使用所述分类模型对所述待识别隐藏状态序列进行事件识别,得到事件标签序列,包括:The computer-readable storage medium as claimed in claim 16, wherein the using the classification model to perform event identification on the hidden state sequence to be identified to obtain an event label sequence, comprising:
    通过所述分类模的全连接层将所述待识别隐藏状态序列映射为多维空间向量;The hidden state sequence to be identified is mapped to a multi-dimensional space vector through the fully connected layer of the classification module;
    利用预设的激活函数对所述多维空间向量进行概率计算,得到事件标签序列。Probability calculation is performed on the multi-dimensional space vector by using a preset activation function to obtain an event label sequence.
PCT/CN2021/082872 2020-12-01 2021-03-25 Speech event detection method and apparatus, electronic device, and computer storage medium WO2022116420A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011381842.8 2020-12-01
CN202011381842.8A CN112447189A (en) 2020-12-01 2020-12-01 Voice event detection method and device, electronic equipment and computer storage medium

Publications (1)

Publication Number Publication Date
WO2022116420A1 true WO2022116420A1 (en) 2022-06-09

Family

ID=74740231

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/082872 WO2022116420A1 (en) 2020-12-01 2021-03-25 Speech event detection method and apparatus, electronic device, and computer storage medium

Country Status (2)

Country Link
CN (1) CN112447189A (en)
WO (1) WO2022116420A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623531A (en) * 2022-11-29 2023-01-17 浙大城市学院 Hidden monitoring equipment discovering and positioning method using wireless radio frequency signal
CN117316184A (en) * 2023-12-01 2023-12-29 常州分音塔科技有限公司 Event detection feedback processing system based on audio signals

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium
CN113140226B (en) * 2021-04-28 2022-06-21 桂林电子科技大学 Sound event marking and identifying method adopting double Token labels
CN113239872B (en) * 2021-06-01 2024-03-19 平安科技(深圳)有限公司 Event identification method, device, equipment and storage medium
CN113782051B (en) * 2021-07-28 2024-03-19 北京中科模识科技有限公司 Broadcast effect classification method and system, electronic equipment and storage medium
CN113707175B (en) * 2021-08-24 2023-12-19 上海师范大学 Acoustic event detection system based on feature decomposition classifier and adaptive post-processing
CN113724734B (en) * 2021-08-31 2023-07-25 上海师范大学 Sound event detection method and device, storage medium and electronic device
CN113555037B (en) * 2021-09-18 2022-01-11 中国科学院自动化研究所 Method and device for detecting tampered area of tampered audio and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN110929092A (en) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 Multi-event video description method based on dynamic attention mechanism
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10783434B1 (en) * 2019-10-07 2020-09-22 Audio Analytic Ltd Method of training a sound event recognition system
CN110827804A (en) * 2019-11-14 2020-02-21 福州大学 Sound event labeling method from audio frame sequence to event label sequence
CN110929092A (en) * 2019-11-19 2020-03-27 国网江苏省电力工程咨询有限公司 Multi-event video description method based on dynamic attention mechanism
CN111753549A (en) * 2020-05-22 2020-10-09 江苏大学 Multi-mode emotion feature learning and recognition method based on attention mechanism
CN112447189A (en) * 2020-12-01 2021-03-05 平安科技(深圳)有限公司 Voice event detection method and device, electronic equipment and computer storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115623531A (en) * 2022-11-29 2023-01-17 浙大城市学院 Hidden monitoring equipment discovering and positioning method using wireless radio frequency signal
CN115623531B (en) * 2022-11-29 2023-03-31 浙大城市学院 Hidden monitoring equipment discovering and positioning method using wireless radio frequency signal
CN117316184A (en) * 2023-12-01 2023-12-29 常州分音塔科技有限公司 Event detection feedback processing system based on audio signals
CN117316184B (en) * 2023-12-01 2024-02-09 常州分音塔科技有限公司 Event detection feedback processing system based on audio signals

Also Published As

Publication number Publication date
CN112447189A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
WO2022116420A1 (en) Speech event detection method and apparatus, electronic device, and computer storage medium
WO2021232594A1 (en) Speech emotion recognition method and apparatus, electronic device, and storage medium
US10522136B2 (en) Method and device for training acoustic model, computer device and storage medium
WO2022078346A1 (en) Text intent recognition method and apparatus, electronic device, and storage medium
WO2022121176A1 (en) Speech synthesis method and apparatus, electronic device, and readable storage medium
WO2022213465A1 (en) Neural network-based image recognition method and apparatus, electronic device, and medium
WO2022105179A1 (en) Biological feature image recognition method and apparatus, and electronic device and readable storage medium
CA3060822A1 (en) Label information acquistion method and apparatus, electronic device and computer readable medium
CN112001175A (en) Process automation method, device, electronic equipment and storage medium
CN112527994A (en) Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium
CN112667805B (en) Work order category determining method, device, equipment and medium
CN109947924B (en) Dialogue system training data construction method and device, electronic equipment and storage medium
CN109299227B (en) Information query method and device based on voice recognition
WO2022227190A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN113205814B (en) Voice data labeling method and device, electronic equipment and storage medium
WO2021189903A1 (en) Audio-based user state identification method and apparatus, and electronic device and storage medium
WO2022194062A1 (en) Disease label detection method and apparatus, electronic device, and storage medium
WO2022178933A1 (en) Context-based voice sentiment detection method and apparatus, device and storage medium
WO2021208700A1 (en) Method and apparatus for speech data selection, electronic device, and storage medium
WO2021196477A1 (en) Risk user identification method and apparatus based on voiceprint characteristics and associated graph data
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN112542172A (en) Communication auxiliary method, device, equipment and medium based on online conference
WO2022141867A1 (en) Speech recognition method and apparatus, and electronic device and readable storage medium
CN113221990B (en) Information input method and device and related equipment
WO2022222228A1 (en) Method and apparatus for recognizing bad textual information, and electronic device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21899464

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21899464

Country of ref document: EP

Kind code of ref document: A1