WO2023173890A1 - 实时语音识别方法、模型训练方法、装置、设备及存储介质 - Google Patents

实时语音识别方法、模型训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023173890A1
WO2023173890A1 PCT/CN2022/142596 CN2022142596W WO2023173890A1 WO 2023173890 A1 WO2023173890 A1 WO 2023173890A1 CN 2022142596 W CN2022142596 W CN 2022142596W WO 2023173890 A1 WO2023173890 A1 WO 2023173890A1
Authority
WO
WIPO (PCT)
Prior art keywords
block
feature sequence
audio
module
frame
Prior art date
Application number
PCT/CN2022/142596
Other languages
English (en)
French (fr)
Inventor
刘晶晶
张弼弘
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023173890A1 publication Critical patent/WO2023173890A1/zh
Priority to US18/384,009 priority Critical patent/US20240062744A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Embodiments of the present application relate to the field of artificial intelligence technology, and in particular to a real-time speech recognition method, model training method, device, equipment and storage medium.
  • Speech recognition refers to recognizing the speech data provided by the object and obtaining corresponding text data.
  • Speech recognition is generally divided into real-time speech recognition and non-real-time speech recognition.
  • Non-real-time speech recognition means that the system performs recognition after the subject has finished speaking a sentence or a paragraph
  • real-time speech recognition means that the system performs recognition simultaneously while the subject is still speaking.
  • the recognition speed and delay This often becomes a bottleneck for its actual implementation.
  • Embodiments of the present application provide a real-time speech recognition method, model training method, device, equipment and storage medium, which can solve the technical problem of a large amount of calculation during the encoding process, resulting in a large delay in speech recognition.
  • the technical solutions are as follows:
  • Embodiments of the present application provide a real-time speech recognition method, which is executed by a computer device with a real-time speech recognition model deployed in the computer device.
  • the method includes:
  • the block to be recognized includes at least two consecutive audio frames in the speech data.
  • the audio feature sequence of the block to be recognized includes the block to be recognized. Audio features of each frame included;
  • the intermediate processing result of the historical block corresponding to the block to be identified is obtained, and the audio feature sequence of the block to be identified is encoded to obtain the hidden layer feature of the block to be identified,
  • the hidden layer features are the features of the block to be identified after encoding; wherein the historical block refers to a block that has at least one overlapping audio frame with the block to be identified and has been encoded;
  • the real-time speech recognition result of the block to be recognized is obtained by decoding.
  • Embodiments of the present application also provide a method for training a real-time speech recognition model.
  • the method includes:
  • the audio feature sequence includes audio features of multiple audio frames of the sample voice data
  • the audio feature sequence is input to the encoder of the real-time speech recognition model, and the encoder divides the audio feature sequence into blocks according to the mask matrix, and performs encoding processing on each block to obtain the sample speech data.
  • the hidden layer feature sequence includes the hidden layer features of each block; wherein each block includes at least two consecutive audio frames among the plurality of audio frames, and two adjacent blocks There is at least one overlapping audio frame, and when encoding the current block, the encoder uses the intermediate processing result of at least one historical block of audio frames that overlaps with the current block stored in the buffer area; Historical blocks refer to blocks that have at least one overlapping audio frame with the block to be identified and have been encoded;
  • the hidden layer feature sequence is decoded by the decoder of the real-time speech recognition model to obtain the predicted recognition result of the sample speech data;
  • the real-time speech recognition model is trained based on the predicted recognition results and real recognition results of the sample speech data.
  • An embodiment of the present application also provides a real-time speech recognition device, which includes:
  • a sequence acquisition module configured to acquire an audio feature sequence of a block to be recognized of the voice data, where the block to be recognized includes at least two consecutive audio frames in the voice data, and the audio feature sequence of the block to be recognized includes The audio characteristics of each audio frame contained in the block to be identified;
  • An encoding processing module configured to obtain the intermediate processing result of the historical block corresponding to the block to be identified from the data stored in the cache area, and perform encoding processing on the audio feature sequence of the block to be identified, to obtain the audio feature sequence of the block to be identified.
  • Hidden layer features are the features of the block to be identified after encoding processing; wherein the historical block refers to an audio frame that has at least one overlap with the block to be identified and has been encoded. piece;
  • a decoding processing module configured to decode and obtain the real-time speech recognition result of the block to be recognized according to the hidden layer features.
  • An embodiment of the present application also provides a training device for a real-time speech recognition model.
  • the device includes:
  • a sample acquisition module configured to acquire an audio feature sequence of sample voice data, where the audio feature sequence includes audio features of multiple audio frames of the sample voice data;
  • An encoding processing module configured to input the audio feature sequence to the encoder of the real-time speech recognition model, divide the audio feature sequence into blocks according to the mask matrix through the encoder, and perform encoding processing on each block, Obtain a hidden layer feature sequence of the sample speech data, where the hidden layer feature sequence includes the hidden layer features of each block; wherein each block includes at least two consecutive audio frames among the plurality of audio frames, And there is at least one overlapping audio frame between two adjacent blocks.
  • the encoder uses the data of at least one historical block of the audio frame that overlaps with the current block stored in the buffer area. Intermediate processing results; the historical block refers to a block that has at least one overlapping audio frame with the block to be identified and has been encoded;
  • a decoding processing module configured to decode the hidden layer feature sequence through the decoder of the real-time speech recognition model to obtain the predicted recognition result of the sample speech data
  • a model training module configured to train the real-time speech recognition model based on the predicted recognition results and real recognition results of the sample speech data.
  • An embodiment of the present application also provides a computer device.
  • the computer device includes a processor and a memory. At least one instruction, at least a program, a code set or an instruction set are stored in the memory. The at least one instruction, the At least one program, the code set or the instruction set is loaded and executed by the processor to implement the above-mentioned real-time speech recognition method, or the above-mentioned real-time speech recognition model training method.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium stores at least one instruction, at least one program, a code set or an instruction set.
  • the at least one instruction, the at least one program The code set or instruction set is loaded and executed by the processor to implement the above-mentioned real-time speech recognition method, or the above-mentioned real-time speech recognition model training method.
  • Embodiments of the present application also provide a computer program product or computer program.
  • the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned real-time speech recognition method, or the above-mentioned real-time speech recognition model training method.
  • Figure 1 is a schematic diagram of the AED (Attention based Encoder-Decoder, attention-based encoder-decoder)-CTC (Connectionist Temporal Classification, connectionist temporal classification)/Attention (attention) architecture provided by some embodiments of this application;
  • Figure 2 is a training diagram of a real-time speech recognition model based on chunking operations provided by some embodiments of the present application;
  • Figure 3 is a schematic diagram of the real-time speech recognition model provided by some embodiments of the present application in the use stage;
  • Figure 4 is a schematic diagram of the solution implementation environment provided by some embodiments of the present application.
  • Figure 5 is a workflow diagram of real-time speech recognition in applications provided by some embodiments of the present application.
  • Figure 6 is a flow chart of a real-time speech recognition method provided by some embodiments of the present application.
  • Figure 7 is a schematic diagram of the division of blocks to be recognized of voice data provided by some embodiments of the present application.
  • Figure 8 is a flow chart of a real-time speech recognition method provided by another embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a Conformer network provided by some embodiments of the present application.
  • Figure 10 is a schematic structural diagram of a convolution module provided by some embodiments of the present application.
  • Figure 11 is a schematic structural diagram of a multi-head self-attention module provided by some embodiments of the present application.
  • Figure 12 is a schematic structural diagram of a Conformer with a buffer added according to some embodiments of the present application.
  • Figure 13 is a schematic diagram of the calculation method of the self-attention mechanism provided by some embodiments of the present application.
  • Figure 14 is a schematic diagram of using the cache area for calculation provided by some embodiments of the present application.
  • Figure 15 is a flow chart of a training method for a real-time speech recognition model provided by some embodiments of the present application.
  • Figure 16 is a schematic diagram of block division based on a mask matrix provided by some embodiments of the present application.
  • Figure 17 is a block diagram of a real-time speech recognition device provided by some embodiments of the present application.
  • Figure 18 is a block diagram of a real-time speech recognition device provided by other embodiments of the present application.
  • Figure 19 is a block diagram of a training device for a real-time speech recognition model provided by some embodiments of the present application.
  • Figure 20 is a schematic structural diagram of a computer device provided by some embodiments of the present application.
  • the audio features of the speech data input by the object are input to the encoder, and the encoder encodes it to obtain the hidden layer features. , and then the decoder decodes according to the above hidden layer features to obtain the corresponding speech recognition result.
  • this method requires a large amount of calculation during the encoding process, resulting in a large delay in speech recognition.
  • This speech recognition model uses the AED-CTC/Attention architecture.
  • the architecture is exemplarily shown in Figure 1, including an encoding network 10, a decoding network 20, a CTC module 30 and a CE (Cross Entropy, cross entropy) module 40.
  • the encoding network 10 models acoustic features
  • the decoding network 20 combines language features and acoustic features to model
  • the CTC module 30 can automatically learn word boundary alignment
  • the automatic alignment capability of the CTC module 30 can make text and acoustic features
  • the decoding network 20 can avoid problems such as long sentence truncation
  • the joint modeling capability of the decoding network 20 can also enable CTC to have richer text context capabilities and stronger recognition capabilities.
  • the encoding network 10 can only encode part of the audio features up to the current moment or including limited future moments, and then use these encoding information and historical predictions to decode, usually
  • the method is to divide the audio into chunks and perform attention calculations within the chunks.
  • the training loss is calculated through the CTC module 30 and the CE module 40 respectively, and the parameters of the speech recognition model are adjusted based on the calculated training loss.
  • the encoding network 10 uses a Conformer structure
  • the decoding network 20 uses a Transformer structure.
  • a CTC module 30 is added at the end of the encoding network 10 to calculate the CTC loss
  • a CE module 40 is added at the output end of the decoding network 20 to calculate the CE loss.
  • the entire model combines the two training criteria (ie, the above two losses) for parameter update.
  • the decoding network 20 and the CE module 40 are removed, and only the encoding network 10 and the CTC module are used to generate the acoustic posterior probability, and then the n-ary speech model is introduced, and the weighted finite state transducer (Weighted Finite State Transducer) constructed through search is introduced. WFST) image is decoded to obtain the recognition result.
  • Weighted Finite State Transducer Weighted Finite State Transducer
  • Transformer network is a deep self-attention transformation network, which is also commonly used to refer to all similar deep self-attention transformation network structures.
  • the Transformer network breaks through the limitation that recurrent neural networks cannot perform parallel calculations. Compared with convolutional neural networks, the number of operations required to calculate the association between two positions does not increase with distance, and self-attention can produce a more interpretable model.
  • the Conformer network is a convolution-enhanced Transformer network.
  • Conformer uses convolution to enhance the effect of Transformer in the field of speech recognition.
  • the Transformer is good at capturing global features and the convolutional neural network can effectively represent the characteristics of local features.
  • the fusion of the two can better extract the global and local dependencies of audio features, thereby enhancing the effect of speech recognition.
  • Figure 2 exemplarily shows a training diagram of a real-time speech recognition model based on block operations
  • Figure 2 is also an expanded view of the architecture diagram given in Figure 1.
  • the input speech data 21 is first divided into multiple blocks based on audio frames, and each block contains multiple consecutive audio frames.
  • each block contains the same number of audio frames.
  • the number of audio frames included in each block may also be different, and this application does not limit this.
  • Divide the audio frames contained in each block into historical frames, valid frames and future frames.
  • the valid frame is the part of the block that needs to be recognized by speech.
  • the historical frame and the future frame are used to help speech recognition of the valid frame part of the block.
  • the historical frame is the preceding frame of the valid frame
  • the future frame is the subsequent frame of the valid frame. Set the frame, and perform speech recognition through the valid frame part in the help block of the preceding historical frame and the following future frame of the valid frame. As shown in area 22 in Figure 2, area 22 shows the speech data divided into multiple blocks.
  • Nc represents the effective frame of the block
  • Ni represents the historical frame of the block
  • Nr represents the future frame of the block.
  • the future frame portion of Block 1 may coincide with the valid frame portion of Block 2
  • the historical frame portion of Block 2 may coincide with the valid frame portion of Block 1.
  • the above complete hidden layer feature sequence is decoded by the decoder to obtain the predicted recognition result (such as predicted text information).
  • the CTC module is used to calculate the training loss of the complete hidden layer feature sequence, and the parameters of the encoder are adjusted; the CE module is used to calculate the training loss of the prediction recognition results, and the parameters of the encoder and decoder are adjusted.
  • the acoustic posterior probability calculation of a block is started only when the speech features of the Ni+Nc+Nr frame are accumulated.
  • only the posterior probability of the valid frame Nc frame is taken out and then Use CTC-WFST image decoding to obtain the recognition results.
  • Figure 3 exemplarily shows a schematic diagram of the speech recognition model in the actual use stage.
  • audio frames are extracted through real-time input speech data.
  • the above-obtained blocks are input to the encoder.
  • Encoding is performed to obtain the hidden layer feature sequence of the block, and the complete hidden layer feature sequence corresponding to the speech data is obtained by combining the hidden layer feature sequences of each block.
  • the above complete hidden layer feature sequence is decoded to obtain the predicted text information.
  • FIG 4 shows a schematic diagram of the solution implementation environment provided by some embodiments of the present application.
  • the implementation environment of this solution can be realized as a real-time speech recognition system, used to recognize the speech data input by the object, such as realizing the function of real-time speech recognition.
  • the solution implementation environment may include: a model training device 410 and a model using device 420.
  • the model training device 410 may be a terminal device or a server.
  • the model using device 420 may be a terminal device or a server.
  • the model training device 410 may be an electronic device such as a computer, a server, an intelligent robot, or other electronic devices with strong computing capabilities.
  • the model training device 410 is used to train the real-time speech recognition model 430.
  • the real-time speech recognition model 430 is a model used to recognize speech data.
  • the real-time speech recognition model 430 may include a coding network 431 and a decoding network 432 .
  • the model training device 410 can use machine learning to train the speech recognition model 430 so that it has better performance.
  • the above-trained real-time speech recognition model 430 can be deployed and used in the model using device 420 to recognize speech data and obtain corresponding recognition results (ie, predicted text data).
  • the model using device 420 can be a terminal device such as a mobile phone, computer, smart TV, multimedia playback device, wearable device, medical device, intelligent voice interaction device, smart home appliance, vehicle-mounted terminal device, etc., or it can be a server, which is not covered by this application. limited.
  • Embodiments of this application can be applied to various scenarios, including but not limited to artificial intelligence, smart transportation, assisted driving, etc.
  • Terminal devices can be electronic devices such as mobile phones, tablets, PCs (Personal Computers), wearable devices, vehicle-mounted terminal devices, VR (Virtual Reality, virtual reality) devices and AR (Augmented Reality, augmented reality) devices. This application does not limit this. Clients running applications can be installed on terminal devices.
  • the above-mentioned application program refers to an application program that can recognize voice data input by an object.
  • the application may be an input method application, a social networking application, an interactive entertainment application, a map navigation application, etc., where the object can input voice data.
  • the server can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides cloud computing services.
  • the server can be a backend server of the above-mentioned application, and is used to provide background services for the client of the application, such as identifying and sending the voice data input by the object to the client, and displaying text information corresponding to the voice data on the client.
  • speech recognition can also be completed locally by the client, which is not limited in this application.
  • the above-mentioned application program may be a separately developed independent APP (Application, application program), or may be an applet, or other forms of application program such as web application, which is not limited in this application.
  • Terminal devices and servers can communicate with each other through the network.
  • the subject inputs voice data in the client corresponding to the application program, and the voice data may be input by the subject in real time.
  • the client of the application obtains the voice data input by the object, it sends the voice data to the server, and the server recognizes the voice data to obtain the corresponding predicted text data.
  • the recognized predictive text data is then sent to the client and displayed in the client.
  • the real-time speech recognition model of this application can be used for online real-time speech recognition products with different latency requirements, such as voice input methods, voice notes, vehicle-mounted intelligent speech recognition, simultaneous interpretation, and online live broadcast speech recognition products.
  • Figure 5 exemplarily shows a workflow diagram of real-time speech recognition in an application. The object clicks the button to start inputting voice, and the client starts the recording function. Through the client VAD (Voice Activity Detection, Voice Activation Detection), the valid speech segment is detected, and then uploaded to the background server after audio compression and encoding.
  • VAD Voice Activity Detection, Voice Activation Detection
  • the server first completes the audio Decompress, and then further detect the valid speech segments through the server VAD, and send them to the server speech recognition decoder for recognition and decoding.
  • the recognition results are then post-processed and sent back to the client through the network to present to the object.
  • This application optimizes the server-side ASR decoding part in Figure 5, and proposes the use method and training method of the real-time speech recognition model described in the embodiments below.
  • FIG 6 shows a flow chart of a real-time speech recognition method provided by some embodiments of the present application.
  • the execution subject of this method may be the model usage device 20 in the solution implementation environment shown in Figure 1 .
  • the method may include at least one of the following steps (610-630):
  • Step 610 Obtain the audio feature sequence of the block to be recognized of the speech data.
  • the block to be recognized includes at least two consecutive audio frames in the speech data.
  • the audio feature sequence of the block to be recognized includes the audio of each audio frame included in the block to be recognized. feature.
  • the process of collecting voice data and the process of recognizing voice data may be performed in the same device, for example, both are performed in a terminal device.
  • the process of collecting voice data and the process of identifying voice data can also be performed in different devices. For example, the process of collecting voice data is performed by a terminal device, and then the terminal device sends the collected voice data. to the server, which recognizes the voice data.
  • Voice data is the voice data to be recognized provided by the object in the client.
  • the voice data can be voice data input or recorded by the subject in real time on the client, or it can be voice data recorded in advance.
  • the corresponding text data can be obtained. For example, if the object wants to input the text data of "Good morning” in the client, the object can input the voice data corresponding to the text data of "Good morning” in the corresponding voice input area of the client.
  • Voice data can be divided into frames according to time to obtain multiple audio frames, and the time of each audio frame is the same.
  • the block to be recognized is part of the speech data obtained by dividing the speech data into chunks.
  • the speech data is divided according to the number of frames to obtain multiple chunks.
  • multiple blocks include the same number of frames, and each block includes at least two consecutive audio frames.
  • the number of frames included in multiple blocks may also be different, and this application does not limit this.
  • the introduction and explanation are mainly based on an example in which the number of frames contained in multiple blocks is the same.
  • each block includes: at least one valid frame, at least one historical frame preceding the valid frame, and at least one future frame following the valid frame.
  • a block consists of valid frames, historical frames and future frames.
  • the valid frame is the audio frame to be recognized in the block.
  • the historical frame and the future frame are the audio frames used to help improve the accuracy of the recognition result.
  • the valid frame and the future frame through the before and after Audio frames to more accurately identify the recognition results of the valid frame parts.
  • the greater the number of valid frames and future frames selected the more accurate the recognition result of the valid frame part will be, but the greater the delay in the speech recognition scene; the smaller the number of valid frames and future frames selected, The less accurate the recognition result of the valid frame part is, the smaller the delay in the speech recognition scenario will be.
  • the blocks are divided according to valid frames, there are overlapping portions between each block (the overlapping portions are overlapping partial audio frames).
  • the audio feature sequence of the block to be recognized is a set of audio features corresponding to each audio frame of the block to be recognized.
  • the audio feature sequence is generated by combining the audio features corresponding to each audio frame.
  • Audio features are used to represent the semantic features of audio frames. Audio features can be obtained from the waveform diagram corresponding to the audio frame. By calculating the frequency, phase, amplitude, reciprocal of Mel spectrum and other characteristics of the waveform diagram corresponding to the audio frame, the audio frame is obtained Corresponding audio characteristics.
  • Figure 7 exemplarily shows a schematic diagram of block division of voice data.
  • the number of audio frames of the input voice data 70 is 20 frames, and the valid frames set for each block are 4 frames, the historical frames are 8 frames, and the future frames are 4 frames, then the input voice data 70 can be divided into 5 frames.
  • the number of frames in each block is: 8 frames, 12 frames, 16 frames, 16 frames and 12 frames, as shown in blocks 71, 72, 73, 74 and 75 in the figure.
  • block 71 only has 4 valid frames and 4 future frames
  • block 72 has 4 historical frames
  • blocks 73 and 74 both have 8 historical frames
  • block 75 has 8 historical frames and 4 valid frames.
  • the future frame portion of block 71 overlaps with the effective frame portion of block 72
  • the historical frame portion of block 72 overlaps with the effective frame portion of block 71 .
  • the audio frames of the speech data are acquired in real time when acquiring the speech data.
  • the block is obtained.
  • the next block is obtained.
  • the valid frames set in the block are 4 frames, the historical frames are 8 frames, and the future frames are 4 frames.
  • the first block is obtained.
  • the first block has 4 valid frames and 4 future frames.
  • the first block consists of frames 1 to 8; in the obtained audio frames
  • the second block is obtained.
  • the second block has 4 historical frames, 4 valid frames and 4 future frames.
  • the second block It consists of frames 1 to 12; when the number of audio frames obtained is 12 frames, and the number of audio frames obtained is 4 frames, the third block is obtained.
  • the third block has 8 historical frames, 4 valid frames and 4 future frames, the third block consists of frames 1 to 16.
  • the fourth block consists of frames 5 to 20.
  • the number of audio frames that the first block needs to obtain is the set number of valid frames plus the number of future frames
  • the acquisition condition for subsequent blocks is that the number of audio frames that need to be obtained is the set number of valid frames.
  • Step 620 Obtain the intermediate processing result of the historical block corresponding to the block to be recognized from the data stored in the cache area, and use the intermediate processing result of the historical block to encode the audio feature sequence of the block to be recognized to obtain the audio feature sequence of the block to be recognized.
  • Hidden layer features, hidden layer features are the features of the block to be identified after encoding processing; where the historical block refers to a block that has at least one overlapping audio frame with the block to be identified and has been encoded.
  • the intermediate processing result of the historical block is: an intermediate quantity that needs to be used in the process of encoding the block to be identified.
  • block 71 and block 72 have been processed at this time.
  • the historical frame part of the block 73 to be recognized is the valid frame part of the block 71 and the valid frame part of the block 72.
  • both the block 71 and the block 72 are historical blocks of the block 73 to be recognized.
  • the effective frame parts of block 71 and block 72 have been processed, so the intermediate processing results of the effective frame parts of block 71 and block 72 can be directly used to encode the block 73 to be identified.
  • the hidden layer feature sequence is the result of encoding the audio feature sequence corresponding to the speech data.
  • the encoding process is to encode the effective frame part in each block divided by the audio data, based on the historical frames and effective frames in the block to be recognized.
  • the frame and future frames encode the effective frame part to obtain the hidden layer features corresponding to the effective frame.
  • the encoded hidden layer features corresponding to the valid frames of each block to be recognized are combined to generate a hidden layer feature sequence.
  • the hidden layer features correspond to the audio features.
  • the audio features are the features of the audio frame before encoding
  • the hidden layer features are the features of the audio frame after encoding.
  • Block 71 is first encoded. Based on the 4 valid frames and 4 future frames of block 71, the corresponding 4 valid frames in block 71 are obtained. hidden layer features. Then, block 72 is encoded, and based on the 4 historical frames, 4 valid frames, and 4 future frames of block 72, the hidden layer features corresponding to the 4 valid frames in block 72 are obtained. Among them, when the block 72 is a block to be recognized, the four historical frames of the block 72 to be recognized are also the valid frame parts of the block 71 , that is, the block 71 is a historical block of the block 72 to be recognized. When the calculation results of the four valid frames of the block 71 have been calculated, there is no need to repeatedly calculate the four audio frames when encoding the block 72 to be recognized. The same is true for blocks 73, 74 and 75, which will not be described again.
  • the number of valid frames and future frames is determined based on the delay requirements of the current speech recognition scenario.
  • the delay is used to represent the delay in the speech recognition scenario.
  • the delay includes the first word delay and the last word delay.
  • the first word delay represents the time required for the user to input speech data and obtain the first recognized word.
  • the last word delay Indicates the time required from the user inputting voice data to obtaining the last recognized word.
  • the real-time rate is the calculation result obtained by dividing the time required to process a piece of voice data by the time of the piece of voice data.
  • the real-time speech recognition model in this application has a delay within 500ms and a real-time decoding rate of about 0.5, achieving a high recognition accuracy.
  • Control the time required to display partial real-time speech recognition results in the client by adjusting the number of valid frames and future frames.
  • the number of effective frames and future frames can be adjusted according to the needs of the object, increasing the diversity and flexibility of the real-time speech recognition function.
  • Step 630 According to the hidden layer features, decode to obtain the real-time speech recognition result of the block to be recognized.
  • a predictive recognition result is obtained by decoding the hidden layer features according to the encoded hidden layer features.
  • the intermediate processing results of the previous or previous historical blocks of the block to be recognized are reused, thereby reducing It reduces the calculation amount of the encoding process and speeds up the speech recognition, thereby better meeting the needs of real-time speech recognition.
  • FIG. 8 shows a flow chart of a real-time speech recognition method provided by other embodiments of the present application.
  • the execution subject of this method may be the model usage device 20 in the solution implementation environment shown in Figure 1 .
  • the method may include at least one of the following steps (810-840):
  • Step 810 Obtain the audio feature sequence of the block to be recognized of the speech data.
  • the block to be recognized includes at least two consecutive audio frames in the speech data.
  • the audio feature sequence of the block to be recognized includes the audio of each audio frame included in the block to be recognized. feature.
  • Step 820 Obtain the intermediate processing result of the historical block from the data stored in the cache area.
  • the cache area is an area used to store intermediate processing results of historical blocks.
  • the cache area stores effective frame calculation results corresponding to the historical blocks of the blocks to be identified.
  • the number of valid frame calculation results stored in the buffer area is the same as the set number of historical frames.
  • Step 830 Use the intermediate processing result of the historical block to encode the audio feature sequence of the block to be recognized through the encoder of the real-time speech recognition model to obtain the hidden layer features of the block to be recognized.
  • the real-time speech recognition model is a model used for real-time speech recognition of speech data.
  • the structure of the real-time speech recognition model is introduced in the embodiment below.
  • a real-time speech recognition model can be a model built based on a neural network.
  • the real-time speech recognition model includes an encoder (or encoding network) and a decoder (or decoding network).
  • the encoder is used to encode the input audio features to obtain hidden layer features
  • the decoder is used to decode the hidden layer features to obtain speech recognition results.
  • the server-side ASR decoding part shown in Figure 5 above refers to the complete process of the server executing the real-time speech recognition method provided by the embodiment of the present application, such as obtaining the audio feature sequence of the to-be-recognized block of speech data, Then, the encoder of the real-time speech recognition model encodes the audio feature sequence of the block to be recognized according to the processing results of the historical block corresponding to the block to be recognized, and obtains the hidden layer features of the block to be recognized, and then decodes it through the server CTC-WFST. The decoder of the decoder decodes and obtains the real-time speech recognition result of the block to be recognized based on the hidden layer features.
  • the real-time speech recognition model When the real-time speech recognition model encodes the block to be recognized, it can use the intermediate processing results of the historical blocks in the buffer area to help the subsequent encoding of the block to be recognized. Further, when the latest calculated intermediate calculation result of the valid frame is obtained, the intermediate calculation result of the relatively earliest valid frame stored in the buffer area is overwritten.
  • the speech data is encoded.
  • block 71 is encoded.
  • block 71 is the block to be recognized, based on 4 valid frames and 4 future frames of block 71 to be recognized. , obtain the hidden layer features corresponding to the four valid frames in the block 71 to be recognized, and store the intermediate calculation results corresponding to the four valid frames in the buffer area.
  • the intermediate calculation results may be the K and V vectors calculated by the multi-head self-attention module of each coding layer, and the convolution results calculated by the convolution module.
  • the block 72 is encoded, and based on the 4 historical frames, 4 valid frames and 4 future frames of the block 72 to be recognized, the hidden layer features corresponding to the 4 valid frames in the block 72 to be recognized are obtained.
  • the four historical frames of the block 72 to be recognized are also the valid frame parts of the block 71 , that is, the block 71 is a historical block of the block 72 to be recognized.
  • the intermediate calculation results corresponding to the four valid frames can be obtained from the buffer area, so there is no need to calculate the four audio frames.
  • Step 840 According to the hidden layer features, decode to obtain the real-time speech recognition result of the block to be recognized.
  • Step 810 and step 840 have been introduced in the above embodiments and will not be described again here.
  • the encoder includes n serially connected encoding layers, where n is an integer greater than 1.
  • the encoding layer includes a multi-head self-attention module and a convolution module.
  • the multi-head self-attention module is used to process the input feature sequence using a multi-head self-attention mechanism
  • the convolution module is used to convolve the input feature sequence.
  • the cache area includes a first cache area and a second cache area. The first cache area is used to store the intermediate processing results of the multi-head self-attention module for historical blocks, and the second cache area is used to store the convolution module's results for historical blocks. Intermediate processing results.
  • the functions of the first buffer area and the second buffer area are the same, and they both store the output results of the module.
  • the difference is that the first buffer area stores the intermediate processing results in the multi-head self-attention module, and the second buffer area stores is the intermediate processing result in the convolution module.
  • the encoding layer further includes a first feedforward module, a second feedforward module, and a layer normalization module.
  • the first feed-forward module also known as Feed Forward Module (FFM for short)
  • FMF Feed Forward Module
  • MHSA Multi-Head Self Attention Module
  • the convolution module Convolution Module
  • the second feedforward module is the same as the first feedforward module and is used to perform feedforward processing on the third intermediate feature sequence to obtain a fourth intermediate feature sequence.
  • the layer normalization module (Layernorm) is used to normalize the fourth intermediate feature sequence to obtain the output feature sequence.
  • Figure 9 exemplarily shows the structure diagram of the Conformer network.
  • the Conformer first preprocesses the input feature sequence, including a data enhancement module (SqecAug) 910 and a convolution downsampling module. (Convolution Subsampling) 920, arrangement module (Linear) 930 and removal module (Dropout) 940 to obtain the preprocessed feature sequence.
  • the feature sequence is then input into the encoding module 950 for encoding processing.
  • the encoding module 950 has multiple encoding layers, and the structure of each encoding layer is the same. In some embodiments, the structure of each encoding layer can also be different.
  • each coding layer is the same, which is the structure diagram shown in area 960 in Figure 9 .
  • the encoding layer consists of multiple layers: a first feedforward module 961, a multi-head self-attention module 962, a convolution module 963, a second feedforward module 964, and a layer normalization module 965.
  • the modules are introduced in the examples below.
  • the number of coding layers possessed by the coding module 950 may be 4 layers or 6 layers, which is not limited in this application.
  • the output result of the first feedforward module 961 is the first intermediate feature sequence.
  • the first intermediate feature sequence is input to the multi-head self-attention module 962 to obtain the second intermediate feature sequence.
  • the second intermediate feature sequence is input to the convolution module. 963 to obtain the third intermediate feature sequence, input the third intermediate feature sequence to the second feedforward module 964 to obtain the fourth intermediate feature sequence, and input the fourth intermediate feature sequence to the layer normalization module 965 to obtain the output feature sequence.
  • Its calculation formula is as follows:
  • FFN is the feedforward module
  • MHSA is the multi-head self-attention mechanism module
  • Conv is the convolution module
  • Layernorm is the layer normalization module.
  • Figure 10 exemplarily shows a schematic structural diagram of a convolution module.
  • the convolution module is composed of a multi-layer structure, consisting of a layer normalization module, three convolution modules, two activation function modules, accelerated neural network training and temporary removal modules.
  • the first convolution module and the third convolution module are the same convolution modules. They are both Pointwise convolution modules. They only change the number of feature maps without changing the feature map size.
  • the second convolution module is a Depthwise convolution module. , refers to changing the size of the feature map without changing the number of channels.
  • Figure 11 exemplarily shows a schematic structural diagram of a multi-head self-attention module.
  • Part (a) in Figure 11 is a schematic structural diagram of the multi-head self-attention module
  • part (b) in Figure 11 is an expanded structural diagram of the scaled attention mechanism in part (a).
  • the calculation method for the self-attention mechanism is introduced below.
  • Figure 12 exemplarily shows a schematic structural diagram of a Conformer with a cache area added.
  • buffer areas are added in front of the multi-head self-attention module 962 and the convolution module 963.
  • the first buffer area 121 is added in front of the multi-head self-attention module 962, and the second buffer area is added in front of the convolution module 963. 122.
  • Figure 13 exemplarily shows a schematic diagram of the calculation method of the self-attention mechanism.
  • the attention mechanism is different from the encoding process. It is a process of finding the correlation between various inputs. Therefore, in this application, the recognition result of the effective frame is predicted by using the self-attention mechanism to obtain the correlation between the historical frame or the future frame and the effective frame.
  • the historical frame or future frame and the first intermediate feature sequence of the valid frame output by the above-mentioned first feedforward module are used as the input of the attention mechanism, where the feedforward result of the historical frame is set to a 1 , the feedforward result of the valid frame is set to a 2 , and the feedforward result of the future frame is set to a 3 .
  • the softmax function value of the element is the ratio of the index of the element to the sum of the indices of all elements. Taking ⁇ 1.1 as an example, its softmax function operation is to divide the exponent of ⁇ 1.1 by the sum of the exponent of ⁇ 1.1 , the exponent of ⁇ 1.2 , and the exponent of ⁇ 1.3 .
  • b 1 is the second intermediate feature sequence between the historical frame and the valid frame
  • b 3 is the second intermediate feature sequence between the future frame and the valid frame.
  • the convolution module is configured to perform convolution processing on the second intermediate feature of the current frame and the second intermediate feature of at least one historical frame of the current frame to obtain a third intermediate feature of the current frame.
  • the convolution module in Conformer uses causal convolution. If the number of convolution kernels is 15, it is necessary to use the historical 14 frames and the current frame to predict the convolution output of the current frame.
  • the cache mechanism is designed as follows: before each block calculates the convolution module, the last 14 frames of the valid frame Nc of the current block are cached. This part of the cache is used as a historical frame before the next block calculates the convolution module. .
  • part L in Figure 14 is the intermediate result of self-attention calculation cached in the previous block.
  • the Query dimension of each block is C when calculating the second intermediate feature sequence
  • the Key dimension is L+C.
  • C in the Key dimension and C in the Query dimension are of the same source. They are both calculated in the current block, but L is not calculated in the current block, but is reused.
  • the intermediate result cached in the first cache area when the previous block was calculated, generally L is n times C.
  • the intermediate calculation results of the calculated blocks are stored and used in subsequent calculations.
  • the number of The calculation amount is reduced and the calculation time is saved.
  • Figure 15 shows a flow chart of a real-time speech recognition model training method provided by some embodiments of the present application.
  • the execution subject of this method may be the model training device 10 in the solution implementation environment shown in Figure 1.
  • the method may include at least one of the following steps (1510-1540):
  • Step 1510 Obtain an audio feature sequence of the sample voice data.
  • the audio feature sequence includes audio features of multiple audio frames of the sample voice data.
  • the sample speech data is the speech data used to train the real-time speech recognition model.
  • the sample speech data corresponds to the real recognition result.
  • the real recognition result is the accurate recognition result expressed by the sample speech data.
  • sample voice data divide the sample voice data into frames according to time, and obtain multiple audio frames. Obtain the audio features of each frame and integrate them to obtain the audio feature sequence of the sample speech data.
  • Step 1520 The audio feature sequence is input to the encoder of the real-time speech recognition model.
  • the encoder divides the audio feature sequence into blocks according to the mask matrix, and encodes each block to obtain the hidden layer feature sequence of the sample speech data.
  • the layer feature sequence includes hidden layer features of each block; wherein each block includes at least two consecutive audio frames among multiple audio frames, and there is at least one overlapping audio frame between two adjacent blocks, and the encoder When encoding the current block, use the intermediate processing result of at least one historical block of audio frames that overlaps the current block and is stored in the cache area.
  • the historical block refers to a block that has at least one overlapping audio frame with the block to be identified and has been encoded.
  • the current block is the block to be identified that is being encoded
  • the historical block is the block that has been encoded.
  • the intermediate processing result of the historical block can be reused to help the current block perform encoding processing.
  • the blocks to be recognized are no longer obtained in real time, but the entire audio feature sequence is input to the encoder.
  • the audio feature sequence is chunked to generate multiple chunks.
  • Each block is encoded to obtain the hidden layer features corresponding to the effective frame part of each block.
  • the calculation results of the valid frame part in the historical block can be multiplexed when encoding subsequent blocks to be identified, as in the above embodiment, to save calculation time.
  • the encoder determines the frames contained in each block according to the mask matrix; wherein the mask matrix includes multiple sets of elements, each set of elements is used to indicate the frames included in a block, and each set of elements is passed through two Different values to distinguish between frames that are included in a block and those that are not.
  • FIG. 16 exemplarily shows a schematic diagram when the mask matrix is divided into blocks.
  • Figure 16 only shows the valid frames and historical frames of the block.
  • the set valid frames of the block are 2 frames and the historical frames are 4 frames.
  • 1 represents the mask.
  • the audio frame that the matrix focuses on, 0 represents the audio frame that the mask matrix does not focus on.
  • Each solid line box represents each block, in which the four 1's on the right side of each block are the valid frame part, and the remaining parts are the historical frame part.
  • the vertical axis of the Mask matrix represents the Query (Q) direction
  • the horizontal axis represents the Key (K) direction.
  • the specific calculation formula of the multi-head self-attention module is as follows:
  • MultiHead(Q,K,V) Concat(head 1 ,...,head h )W o
  • Figure 16 describes the QK ⁇ T matrix calculation.
  • Q is an m ⁇ n dimensional matrix
  • K is a k ⁇ n dimensional matrix
  • T represents the transpose of the matrix
  • the QK ⁇ T matrix calculation result is an m ⁇ k dimensional matrix.
  • the matrix describes the correlation coefficient (degree) between each frame of Query and each frame of Key.
  • the encoder includes n serial coding layers, where n is an integer greater than 1.
  • the audio feature sequence is input to the encoder of the real-time speech recognition model, and through the encoder Dividing the audio feature sequence into blocks according to the mask matrix includes: determining the block corresponding to the first coding layer according to the mask matrix, including: at least one valid frame and at least one history before the valid frame. frame, and at least one future frame located after the valid frame;
  • the block corresponding to the i-th coding layer is determined according to the mask matrix, including: at least one valid frame, and at least one historical frame before the valid frame, where i is an integer greater than 1 and less than or equal to n.
  • the encoding process of each block is performed to obtain the hidden layer feature sequence of the sample speech data, including:
  • the output feature sequence of the i-1 coding layer is input to the i-th coding layer, and the output feature sequence of the i-1 coding layer is encoded through the i-th coding layer.
  • the output feature sequence of the nth coding layer of the encoder is used as the hidden layer feature sequence.
  • the model training phase during the encoding process, only the valid frames in the block of the first coding layer are encoded and the future frames are considered. In the subsequent i-th coding layer, only the historical frames and valid frames of the block are encoded. . Because during the encoding process, if the future frames of the blocks in all coding layers are encoded, a large amount of delay will be generated, and the delay will be much larger than Nc+n*Nr. If only the future frames of the blocks in the first coding layer are encoded, the resulting delay is only Nc+Nr, which is much smaller than the delay produced by the above encoding method, reducing the efficiency of the real-time speech recognition system. time delay. At the same time, only the first layer of converter focuses on future frames, and other layers only focus on limited historical frames and currently valid frames, so that the delay of the speech recognition system can be intuitively and flexibly controlled.
  • the above coding layer includes: a first feedforward module, a multi-head self-attention module, a convolution module, a second feedforward module and a layer normalization module;
  • the first feedforward module is used to perform feedforward processing on the input feature sequence to obtain the first intermediate feature sequence
  • the multi-head self-attention module is used to process the first intermediate feature sequence using the multi-head self-attention mechanism to obtain the second intermediate feature sequence;
  • the convolution module is used to perform convolution processing on the second intermediate feature sequence to obtain the third intermediate feature sequence
  • the second feedforward module is used to perform feedforward processing on the third intermediate feature sequence to obtain the fourth intermediate feature sequence;
  • the layer normalization module is used to normalize the fourth intermediate feature sequence to obtain an output feature sequence.
  • the convolution module is configured to perform convolution processing on the second intermediate feature of the current frame and the second intermediate feature of at least one historical frame of the current frame to obtain a third intermediate feature of the current frame.
  • the convolution module in Conformer uses causal convolution. If the number of convolution system kernel cores is 15, it is necessary to use the historical 14 frames and the current frame to predict the convolution output of the current frame.
  • Step 1530 Decode the hidden layer feature sequence through the decoder of the real-time speech recognition model to obtain the predicted recognition result of the sample speech data.
  • the hidden layer feature sequence is decoded by the decoder to obtain the prediction recognition result of the sample speech data.
  • the above-mentioned encoder may be a Transformer encoder.
  • Step 1540 Train the real-time speech recognition model based on the predicted recognition results and real recognition results of the sample speech data.
  • the training loss of the real-time speech recognition model is determined, and the network parameters of the real-time speech recognition model are adjusted based on the training loss.
  • the training loss of the real-time speech recognition model is used to measure the gap between the predicted recognition results and the real recognition results.
  • the gradient descent method is used to adjust the model parameters based on the training loss, and finally a real-time speech recognition model that has completed training is obtained.
  • the model training speed is accelerated and the training efficiency of the speech recognition model is improved.
  • the encoding process of the current block is performed by using the intermediate processing result of at least one historical block of the audio frame that overlaps with the current block stored in the buffer area, which reduces the amount of calculation and time, further speeding up model training and improving the training efficiency of speech recognition models.
  • FIG. 17 shows a block diagram of a real-time speech recognition device provided by an embodiment of the present application.
  • the device has the function of implementing the above-mentioned real-time speech recognition method.
  • the function can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the device can be the model using equipment introduced above, or can be set in the model using equipment.
  • the device 1700 may include: a sequence acquisition module 1710, an encoding processing module 1720, and a decoding processing module 1730.
  • Sequence acquisition module 1710 used to obtain the audio feature sequence of the block to be recognized of the speech data, the block to be recognized includes at least two consecutive audio frames in the speech data, the audio feature sequence of the block to be recognized Includes audio features of each audio frame included in the block to be identified.
  • the encoding processing module 1720 is used to obtain the intermediate processing result of the historical block corresponding to the block to be recognized from the data stored in the cache area, and use the intermediate processing result of the historical block to process the audio feature sequence of the block to be recognized. Carry out encoding processing to obtain the hidden layer features of the block to be identified, and the hidden layer features are the features of the block to be identified after encoding processing; wherein, the historical block refers to a block that has at least one similarity with the block to be identified. Overlapping audio frames and encoded blocks.
  • the decoding processing module 1730 is configured to decode and obtain the real-time speech recognition result of the block to be recognized according to the hidden layer features.
  • the encoder includes n serially connected encoding layers, where n is an integer greater than 1;
  • the coding layer includes a multi-head self-attention module and a convolution module.
  • the multi-head self-attention module is used to process the input feature sequence using a multi-head self-attention mechanism.
  • the convolution module is used to process the input feature sequence. Perform convolution processing;
  • the cache area includes a first cache area and a second cache area.
  • the first cache area is used to store the intermediate processing results of the multi-head self-attention module for the historical block.
  • the second cache area is used to store The convolution module targets the intermediate processing results of the historical blocks.
  • the block to be recognized includes at least one valid frame and at least one historical frame located before the valid frame; the at least one historical frame is an overlapping block between the block to be recognized and the historical block.
  • Audio frame as shown in Figure 18, the encoding processing module 1720 includes: a cache result acquisition unit 1721 and an encoding processing unit 1722.
  • the cache result acquisition unit 1721 is used to obtain the intermediate processing results of the multi-head self-attention module for the historical block from the data stored in the first cache area, and obtain the convolution module from the second cache area. Intermediate processing results for the historical block.
  • the encoding processing unit 1722 is configured to use the encoder of the real-time speech recognition model to perform audio processing on the audio features of at least one valid frame of the block to be recognized according to the intermediate processing results obtained from the first buffer area and the second buffer area.
  • the encoding process is performed to obtain the hidden layer features of the effective frame and the hidden layer features of the block to be identified.
  • the encoding layer further includes a first feedforward module, a second feedforward module and a layer normalization module;
  • the first feedforward module is used to perform feedforward processing on the input feature sequence to obtain a first intermediate feature sequence
  • the multi-head self-attention module is used to process the first intermediate feature sequence using a multi-head self-attention mechanism to obtain a second intermediate feature sequence;
  • the convolution module is used to perform convolution processing on the second intermediate feature sequence to obtain a third intermediate feature sequence
  • the second feedforward module is used to perform feedforward processing on the third intermediate feature sequence to obtain a fourth intermediate feature sequence
  • the layer normalization module is used to normalize the fourth intermediate feature sequence to obtain an output feature sequence.
  • the convolution module is configured to perform convolution processing on the second intermediate feature of the current frame and the second intermediate feature of at least one historical frame of the current frame to obtain a third intermediate feature of the current frame. feature.
  • the block to be identified further includes: at least one future frame located after the valid frame.
  • the apparatus further includes: a future frame determination module 1740.
  • the future frame determination module 1740 is used to determine the number of future frames based on the delay requirements of the current speech recognition scenario.
  • the processing results of the previous or previous historical blocks of the block to be recognized are reused, thereby The calculation amount of the encoding process is reduced and the speed of speech recognition is accelerated, thereby better meeting the needs of real-time speech recognition.
  • FIG. 19 shows a block diagram of a training device for a real-time speech recognition model provided by an embodiment of the present application.
  • the device has the function of implementing the above-mentioned training method of the real-time speech recognition model.
  • the function can be implemented by hardware, or can be implemented by hardware executing corresponding software.
  • the device can be the model training equipment introduced above, or can be set in the model training equipment.
  • the device 1900 may include: a sample acquisition module 1910, an encoding processing module 1920, a decoding processing module 1930, and a model training module 1940.
  • the sample acquisition module 1910 is configured to acquire an audio feature sequence of sample voice data, where the audio feature sequence includes audio features of multiple audio frames of the sample voice data.
  • Encoding processing module 1920 used to input the audio feature sequence to the encoder of the real-time speech recognition model, divide the audio feature sequence into blocks according to the mask matrix, and perform encoding processing on each block. , obtain the hidden layer feature sequence of the sample speech data, the hidden layer feature sequence includes the hidden layer features of each block; wherein each block includes at least two consecutive audio frames among the plurality of audio frames. , and there is at least one overlapping audio frame between two adjacent blocks.
  • the encoder uses at least one historical block of audio frames that overlaps with the current block stored in the buffer area. The processing result, the historical block refers to a block that has at least one overlapping audio frame with the block to be identified and has been encoded.
  • the decoding processing module 1930 is configured to decode the hidden layer feature sequence through the decoder of the real-time speech recognition model to obtain the predicted recognition result of the sample speech data.
  • the model training module 1940 is used to train the real-time speech recognition model based on the predicted recognition results and real recognition results of the sample speech data.
  • the mask matrix includes multiple sets of elements, each set of elements is used to indicate the audio frames included in a block, and each set of elements is distinguished by two different values from those included in a block and those not included in the block. audio frame.
  • the encoder includes n serially connected coding layers, where n is an integer greater than 1; each of the n coding layers includes a multi-head self-attention module;
  • the multi-head self-attention module uses the mask matrix to screen out audio frames that do not belong to the current block.
  • the encoder includes n serially connected encoding layers, n is an integer greater than 1, and the encoding processing module 1920 is used to:
  • the block corresponding to the first coding layer is determined according to the mask matrix, including: at least one valid frame, at least one historical frame located before the valid frame, and at least one future frame located after the valid frame;
  • the block corresponding to the i-th coding layer is determined according to the mask matrix, including: at least one valid frame, and at least one historical frame before the valid frame, where i is an integer greater than 1 and less than or equal to n.
  • the encoding processing module 1920 is further used to:
  • the output feature sequence of the i-1th coding layer is input to the i-th coding layer, and the i-1th coding layer is used to The output feature sequence of the coding layer is coded to obtain the output feature sequence of the i-th coding layer; wherein, when the i-th coding layer codes the current block, it uses all the features stored in the buffer area.
  • the output feature sequence of the nth coding layer of the encoder is used as the hidden layer feature sequence.
  • the encoding layer includes: a first feedforward module, a multi-head self-attention module, a convolution module, a second feedforward module and a layer normalization module;
  • the first feedforward module is used to perform feedforward processing on the input feature sequence to obtain a first intermediate feature sequence
  • the multi-head self-attention module is used to process the first intermediate feature sequence using a multi-head self-attention mechanism to obtain a second intermediate feature sequence;
  • the convolution module is used to perform convolution processing on the second intermediate feature sequence to obtain a third intermediate feature sequence
  • the second feedforward module is used to perform feedforward processing on the third intermediate feature sequence to obtain a fourth intermediate feature sequence
  • the layer normalization module is used to normalize the fourth intermediate feature sequence to obtain an output feature sequence.
  • the convolution module is configured to perform convolution processing on the second intermediate feature of the current frame and the second intermediate feature of at least one historical frame of the current frame to obtain a third intermediate feature of the current frame. feature.
  • the encoding process of the current block is performed by using the intermediate processing result of at least one historical block of audio frames that overlaps with the current block stored in the buffer area.
  • the amount of calculation and calculation time are reduced, thereby speeding up model training and improving the training efficiency of the speech recognition model.
  • FIG. 20 shows a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device can be any electronic device with data calculation, processing and storage functions, such as a mobile phone, tablet computer, PC (Personal Computer, personal computer) or server, etc.
  • the computer device can be implemented as a model using device, used to implement the real-time speech recognition method provided in the above embodiment; or, the computer device can be implemented as a model training device, used to implement the real-time speech recognition model provided in the above embodiment. Training methods. Specifically:
  • the computer device 2000 includes a central processing unit (such as CPU (Central Processing Unit, central processing unit), GPU (Graphics Processing Unit, graphics processor) and FPGA (Field Programmable Gate Array, field programmable logic gate array), etc.) 2001, A system memory 2004 including a RAM (Random-Access Memory) 2002 and a ROM (Read-Only Memory) 2003, and a system bus 2005 connecting the system memory 2004 and the central processing unit 2001.
  • the computer device 2000 also includes a basic input/output system (I/O system) 2006 that helps transfer information between various devices in the server, and a storage operating system 2013, applications 2014 and other program modules 2015 of mass storage devices in 2007.
  • I/O system basic input/output system
  • the basic input/output system 2006 includes a display 2008 for displaying information and an input device 2009 such as a mouse and a keyboard for inputting information. Wherein, the display 2008 and the input device 2009 are both connected to the central processing unit 2001 through the input and output controller 2010 connected to the system bus 2005.
  • the basic input/output system 2006 may also include an input/output controller 2010 for receiving and processing input from a plurality of other devices such as a keyboard, mouse, or electronic stylus. Similarly, input and output controller 2010 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 2007 is connected to the central processing unit 2001 through a mass storage controller (not shown) connected to the system bus 2005 .
  • the mass storage device 2007 and its associated computer-readable media provide non-volatile storage for the computer device 2000 . That is, the mass storage device 2007 may include a computer-readable medium (not shown) such as a hard disk or a CD-ROM (Compact Disc Read-Only Memory) drive.
  • the computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory, Electrically Erasable Programmable Read-Only Memory), flash memory or Other solid-state storage technologies, CD-ROM, DVD (Digital Video Disc, high-density digital video disc) or other optical storage, tape cassettes, magnetic tapes, disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory or Other solid-state storage technologies
  • the computer device 2000 can also be connected to a remote computer on the network through a network such as the Internet to run. That is, the computer device 2000 can be connected to the network 2012 through the network interface unit 2011 connected to the system bus 2005, or the network interface unit 2011 can also be used to connect to other types of networks or remote computer systems (not shown) .
  • the memory also includes at least one instruction, at least one segment of a program, a set of code, or a set of instructions stored in the memory and configured to be executed by one or more processors to Implement the above real-time speech recognition method or the training method of the real-time speech recognition model.
  • a computer-readable storage medium stores at least one instruction, at least a program, a code set, or an instruction set. At least one instruction, at least a program, a code set, or an instruction set.
  • the real-time speech recognition method or the real-time speech recognition model training method provided by the above embodiments is implemented when executed by the processor of the computer device.
  • the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State Drives, solid state drive) or optical disk, etc. .
  • random access memory can include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory, dynamic random access memory).
  • a computer program product or computer program is also provided, the computer program product or computer program including computer instructions stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the above-mentioned real-time speech recognition method or the training method of the real-time speech recognition model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种实时语音识别方法、模型训练方法、装置及设备,涉及人工智能技术领域。其中,实时语音识别方法包括:获取目标语音数据的待识别块的音频特征序列(610);复用待识别块的历史块的处理结果,对待识别块的音频特征序列进行编码处理,得到待识别块的隐层特征序列,历史块是指与待识别块具有至少一个重叠的帧,且已经过编码处理的块(620);根据隐层特征序列,解码得到待识别块的语音识别结果(630)。该方法减少了编码处理过程的计算量,加快了语音识别的速度,满足了实时语音识别的需求。

Description

实时语音识别方法、模型训练方法、装置、设备及存储介质
本申请要求于2022年3月15日提交中国专利局、申请号为202210253123.0,发明名称为“实时语音识别方法、模型训练方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能技术领域,特别涉及一种实时语音识别方法、模型训练方法、装置、设备及存储介质。
背景技术
语音识别是指对对象提供的语音数据进行识别,得到相应的文本数据。
语音识别一般分为实时语音识别和非实时语音识别。非实时语音识别是指***在对象说完一句话或一段话后再进行识别,而实时语音识别是指***在对象还在说话的时候便同步进行识别,在实时语音识别场景下识别速度和延时往往成为其实际落地的瓶颈。
技术内容
本申请实施例提供了一种实时语音识别方法、模型训练方法、装置、设备及存储介质,可以解决在编码处理过程中计算量较大,导致语音识别的延时较大的技术问题。所述技术方案如下:
本申请实施例提供了一种实时语音识别方法,由计算机设备执行,所述计算机设备中部署有实时语音识别模型,所述方法包括:
获取所述语音数据的待识别块的音频特征序列,所述待识别块包括所述语音数据中的至少两个连续的音频帧,所述待识别块的音频特征序列包括所述待识别块所包含的各个帧的音频特征;
从缓存区存储的数据中,获取所述待识别块对应的历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征,所述隐层特征为所述待识别块经过编码处理后的特征;其中,所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
根据所述隐层特征,解码得到所述待识别块的实时语音识别结果。
本申请实施例还提供了一种实时语音识别模型的训练方法,所述方法包括:
获取样本语音数据的音频特征序列,所述音频特征序列包括所述样本语音数据的多个音频帧的音频特征;
将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,所述隐层特征序列包括所述各个块的隐层特征;其中,每个块包括所述多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,所述编码器在对当前块进行编码处理时,使用缓存区存储的与所述当前块具有重叠的音频帧的至少一个历史块的中间处理结果;所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
通过所述实时语音识别模型的解码器解码所述隐层特征序列,得到所述样本语音数据的预测识别结果;
基于所述样本语音数据的预测识别结果和真实识别结果,对所述实时语音识别模型进行训练。
本申请实施例还提供了一种实时语音识别装置,所述装置包括:
序列获取模块,用于获取所述语音数据的待识别块的音频特征序列,所述待识别块包括所述语音数据中的至少两个连续的音频帧,所述待识别块的音频特征序列包括所述待识别块所包含的各个音频帧的音频特征;
编码处理模块,用于从缓存区存储的数据中,获取所述待识别块对应的历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征,所述隐层特征为所述待识别块经过编码处理后的特征;其中,所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
解码处理模块,用于根据所述隐层特征,解码得到所述待识别块的实时语音识别结果。
本申请实施例还提供了一种实时语音识别模型的训练装置,所述装置包括:
样本获取模块,用于获取样本语音数据的音频特征序列,所述音频特征序列包括所述样本语音数据的多个音频帧的音频特征;
编码处理模块,用于将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,所述隐层特征序列包括所述各个块的隐层特征;其中,每个块包括所述多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,所述编码器在对当前块进行编码处理时,使用缓存区存储的与所述当前块具有重叠的音频帧的至少一个历史块的中间处理结果;所述历史块是指与所述待识别块具 有至少一个重叠的音频帧,且已经过编码处理的块;
解码处理模块,用于通过所述实时语音识别模型的解码器解码所述隐层特征序列,得到所述样本语音数据的预测识别结果;
模型训练模块,用于基于所述样本语音数据的预测识别结果和真实识别结果,对所述实时语音识别模型进行训练。
本申请实施例还提供了一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现上述实时语音识别方法,或者上述实时语音识别模型的训练方法。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现上述实时语音识别方法,或者上述实时语音识别模型的训练方法。
本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述实时语音识别方法,或者上述实时语音识别模型的训练方法。
附图简要说明
图1是本申请一些实施例提供的AED(Attention based Encoder-Decoder,基于注意力的编码器-解码器)-CTC(Connectionist Temporal Classification,联结主义时间分类)/Attention(注意力)架构的示意图;
图2是本申请一些实施例提供的基于分块(chunk)操作的实时语音识别模型的训练图;
图3是本申请一些实施例提供的实时语音识别模型在使用阶段的示意图;
图4是本申请一些实施例提供的方案实施环境的示意图;
图5是本申请一些实施例提供的应用程序中实时语音识别的工作流程图;
图6是本申请一些实施例提供的实时语音识别方法的流程图;
图7是本申请一些实施例提供的语音数据的待识别块的划分示意图;
图8是本申请另一个实施例提供的实时语音识别方法的流程图;
图9是本申请一些实施例提供的Conformer网络的结构示意图;
图10是本申请一些实施例提供的卷积模块的结构示意图;
图11是本申请一些实施例提供的多头自注意力模块的结构示意图;
图12是本申请一些实施例提供的添加了缓存区的Conformer的结构示意图;
图13是本申请一些实施例提供的自注意力机制的计算方法的示意图;
图14是本申请一些实施例提供的运用缓存区进行计算的示意图;
图15是本申请一些实施例提供的实时语音识别模型的训练方法的流程图;
图16是本申请一些实施例提供的基于掩码矩阵进行块划分的示意图;
图17是本申请一些实施例提供的实时语音识别装置的框图;
图18是本申请另一些实施例提供的实时语音识别装置的框图;
图19是本申请一些实施例提供的实时语音识别模型的训练装置的框图;
图20是本申请一些实施例提供的计算机设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请实施例提供的方案涉及人工智能的语音技术和机器学习等技术,具体通过如下实施例进行说明。
在一些端到端的实时语音识别方案中,通过构建包含编码器和解码器的语音识别模型,将对象输入的语音数据的音频特征输入至编码器,由编码器对其进行编码处理得到隐层特征,然后由解码器根据上述隐层特征解码得到相应的语音识别结果。然而,这种方式在编码处理过程中计算量较大,导致语音识别的延时较大。
该语音识别模型采用AED-CTC/Attention架构。该架构示例性如图1所示,包括编码网络10、解码网络20、CTC模块30和CE(Cross Entropy,交叉熵)模块40。其中,编码网络10是对声学特征进行建模,解码网络20是联合语言特征与声学特征进行建模,CTC模块30能自动学习字边界对齐,CTC模块30的自动对齐能力可以使得文本与声学特征上有更强单调对齐关系,解码网络20可以避免长句截断等问题;而解码网络20的联合建模能力也可以使得CTC带有更丰富的文本上下文能力,拥有更强的识别能力。在基于AED-CTC/Attention的端到端实时识别***中,编码网络10只能对截止当前时刻或包含有限未来时刻的部分音频特征进行编码,然后利用这些编码信息和历史预测来进行解码,通常的做法是对音频进行分块(chunk),在块内进行注意力(Attention)计算。
在模型训练阶段,通过CTC模块30和CE模块40两个模块分别计算训练损失,基于 计算得到的训练损失对语音识别模型进行参数调整。在一些实施例中,编码网络10使用Conformer结构,解码网络20使用Transformer结构。在编码网络10的末端增加CTC模块30计算CTC损失,在解码网络20的输出端增加CE模块40计算CE损失,整个模型联合两个训练准则(即上述两个损失)进行参数更新。
在模型使用阶段,去掉解码网络20和CE模块40,只使用编码网络10和CTC模块生成声学后验概率,然后引入n元语音模型,通过搜索构建的加权有限状态转换器(Weighted Finite State Transducer,WFST)图进行解码得到识别结果。
Transformer网络是深度自注意力变换网络,也常用于指代所有类似的深度自注意力变换网络结构。Transformer网络突破了循环神经网络不能并行计算的限制,相比卷积神经网络,计算两个位置之间的关联所需的操作次数不随距离增长,自注意力可以产生更具可解释性的模型。
Conformer网络是卷积增强的Transformer网络,Conformer用卷积去加强Transformer在语音识别领域的效果。在编码端利用Transformer擅长捕捉全局特征和卷积神经网络能够有效地表示局部特征的特性,将二者融合更好地提取音频特征的全局和局部依赖性,从而加强语音识别的效果。
在一些实施例中,如图2所示,图2示例性示出了基于分块操作的实时语音识别模型的训练图,图2也是图1给出的架构图的展开图。在实时语音识别模型的训练阶段,首先将输入的语音数据21基于音频帧划分为多个块,每个块包含多个连续的音频帧。在一些实施例中,每个块所包含的音频帧数量相同。当然在一些其他实施例中,每个块所包含的音频帧数量也可以是不同的,本申请对此不作限定。将每个块所包含的音频帧划分为历史帧、有效帧和未来帧。基于设置的有效帧帧数对输入的语音数据进行块的划分。在本申请例中,设置每个块的音频帧的数量是相同的,同时,由于块的划分是基于有效帧进行划分的,所以后一个块的历史帧必定和前一个块的有效帧部分重合,因此块与块之间存在重合部分。其中,有效帧是块中所需要被语音识别的部分,历史帧和未来帧是用来帮助语音识别块中有效帧部分的,历史帧是有效帧的前置帧,未来帧是有效帧的后置帧,通过有效帧的前置历史帧和后置未来帧帮助块中的有效帧部分进行语音识别。如图2中区域22,区域22中展示了划分为多个块的语音数据,Nc所表示的是块的有效帧,Ni所表示的是块的历史帧,Nr所表示的是块的未来帧。在一些实施例中,块1的未来帧部分可以和块2的有效帧部分重合,块2的历史帧部分可以和块1的有效帧部分重合。然后通过编码器对每个块进行编码处理,分别得到每个块的隐层特征序列,将每个块的隐层特征序列拼接组合后得到语音数据对应的完整隐层特征序列。同时,为了实现解码延时可控,在编码器端采用动 态的分块方式,训练每个识别结果时随机选择一组块的参数来进行分块操作。
最后,通过解码器对上述完整隐层特征序列进行解码,得到预测识别结果(如预测文本信息)。其中,基于CTC模块计算完整隐层特征序列的训练损失,对编码器进行参数调整;基于CE模块计算预测识别结果的训练损失,对编码器和解码器进行参数调整。为了保持和训练策略的一致性,当对象开始输入语音后,累积到Ni+Nc+Nr帧语音特征时才开始一个块的声学后验概率计算,最终只取出有效帧Nc帧的后验概率然后使用CTC-WFST图解码得到识别结果。
在一些实施例中,如图3所示,图3示例性示出了语音识别模型在实际使用阶段的示意图。为了加快语音识别的速度,在语音识别模型的使用阶段,通过实时输入的语音数据进行音频帧的抽取,在音频帧的数量满足一个块的音频帧数量时,将上述得到的块输入至编码器进行编码,得到该块的隐层特征序列,将各个块的隐层特征序列组合后得到语音数据对应的完整隐层特征序列。最后,对上述完整隐层特征序列进行解码,得到预测文本信息。
请参考图4,其示出了本申请一些实施例提供的方案实施环境的示意图。该方案实施环境可以实现成为一种实时语音识别***,用于对对象输入的语音数据进行识别,如实现实时语音识别的功能。该方案实施环境可以包括:模型训练设备410和模型使用设备420。其中,模型训练设备410可以是终端设备,也可以是服务器。同样的,模型使用设备420可以是终端设备,也可以是服务器。
模型训练设备410可以是诸如电脑、服务器、智能机器人等电子设备,或者是其他一些具有较强计算能力的电子设备。模型训练设备410用于对实时语音识别模型430进行训练。在本申请实施例中,实时语音识别模型430是用于对语音数据进行识别的模型,实时语音识别模型430可以包括编码网络431、解码网络432。模型训练设备410可以采用机器学习的方式对该语音识别模型430进行训练,以使得其具备较好的性能。
上述训练完成的实时语音识别模型430可部署在模型使用设备420中使用,以对语音数据进行识别,得到相应的识别结果(即预测文本数据)。模型使用设备420可以是诸如手机、电脑、智能电视、多媒体播放设备、可穿戴设备、医疗设备、智能语音交互设备、智能家电、车载终端设备等终端设备,也可以是服务器,本申请对此不作限定。本申请实施例可应用于各种场景,包括但不限于人工智能、智慧交通、辅助驾驶等。
终端设备可以是诸如手机、平板电脑、PC(Personal Computer,个人计算机)、可穿戴设备、车载终端设备、VR(Virtual Reality,虚拟现实)设备和AR(Augmented Reality,增强现实)设备等电子设备,本申请对此不作限定。终端设备中可以安装运行有应用程序 的客户端。
在本申请实施例中,上述应用程序是指能够对对象输入的语音数据进行识别的应用程序。示例性地,该应用程序可以是输入法类应用程序、社交类应用程序、互动娱乐类应用程序、地图导航类应用程序等等对象可以输入语音数据的应用程序。
服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或分布式***,还可以是提供云计算服务的云服务器。服务器可以是上述应用程序的后台服务器,用于为应用程序的客户端提供后台服务,例如对对象输入的语音数据进行识别并发送给客户端,在客户端中显示该语音数据对应的文本信息。在一些实施例中,语音识别也可以由客户端在本地完成,本申请对此不作限定。
在一些实施例中,上述应用程序可以是单独开发的独立APP(Application,应用程序),也可以是小程序,或者网页应用等其他形式的应用程序,本申请对此不作限定。
终端设备和服务器之间可以通过网络进行互相通信。
在一些实施例中,对象在应用程序对应的客户端中输入语音数据,该语音数据可以是对象实时输入的。应用程序的客户端获取到对象输入的语音数据后,将该语音数据发送至服务器中,通过服务器对语音数据进行识别,得到对应的预测文本数据。之后将识别得到的预测文本数据发送至客户端并显示在客户端中。
在一些实施例中,本申请的实时语音识别模型可用于具有不同时延需求的在线实时语音识别产品,例如语音输入法、语音笔记、车载智能语音识别、同声传译和在线直播语音识别等产品。如图5所示,图5示例性示出了应用程序中实时语音识别的工作流程图。对象点击按钮开始输入语音,客户端启动录音功能,通过客户端VAD(Voice Activity Detection,语音激活检测),将有效语音段检测出,经过音频压缩编码后上传至后台服务端,服务端首先完成音频解压缩,然后通过服务端VAD进一步检测出有效语音段,将其送入服务端语音识别解码器进行识别解码,识别结果再经过后处理操作后通过网络传回客户端呈现给对象。本申请对图5中的服务端ASR解码部分进行了优化,提出了下文实施例介绍的实时语音识别模型的使用方法和训练方法。
请参考图6,其示出了本申请一些实施例提供的实时语音识别方法的流程图。该方法的执行主体可以是图1所示方案实施环境中的模型使用设备20。该方法可以包括如下几个步骤(610~630)中的至少一个步骤:
步骤610,获取语音数据的待识别块的音频特征序列,待识别块包括语音数据中的至少两个连续的音频帧,待识别块的音频特征序列包括待识别块所包含的各个音频帧的音频特征。
在一些实施例中,采集语音数据的过程和对语音数据进行识别的过程可以是同一个设备中进行的,例如都在终端设备中进行。在一些实施例中,采集语音数据的过程和对语音数据进行识别的过程也可以在不同的设备中进行,例如,采集语音数据的过程是终端设备进行的,然后终端设备将采集的语音数据发送给服务端,由服务端对语音数据进行识别。
语音数据是对象在客户端中提供的待识别的语音数据。例如,该语音数据可以是对象在客户端中实时输入或录制的语音数据,也可以是提前录制好的语音数据。通过对对象输入的语音数据进行识别,可以得到对应的文本数据。例如对象想要在客户端中输入“早上好”文本数据,则对象可以在客户端对应的语音输入区域中输入对应文本数据是“早上好”的语音数据。
语音数据可以按照时间进行分帧,得到多个音频帧,每个音频帧的时间是相同的。待识别块是将语音数据进行块(chunk)划分后的部分语音数据,根据帧数对语音数据进行划分,得到多个块。在一些实施例中,多个块所包含的帧数是相同的,每个块至少包括两个连续的音频帧。当然,在一些其他实施例中,多个块所包含的帧数也可以是不相同的,本申请对此不作限定。在本申请中,主要以多个块所包含的帧数是相同的为例进行介绍说明。
在一些实施例中,每个块包括:至少一个有效帧、位于有效帧之前的至少一个历史帧,以及位于有效帧之后的至少一个未来帧。
块由有效帧,历史帧和未来帧组成。有效帧是块中所要识别的音频帧,历史帧和未来帧是用于帮助提高识别结果的精确度所用的音频帧,通过有效帧与历史帧、有效帧与未来帧之间的关系,通过前后音频帧来更精确地识别出有效帧部分的识别结果。在一些实施例中,选择的有效帧和未来帧的数量越多,有效帧部分的识别结果越精确,但语音识别场景的时延就越大;选择的有效帧和未来帧的数量越少,有效帧部分的识别结果越不精确,语音识别场景的时延就越小。其中,由于块的划分是根据有效帧进行的,所以各个块之间存在重叠部分(该重叠部分为重叠的部分音频帧)。
待识别块的音频特征序列是待识别块的各个音频帧所对应的音频特征的集合,对于待识别块的每一个音频帧,通过将各个音频帧对应的音频特征组合生成音频特征序列。音频特征用于表示音频帧的语义特征,音频特征可以通过音频帧对应的波形图得到,通过对音频帧对应的波形图进行频率、相位、幅度、梅尔频谱倒数等特征的计算,得到音频帧对应的音频特征。
在一些实施例中,如图7所示,图7示例性示出了语音数据的块划分示意图。假设输入的语音数据70的音频帧数量为20帧,且每个块设置的有效帧为4帧,历史帧为8帧, 未来帧为4帧,则可以将输入的语音数据70划分为5个块,各个块的帧数依次为:8帧、12帧、16帧、16帧和12帧,如图中的块71、72、73、74和75所示。其中,块71只拥有4个有效帧和4个未来帧,块72拥有4个历史帧、4个有效帧和4个未来帧,块73和74都拥有8个历史帧、4个有效帧和4个未来帧,块75拥有8个历史帧和4个有效帧。其中,各个块之间存在重合部分,例如块71的未来帧部分和块72的有效帧部分重合,块72的历史帧部分和块71的有效帧部分重合。
在实际语音识别过程中,为了减少语音识别过程的延时,在获取语音数据时,实时获取语音数据的音频帧。在音频帧的数量满足一个块的音频帧的数量时,获取该块。同样的,在音频帧数量满足下一个块的音频帧的数量时,获取下一个块。
在一些实施例中,假设块中设置的有效帧为4帧,历史帧为8帧,未来帧为4帧。在获取到音频帧的数量为8帧时,获取第一个块,第一个块拥有4个有效帧和4个未来帧,第一个块由第1~8帧组成;在获取的音频帧的数量为8个基础上,再获取到音频帧的数量为4帧时,获取第二个块,第二个块拥有4个历史帧、4个有效帧和4个未来帧,第二个块由第1~12帧组成;在获取的音频帧的数量为12帧基础上,再获取到音频帧的数量为4帧时,获取第三个块,第三个块拥有8个历史帧、4个有效帧和4个未来帧,第三个块由第1~16帧组成。依次类推,第四个块由第5~20帧组成。其中,第一个块需要获取的音频帧数量为设置的有效帧数量加未来帧数量,后续的块的获取条件为再获取的音频帧数量为设置的有效帧数量。
步骤620,从缓存区存储的数据中,获取待识别块对应的历史块的中间处理结果,利用所述历史块的中间处理结果,对待识别块的音频特征序列进行编码处理,得到待识别块的隐层特征,隐层特征为所述待识别块经过编码处理后的特征;其中,历史块是指与待识别块具有至少一个重叠的音频帧,且已经过编码处理的块。
在一些实施例中,所述历史块的中间处理结果为:在对待识别块进行编码处理的过程中需要用到的中间量。如图7所示,以块73为待识别块为例,在处理待识别块73时,此时块71和块72已被处理。对待识别块73的有效帧部分进行编码处理时,需要同时考虑待识别块73的历史帧部分和未来帧部分。而此时待识别块73的历史帧部分为块71的有效帧部分和块72的有效帧部分,此时块71和块72都为待识别块73的历史块。同时,块71和块72的有效帧部分已经处理完毕,因此可以直接使用块71和块72的有效帧部分的中间处理结果,来对待识别块73进行编码处理。
隐层特征序列是对语音数据对应的音频特征序列进行编码处理后的结果,编码处理是对音频数据划分出的各个块中的有效帧部分进行编码处理,基于待识别块中的历史帧、有 效帧和未来帧对有效帧部分进行编码处理,得到有效帧对应的隐层特征。各个待识别块的有效帧对应的编码处理后的隐层特征组合后生成隐层特征序列。
隐层特征与音频特征相对应,音频特征是音频帧未经编码处理时的特征,隐层特征是音频帧经过编码处理后的特征。
在一些实施例中,如图7所示,对语音数据进行编码处理,首先对块71进行编码处理,基于块71的4个有效帧和4个未来帧,得到块71中4个有效帧对应的隐层特征。接着对块72进行编码处理,基于块72的4个历史帧、4个有效帧和4个未来帧,得到块72中4个有效帧对应的隐层特征。其中,当块72是待识别块时,待识别块72的4个历史帧也就是块71的有效帧部分,也就是块71为待识别块72的历史块。在已经计算得到块71的4个有效帧的计算结果时,在编码处理待识别块72时,就不用再重复计算该4个音频帧了。同样的,对于块73、74和75也是如此,在此不再赘述。
在一些实施例中,基于当前语音识别场景的时延需求,确定有效帧和未来帧的数量。
时延用于表示语音识别场景的延时,时延包含首字时延和尾字时延,首字时延表示用户输入语音数据到得到第一个识别字所需要的时间,尾字时延表示用户输入语音数据到得到最后一个识别字所需要的时间。实时率是处理一段语音数据所要的时间除以该段语音数据的时间所得到的计算结果,实时率表示的是解码识别速度,实时率越小,表示解码识别速度越快,相应的时延也就越小。例如,如果处理一段长度为2小时的音频花了8个小时,则实时率为8/2=4。经实验评估,本申请中的实时语音识别模型的时延能够在500ms以内、解码实时率0.5左右,达到了很高的识别准确率。
上述编码处理中,通过对历史块的中间计算结果的复用,无需重复对历史块中音频帧进行重复的计算,从而节约了对待识别块中历史帧进行计算所需要的时间。因此,对于待识别块识别过程中,在对象获取到音频帧的数量满足历史帧+有效帧+未来帧时,获取第一个待识别块,并对该待识别块进行识别,得到对应的识别结果并显示在对象的客户端中。其中,通过调整未来帧的数量,来调整历史帧+有效帧+未来帧的帧数总和,相当于调整了获取第一个待识别块所需要的时间,相当于调整了对象在客户端中看到部分实时语音识别结果所需要的时间。
通过调整有效帧和未来帧的数量,来控制客户端中显示部分实时语音识别结果所需要的时间。可以根据对象的需求来进行有效帧和未来帧数量的调整,增加了实时语音识别功能的多样性和灵活性。
步骤630,根据隐层特征,解码得到待识别块的实时语音识别结果。
在一些实施例中,根据编码得到的隐层特征,通过对隐层特征进行解码,得到预测识 别结果。
在本申请实施例中,通过将语音数据划分为多个块,在对当前待识别块进行编码处理时,复用该待识别块的前一个或前几个历史块的中间处理结果,从而减少了编码处理过程的计算量,加快了语音识别的速度,从而更好地满足实时语音识别的需求。
请参考图8,其示出了本申请另一些实施例提供的实时语音识别方法的流程图。该方法的执行主体可以是图1所示方案实施环境中的模型使用设备20。该方法可以包括如下几个步骤(810~840)中的至少一个步骤:
步骤810,获取语音数据的待识别块的音频特征序列,待识别块包括语音数据中的至少两个连续的音频帧,待识别块的音频特征序列包括待识别块所包含的各个音频帧的音频特征。
步骤820,从缓存区存储的数据中,获取历史块的中间处理结果。
缓存区是用于存储历史块的中间处理结果的区域,缓存区中存储有待识别块的历史块对应的有效帧计算结果。在一些实施例中,缓存区中存储的有效帧计算结果的数量与设置的历史帧数量相同。
步骤830,利用所述历史块的中间处理结果,通过实时语音识别模型的编码器,对待识别块的音频特征序列进行编码处理,得到待识别块的隐层特征。
实时语音识别模型是用于对语音数据进行实时语音识别的模型,对于实时语音识别模型的结构在下文实施例中介绍。例如,实时语音识别模型可以是基于神经网络构建的模型。在一些实施例中,实时语音识别模型包括编码器(或称为编码网络)和解码器(或称为解码网络)。其中,编码器用于对输入的音频特征进行编码处理,得到隐层特征;解码器用于对隐层特征进行解码处理,得到语音识别结果。
需要说明的是,上文图5示出的服务端ASR解码部分,是指服务端执行本申请实施例提供的实时语音识别方法的完整流程,例如获取语音数据的待识别块的音频特征序列,然后通过实时语音识别模型的编码器根据待识别块对应的历史块的处理结果,对待识别块的音频特征序列进行编码处理,得到待识别块的隐层特征,之后再通过服务端CTC-WFST解码器的解码器根据隐层特征,解码得到待识别块的实时语音识别结果。
实时语音识别模型在对待识别块进行编码处理时,可以使用缓存区中的历史块的中间处理结果来帮助后续待识别块进行编码处理。进一步地,当获取到最新计算得到的有效帧的中间计算结果时,覆盖掉缓存区中相对最早存储进来的有效帧的中间计算结果。
在一些实施例中,结合参考图7,对语音数据进行编码处理,首先对块71进行编码处理,此时块71为待识别块,基于待识别块71的4个有效帧和4个未来帧,得到待识别块 71中4个有效帧对应的隐层特征,将该4个有效帧对应的中间计算结果存储在缓存区中。在一些实施例中,所述中间计算结果可以是每个编码层的多头自注意力模块计算的K和V向量,以及卷积模块计算的卷积结果。接着对块72进行编码处理,基于待识别块72的4个历史帧、4个有效帧和4个未来帧,得到待识别块72中4个有效帧对应的隐层特征。其中,当块72是待识别块时,待识别块72的4个历史帧也就是块71的有效帧部分,也就是块71为待识别块72的历史块。此时,可以从缓存区中获取该4个有效帧对应的中间计算结果,也就不用再计算该4个音频帧了。
步骤840,根据隐层特征,解码得到待识别块的实时语音识别结果。
步骤810和步骤840已在上文实施例中介绍,在此不再赘述。
在一些实施例中,编码器包括n个串联的编码层,n为大于1的整数。编码层包括多头自注意力模块和卷积模块,多头自注意力模块用于采用多头自注意力机制对输入的特征序列进行处理,卷积模块用于对输入的特征序列进行卷积处理。相应地,缓存区包括第一缓存区和第二缓存区,第一缓存区用于存储多头自注意力模块针对历史块的中间处理结果,第二缓存区用于存储卷积模块针对历史块的中间处理结果。
第一缓存区和第二缓存区的功能相同,都是对模块的输出结果进行存储,不同的是,第一缓存区存储的是多头自注意力模块中的中间处理结果,第二缓存区存储的是卷积模块中的中间处理结果。
在一些实施例中,编码层还包括第一前馈模块、第二前馈模块和层归一化模块。第一前馈模块也就是前馈模块(Feed Forward Module,简称为FFM),用于对输入特征序列进行前馈处理,得到第一中间特征序列。多头自注意力模块(Multi-Head Self Attention Module,简称为MHSA)用于采用多头自注意力机制对第一中间特征序列进行处理,得到第二中间特征序列。卷积模块(Convolution Module)用于对第二中间特征序列进行卷积处理,得到第三中间特征序列。第二前馈模块与第一前馈模块相同,用于对第三中间特征序列进行前馈处理,得到第四中间特征序列。层归一化模块(Layernorm)用于对第四中间特征序列进行归一化处理,得到输出特征序列。
在一些实施例中,如图9所示,图9示例性示出了Conformer网络的结构图,Conformer首先对输入的特征序列进行预处理,包括数据增强模块(SqecAug)910、卷积下采样模块(Convolution Subsampling)920、排列模块(Linear)930和去除模块(Dropout)940,得到预处理后的特征序列。然后将特征序列输入至编码模块950中进行编码处理,编码模块950拥有多层编码层,每层编码层的结构相同,在一些实施例中,每层编码层的结构也可以不同。在本实施例中,每层编码层的结构相同,都为图9中区域960所示的结构图。 如区域960中所示,编码层由多层组成:第一前馈模块961、多头自注意力模块962、卷积模块963、第二前馈模块964和层归一化模块965,对于每个模块的介绍在下文实施例中。编码模块950所拥有的编码层的数量可以是4层,也可以是6层,本申请对此不作限定。其中,第一前馈模块961的输出结果是第一中间特征序列,将第一中间特征序列输入至多头自注意力模块962得到第二中间特征序列,将第二中间特征序列输入至卷积模块963得到第三中间特征序列,将第三中间特征序列输入至第二前馈模块964得到第四中间特征序列,将第四中间特征序列输入至层归一化模块965得到输出特征序列。其计算公式如下所示:
Figure PCTCN2022142596-appb-000001
Figure PCTCN2022142596-appb-000002
x″ i=x′ i+Conv(x′ i)
Figure PCTCN2022142596-appb-000003
其中,FFN是前馈模块,MHSA是多头自注意力机制模块,Conv是卷积模块,Layernorm是层归一化模块。
在一些实施例中,如图10所示,图10示例性示出了卷积模块的结构示意图。卷积模块通过多层结构组成,通过层归一化模块,三个卷积模块、两个激活函数模块、加速神经网络训练和暂时去除模块组成。其中,第一卷积模块和第三卷积模块是相同的卷积模块,都是Pointwise卷积模块,只改变特征图的数量而不改变特征图大小,第二卷积模块是Depthwise卷积模块,指改变特征图的大小,不改变通道的数量。
在一些实施例中,如图11所示,图11示例性示出了多头自注意力模块的结构示意图。图11中(a)部分是多头自注意力模块的结构示意图,图11中(b)部分是(a)部分中的缩放后的注意力机制的展开结构图。对于自注意力机制的计算方法在下文中介绍。
在一些实施例中,如图12所示,图12示例性示出了添加了缓存区的Conformer的结构示意图。图12中在多头自注意力模块962和卷积模块963前都添加了缓存区,在多头自注意力模块962前添加了第一缓存区121,在卷积模块963前添加了第二缓存区122。
在一些实施例中,如图13所示,图13示例性示出了自注意力机制的计算方法示意图。注意力机制和编码过程不同,它是一个求各个输入之间的关联度的过程。因此,在本申请中,通过采用自注意力机制获取历史帧或未来帧与有效帧之间的关联度,来对有效帧的识别结果进行预测。
首先,如图13所示,将上述第一前馈模块输出的历史帧或未来帧和有效帧第一中间 特征序列作为注意力机制的输入,其中,将历史帧的前馈结果设为a 1,有效帧的前馈结果设为a 2,未来帧的前馈结果设为a 3
接着,将向量a 1,a 2,a 3分别乘上三个不同的嵌入转换矩阵W q,W k,W v,分别得到不同的向量q,k,v,其中以a 1为例,就可以得到三个向量q 1,k 1,v 1。其中,q代表的是查询向量,k代表的是键向量,v代表的是信息提取向量。
然后,以a 1为例,使用向量q 1与向量k 1相乘,就是一个注意力匹配的过程,其中,为了防止数值过大,要进行归一化的过程,将向量q 1与向量k 1相乘后,要除以
Figure PCTCN2022142596-appb-000004
得到α 1.1,以此类推,可以得到α 1.2,α 1.3。其中d是q和k的维度,维度通常的理解是:“点是0维、直线是1维、平面是2维、体是3维”。通过这个过程可以得出内积向量α 1.i
其次,我们将得到的内积向量α 1.i进行softmax函数操作,该元素的softmax函数值,就是该元素的指数与所有元素指数和的比值。以α 1.1为例,它的softmax函数操作就是将α 1.1的指数除以α 1.1的指数,α 1.2的指数,α 1.3的指数的总和。
Figure PCTCN2022142596-appb-000005
为α 1.i进行softmax函数操作后的值。
之后,将得到的
Figure PCTCN2022142596-appb-000006
与v i相乘,具体来说,把
Figure PCTCN2022142596-appb-000007
乘上v 1,把
Figure PCTCN2022142596-appb-000008
乘上v 2,把
Figure PCTCN2022142596-appb-000009
乘上v 3,把得到的结果相加得到b 1,其中b 1就是最后的输出结果,以此类推,可以得到b 2,b 3。其中,b 1即为历史帧和有效帧之间的第二中间特征序列,b 3即为未来帧和有效帧之间的第二中间特征序列。
在一些实施例中,卷积模块用于对当前帧的第二中间特征和当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到当前帧的第三中间特征。
为了减少语音识别过程中的样式,Conformer里的卷积模块使用因果卷积,如果卷积核数是15的话,需要利用历史14帧和当前帧来预测当前帧的卷积输出。
设计缓存机制如下:每个块在计算卷积模块之前,把当前块的有效帧Nc部分的最后14帧进行缓存,下一个块计算卷积模块之前把这部分缓存拿来当作历史帧来使用。
在一些实施例中,如图14所示,图14中L部分为上一个块中缓存的自注意力计算中间结果,显而易见的,每个块在计算第二中间特征序列时Query维度是C,Key维度是L+C,在多头自注意力机制中,Key维度的C和Query维度的C是同源的,都是在当前块内计算的,但L不是在当前块计算,而是复用上一个块计算时缓存在第一缓存区中的中间结果,一般L是C的n倍。
在本申请实施例中,通过设置第一缓存区和第二缓存区,对已经计算完的块的中间计算结果进行存储,并将其运用于后续计算中,通过复用中间计算结果,减少了计算量,节约了计算时间。
另外,还通过在卷积模块中不对未来帧进行卷积,减少了计算量,节约了计算时间,减少了语音识别过程的时延。
请参考图15,其示出了本申请一些实施例提供的实时语音识别模型的训练方法的流程图。该方法的执行主体可以是图1所示方案实施环境中的模型训练设备10。该方法可以包括如下几个步骤(1510~1540)中的至少一个步骤:
步骤1510,获取样本语音数据的音频特征序列,音频特征序列包括样本语音数据的多个音频帧的音频特征。
样本语音数据是用来训练实时语音识别模型的语音数据,样本语音数据对应有真实识别结果,真实识别结果是样本语音数据所要表达的准确识别结果。
获取样本语音数据,按照时间对样本语音数据进行分帧,得到多个音频帧。获取各个帧的音频特征,整合得到样本语音数据的音频特征序列。
步骤1520,将音频特征序列输入至实时语音识别模型的编码器,通过编码器根据掩码矩阵对音频特征序列进行分块,对各个块进行编码处理,得到样本语音数据的隐层特征序列,隐层特征序列包括各个块的隐层特征;其中,每个块包括多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,编码器在对当前块进行编码处理时,使用缓存区存储的与当前块具有重叠的音频帧的至少一个历史块的中间处理结果。所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块。
当前块是正在编码处理的待识别块,历史块是已经编码处理完成的块。当当前块和历史块存在重叠部分(该重叠部分为重叠的部分音频帧)时,可以复用历史块的中间处理结果来帮助当前块进行编码处理。
与模型使用过程不同的是,在模型训练过程中,不再实时获取待识别块,而是将整个音频特征序列输入至编码器中。例如,在样本语音数据为10帧的数据时,将该10帧全部输入至编码器中;又如在样本语音数据为20帧的数据时,将该20帧全部输入至编码器中。在编码器中,对音频特征序列进行分块,生成多个块。对各个块进行编码处理,得到各个块的有效帧部分对应的隐层特征。同样的,由于各个块存在重叠,可以像上述实施例一样,在对后续待识别块进行编码处理时,复用历史块中的有效帧部分的计算结果,节约计算时间。
在一些实施例中,编码器根据掩码矩阵确定各个块所包含的帧;其中,掩码矩阵包括多组元素,每组元素用于指示一个块所包含的帧,每组元素中通过两种不同的数值来区分一个块中包含和不包含的帧。
当编码器使用Transformer或Conformer结构时,Transformer或Conformer结构中 拥有多头自注意力模块。在多头自注意力模块中通过掩码(Mask)矩阵的方式对语音数据进行块的划分。如图16所示,图16示例性示出了掩码矩阵进行块的划分时的示意图。图16中仅显示了块的有效帧和历史帧部分,其中,设置的块的有效帧为2帧,历史帧为4帧,如图16中(a)部分所示,1表示的是掩码矩阵关注的音频帧,0表示的是掩码矩阵未关注的音频帧。各个实线框表示的是各个块,其中每个块的右边4个1为有效帧部分,剩下部分都为历史帧部分。如图16中(b)部分所示,Mask矩阵竖轴代表Query(查询)(Q)方向,横轴代表Key(关键词)(K)方向。具体的多头自注意力模块的计算公式如下:
Figure PCTCN2022142596-appb-000010
MultiHead(Q,K,V)=Concat(head 1,…,head h)W o
Figure PCTCN2022142596-appb-000011
其中,图16是描述的QK^T这部分矩阵计算,Q是m×n维矩阵,K是k×n维矩阵,T代表矩阵的转置,QK^T矩阵计算结果是一个m×k维的矩阵,描述的是Query的每一帧和Key的每一帧之间的关联系数(程度)。
在一些实施例中,编码器包括n个串联的编码层,n为大于1的整数,步骤1520中,将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,包括:根据所述掩码矩阵确定第1个编码层所对应的块,包括:至少一个有效帧、位于所述有效帧之前的至少一个历史帧,以及位于所述有效帧之后的至少一个未来帧;
根据所述掩码矩阵确定第i个编码层所对应的块,包括:至少一个有效帧,以及位于所述有效帧之前的至少一个历史帧,i为大于1且小于或等于n的整数。
所述对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,包括:
将音频特征序列输入至编码器的第1个编码层,通过第1个编码层对音频特征序列进行编码处理,得到第1个编码层的输出特征序列;
对于编码器的第i个编码层,将第i-1个编码层的输出特征序列输入至第i个编码层,通过第i个编码层对第i-1个编码层的输出特征序列进行编码处理,得到第i个编码层的输出特征序列;其中,第i个编码层在对当前块进行编码处理时,复用至少一个历史块在第i-1个编码层的输出特征;
其中,编码器的第n个编码层的输出特征序列,作为隐层特征序列。
对于模型训练阶段,在编码过程中,仅对第1个编码层的块中有效帧编码处理时考虑未来帧,在后续第i个编码层中,仅对块的历史帧和有效帧进行编码处理。因为在编码过 程如果对所有编码层中的块的未来帧进行编码处理,会产生大量的时延,其时延会远大于Nc+n*Nr。如果仅对第1层编码层中的块的未来帧进行编码处理时,所产生的时延仅为Nc+Nr,相比上述编码方法所产生的时延小很多,降低了实时语音识别***的时延。同时,只在第一层conformer关注未来帧,其他层只关注有限历史帧和当前有效帧,这样就可以直观且灵活地控制语音识别***的延时。
在一些实施例中,上述编码层包括:第一前馈模块、多头自注意力模块、卷积模块、第二前馈模块和层归一化模块;
第一前馈模块用于对输入特征序列进行前馈处理,得到第一中间特征序列;
多头自注意力模块用于采用多头自注意力机制对第一中间特征序列进行处理,得到第二中间特征序列;
卷积模块用于对第二中间特征序列进行卷积处理,得到第三中间特征序列;
第二前馈模块用于对第三中间特征序列进行前馈处理,得到第四中间特征序列;
层归一化模块用于对第四中间特征序列进行归一化处理,得到输出特征序列。
对于Conformer编码层的介绍已在上文实施例中介绍,在此不再赘述。
在一些实施例中,卷积模块用于对当前帧的第二中间特征和当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到当前帧的第三中间特征。
为了减少语音识别过程中的样式,Conformer里的卷积模块使用因果卷积,如果卷积的***内核核数是15的话,需要利用历史14帧和当前帧来预测当前帧的卷积输出。
步骤1530,通过实时语音识别模型的解码器解码隐层特征序列,得到样本语音数据的预测识别结果。
根据得到的隐层特征序列,通过解码器对隐层特征序列进行解码,得到样本语音数据的预测识别结果。在一些实施例中,上述编码器可以是Transformer编码器。
步骤1540,基于样本语音数据的预测识别结果和真实识别结果,对实时语音识别模型进行训练。
根据真实文本数据和预测文本数据,确定实时语音识别模型的训练损失,并基于训练损失调整实时语音识别模型的网络参数。实时语音识别模型的训练损失用于衡量预测识别结果和真实识别结果差距,在一些实施例中,基于训练损失采用梯度下降法调整模型参数,最终得到完成训练的实时语音识别模型。
在本申请实施例中,通过对语音数据整句训练而不是对语音数据分块训练,并通过掩码矩阵来进行分块处理,从而加快了模型训练速度,提高了语音识别模型的训练效率。另外,在对当前块进行编码处理时,通过使用缓存区中存储的与当前块具有重叠的音频帧的 至少一个历史块的中间处理结果,来进行当前块的编码处理,减少了计算量和计算时间,进一步加快了模型训练速度,提高了语音识别模型的训练效率。
另外,通过在第二层编码层及其后续编码层中不对未来帧进行编码处理,减少了实时语音识别模型训练的时延。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参考图17,其示出了本申请一个实施例提供的实时语音识别装置的框图。该装置具有实现上述实时语音识别方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是上文介绍的模型使用设备,也可以设置在模型使用设备中。该装置1700可以包括:序列获取模块1710、编码处理模块1720和解码处理模块1730。
序列获取模块1710,用于获取所述语音数据的待识别块的音频特征序列,所述待识别块包括所述语音数据中的至少两个连续的音频帧,所述待识别块的音频特征序列包括所述待识别块所包含的各个音频帧的音频特征。
编码处理模块1720,用于从缓存区存储的数据中,获取所述待识别块对应的历史块的中间处理结果,利用所述历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征,所述隐层特征为所述待识别块经过编码处理后的特征;其中,所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块。
解码处理模块1730,用于根据所述隐层特征,解码得到所述待识别块的实时语音识别结果。
在一些实施例中,所述编码器包括n个串联的编码层,n为大于1的整数;
所述编码层包括多头自注意力模块和卷积模块,所述多头自注意力模块用于采用多头自注意力机制对输入的特征序列进行处理,所述卷积模块用于对输入的特征序列进行卷积处理;
所述缓存区包括第一缓存区和第二缓存区,所述第一缓存区用于存储所述多头自注意力模块针对所述历史块的中间处理结果,所述第二缓存区用于存储所述卷积模块针对所述历史块的中间处理结果。
在一些实施例中,所述待识别块包括至少一个有效帧和位于所述有效帧之前的至少一个历史帧;所述至少一个历史帧为所述待识别块和所述历史块之间重叠的音频帧;如图18所示,所述编码处理模块1720包括:缓存结果获取单元1721和编码处理单元1722。
缓存结果获取单元1721,用于从第一缓存区存储的数据中,获取所述多头自注意力模 块针对所述历史块的中间处理结果,从所述第二缓存区中获取所述卷积模块针对所述历史块的中间处理结果。
编码处理单元1722,用于通过实时语音识别模型的编码器,根据从所述第一缓存区和第二缓存区获取的中间处理结果,对所述待识别块的至少一个有效帧的音频特征进行编码处理,得到所述有效帧的隐层特征,作文所述待识别块的隐层特征。
在一些实施例中,所述编码层还包括第一前馈模块、第二前馈模块和层归一化模块;
所述第一前馈模块用于对输入特征序列进行前馈处理,得到第一中间特征序列;
所述多头自注意力模块用于采用多头自注意力机制对所述第一中间特征序列进行处理,得到第二中间特征序列;
所述卷积模块用于对所述第二中间特征序列进行卷积处理,得到第三中间特征序列;
所述第二前馈模块用于对所述第三中间特征序列进行前馈处理,得到第四中间特征序列;
所述层归一化模块用于对所述第四中间特征序列进行归一化处理,得到输出特征序列。
在一些实施例中,所述卷积模块用于对当前帧的第二中间特征和所述当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到所述当前帧的第三中间特征。
在一些实施例中,所述待识别块还包括:位于所述有效帧之后的至少一个未来帧。
在一些实施例中,如图18所示,所述装置还包括:未来帧确定模块1740。
未来帧确定模块1740,用于基于当前语音识别场景的时延需求,确定所述未来帧的数量。
在本申请实施例中,通过将语音数据划分为多个待识别块,在对该待识别块进行编码处理时,复用该待识别块的前一个或前几个历史块的处理结果,从而减少了编码处理过程的计算量,加快了语音识别的速度,从而更好地满足实时语音识别的需求。
请参考图19,其示出了本申请一个实施例提供的实时语音识别模型的训练装置的框图。该装置具有实现上述实时语音识别模型的训练方法的功能,所述功能可以由硬件实现,也可以由硬件执行相应的软件实现。该装置可以是上文介绍的模型训练设备,也可以设置在模型训练设备中。该装置1900可以包括:样本获取模块1910、编码处理模块1920、解码处理模块1930和模型训练模块1940。
样本获取模块1910,用于获取样本语音数据的音频特征序列,所述音频特征序列包括所述样本语音数据的多个音频帧的音频特征。
编码处理模块1920,用于将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,对各个块进行编码处理,得 到所述样本语音数据的隐层特征序列,所述隐层特征序列包括所述各个块的隐层特征;其中,每个块包括所述多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,所述编码器在对当前块进行编码处理时,使用缓存区存储的与所述当前块具有重叠的音频帧的至少一个历史块的处理结果,所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块。
解码处理模块1930,用于通过所述实时语音识别模型的解码器解码所述隐层特征序列,得到所述样本语音数据的预测识别结果。
模型训练模块1940,用于基于所述样本语音数据的预测识别结果和真实识别结果,对所述实时语音识别模型进行训练。
在一些实施例中,所述掩码矩阵包括多组元素,每组元素用于指示一个块所包含的音频帧,每组元素中通过两种不同的数值来区分一个块中包含和不包含的音频帧。
在一些实施例中,所述编码器包括n个串联的编码层,n为大于1的整数;所述n个编码层中的每个编码层包含多头自注意力模块;
所述多头自注意力模块在计算多头自注意力系数时,通过所述掩码矩阵屏蔽掉不属于当前块的音频帧。
在一些实施例中,所述编码器包括n个串联的编码层,n为大于1的整数,所述编码处理模块1920用于:
根据所述掩码矩阵确定第1个编码层所对应的块,包括:至少一个有效帧、位于所述有效帧之前的至少一个历史帧,以及位于所述有效帧之后的至少一个未来帧;
根据所述掩码矩阵确定第i个编码层所对应的块,包括:至少一个有效帧,以及位于所述有效帧之前的至少一个历史帧,i为大于1且小于或等于n的整数。
在一些实施例中,所述编码处理模块1920进一步用于:
将所述音频特征序列输入至所述编码器的第1个编码层,通过所述第1个编码层对所述音频特征序列进行编码处理,得到所述第1个编码层的输出特征序列;
对于所述编码器的第i个编码层,将第i-1个编码层的输出特征序列输入至所述第i个编码层,通过所述第i个编码层对所述第i-1个编码层的输出特征序列进行编码处理,得到所述第i个编码层的输出特征序列;其中,所述第i个编码层在对所述当前块进行编码处理时,使用缓存区中存储的所述至少一个历史块在所述第i-1个编码层的输出特征;
其中,所述编码器的第n个编码层的输出特征序列,作为所述隐层特征序列。
在一些实施例中,所述编码层包括:第一前馈模块、多头自注意力模块、卷积模块、第二前馈模块和层归一化模块;
所述第一前馈模块用于对输入特征序列进行前馈处理,得到第一中间特征序列;
所述多头自注意力模块用于采用多头自注意力机制对所述第一中间特征序列进行处理,得到第二中间特征序列;
所述卷积模块用于对所述第二中间特征序列进行卷积处理,得到第三中间特征序列;
所述第二前馈模块用于对所述第三中间特征序列进行前馈处理,得到第四中间特征序列;
所述层归一化模块用于对所述第四中间特征序列进行归一化处理,得到输出特征序列。
在一些实施例中,所述卷积模块用于对当前帧的第二中间特征和所述当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到所述当前帧的第三中间特征。
在本申请实施例中,在对当前块进行编码处理时,通过使用缓存区中存储的与当前块具有重叠的音频帧的至少一个历史块的中间处理结果,来进行当前块的编码处理,较少了计算量和计算时间,从而加快了模型训练速度,提高了语音识别模型的训练效率。
需要说明的是,上述实施例提供的装置,在实现其功能时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
请参考图20,其示出了本申请一个实施例提供的计算机设备的结构示意图。该计算机设备可以是任何具备数据计算、处理和存储功能的电子设备,如手机、平板电脑、PC(Personal Computer,个人计算机)或服务器等。该计算机设备可以实现成为模型使用设备,用于实施上述实施例中提供的实时语音识别方法;或,该计算机设备可以实现成为模型训练设备,用于实施上述实施例中提供的实时语音识别模型的训练方法。具体来讲:
该计算机设备2000包括中央处理单元(如CPU(Central Processing Unit,中央处理器)、GPU(Graphics Processing Unit,图形处理器)和FPGA(Field Programmable Gate Array,现场可编程逻辑门阵列)等)2001、包括RAM(Random-Access Memory,随机存取存储器)2002和ROM(Read-Only Memory,只读存储器)2003的***存储器2004,以及连接***存储器2004和中央处理单元2001的***总线2005。该计算机设备2000还包括帮助服务器内的各个器件之间传输信息的基本输入/输出***(Input Output System,I/O***)2006,和用于存储操作***2013、应用程序2014和其他程序模块2015的大容量存储设备2007。
该基本输入/输出***2006包括有用于显示信息的显示器2008和用于对象输入信息的诸如鼠标、键盘之类的输入设备2009。其中,该显示器2008和输入设备2009都通过连接到***总线2005的输入输出控制器2010连接到中央处理单元2001。该基本输入/输出***2006还可以包括输入输出控制器2010以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器2010还提供输出到显示屏、打印机或其他类型的输出设备。
该大容量存储设备2007通过连接到***总线2005的大容量存储控制器(未示出)连接到中央处理单元2001。该大容量存储设备2007及其相关联的计算机可读介质为计算机设备2000提供非易失性存储。也就是说,该大容量存储设备2007可以包括诸如硬盘或者CD-ROM(Compact Disc Read-Only Memory,只读光盘)驱动器之类的计算机可读介质(未示出)。
不失一般性,该计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、EEPROM(Electrically Erasable Programmable Read-Only Memory,电可擦写可编程只读存储器)、闪存或其他固态存储技术,CD-ROM、DVD(Digital Video Disc,高密度数字视频光盘)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知该计算机存储介质不局限于上述几种。上述的***存储器2004和大容量存储设备2007可以统称为存储器。
根据本申请实施例,该计算机设备2000还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即计算机设备2000可以通过连接在该***总线2005上的网络接口单元2011连接到网络2012,或者说,也可以使用网络接口单元2011来连接到其他类型的网络或远程计算机***(未示出)。
存储器还包括至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集存储于存储器中,且经配置以由一个或者一个以上处理器执行,以实现上述实时语音识别方法或实时语音识别模型的训练方法。
在示例性实施例中,还提供了一种计算机可读存储介质,存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,至少一条指令、至少一段程序、代码集或指令集在被计算机设备的处理器执行时实现上述实施例提供的实时语音识别方法或实时语音识别模型的训练方法。
在一些实施例中,该计算机可读存储介质可以包括:ROM(Read-Only Memory,只读存储器)、RAM(Random-Access Memory,随机存储器)、SSD(Solid State Drives,固态硬盘)或光盘等。其中,随机存取记忆体可以包括ReRAM(Resistance Random Access Memory,电阻式随机存取记忆体)和DRAM(Dynamic Random Access Memory,动态随机存取存储器)。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,计算机程序产品或计算机程序包括计算机指令,计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质中读取计算机指令,处理器执行计算机指令,使得计算机设备执行上述实时语音识别方法或实时语音识别模型的训练方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。另外,本文中描述的步骤编号,仅示例性示出了步骤间的一种可能的执行先后顺序,在一些其它实施例中,上述步骤也可以不按照编号顺序来执行,如两个不同编号的步骤同时执行,或者两个不同编号的步骤按照与图示相反的顺序执行,本申请实施例对此不作限定。
以上仅为本申请的示例性实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (19)

  1. 一种实时语音识别模型的训练方法,由计算机设备执行,所述方法包括:
    获取样本语音数据的音频特征序列,所述音频特征序列包括所述样本语音数据的多个音频帧的音频特征;
    将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,所述隐层特征序列包括所述各个块的隐层特征;其中,每个块包括所述多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,所述编码器在对当前块进行编码处理时,使用缓存区存储的与所述当前块具有重叠的音频帧的至少一个历史块的中间处理结果;所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
    通过所述实时语音识别模型的解码器解码所述隐层特征序列,得到所述样本语音数据的预测识别结果;
    基于所述样本语音数据的预测识别结果和真实识别结果,对所述实时语音识别模型进行训练。
  2. 根据权利要求1所述的方法,其中,所述掩码矩阵包括多组元素,每组元素用于指示一个块所包含的音频帧,每组元素中通过两种不同的数值来区分一个块中包含和不包含的音频帧。
  3. 根据权利要求1所述的方法,其中,所述编码器包括n个串联的编码层,n为大于1的整数;所述n个编码层中的每个编码层包含多头自注意力模块;
    所述多头自注意力模块在计算多头自注意力系数时,通过所述掩码矩阵屏蔽掉不属于当前块的音频帧。
  4. 根据权利要求1所述的方法,其中,所述编码器包括n个串联的编码层,n为大于1的整数;
    所述将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,包括:
    根据所述掩码矩阵确定第1个编码层所对应的块,包括:至少一个有效帧、位于所述有效帧之前的至少一个历史帧,以及位于所述有效帧之后的至少一个未来帧;
    根据所述掩码矩阵确定第i个编码层所对应的块,包括:至少一个有效帧,以及位于所述有效帧之前的至少一个历史帧,i为大于1且小于或等于n的整数。
  5. 根据权利要求4所述的方法,其中,所述对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,包括:
    将所述音频特征序列输入至所述编码器的第1个编码层,通过所述第1个编码层对所述音频特征序列进行编码处理,得到所述第1个编码层的输出特征序列;
    对于所述编码器的第i个编码层,将第i-1个编码层的输出特征序列输入至所述第i个编码层,通过所述第i个编码层对所述第i-1个编码层的输出特征序列进行编码处理,得到所述第i个编码层的输出特征序列;其中,所述第i个编码层在对所述当前块进行编码处理时,使用缓存区中存储的所述至少一个历史块在所述第i-1个编码层的输出特征;
    其中,所述编码器的第n个编码层的输出特征序列,作为所述隐层特征序列。
  6. 根据权利要求4所述的方法,其中,所述n个编码层中的每个编码层包括:第一前馈模块、多头自注意力模块、卷积模块、第二前馈模块和层归一化模块;
    所述第一前馈模块用于对输入特征序列进行前馈处理,得到第一中间特征序列;
    所述多头自注意力模块用于采用多头自注意力机制对所述第一中间特征序列进行处理,得到第二中间特征序列;
    所述卷积模块用于对所述第二中间特征序列进行卷积处理,得到第三中间特征序列;
    所述第二前馈模块用于对所述第三中间特征序列进行前馈处理,得到第四中间特征序列;
    所述层归一化模块用于对所述第四中间特征序列进行归一化处理,得到输出特征序列。
  7. 根据权利要求6所述的方法,其中,所述卷积模块用于对当前帧的第二中间特征和所述当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到所述当前帧的第三中间特征。
  8. 一种实时语音识别方法,由计算机设备执行,所述计算机设备中部署有实时语音识别模型,所述方法包括:
    获取语音数据的待识别块的音频特征序列,所述待识别块包括所述语音数据中的至少两个连续的音频帧,所述待识别块的音频特征序列包括所述待识别块所包含的各个音频帧 的音频特征;
    从缓存区存储的数据中,获取所述待识别块对应的历史块的中间处理结果,利用所述历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征;其中,所述隐层特征为所述待识别块经过编码处理后的特征;所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
    根据所述隐层特征,解码得到所述待识别块的实时语音识别结果。
  9. 根据权利要求8所述的方法,其中,所述实时语音识别模型包括编码器,所述编码器包括n个串联的编码层,n为大于1的整数;
    所述n个编码层中的每个编码层包括多头自注意力模块和卷积模块,所述多头自注意力模块用于采用多头自注意力机制对输入的特征序列进行处理,所述卷积模块用于对输入的特征序列进行卷积处理;
    所述缓存区包括第一缓存区和第二缓存区,所述第一缓存区用于存储所述多头自注意力模块针对所述历史块的中间处理结果,所述第二缓存区用于存储所述卷积模块针对所述历史块的中间处理结果。
  10. 根据权利要求9所述的方法,其中,所述待识别块包括至少一个有效帧和位于所述有效帧之前的至少一个历史帧;所述至少一个历史帧为所述待识别块和所述历史块之间重叠的音频帧;
    所述利用所述历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征序列,包括:
    从所述第一缓存区中获取所述多头自注意力模块针对所述历史块的中间处理结果,从所述第二缓存区中获取所述卷积模块针对所述历史块的中间处理结果,根据从所述第一缓存区和第二缓存区获取的中间处理结果,对所述待识别块的至少一个有效帧的音频特征进行编码处理,得到所述有效帧的隐层特征,作为所述待识别块的隐层特征。
  11. 根据权利要求9所述的方法,其中,所述n个编码层中的每个编码层还包括第一前馈模块、第二前馈模块和层归一化模块;
    所述第一前馈模块用于对输入特征序列进行前馈处理,得到第一中间特征序列;
    所述多头自注意力模块用于采用多头自注意力机制对所述第一中间特征序列进行处理,得到第二中间特征序列;
    所述卷积模块用于对所述第二中间特征序列进行卷积处理,得到第三中间特征序列;
    所述第二前馈模块用于对所述第三中间特征序列进行前馈处理,得到第四中间特征序列;
    所述层归一化模块用于对所述第四中间特征序列进行归一化处理,得到输出特征序列。
  12. 根据权利要求11所述的方法,其中,所述卷积模块用于对当前帧的第二中间特征和所述当前帧的至少一个历史帧的第二中间特征进行卷积处理,得到所述当前帧的第三中间特征。
  13. 根据权利要求10至12任一项所述的方法,其中,所述待识别块进一步包括:位于所述有效帧之后的至少一个未来帧。
  14. 根据权利要求13所述的方法,还包括:
    基于当前语音识别场景的时延需求,确定所述未来帧的数量。
  15. 一种实时语音识别模型的训练装置,包括:
    样本获取模块,用于获取样本语音数据的音频特征序列,所述音频特征序列包括所述样本语音数据的多个音频帧的音频特征;
    编码处理模块,用于将所述音频特征序列输入至所述实时语音识别模型的编码器,通过所述编码器根据掩码矩阵对所述音频特征序列进行分块,对各个块进行编码处理,得到所述样本语音数据的隐层特征序列,所述隐层特征序列包括所述各个块的隐层特征;其中,每个块包括所述多个音频帧中的至少两个连续的音频帧,且相邻两个块之间存在至少一个重叠的音频帧,所述编码器在对当前块进行编码处理时,使用缓存区存储的与所述当前块具有重叠的音频帧的至少一个历史块的中间处理结果;所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
    解码处理模块,用于通过所述实时语音识别模型的解码器解码所述隐层特征序列,得到所述样本语音数据的预测识别结果;
    模型训练模块,用于基于所述样本语音数据的预测识别结果和真实识别结果,对所述实时语音识别模型进行训练。
  16. 一种实时语音识别装置,包括:
    序列获取模块,用于获取所述语音数据的待识别块的音频特征序列,所述待识别块包括所述语音数据中的至少两个连续的音频帧,所述待识别块的音频特征序列包括所述待识别块所包含的各个音频帧的音频特征;
    编码处理模块,用于从缓存区存储的数据中,获取所述待识别块对应的历史块的中间处理结果,利用所述历史块的中间处理结果,对所述待识别块的音频特征序列进行编码处理,得到所述待识别块的隐层特征;其中,所述隐层特征为所述待识别块经过编码处理后的特征;所述历史块是指与所述待识别块具有至少一个重叠的音频帧,且已经过编码处理的块;
    解码处理模块,用于根据所述隐层特征,解码得到所述待识别块的实时语音识别结果。
  17. 一种计算机设备,所述计算机设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一项所述的方法,或者实现如权利要求8至14任一项所述的方法。
  18. 一种计算机可读存储介质,所述计算机可读存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任一项所述的方法,或者实现如权利要求8至14任一项所述的方法。
  19. 一种计算机程序产品,所述计算机程序产品包括计算机指令,所述计算机指令存储在计算机可读存储介质中,处理器从所述计算机可读存储介质中读取并执行所述计算机指令,以实现如权利要求1至7任一项所述的方法,或者实现如权利要求8至14任一项所述的方法。
PCT/CN2022/142596 2022-03-15 2022-12-28 实时语音识别方法、模型训练方法、装置、设备及存储介质 WO2023173890A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/384,009 US20240062744A1 (en) 2022-03-15 2023-10-26 Real-time voice recognition method, model training method, apparatuses, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210253123.0A CN114596841A (zh) 2022-03-15 2022-03-15 实时语音识别方法、模型训练方法、装置及设备
CN202210253123.0 2022-03-15

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/384,009 Continuation US20240062744A1 (en) 2022-03-15 2023-10-26 Real-time voice recognition method, model training method, apparatuses, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023173890A1 true WO2023173890A1 (zh) 2023-09-21

Family

ID=81808903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/142596 WO2023173890A1 (zh) 2022-03-15 2022-12-28 实时语音识别方法、模型训练方法、装置、设备及存储介质

Country Status (3)

Country Link
US (1) US20240062744A1 (zh)
CN (1) CN114596841A (zh)
WO (1) WO2023173890A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114596841A (zh) * 2022-03-15 2022-06-07 腾讯科技(深圳)有限公司 实时语音识别方法、模型训练方法、装置及设备
CN116913266B (zh) * 2023-09-13 2024-01-05 腾讯科技(深圳)有限公司 一种语音检测方法、装置、设备及存储介质
CN117275484B (zh) * 2023-11-17 2024-02-20 深圳市友杰智新科技有限公司 命令词识别方法、装置、设备和介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111696526A (zh) * 2020-06-22 2020-09-22 北京达佳互联信息技术有限公司 语音识别模型的生成方法、语音识别方法、装置
CN114023309A (zh) * 2020-07-15 2022-02-08 阿里巴巴集团控股有限公司 语音识别***、相关方法、装置及设备
CN114596841A (zh) * 2022-03-15 2022-06-07 腾讯科技(深圳)有限公司 实时语音识别方法、模型训练方法、装置及设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10593321B2 (en) * 2017-12-15 2020-03-17 Mitsubishi Electric Research Laboratories, Inc. Method and apparatus for multi-lingual end-to-end speech recognition
CN110619871B (zh) * 2018-06-20 2023-06-30 阿里巴巴集团控股有限公司 语音唤醒检测方法、装置、设备以及存储介质
CN113362812B (zh) * 2021-06-30 2024-02-13 北京搜狗科技发展有限公司 一种语音识别方法、装置和电子设备
CN113889076B (zh) * 2021-09-13 2022-11-01 北京百度网讯科技有限公司 语音识别及编解码方法、装置、电子设备及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111696526A (zh) * 2020-06-22 2020-09-22 北京达佳互联信息技术有限公司 语音识别模型的生成方法、语音识别方法、装置
CN114023309A (zh) * 2020-07-15 2022-02-08 阿里巴巴集团控股有限公司 语音识别***、相关方法、装置及设备
CN114596841A (zh) * 2022-03-15 2022-06-07 腾讯科技(深圳)有限公司 实时语音识别方法、模型训练方法、装置及设备

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XIE CHEN; YU WU; ZHENGHAO WANG; SHUJIE LIU; JINYU LI: "Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 28 February 2021 (2021-02-28), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081885970 *
YAO ZHUOYUAN, WU DI, WANG XIONG, ZHANG BINBIN, YU FAN, YANG CHAO, PENG ZHENDONG, CHEN XIAOYU, XIE LEI, LEI XIN: "WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit", ARXIV: 2102.01547V5, 29 December 2021 (2021-12-29), XP093092811, Retrieved from the Internet <URL:https://arxiv.org/pdf/2102.01547.pdf> [retrieved on 20231018] *

Also Published As

Publication number Publication date
CN114596841A (zh) 2022-06-07
US20240062744A1 (en) 2024-02-22

Similar Documents

Publication Publication Date Title
WO2023173890A1 (zh) 实时语音识别方法、模型训练方法、装置、设备及存储介质
US20220180202A1 (en) Text processing model training method, and text processing method and apparatus
CN109874029B (zh) 视频描述生成方法、装置、设备及存储介质
CN108419094B (zh) 视频处理方法、视频检索方法、装置、介质及服务器
US20200126539A1 (en) Speech recognition using convolutional neural networks
CN112333179B (zh) 虚拟视频的直播方法、装置、设备及可读存储介质
CN111581437A (zh) 一种视频检索方法及装置
WO2020257812A2 (en) Modeling dependencies with global self-attention neural networks
CN110622176A (zh) 视频分区
WO2023273628A1 (zh) 一种视频循环识别方法、装置、计算机设备及存储介质
US20230090590A1 (en) Speech recognition and codec method and apparatus, electronic device and storage medium
US20200364576A1 (en) Utilizing deep recurrent neural networks with layer-wise attention for punctuation restoration
CN115362497A (zh) 具有延迟阈值的序列到序列语音识别
CN112804558B (zh) 视频拆分方法、装置及设备
CN112989120B (zh) 一种视频片段查询***和视频片段查询方法
CN113987269A (zh) 数字人视频生成方法、装置、电子设备和存储介质
CN114390218A (zh) 视频生成方法、装置、计算机设备和存储介质
WO2023207541A1 (zh) 一种语音处理方法及相关设备
KR20230062429A (ko) 문장 기반 스케치 추천 방법 및 장치
CN114207711A (zh) 用于识别用户的语音的***和方法
CN113762056A (zh) 演唱视频识别方法、装置、设备及存储介质
CN116662604A (zh) 一种基于分层Transformer的视频摘要方法
US20230419082A1 (en) Improved Processing of Sequential Data via Machine Learning Models Featuring Temporal Residual Connections
CN115910083A (zh) 一种实时语音转换方法、装置、电子设备及介质
CN115312043A (zh) 语音识别方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22931895

Country of ref document: EP

Kind code of ref document: A1