CN116259308A - Context-aware blank pipe voice recognition method and electronic equipment - Google Patents

Context-aware blank pipe voice recognition method and electronic equipment Download PDF

Info

Publication number
CN116259308A
CN116259308A CN202310548256.5A CN202310548256A CN116259308A CN 116259308 A CN116259308 A CN 116259308A CN 202310548256 A CN202310548256 A CN 202310548256A CN 116259308 A CN116259308 A CN 116259308A
Authority
CN
China
Prior art keywords
context
voice
dimensional
speech
blank pipe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310548256.5A
Other languages
Chinese (zh)
Other versions
CN116259308B (en
Inventor
郭东岳
林毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202310548256.5A priority Critical patent/CN116259308B/en
Publication of CN116259308A publication Critical patent/CN116259308A/en
Application granted granted Critical
Publication of CN116259308B publication Critical patent/CN116259308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G5/00Traffic control systems for aircraft, e.g. air-traffic control [ATC]
    • G08G5/0017Arrangements for implementing traffic-related aircraft activities, e.g. arrangements for generating, displaying, acquiring or managing traffic information
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G5/00Traffic control systems for aircraft, e.g. air-traffic control [ATC]
    • G08G5/003Flight plan management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/083Recognition networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of blank pipe voice recognition, in particular to a context-aware blank pipe voice recognition method and electronic equipment. The method establishes a text sequence of the context information by leading in the context information; and then receiving the blank pipe voice data, inputting the blank pipe voice data into a pre-constructed end-to-end context aware voice recognition model for recognition, and outputting a corresponding predicted text sequence. Aiming at the problem of low recognition rate of low-frequency words and key named entities of the traditional voice recognition model, the invention establishes a context information text sequence by extracting context information from secondary radar, flight plan and control basic data, and fully utilizes the hidden context information in the land-air communication environment to improve the performance and precision of voice recognition. Meanwhile, an end-to-end context-aware speech recognition model is constructed, and the context-aware speech recognition with high real-time performance is realized in an end-to-end and non-autoregressive decoder mode, so that the performance and efficiency of the context-related speech recognition are effectively improved.

Description

Context-aware blank pipe voice recognition method and electronic equipment
Technical Field
The invention relates to the field of blank pipe voice recognition, in particular to a context-aware blank pipe voice recognition method and electronic equipment.
Background
The land-air communication is an important link in the current air traffic control process, and in the process, the controllers and pilots communicate through very high frequency radio communication so as to ensure the safe and efficient execution of the flight mission. The method is different from flight plans, tracks and the like, takes structured messages as transmission means, and has the characteristics of flexibility, diversity and the like in voice communication. At present, land-air communication is still independent of a control automation system, and is a centralized embodiment of 'people in a loop' in the current air traffic control process. With the development of big data and artificial intelligence, many technologies have been introduced into the air traffic control process to assist controllers in managing airspace, so as to reduce the workload of the controllers and improve the control efficiency. The automatic voice semantic understanding technology is used for automatically translating the land-air communication into text data and analyzing the text data into structural data, and basic data support is provided for applications such as empty pipe data analysis and real-time safety monitoring. As can be seen, speech recognition is the underlying technology for a range of intelligent applications based on land-air telephony. Therefore, high-performance air traffic control voice recognition becomes a precondition for intelligent application related to air-ground communication.
The existing blank pipe voice recognition technology is almost the migration and improvement of a general algorithm, and a model is trained based on a large amount of historical data, so that basic recognition requirements can be met, but a large amount of low-frequency words, similar named entities and the like existing in blank pipe voice are recognized, and a large challenge is still faced. Current blank pipe speech recognition techniques can be divided into two categories, context dependent and context independent, depending on whether context information classification is relied upon. For context-free methods, such methods are similar to generic speech recognition models, i.e., input speech signals, the models extract acoustic features and give probabilities of corresponding text sequences. However, in land-air communication, the content of the voice has a strong correlation with the current control context and situation, and when low-frequency words or strong environmental noise interference is encountered, the performance of the context-independent model is often greatly reduced.
For context-related methods, most of the current methods are to fuse context information in the decoding process, i.e. dynamically compile the context information into a graph, and decode the graph together with a language model using a Weighted Finite State Transducer (WFST), so as to achieve the purpose of improving the probability of hot words in the context. Although the method can improve the voice recognition performance to a certain extent, the empty pipe context information has strong instantaneity, and each decoding needs to refresh and recompile the voice information graph, so that the voice recognition efficiency is reduced, and the method is not suitable for application scenes with high instantaneity requirements.
What is needed is a context-aware blank pipe speech recognition method and electronic device that improves the recognition accuracy and recognition efficiency of blank pipe speech in context-dependent environments.
Disclosure of Invention
The invention aims to solve the problems of low performance and low efficiency of context-related voice recognition in the prior art and provides a context-aware blank pipe voice recognition method and electronic equipment.
In order to achieve the above object, the present invention provides the following technical solutions:
a context-aware blank pipe speech recognition method, comprising the steps of:
s1: connecting context information and establishing a context information text sequence;
s2: receiving blank pipe voice data and performing segmentation processing to generate a single sentence voice signal;
s3: preprocessing the single sentence voice signal to generate a spectrogram characteristic;
s4: inputting the context information text sequence and the spectrogram characteristics into a pre-constructed end-to-end context aware speech recognition model, and outputting a vocabulary probability matrix;
s5: and decoding the vocabulary probability matrix into text information, and outputting a predicted text sequence corresponding to each single sentence voice signal.
As a preferred embodiment of the present invention, the S1 includes:
s11: establishing a context information set;
s12: leading in flight plan information, decoding and extracting named entities in a flight plan context, and inputting the context information set; the named entities in the flight plan context comprise named entities of call signs and destination punctuations;
s13: leading track information, decoding and extracting named entities in track context, and inputting the context information set; the named entities in the track context comprise named entities of call signs, target heights and flying speeds;
s14: the basic data of the control sector is led in, the named entity in the sector context of the current control sector is decoded and extracted, and the context information set is input; the named entities in the sector context comprise geographic information, landmark points, waypoints, track numbers and named entities of handover frequencies;
s15: defining a special symbol as a separator and inserting the special symbol into the middle of each named entity in the context information set;
s16: mapping each named entity into a named entity text subsequence corresponding to a voice recognition output text unit;
s17: linking the named entity text subsequences through separators, and outputting the named entity text subsequences as context information text sequences;
wherein the set of contextual information and the sequence of contextual information text are dynamically updated as the context changes.
As a preferred embodiment of the present invention, the S2 includes:
s21: receiving blank pipe voice data and dividing the blank pipe voice data into a plurality of voice frame signals with preset duration;
s22: inputting the voice frame signals into a pre-constructed voice frame classification model, and predicting the category of the voice frame signals; the category includes voice signals and non-voice signals;
s23: the speech frame signal continuously predicted as a human voice speech signal is output as a mono speech signal.
As a preferable scheme of the invention, the voice frame classification model comprises a one-dimensional convolution module, an LSTM module and an output layer;
the one-dimensional convolution module is used for learning local voice characteristics in voice frame signals, extracting characteristic representations of voice signals and non-voice signals, and learning distinction between the voice signals and the non-voice signals; the one-dimensional convolution module comprises a one-dimensional convolution layer, a ReLU activation function and a BatchNorm layer;
the LSTM module is constructed based on an LSTM network and is used for learning the relation between voice frame sequences, capturing the characteristic change process of converting a human voice signal into a non-human voice signal or converting the non-human voice signal into the human voice signal and the high-dimensional category characteristics of each frame;
the output layer comprises a full connection layer and a Sigmoid activation function, and is used for classifying and outputting probabilities of corresponding categories of voice frame signals.
As a preferred embodiment of the present invention, the training step of the speech frame classification model includes:
(1) Collecting continuous real empty management call voice data, and manually marking voice frames and non-voice frames after data preprocessing and data cleaning to form a voice frame classification data set;
(2) Dividing the voice frame classification data set into three parts of a training set, a verification set and a test set according to a preset proportion;
(3) Using the training set and the loss function
Figure SMS_1
Training the speech frame classification model, and using an Adam optimizer to perform back propagation to update network parameters; specifically, the loss function->
Figure SMS_2
The method comprises the following steps:
Figure SMS_3
wherein ,
Figure SMS_4
and />
Figure SMS_5
Respectively +.>
Figure SMS_6
The individual speech frame signal classes are the predicted probability and the true probability of a voice speech signal,
Figure SMS_7
is the number of speech frames.
As a preferred embodiment of the present invention, the end-to-end context aware speech recognition model includes a speech encoder, a context encoder, a decoder based on an attention mechanism;
the speech coder is used for learning the speech feature representation and mapping the speech feature representation to a high-dimensional feature space;
the context encoder is used for learning the characteristic representation of the context information and mapping the context information to a high-dimensional characteristic space;
the decoder based on the attention mechanism is used for calculating the corresponding relation between the high-dimensional voice characteristic representation and the context characteristic representation, forming the context-voice joint representation and outputting a vocabulary probability matrix;
the end-to-end context aware speech recognition model uses time series connection sense classification (CTC) as a loss function, trained using Adam optimizers.
As a preferred scheme of the present invention, the speech encoder in the end-to-end context-aware speech recognition model is composed of a dynamic convolutional recurrent neural network, and the expression when running is as follows:
Figure SMS_8
wherein ,
Figure SMS_9
representation from speech feature->
Figure SMS_10
The extracted high-dimensional speech feature representing sequence, < >>
Figure SMS_11
For a given learned sequence of speech signal features;
the dynamic convolution cyclic neural network comprises a convolution module and a cyclic neural network module;
the convolution module consists of N dynamic convolution modules and is used for extracting high-dimensional local audio features; the dynamic convolution module is formed by cascading a one-dimensional dynamic convolution layer, batch normalization and HardTanh activation functions; the one-dimensional dynamic convolution layer comprises a plurality of convolution kernels with the same size, each convolution kernel is used for learning different high-dimensional category characterization, and the output of the one-dimensional dynamic convolution layer is obtained through weighted fusion;
the calculation formula of the one-dimensional dynamic convolution layer is as follows:
Figure SMS_12
Figure SMS_13
wherein ,
Figure SMS_14
for a given input feature->
Figure SMS_15
Is an output feature of a one-dimensional dynamic convolution layer, +.>
Figure SMS_16
Represents->
Figure SMS_17
Convolution operation of the convolution kernels, +.>
Figure SMS_18
Is->
Figure SMS_19
Weight coefficient of convolution kernel->
Figure SMS_20
Is a weight that can be learnedAssigning a function;
the cyclic neural network module is composed of a plurality of bidirectional gating cyclic neural network modules and is used for capturing time sequence relations among voice frames so as to output high-dimensional voice characteristic representation.
As a preferred embodiment of the present invention, the context encoder includes a word embedding layer and a long and short term memory network layer, and the expression of the context encoder at the time of operation can be expressed as follows:
Figure SMS_21
wherein ,
Figure SMS_22
the contextual information text sequence output for the S1,>
Figure SMS_23
representing a text sequence from said contextual information +.>
Figure SMS_24
The extracted high-dimensional contextual features represent sequences;
the word embedding layer is used for mapping words in the text context sequence to a feature space with fixed dimension and learning semantic representation among words; the long-short-time memory network layer is used for capturing relations among words forming the named entities and boundary information of each named entity in the context information.
As a preferred embodiment of the present invention, the attention mechanism-based decoder includes a context attention module and a fully-connected layer; the context attention module is used for calculating the alignment relation between the high-dimensional voice features and the context features through an attention mechanism and fusing the high-dimensional voice features and the context features into a high-dimensional voice representation for perceiving the context features, and the main calculation flow is as follows:
Figure SMS_25
Figure SMS_26
Figure SMS_27
wherein ,Kfor a linear transformation of the context information,Scoresis the relevance score between the speech feature and the contextual attention feature,
Figure SMS_28
for perception of a high-dimensional phonetic representation of a contextual feature, < >>
Figure SMS_29
For vector concatenation operations, ++>
Figure SMS_30
For nonlinear activation function +.>
Figure SMS_31
Is a preset matrix of learnable parameters, +.>
Figure SMS_32
Is a bias vector;
the full-connection layer takes high-dimensional voice representation of perceived contextual characteristics as input, is activated by using a Softmax function, outputs a corresponding vocabulary probability matrix, and the calculation expression is as follows:
Figure SMS_33
wherein ,
Figure SMS_34
representing a full connection layer, ">
Figure SMS_35
Representing the generated probability matrix;
the probability matrix
Figure SMS_36
Modeling vocabulary in combination with speech recognition, employing greDecoding by edy algorithm or beam search algorithm to obtain readable text sequence +.>
Figure SMS_37
The final recognition result of the voice is obtained.
An electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding claims.
Compared with the prior art, the invention has the beneficial effects that:
aiming at the problem of low recognition rate of low-frequency words and key named entities of the traditional voice recognition model, the invention establishes a context information text sequence by extracting context information from secondary radar, flight plan and control basic data, and fully utilizes the hidden context information in the land-air communication environment to improve the performance and precision of voice recognition. Meanwhile, an end-to-end context-aware speech recognition model is constructed, and the context-aware speech recognition with high real-time performance is realized in an end-to-end and non-autoregressive decoder mode, so that the performance and efficiency of the context-related speech recognition are effectively improved.
The invention provides a context attention mechanism, which calculates the association between the voice characteristics and the context characteristics through matrix multiplication, thereby realizing rapid and effective completion of characteristic fusion of the voice and the context.
Drawings
Fig. 1 is a flow chart of a context-aware blank pipe speech recognition method according to embodiment 1 of the present invention;
fig. 2 is a schematic diagram of an end-to-end context-aware speech recognition model in a context-aware blank pipe speech recognition method according to embodiment 1 of the present invention;
fig. 3 is a schematic structural diagram of a speech frame classification model in a context-aware blank pipe speech recognition method according to embodiment 2 of the present invention;
fig. 4 is a schematic diagram of a dynamic convolution module structure of an end-to-end context-aware speech recognition model in a context-aware blank pipe speech recognition method according to embodiment 2 of the present invention;
fig. 5 is a schematic structural diagram of an electronic device according to embodiment 4 of the present invention using a context-aware blank pipe speech recognition method according to embodiment 1 or 2.
Detailed Description
The present invention will be described in further detail with reference to examples and embodiments. It should not be construed that the scope of the above subject matter of the present invention is limited to the following embodiments, and all techniques realized based on the present invention are within the scope of the present invention.
Example 1
As shown in fig. 1, a context-aware blank pipe speech recognition method includes the following steps:
s1: connecting context information and establishing a context information text sequence;
s2: receiving blank pipe voice data and performing segmentation processing to generate a single sentence voice signal;
s3: preprocessing the single sentence voice signal to generate a spectrogram characteristic;
s4: inputting the context information text sequence and the spectrogram characteristics into a pre-constructed end-to-end context aware speech recognition model, and outputting a vocabulary probability matrix;
s5: and decoding the vocabulary probability matrix into text information, and outputting a predicted text sequence corresponding to each single sentence voice signal.
As shown in FIG. 2, the end-to-end Context aware Speech recognition model includes a Speech coder (Speech Encoder), a Context coder (Context Encoder), and an Attention-based Decoder (Attention-based Decoder).
Example 2
This example is a specific implementation of the method described in example 1, and specifically includes the following steps:
s1: and leading the context information to establish a text sequence of the context information. The current air traffic control mode is that a controller acquires information such as radar, flight plan and the like through an automatic system, gathers and understands the current air traffic situation, and gives voice instructions to a target control aircraft through a very high frequency radio according to the flight tasks of different aircraft so as to achieve the purposes of adjusting and controlling air traffic. Therefore, the content of the controlled call is closely related to the current air traffic situation, namely, the air traffic situation can be regarded as the context information of the land-air call, and the full utilization of the context information is a key point for breaking through the bottleneck of the current air-traffic control voice recognition technology.
S11: a set of context information is established, which is initially an empty set.
S12: leading in flight plan information, decoding and extracting named entities in a flight plan context, and inputting the context information set; named entities in the flight plan context include call signs, destination punctuations, and the like.
Specifically, after the flight plan information is led in, decoding is carried out according to the composition rule of the flight plan message, clear plan information is obtained, the flight plans of all the aircrafts in the current control range are extracted from the plan information, the named entity fields such as call signs, destination punctuations and the like are extracted, and the context information set is added.
S13: leading track information, decoding and extracting named entities in track context, and inputting the context information set; named entities in the track context include call signs, target altitude, flight speed, and the like.
Specifically, after the track information is led, decoding is carried out according to the composition rule of the track message to obtain the track information of the plaintext, the track information is filtered, the flight plans of all airplanes in the current control range are extracted, the named entity fields such as call sign, target height, speed and the like are extracted, and the context information set is added.
S14: the basic data of the control sector is led in, the named entity in the sector context of the current control sector is decoded and extracted, and the context information set is input; named entities in the sector context include geographic information, landmark points, waypoints, runway numbers, handoff frequencies, and the like.
Specifically, after the basic data of the current control sector is led, decoding is carried out according to a message rule to obtain key naming entity information such as geographic information, landmark points, waypoints, track numbers, handover frequencies and the like of the current control sector, and a context information set is added.
The initial extraction of the context information can be completed through the steps S11-S14, and the context set still has the following problems in the air management field, unlike the conventional speech recognition in which the context information is directly input into the model: a) Each piece of information in a context set is often a semantically inseparable named entity consisting of a plurality of words, and boundaries between the entities need to be distinguished. b) In view of the fact that the message is mostly composed of English letters and data, the message is heterogeneous with text units output by voice recognition, and is unfavorable for feature representation and fusion. In view of this, the present embodiment proposes the following method to deal with the above-described problem to obtain the final context information text sequence.
S15: a special symbol is defined as a separator and inserted into the middle of each named entity in the context information set to distinguish the boundaries of adjacent named entities and enhance the semantic representation inside the named entities. In this embodiment, < eos > is used as the separator.
S16: mapping each named entity into a named entity text subsequence corresponding to the voice recognition output text unit. In this embodiment, the entity mapping operates according to the rule of the null management message.
S17: and linking the named entity text subsequences through separators, and outputting the named entity text subsequences as a context information text sequence.
The context information set and the context information text sequence are dynamically updated along with the change of the context, namely new context information is added in real time, and outdated context information is removed at fixed time, so that the accuracy and timeliness of the context information set are maintained; accordingly, the context information text sequence is updated with the update of the set of context information.
S2: and receiving the blank pipe voice data and performing segmentation processing to generate a single sentence voice signal. The land-air call is presented in the form of a continuous voice dialogue, and the segmentation of a single sentence of voice signal from a continuous voice stream is a basic requirement of air-management voice-related applications. Because the land-air call is based on half-duplex very high frequency radio call, the problems of overlapping of voice signals and the like do not exist, and therefore, the voice signals of people and the voice signals of non-people are recognized from continuous audio signals, and the segmentation of the land-air call voice can be completed.
S21: receiving blank pipe voice data and dividing the blank pipe voice data into a plurality of voice frame signals with preset duration; in this embodiment, the duration is 25ms.
S22: inputting the voice frame signals into a pre-constructed voice frame classification model, and predicting the category of the voice frame signals; the category includes voice speech signals and non-voice speech signals.
S23: and outputting the voice frame signal continuously predicted as the voice signal as a single sentence voice signal, wherein the single sentence voice signal is an audio file stored as a wav format.
Specifically, as shown in fig. 3, the speech frame classification model includes a one-dimensional convolution module (Conv 1D), an LSTM module, and an output layer; the workflow can be expressed as: given a sequence of speech frames s= { S 1 ,s 2 ,…, s k ,…, s n If it is desired to predict s k If the category of the sequence S belongs, the subsequence S' = { S of the sequence S is taken k-m, …, s k , …, s k+m And inputting the result into the model to obtain a classification result.
Specifically, the one-dimensional convolution module is a feature extractor, and is used for learning local voice features in voice frame signals, extracting feature representations of human voice signals and non-human voice signals, and learning differences between the two; the one-dimensional convolution module comprises a one-dimensional convolution layer, a ReLU activation function and a BatchNorm layer.
The LSTM module is constructed based on an LSTM network and is used for learning the relation between voice frame sequences, capturing the characteristic change process of converting a human voice signal into a non-human voice signal or converting the non-human voice signal into the human voice signal and the high-dimensional category characterization of each frame.
The output layer comprises a full connection layer and a Sigmoid activation function, and is used for classifying and outputting probabilities of corresponding categories of voice frame signals.
The training of the speech frame classification model comprises the following steps:
(1) And collecting continuous real empty management call voice data, and manually marking the voice frames and the non-voice frames after data preprocessing and data cleaning to form a voice frame classification data set.
(2) Dividing the voice frame classification data set into three parts of a training set, a verification set and a test set according to a preset proportion.
(3) Using the training set and the loss function
Figure SMS_38
Training the speech frame classification model, and using an Adam optimizer to perform back propagation to update network parameters; specifically, the loss function->
Figure SMS_39
The method comprises the following steps:
Figure SMS_40
wherein ,
Figure SMS_41
and />
Figure SMS_42
Respectively +.>
Figure SMS_43
The speech frame signal classes are the prediction probability and the true probability of the voice signal, +.>
Figure SMS_44
Is the number of speech frames.
S3: and preprocessing the single sentence voice signal to generate a spectrogram characteristic.
S31: and pre-emphasis processing is carried out on the single speech signals, so that the influence of lip radiation is removed, and the high-frequency resolution of speech is increased.
S32: dividing the pre-emphasized single sentence speech signal into a plurality of speech frame sequences; the slicing parameter in this embodiment is a frame length of 25ms, and the frame is shifted by 10ms.
S33: and windowing the voice frame sequence by using a Hamming window to obtain a good sidelobe reduction amplitude.
S34: and performing fast Fourier transform on the voice frame sequence subjected to the windowing operation, and outputting spectrogram characteristics. And performing the fast Fourier transform operation, namely performing square after Fourier transform and dividing the square by the number of points of the fast Fourier transform, so as to obtain the spectrogram characteristics.
S4: and inputting the context information text sequence and the spectrogram characteristics into a pre-constructed end-to-end context aware speech recognition model, and outputting a vocabulary probability matrix.
The end-to-end context aware speech recognition model includes a speech encoder, a context encoder, and an attention-based decoder. The speech coder is configured to learn representative features of speech features and map the speech features to a high-dimensional feature space. The context encoder is used to learn a high-dimensional representation of the context information. The attention mechanism-based decoder is used for calculating the correspondence between the high-dimensional voice features and the context features and outputting a vocabulary probability matrix. The end-to-end context aware speech recognition model uses time series connection sense classification as a loss function and is trained by an Adam optimizer.
(1) The speech coder is used for learning the speech feature representation and mapping the speech feature representation to a high-dimensional feature space; the speech coder consists of a dynamic convolutional recurrent neural network whose expression in operation is as follows:
Figure SMS_45
wherein ,
Figure SMS_46
representation from speech feature->
Figure SMS_47
The extracted high-dimensional speech feature representing sequence, < >>
Figure SMS_48
For a given learned sequence of speech signal features.
Specifically, the dynamic convolution cyclic neural network comprises a convolution module and a cyclic neural network module. The convolution module consists of N dynamic convolution modules and is used for extracting high-dimensional local audio features. As shown in fig. 4, the dynamic convolution module is formed by cascading one-dimensional dynamic convolution layer (1D dynamic convolutional layer, DConv 1D), batch Normalization (BN), and Hardtank (HT) activation functions.
The one-dimensional dynamic convolution layer is an improvement on the one-dimensional convolution layer, and compared with common one-dimensional convolution, the DConv1D has a plurality of convolution kernels with the same size, and each convolution kernel can learn different high-dimensional category characterizations and obtain the output of the dynamic convolution layer through weighted fusion.
Given input features
Figure SMS_49
The specific calculation process of the dynamic convolution layer is as follows:
Figure SMS_50
Figure SMS_51
wherein ,
Figure SMS_53
for a given input feature->
Figure SMS_55
Is an output feature of a one-dimensional dynamic convolution layer, +.>
Figure SMS_58
Represents->
Figure SMS_52
Convolution operation of the convolution kernels, +.>
Figure SMS_56
Is->
Figure SMS_57
Weight coefficient of convolution kernel->
Figure SMS_59
Is a learnable weight distribution function. It can be seen that->
Figure SMS_54
The convolution kernel is a function of input characteristics, the weight coefficient of each convolution kernel can be dynamically adjusted according to the input characteristics, and compared with general convolution, the convolution kernel has stronger robustness.
The cyclic neural network module is composed of M bidirectional gating cyclic neural network modules and is used for capturing the time sequence relation between voice frames so as to output high-dimensional voice characteristic representation.
(2) The context encoder is used for learning a high-dimensional category representation of the context information; the context encoder comprises a Word Embedding layer (Word Embedding) and a long and short time memory network Layer (LSTM); the word embedding layer is used for mapping words in the text context sequence to a feature space with fixed dimension and learning semantic representation among words; the long-short-time memory network layer is used for capturing relations among words forming the named entities and boundary information of each named entity in the context information. The function of the context encoder shown is summarised as:
Figure SMS_60
wherein ,
Figure SMS_61
the contextual information text sequence output for the S1,>
Figure SMS_62
representing a text sequence from said contextual information +.>
Figure SMS_63
The extracted high-dimensional contextual features represent sequences.
The word embedding layer is used for mapping words in the text context sequence to a feature space with fixed dimension and learning semantic representation among words; the long-short-time memory network layer is used for capturing relations among words forming the named entities and boundary information of each named entity in the context information.
(3) The decoder based on the attention mechanism is used for calculating the corresponding relation between the high-dimensional voice characteristic representation and the context characteristic representation, forming the context-voice joint representation and outputting a vocabulary probability matrix; including a context attention module and a fully connected layer; the context attention module is used for fusing high-dimensional voice characteristics
Figure SMS_64
And high-dimensional contextual features->
Figure SMS_65
High-dimensional speech feature representation of perceived contextual features>
Figure SMS_66
The corresponding relation between the high-dimensional voice features and the context features is calculated through an attention mechanism, and feature fusion is carried out through the modes of weighted summation, vector splicing and the like. And finally outputting a corresponding vocabulary probability matrix through the full connection layer. The calculation process of the working time of the decoder based on the attention mechanism is as follows:
Figure SMS_67
Figure SMS_68
Figure SMS_69
wherein ,Kfor a linear transformation of the context information,Scoresis speech feature and context notesThe relevance score between the intentional force features,
Figure SMS_70
for perception of a high-dimensional phonetic representation of a contextual feature, < >>
Figure SMS_71
For vector concatenation operations, ++>
Figure SMS_72
For nonlinear activation function +.>
Figure SMS_73
Is a preset matrix of learnable parameters, +.>
Figure SMS_74
Is a bias vector.
The fully-connected layer perceives a high-dimensional phonetic representation of contextual features
Figure SMS_75
For input and activation using Softmax function, the corresponding vocabulary probability matrix is output, and the computational expression is as follows:
Figure SMS_76
wherein ,
Figure SMS_77
representing a full connection layer, ">
Figure SMS_78
Representing the generated probability matrix;
the probability matrix
Figure SMS_79
Combining with a speech recognition modeling vocabulary table, adopting a greedy algorithm or a beam search algorithm to decode to obtain a readable text sequence +.>
Figure SMS_80
The final recognition result of the voice is obtained.
The construction of the end-to-end context aware speech recognition model comprises the following steps:
(1) Data set construction
a) Collecting land-air communication voice data and simultaneously recording context information when a dialogue occurs;
b) Labeling voice data, extracting key context information, and converting the key context information into a text sequence;
c) The processed data is divided into three parts of a training set, a verification set and a test set for model training.
(2) Determining a loss function
The end-to-end context aware speech recognition model proposed by the present solution uses the chronology junction sense classification (CTC) as a penalty function to learn the differences between the target text sequence and the model output, thereby optimizing the model parameters.
(3) Training model
In the embodiment, an Adam optimizer is used for training a model, the model is verified on a test set after the model is trained, and a voice recognition model with optimal performance is selected.
S5: decoding the vocabulary probability matrix into text information, and outputting a predicted text sequence corresponding to each single sentence voice signal; the word probability matrix is decoded by adopting a greedy algorithm or a beam search algorithm.
Example 3
The embodiment is a specific application example of the method described in embodiment 2, wherein experimental data in the embodiment is collected from a real empty pipe control system, and control calls and current air traffic context information are recorded. The data is subjected to data cleaning, data marking, checking and alignment to form 100000 pieces of voice data, and a voice recognition model is trained according to a voice data set consisting of 90%,5% and 5%. The detailed information of the dataset is shown in the following table:
table 1 dataset information
Figure SMS_81
Where #H denotes a voice duration (hours), and #U denotes the number of voice stripes.
In this embodiment, chinese characters and english letters are used as basic modeling units, and there are 770 basic modeling units in the experimental data, including 738 chinese characters, 26 english letters, and 6 special modeling units (< blank >, < space >, < eos >, < sos >, < pad >, < unk >).
S1: the context information is led to establish a text sequence of the context information.
S2: and receiving the blank pipe voice data and performing segmentation processing to generate a single sentence voice signal.
The method comprises the steps of leading in real-time land-air communication voice signals (namely, blank pipe voice data) from the current control system, classifying and segmenting continuous voice signals into single sentence voice signals by using the voice frame classification method.
S3: and preprocessing the single sentence voice signal to generate a spectrogram characteristic.
S31: the cut single sentence speech signal is pre-emphasized.
S32: the audio signal is divided into several frame sequences with a frame length of 25ms and a frame shift of 10ms.
S33: and windowing is carried out on the voice frame by adopting a Hamming window so as to obtain a better sidelobe descending amplitude.
S34: and performing fast Fourier transform on the voice frame, and squaring and dividing by the number of points of the fast Fourier transform to obtain the spectrogram characteristics.
S4: and inputting the context information text sequence and the spectrogram characteristics into a pre-constructed end-to-end context aware speech recognition model, and outputting a vocabulary probability matrix.
The specific parameters of the end-to-end context aware speech recognition model in this embodiment are set as follows:
a speech encoder: the speech encoder in this embodiment consists of 3 one-dimensional dynamic convolutional layers (DConv 1D) and 7 Bi-GRU network layers. The number of convolution kernels, the size of the convolution kernels, the step length and the channel number of the convolution layer are sequentially set to be [4, 4, 4], [5, 5, 5], [1, 1, 2], [512, 512, 512]; the number of neurons of the Bi-GRU network is set to 512.
Context encoder: the context encoder consists of a layer 1 word embedding layer and a layer 1 LSTM network containing 512 neurons.
Attention-based decoder: comprises a Context attention module and a full-connection layer. Wherein the output dimension of the fully connected layer is 770 neurons, corresponding to the number of basic modeling units in the dataset.
S5: and decoding the vocabulary probability matrix into text information, and outputting a predicted text sequence corresponding to each single sentence voice signal.
Meanwhile, to verify the effectiveness of the methods presented herein, two sets of comparative experiments, a and B, were set up in this example. Wherein, group A is a traditional voice recognition method, which comprises Deep Speech2, RNN-T and transducer model. Group B is the existing voice recognition method for fusing context information in other fields, and comprises CLAS and CRNN-T. The following model effect comparison table was obtained:
table 2 model effect comparison schematic
Figure SMS_82
Wherein word error rate (CER): and the editing distance between the recognition result and the label text is measured.
Call Sign Accuracy (CSA): and measuring whether the call sign in the identification result is accurately identified.
Instruction parameter identification rate (IPA): and measuring whether the parameters related to the instructions in the identification result are accurately identified.
Instruction Accuracy (IA): and measuring whether the call sign and the instruction parameters in the identification result are accurately identified.
Experimental results show that the method proposed by the embodiment is superior to other existing methods in blank pipe speech recognition.
Example 4
As shown in fig. 5, an electronic device includes at least one processor, and a memory communicatively coupled to the at least one processor, and at least one input-output interface communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a context-aware blank pipe speech recognition method as described in the previous embodiments. The input/output interface may include a display, a keyboard, a mouse, and a USB interface for inputting and outputting data.
Those skilled in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read Only Memory (ROM), a magnetic disk or an optical disk, or the like, which can store program codes.
The above-described integrated units of the invention, when implemented in the form of software functional units and sold or used as stand-alone products, may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable storage device, a ROM, a magnetic disk, or an optical disk.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, and alternatives falling within the spirit and principles of the invention.

Claims (10)

1. A context-aware blank pipe speech recognition method, comprising the steps of:
s1: connecting context information and establishing a context information text sequence;
s2: receiving blank pipe voice data and performing segmentation processing to generate a single sentence voice signal;
s3: preprocessing the single sentence voice signal to generate a spectrogram characteristic;
s4: inputting the context information text sequence and the spectrogram characteristics into a pre-constructed end-to-end context aware speech recognition model, and outputting a vocabulary probability matrix;
s5: and decoding the vocabulary probability matrix into text information, and outputting a predicted text sequence corresponding to each single sentence voice signal.
2. A context aware blank pipe speech recognition method according to claim 1, wherein S1 comprises:
s11: establishing a context information set;
s12: leading in flight plan information, decoding and extracting named entities in a flight plan context, and inputting the context information set; named entities in the flight plan context include call signs and destination punctuations;
s13: leading track information, decoding and extracting named entities in track context, and inputting the context information set; named entities in the track context comprise call signs, target heights and flying speeds;
s14: the basic data of the control sector is led in, the named entity in the sector context of the current control sector is decoded and extracted, and the context information set is input; named entities in the sector context comprise geographic information, landmark points, waypoints, runway numbers and handover frequencies;
s15: defining a special symbol as a separator and inserting the special symbol into the middle of each named entity in the context information set;
s16: mapping each named entity into a named entity text subsequence corresponding to a voice recognition output text unit;
s17: linking the named entity text subsequences through separators, and outputting the named entity text subsequences as context information text sequences;
wherein the set of contextual information and the sequence of contextual information text are dynamically updated as the context changes.
3. A context aware blank pipe speech recognition method according to claim 1, wherein S2 comprises:
s21: receiving blank pipe voice data and dividing the blank pipe voice data into a plurality of voice frame signals with preset duration;
s22: inputting the voice frame signals into a pre-constructed voice frame classification model, and predicting the category of the voice frame signals; the category includes voice signals and non-voice signals;
s23: the speech frame signal continuously predicted as a human voice speech signal is output as a mono speech signal.
4. A context aware blank pipe speech recognition method according to claim 3, wherein the speech frame classification model comprises a one-dimensional convolution module, an LSTM module and an output layer;
the one-dimensional convolution module is used for learning local voice characteristics in voice frame signals, extracting characteristic representations of voice signals and non-voice signals, and learning distinction between the voice signals and the non-voice signals; the one-dimensional convolution module comprises a one-dimensional convolution layer, a ReLU activation function and a BatchNorm layer;
the LSTM module is constructed based on an LSTM network and is used for learning the relation between voice frame sequences, capturing the characteristic change process of converting a human voice signal into a non-human voice signal or converting the non-human voice signal into the human voice signal and the high-dimensional category representation of each frame;
the output layer comprises a full connection layer and a Sigmoid activation function, and is used for classifying and outputting probabilities of corresponding categories of voice frame signals.
5. A context aware blank pipe speech recognition method according to claim 3, wherein the training step of the speech frame classification model comprises:
(1) Collecting continuous real empty management call voice data, and manually marking voice frames and non-voice frames after data preprocessing and data cleaning to form a voice frame classification data set;
(2) Dividing the voice frame classification data set into three parts of a training set, a verification set and a test set according to a preset proportion;
(3) Using the training set and the loss function
Figure QLYQS_1
Training the speech frame classification model, and using an Adam optimizer to perform back propagation to update network parameters; specifically, the loss function->
Figure QLYQS_2
The method comprises the following steps:
Figure QLYQS_3
wherein ,
Figure QLYQS_4
and />
Figure QLYQS_5
Respectively +.>
Figure QLYQS_6
The speech frame signal classes are the prediction probability and the true probability of the voice signal, +.>
Figure QLYQS_7
Is the number of speech frames.
6. A context aware blank pipe speech recognition method according to claim 1, wherein the end-to-end context aware speech recognition model comprises a speech encoder, a context encoder, a attention-based decoder;
the speech coder is used for learning the speech feature representation and mapping the speech feature representation to a high-dimensional feature space;
the context encoder is used for learning the characteristic representation of the context information and mapping the context information to a high-dimensional characteristic space;
the decoder based on the attention mechanism is used for calculating the corresponding relation between the high-dimensional voice characteristic representation and the context characteristic representation, forming the context-voice joint representation and outputting a vocabulary probability matrix;
the end-to-end context aware speech recognition model uses time series connection sense classification as a loss function and is trained by an Adam optimizer.
7. A context aware blank pipe speech recognition method according to claim 6, wherein the speech coder in the end-to-end context aware speech recognition model consists of a dynamic convolutional recurrent neural network, whose expression in operation is as follows:
Figure QLYQS_8
wherein ,
Figure QLYQS_9
representation from speech feature->
Figure QLYQS_10
The extracted high-dimensional speech feature representing sequence, < >>
Figure QLYQS_11
For a given learned sequence of speech signal features;
the dynamic convolution cyclic neural network comprises a convolution module and a cyclic neural network module;
the convolution module consists of N dynamic convolution modules and is used for extracting high-dimensional local audio features; the dynamic convolution module is formed by cascading a one-dimensional dynamic convolution layer, batch normalization and HardTanh activation functions; the one-dimensional dynamic convolution layer comprises a plurality of convolution kernels with the same size, each convolution kernel is used for learning different high-dimensional category characterization, and the output of the one-dimensional dynamic convolution layer is obtained through weighted fusion;
the calculation formula of the one-dimensional dynamic convolution layer is as follows:
Figure QLYQS_12
Figure QLYQS_13
,/>
wherein ,
Figure QLYQS_14
for a given input feature->
Figure QLYQS_15
Is an output feature of a one-dimensional dynamic convolution layer, +.>
Figure QLYQS_16
Represents->
Figure QLYQS_17
Convolution operation of the convolution kernels, +.>
Figure QLYQS_18
Is->
Figure QLYQS_19
Weight coefficient of convolution kernel->
Figure QLYQS_20
Is a learnable weight distribution function;
the cyclic neural network module is composed of a plurality of bidirectional gating cyclic neural network modules and is used for capturing time sequence relations among voice frames so as to output high-dimensional voice characteristic representation.
8. A context aware blank pipe speech recognition method according to claim 6, wherein the context encoder comprises a word embedding layer and a long and short term memory network layer, the expression of which at run-time can be expressed as follows:
Figure QLYQS_21
wherein ,
Figure QLYQS_22
the contextual information text sequence output for the S1,>
Figure QLYQS_23
representing a text sequence from said contextual information +.>
Figure QLYQS_24
The extracted high-dimensional contextual features represent sequences;
the word embedding layer is used for mapping words in the text context sequence to a feature space with fixed dimension and learning semantic representation among words; the long-short-time memory network layer is used for capturing relations among words forming the named entities and boundary information of each named entity in the context information.
9. A context aware blank pipe speech recognition method according to claim 6, wherein the attention mechanism based decoder comprises a context attention module and a fully connected layer; the context attention module is used for calculating the alignment relation between the high-dimensional voice features and the context features through an attention mechanism and fusing the high-dimensional voice features and the context features into a high-dimensional voice representation for perceiving the context features, and the main calculation flow is as follows:
Figure QLYQS_25
Figure QLYQS_26
Figure QLYQS_27
wherein ,Kfor a linear transformation of the context information,Scoresis the relevance score between the speech feature and the contextual attention feature,
Figure QLYQS_28
for perception of a high-dimensional phonetic representation of a contextual feature, < >>
Figure QLYQS_29
For vector concatenation operations, ++>
Figure QLYQS_30
For nonlinear activation function +.>
Figure QLYQS_31
Is a preset matrix of learnable parameters, +.>
Figure QLYQS_32
Is a bias vector;
the full-connection layer takes high-dimensional voice representation of perceived contextual characteristics as input, is activated by using a Softmax function, outputs a corresponding vocabulary probability matrix, and the calculation expression is as follows:
Figure QLYQS_33
wherein ,
Figure QLYQS_34
representing a full connection layer, ">
Figure QLYQS_35
Representing the generated probability matrix;
The probability matrix
Figure QLYQS_36
Combining with a speech recognition modeling vocabulary table, adopting a greedy algorithm or a beam search algorithm to decode to obtain a readable text sequence +.>
Figure QLYQS_37
The final recognition result of the voice is obtained.
10. An electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 9.
CN202310548256.5A 2023-05-16 2023-05-16 Context-aware blank pipe voice recognition method and electronic equipment Active CN116259308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310548256.5A CN116259308B (en) 2023-05-16 2023-05-16 Context-aware blank pipe voice recognition method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310548256.5A CN116259308B (en) 2023-05-16 2023-05-16 Context-aware blank pipe voice recognition method and electronic equipment

Publications (2)

Publication Number Publication Date
CN116259308A true CN116259308A (en) 2023-06-13
CN116259308B CN116259308B (en) 2023-07-21

Family

ID=86686560

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310548256.5A Active CN116259308B (en) 2023-05-16 2023-05-16 Context-aware blank pipe voice recognition method and electronic equipment

Country Status (1)

Country Link
CN (1) CN116259308B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581964A (en) * 2020-12-04 2021-03-30 浙江大有实业有限公司杭州科技发展分公司 Multi-domain oriented intelligent voice interaction method
CN114023316A (en) * 2021-11-04 2022-02-08 匀熵科技(无锡)有限公司 TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
WO2022121150A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Speech recognition method and apparatus based on self-attention mechanism and memory network
CN115240651A (en) * 2022-07-18 2022-10-25 四川大学 Land-air communication speaker role identification method and device based on feature fusion
CN115563290A (en) * 2022-12-06 2023-01-03 广东数业智能科技有限公司 Intelligent emotion recognition method based on context modeling
CN116110405A (en) * 2023-04-11 2023-05-12 四川大学 Land-air conversation speaker identification method and equipment based on semi-supervised learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112581964A (en) * 2020-12-04 2021-03-30 浙江大有实业有限公司杭州科技发展分公司 Multi-domain oriented intelligent voice interaction method
WO2022121150A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Speech recognition method and apparatus based on self-attention mechanism and memory network
CN114023316A (en) * 2021-11-04 2022-02-08 匀熵科技(无锡)有限公司 TCN-Transformer-CTC-based end-to-end Chinese voice recognition method
CN115240651A (en) * 2022-07-18 2022-10-25 四川大学 Land-air communication speaker role identification method and device based on feature fusion
CN115563290A (en) * 2022-12-06 2023-01-03 广东数业智能科技有限公司 Intelligent emotion recognition method based on context modeling
CN116110405A (en) * 2023-04-11 2023-05-12 四川大学 Land-air conversation speaker identification method and equipment based on semi-supervised learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
""基于深度学习的空管语音识别"", 《西华大学学报(自然科学版)》 *
DONGYUE GUO: ""A Context-Aware Language Model to Improve the Speech Recognition in Air Traffic Control"", 《AEROSPACE》, pages 1 - 11 *
LIN Y: ""ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems"", 《 APPLIED SOFT COMPUTING》 *
ZHOU K: "\"Improved CTC-attention based end-to-end speech recognition on air traffic control\"", 《ISCIDE 2019》 *
郭东岳: ""基于CGRU多输入特征的地空通话自动切分"", 《四川大学学报(自然科学版)》 *
郭东岳: ""基于CNN的空管地空通话自动切分"", 《中国指挥与控制学会第一届空中交通管理***技术学术年会论文集》, pages 1 - 4 *

Also Published As

Publication number Publication date
CN116259308B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN110364171B (en) Voice recognition method, voice recognition system and storage medium
US11688391B2 (en) Mandarin and dialect mixed modeling and speech recognition
CN108899013B (en) Voice search method and device and voice recognition system
US20200035228A1 (en) Method and apparatus for speech recognition
CN112420024B (en) Full-end-to-end Chinese and English mixed empty pipe voice recognition method and device
WO2021147041A1 (en) Semantic analysis method and apparatus, device, and storage medium
CN113066499B (en) Method and device for identifying identity of land-air conversation speaker
KR102281504B1 (en) Voice sythesizer using artificial intelligence and operating method thereof
CN116110405B (en) Land-air conversation speaker identification method and equipment based on semi-supervised learning
CN112151015A (en) Keyword detection method and device, electronic equipment and storage medium
CN115019776A (en) Voice recognition model, training method thereof, voice recognition method and device
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN110473571A (en) Emotion identification method and device based on short video speech
CN113674732A (en) Voice confidence detection method and device, electronic equipment and storage medium
CN114842835A (en) Voice interaction system based on deep learning model
Shi et al. An end-to-end conformer-based speech recognition model for mandarin radiotelephony communications in civil aviation
CN116259308B (en) Context-aware blank pipe voice recognition method and electronic equipment
CN116189671A (en) Data mining method and system for language teaching
CN113821675B (en) Video identification method, device, electronic equipment and computer readable storage medium
KR102642617B1 (en) Voice synthesizer using artificial intelligence, operating method of voice synthesizer and computer readable recording medium
CN112037772B (en) Response obligation detection method, system and device based on multiple modes
Sawakare et al. Speech recognition techniques: a review
Miao et al. [Retracted] English Speech Feature Recognition‐Based Fuzzy Algorithm and Artificial Intelligent
Liu et al. Keyword retrieving in continuous speech using connectionist temporal classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant