CN113421551B

CN113421551B - Speech recognition method, speech recognition device, computer readable medium and electronic equipment

Info

Publication number: CN113421551B
Application number: CN202011280315.8A
Authority: CN
Inventors: 林炳怀; 王丽园
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-16
Filing date: 2020-11-16
Publication date: 2023-12-19
Anticipated expiration: 2040-11-16
Also published as: CN113421551A

Abstract

The application belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method, a voice recognition device, a voice recognition medium and electronic equipment. The method comprises the following steps: acquiring voice data to be recognized and a voice reference text corresponding to the voice data to be recognized; extracting features of the voice data to be recognized to obtain voice decoding features of the voice data to be recognized, and predicting first text probability distribution of the voice data to be recognized according to the voice decoding features; extracting features of the voice reference text to obtain text coding features of the voice reference text, and predicting second text probability distribution of voice data to be recognized according to similarity between the text coding features and voice decoding features; fusion processing is carried out on the first text probability distribution and the second text probability distribution to obtain comprehensive text probability distribution of the voice data to be recognized; and selecting target text serving as a voice recognition result of the voice data to be recognized from the candidate texts according to the comprehensive text probability distribution. The method can improve the accuracy of voice recognition.

Description

Speech recognition method, speech recognition device, computer readable medium and electronic equipment

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a voice recognition method, a voice recognition device, a computer readable medium and electronic equipment.

Background

With the development of computer and network technologies, speech recognition technology has been widely popularized and applied. Based on the voice recognition technology, the computer can convert voice data into corresponding text data or other types of data through a recognition and understanding process to output, for example, applications such as voice-based text input, machine translation and the like can be realized.

The traditional voice recognition technology needs to train an acoustic model and a language model which are matched with each other, the acoustic model can split voice data into phonemes and determine a corresponding word list, and the language model can finally map the voice data to corresponding words so as to achieve the effect of recognizing voice content. However, due to the limited number of training samples, conventional speech recognition techniques generally only work in designated scenes and fields, but often cannot accurately recognize in other untrained scenes, and thus have difficulty in achieving wide applicability.

Disclosure of Invention

The present application aims to provide a voice recognition method, a voice recognition device, a computer readable medium and an electronic device, which at least overcome the technical problems of low voice recognition accuracy, poor universal applicability and the like in the related technology to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.

According to an aspect of the embodiments of the present application, there is provided a voice recognition method, including: acquiring voice data to be recognized and a voice reference text corresponding to the voice data to be recognized; extracting the characteristics of the voice data to be recognized to obtain voice decoding characteristics of the voice data to be recognized, and predicting first text probability distribution of the voice data to be recognized according to the voice decoding characteristics; extracting the characteristics of the voice reference text to obtain text coding characteristics of the voice reference text, and predicting second text probability distribution of the voice data to be recognized according to the similarity between the text coding characteristics and the voice decoding characteristics; fusion processing is carried out on the first text probability distribution and the second text probability distribution to obtain comprehensive text probability distribution of the voice data to be recognized; and selecting target texts serving as voice recognition results of the voice data to be recognized from the candidate texts according to the comprehensive text probability distribution.

According to an aspect of the embodiments of the present application, there is provided a voice recognition apparatus, including: the data acquisition module is configured to acquire voice data to be recognized and voice reference text corresponding to the voice data to be recognized; the voice decoding module is configured to perform feature extraction on the voice data to be recognized to obtain voice decoding features of the voice data to be recognized, and predict first text probability distribution of the voice data to be recognized according to the voice decoding features; the text coding module is configured to extract the characteristics of the voice reference text to obtain text coding characteristics of the voice reference text, and predict second text probability distribution of the voice data to be recognized according to the similarity of the text coding characteristics and the voice decoding characteristics; the probability fusion module is configured to fuse the first text probability distribution and the second text probability distribution to obtain comprehensive text probability distribution of the voice data to be recognized; and the text selection module is configured to select target text serving as a voice recognition result of the voice data to be recognized from candidate texts according to the comprehensive text probability distribution.

In some embodiments of the present application, based on the above technical solution, the probability fusion module includes: the feature fusion unit is configured to perform weighted fusion on the text coding features according to the second text probability distribution to obtain fusion coding features of the voice reference text; the coefficient determining unit is configured to map the fusion coding feature and the voice decoding feature to obtain a weight coefficient for carrying out weighted fusion on the first text probability distribution and the second text probability distribution; and the probability fusion unit is configured to perform weighted fusion on the first text probability distribution and the second text probability distribution according to the weight coefficient to obtain the comprehensive text probability distribution of the voice data to be recognized.

In some embodiments of the present application, based on the above technical solution, the coefficient determining unit includes: a parameter acquisition subunit configured to acquire weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature; a feature mapping subunit configured to perform weighted mapping on the fusion coding feature and the speech coding feature according to the weight allocation parameter to obtain a weight allocation feature; and the feature normalization subunit is configured to normalize the weight distribution features to obtain weight coefficients for carrying out weighted fusion on the first text probability distribution and the second text probability distribution.

In some embodiments of the present application, based on the above technical solutions, the parameter obtaining subunit includes: a first error determination subunit configured to perform a speech recognition process on the speech data training sample to determine a loss error representing recognition accuracy of the speech data training sample according to a result of the speech recognition process; an error gradient determination subunit configured to determine an error gradient of the fusion coding feature and the speech coding feature, respectively, from the loss error; and a parameter determination subunit configured to determine weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature according to the error gradient.

In some embodiments of the present application, based on the above technical solutions, the parameter obtaining subunit includes: a second error determination subunit configured to acquire recognized voice data related to the voice data to be recognized, and acquire a loss error representing recognition accuracy of the recognized voice data; an error gradient determination subunit configured to determine an error gradient of the fusion coding feature and the speech coding feature, respectively, from the loss error; and a parameter determination subunit configured to determine weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature according to the error gradient.

In some embodiments of the present application, based on the above technical solutions, the data acquisition module includes: a scene determination unit configured to acquire voice data to be recognized and determine a voice recognition scene in which the voice data to be recognized is subjected to voice recognition; and the text selection unit is configured to select a voice reference text corresponding to the voice data to be recognized from a candidate text database according to the voice recognition scene.

In some embodiments of the present application, based on the above technical solutions, the scene determining unit includes: a model acquisition subunit configured to acquire a scene classification model for scene classification of the speech data to be recognized; the feature processing subunit is configured to perform feature extraction and feature mapping processing on the voice data to be recognized through the scene classification model so as to obtain scene probability distribution of the voice data to be recognized; and the first scene selection subunit is configured to select a voice recognition scene for performing voice recognition on the voice data to be recognized from a plurality of candidate scenes according to the scene probability distribution.

In some embodiments of the present application, based on the above technical solutions, the scene determining unit includes: a recognized data acquisition subunit configured to acquire recognized voice data related to the voice data to be recognized; and the second scene selection subunit is configured to select a voice recognition scene for performing voice recognition on the voice data to be recognized from a plurality of candidate scenes according to the voice recognition result of the voice data to be recognized.

In some embodiments of the present application, based on the above technical solutions, the data acquisition module includes: a data acquisition unit configured to acquire voice data to be recognized and recognized voice data related to the voice data to be recognized; and a text selection unit configured to select a speech reference text corresponding to the speech data to be recognized from a candidate text database according to a speech recognition result of the recognized speech data.

In some embodiments of the present application, based on the above technical solutions, the voice recognition device further includes: the text word segmentation module is configured to perform word segmentation processing on the voice reference text to obtain text fields forming the voice reference text; and the field ordering module is configured to randomly order the text fields in the voice reference text to obtain the voice reference text consisting of the text fields with disordered orders.

In some embodiments of the present application, based on the above technical solutions, the speech decoding module includes: a first model acquisition unit configured to acquire a pre-trained speech feature extraction model including an embedded layer, an encoder, and a decoder connected in sequence; the first data embedding unit is configured to perform vectorization processing on the voice data to be recognized through the embedding layer to obtain an embedding vector corresponding to the voice data to be recognized; the first feature coding unit is configured to extract features of the embedded vectors through the encoder to obtain voice coding features of the voice data to be recognized; and the feature decoding unit is configured to extract the features of the voice coding features through the decoder to obtain voice decoding features of the voice data to be recognized.

In some embodiments of the present application, based on the above technical solutions, the text encoding module includes: a second model acquisition unit configured to acquire a text feature extraction model trained in advance, the text feature extraction model including an embedded layer and an encoder; the second data embedding unit is configured to perform vectorization processing on the voice reference text through the embedding layer to obtain an embedding vector corresponding to the voice reference text; and the second feature coding unit is configured to extract the features of the embedded vector through the coder to obtain the text coding features of the voice reference text.

According to an aspect of the embodiments of the present application, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processor, implements a speech recognition method as in the above technical solution.

According to an aspect of the embodiments of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the speech recognition method as in the above technical solution via execution of the executable instructions.

According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the voice recognition method as in the above technical solution.

According to the technical scheme provided by the embodiment of the application, the text probability distribution of the voice data to be recognized can be predicted based on the data features of the two types and the comparison result of the features by respectively extracting the features of the voice data to be recognized and the reference text corresponding to the voice data to be recognized, so that the accuracy of voice recognition is improved. In addition, aiming at different application scenes, different reference texts and voice data to be recognized can be used for dynamic fusion, so that the universal applicability of the voice recognition technology in various application scenes is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:

fig. 1 shows an exemplary system architecture block diagram of a speech recognition system to which the technical solution of the present application is applied.

FIG. 2 is an interactive flow chart of speech recognition processing in an application scenario according to an embodiment of the present application.

FIG. 3 illustrates a flow chart of steps of a speech recognition method in some embodiments of the present application.

FIG. 4 illustrates a flow chart of a method for obtaining reference text based on a speech recognition scenario in some embodiments of the present application.

Fig. 5 illustrates a flow chart of a method for obtaining reference text based on recognized voice data in some embodiments of the present application.

Fig. 6 is a flow chart of a method for feature extraction of speech data to be recognized in some embodiments of the present application.

FIG. 7 illustrates a functional block diagram of a speech feature extraction model used in an embodiment of the present application.

FIG. 8 illustrates a block diagram of the model structure of a Transfomer model used in some embodiments of the present application.

FIG. 9 is a flowchart illustrating steps in a method for fusing text probability distributions in some embodiments of the present application.

Fig. 10 is a schematic diagram of a speech recognition model structure and a speech recognition process used in an application scenario according to an embodiment of the present application.

Fig. 11 is a schematic diagram of an operation on a specific vector in an application scenario according to an embodiment of the present application.

Fig. 12 schematically shows a block diagram of a voice recognition apparatus according to an embodiment of the present application.

Fig. 13 schematically illustrates a block diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Before explaining the technical scheme of the application, first, an artificial intelligence technology and a cloud technology related to the technical scheme of the application are briefly explained.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

Cloud technology (Cloud technology) is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on Cloud computing business model application, and can form a resource pool, so that the Cloud computing business model application system is flexible and convenient as required. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Cloud computing (clouding) is a computing model that distributes computing tasks across a large pool of computers, enabling various application systems to acquire computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the cloud are infinitely expandable in the sense of users, and can be acquired at any time, used as needed, expanded at any time and paid for use as needed.

As a basic capability provider of cloud computing, a cloud computing resource pool (cloud platform for short, generally referred to as IaaS (Infrastructure as a Service, infrastructure as a service) platform) is established, in which multiple types of virtual resources are deployed for external clients to select for use.

Artificial intelligence cloud services, also commonly referred to as AIaaS (AI as Service, chinese is "AI as Service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.

For example, the technical schemes such as the voice recognition method provided by the embodiment of the application can be based on the artificial intelligence cloud service training to realize the machine learning model of natural language processing, and then the machine learning model obtained by training is used for voice recognition application.

As shown in fig. 1, speech recognition system 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include various electronic devices such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart televisions, smart wearable devices, virtual reality devices, smart vehicle devices, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. Network 120 may be a communication medium of various connection types capable of providing a data communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.

The speech recognition system in the embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular in this application.

The voice recognition system based on the technical scheme can be applied to various scenes such as read recitation, spoken language examination, simultaneous interpretation and the like, and the effect of improving the voice recognition accuracy of specific scenes can be achieved. Fig. 2 shows an interactive flowchart of a speech recognition process in an application scenario, where the application scenario may be, for example, a spoken test scenario for semi-open questions, for example, the semi-open questions may include a brief description and a question and answer, where the questions need to answer corresponding points according to questions and a prompt article, and the answer points may also be generally included in the prompt article. As shown in fig. 2, the method for performing the voice recognition processing in the application scenario includes the following steps.

Step S210: the user opens an application program APP for taking the oral examination on terminal equipment such as a smart phone and the like, and displays corresponding questions on an interactive interface of the terminal equipment.

Step S220: after the user clicks to start answering, voice is input to the APP through a microphone or other voice input modes of the terminal equipment to form answering audio.

Step S230: APP sends answer audio acquired by voice and reference answer of the question to the server.

Step S240: and the server performs voice recognition processing on the answer audio by combining the reference answers through a voice recognition model obtained through pre-training, so as to obtain an answer text.

Step S250: and carrying out similarity calculation based on the answer text obtained by recognition and the reference answer, and obtaining a scoring result of the answer of the user.

Step S260: and the server sends the scoring result to the APP, and the corresponding answer score is displayed on the interactive interface of the terminal equipment.

In the application scene, a mode of combining answer audios with corresponding reference answers to perform voice recognition processing is adopted, so that dynamic fusion of voice data and a pre-provided reference text (such as the reference answers or other texts in the application scene) can be realized, and the accuracy of voice recognition in a specific scene is improved.

The following describes in detail the technical schemes such as the voice recognition method, the voice recognition device, the computer readable medium, and the electronic device provided in the present application with reference to the specific embodiments.

Fig. 3 is a flowchart illustrating steps of a voice recognition method in some embodiments of the present application, where the voice recognition method may be performed by the terminal device shown in fig. 1, by a server, or by both the terminal device and the server. As shown in fig. 3, the voice recognition method may mainly include the following steps S310 to S350.

Step S310: and acquiring voice data to be recognized and a voice reference text corresponding to the voice data to be recognized.

Step S320: and extracting the characteristics of the voice data to be recognized to obtain voice decoding characteristics of the voice data to be recognized, and predicting first text probability distribution of the voice data to be recognized according to the voice decoding characteristics.

Step S330: and extracting the characteristics of the voice reference text to obtain text coding characteristics of the voice reference text, and predicting second text probability distribution of the voice data to be recognized according to the similarity between the text coding characteristics and the voice decoding characteristics.

Step S340: and carrying out fusion processing on the first text probability distribution and the second text probability distribution to obtain comprehensive text probability distribution of the voice data to be recognized.

Step S350: and selecting target text serving as a voice recognition result of the voice data to be recognized from the candidate texts according to the comprehensive text probability distribution.

In the voice recognition method provided by the embodiment of the application, the text probability distribution of the voice data to be recognized can be predicted based on the data features of the two types and the comparison result of the features by respectively extracting the features of the voice data to be recognized and the reference text corresponding to the voice data to be recognized, so that the accuracy of voice recognition is improved. In addition, aiming at different application scenes, different reference texts and voice data to be recognized can be used for dynamic fusion, so that the universal applicability of the voice recognition technology in various application scenes is improved.

The following describes the steps of the speech recognition method in the embodiment of the present application in detail in conjunction with the specific embodiments.

In step S310, voice data to be recognized and a voice reference text corresponding to the voice data to be recognized are acquired.

The voice data to be recognized can be data directly collected by an audio collection device such as a microphone, or can be data received by network transmission. For example, when the embodiment of the present application is executed by the terminal device, the user may input voice data to the terminal device through the microphone, and the user may also receive voice messages sent by other users through a social application or an instant messaging tool installed on the terminal device. When the embodiment of the present application is executed by the server or executed jointly by the server and the terminal device, the voice data to be recognized may be voice data transmitted by the terminal device to the server, which is received through network communication between the server and the terminal device.

The voice reference text is text data having a corresponding relation with the voice data to be recognized, for example, the voice data to be recognized is voice data formed by a user reading a section of designated text, and the voice data to be recognized can be the designated text or text data formed by preprocessing the designated text to a certain extent. Taking the application scenario shown in fig. 2 as an example, the voice data to be recognized is answer audio input when the user takes a spoken test, and the corresponding voice reference text may be a reference answer of the test question or a question text including the reference answer.

FIG. 4 illustrates a flow chart of a method for obtaining reference text based on a speech recognition scenario in some embodiments of the present application. As shown in fig. 4, on the basis of the above embodiment, the acquisition of the voice data to be recognized and the voice reference text corresponding to the voice data to be recognized in step S310 may include the following steps S410 to S420.

Step S410: and acquiring the voice data to be recognized, and determining a voice recognition scene for performing voice recognition on the voice data to be recognized.

Step S420: and selecting a voice reference text corresponding to the voice data to be recognized from the candidate text database according to the voice recognition scene.

In the embodiment of the application, firstly, voice data to be recognized is obtained, and voice recognition scenes for carrying out voice recognition on the data are determined, wherein each different voice recognition scene can correspond to part or all of text data in the candidate text database. For example, when the speech recognition scene is a simultaneous interpretation scene for a lecture or a conference, a candidate text database composed of text contents such as papers, news stories, lectures and the like of related topics may be pre-established according to different lecture topics or conference topics, and a certain content correlation exists between a speech reference text selected from the candidate text database and speech data to be recognized. For another example, when the speech recognition scene is a read-following scene, the corresponding speech reference text may be the text content actually read-followed; when the speech recognition scene is an examination scene or an answer scene, the corresponding speech reference text can be a reference answer of the question or a plurality of alternative answers or prompt articles of the question itself, and the like.

In some alternative embodiments, the method of determining a speech recognition scenario in step S410 may include steps S411 to S413 as follows.

Step S411: a scene classification model for scene classification of the speech data to be recognized is obtained.

Step S412: and carrying out feature extraction and feature mapping processing on the voice data to be recognized through the scene classification model so as to obtain scene probability distribution of the voice data to be recognized.

Step S413: and selecting a voice recognition scene for carrying out voice recognition on the voice data to be recognized from the plurality of candidate scenes according to the scene probability distribution.

The scene classification model may be a machine learning model trained in advance for performing scene classification, for example, a Decision Tree model (Decision Tree) or a Random Forest model (Random Forest) may be used, or a neural network model such as a convolutional neural network (Convolutional Neural Networks, CNN), a recurrent neural network (Recurrent Neural Network, RNN), or a Long Short-Term Memory (LSTM) may be used. Based on a pre-trained scene classification model, feature extraction can be performed on voice data to be recognized, extracted data features are mapped into scene probability distribution of candidate scenes, and finally a voice recognition scene with the highest classification probability (the highest probability of representing correctly classified scenes) can be selected according to the scene probability distribution.

In other optional embodiments, the method for determining a speech recognition scenario in step S410 may also include: acquiring recognized voice data related to voice data to be recognized; and selecting a voice recognition scene for performing voice recognition on the voice data to be recognized from a plurality of candidate scenes according to the voice recognition result of the voice data to be recognized.

The voice data to be recognized is generally a voice sequence composed of voice fragments with a certain length, and when voice recognition is performed, each voice fragment can be recognized sequentially. Where the identified and unrecognized portions have some correlation over the voice content and the more closely ordered voice segments will generally be stronger in content correlation. The embodiment of the application can predict the voice recognition scene of the voice data to be recognized by utilizing the voice recognition result of the recognized voice data. For example, in the process of performing voice recognition, the speech of the conference host is included in the recognized speech segment, related information such as the introduction of the participant and the introduction of the conference subject can be obtained based on the recognized speech content, and the voice recognition scene of the voice data to be recognized can be predicted as the subject speech or the subject conference according to the recognized related information. For another example, the recognized voice segment includes relevant information such as a standard introduction of a question, a content description of a question, etc., so that the voice recognition scene of the voice data to be recognized can be predicted to be an examination scene or a question answering scene.

Fig. 5 illustrates a flow chart of a method for obtaining reference text based on recognized voice data in some embodiments of the present application. As shown in fig. 5, on the basis of the above embodiment, the acquisition of the voice data to be recognized and the voice reference text corresponding to the voice data to be recognized in step S310 may include steps S510 to S520 as follows.

Step S510: and acquiring the voice data to be recognized and the recognized voice data related to the voice data to be recognized.

Step S520: and selecting a voice reference text corresponding to the voice data to be recognized from the candidate text database according to the voice recognition result of the recognized voice data.

The recognized voice data may be data having a time correlation with the voice data to be recognized in the voice data sequence, and the voice reference text corresponding to the following voice data may be selected from the candidate text database according to the recognition result of the preceding voice data.

In order to reduce interference of semantic content contained in a reference text on a voice recognition result, the embodiment of the application can pre-process the voice reference text before extracting features of the voice reference text to obtain text coding features of the voice reference text. The method for performing AND processing on the mixture can comprise the following steps: word segmentation is carried out on the voice reference text to obtain text fields forming the voice reference text; and randomly sequencing text fields in the voice reference text to obtain the voice reference text consisting of the text fields with disordered sequences. When the word segmentation is carried out on the voice reference text, specific characters in the text, such as space characters and punctuation characters in English text, can be utilized. For Chinese text, word segmentation processing can be carried out on the voice reference text in a semantic recognition mode, so that text fields with independent semantics are obtained.

In addition to random ordering, other schemes that can eliminate or weaken text semantics of the speech reference text may be used in the embodiments of the present application, for example, random deletion and substitution of text fields in the speech reference text may be performed.

In step S320, feature extraction is performed on the voice data to be recognized to obtain voice decoding features of the voice data to be recognized, and a first text probability distribution of the voice data to be recognized is predicted according to the voice decoding features.

Fig. 6 is a flow chart of a method for feature extraction of speech data to be recognized in some embodiments of the present application. As shown in fig. 6, on the basis of the above embodiment, the feature extraction of the voice data to be recognized in step S320 to obtain the voice decoding feature of the voice data to be recognized may include the following steps S610 to S640.

Step S610: a pre-trained speech feature extraction model is obtained, the speech feature extraction model comprising an embedded layer, an encoder and a decoder connected in sequence.

Step S620: and carrying out vectorization processing on the voice data to be recognized through the embedding layer to obtain an embedding vector corresponding to the voice data to be recognized.

Step S630: and extracting the characteristics of the embedded vectors through an encoder to obtain the voice coding characteristics of the voice data to be recognized.

Step S640: and extracting the characteristics of the voice coding characteristics through a decoder to obtain voice decoding characteristics of the voice data to be recognized.

In the embodiment of the application, the voice data to be recognized is encoded and then decoded by adopting an end-to-end voice feature extraction model to obtain voice decoding features of the voice data.

The Embedding layer (Embedding) is used for uniformly converting the voice data to be recognized into vectors with fixed sizes, namely, corresponding embedded vectors are obtained through vectorization processing. The embedding layer may perform One-Hot Encoding (One-Hot Encoding) on the voice data to be recognized to obtain a One-Hot Encoding vector, and then perform matrix operation on the One-Hot Encoding vector and a pre-established embedding matrix to obtain an embedding vector corresponding to the voice data to be recognized.

FIG. 7 illustrates a functional block diagram of a speech feature extraction model used in an embodiment of the present application. As shown in fig. 7, the speech data 701 to be recognized is encoded by the encoder 702 to obtain corresponding speech encoding features, and then decoded by the decoder 703 to obtain corresponding speech decoding features 704. The first text probability distribution of the speech data to be recognized can be predicted after the linear transformation and normalization of the speech decoding features 704. The speech decoding feature 704 is a real vector that is linearly transformed by projecting it into a log probability (logits) vector over a fully connected neural network, the length of the vector being the same as the number of words of the candidate text. Each element in the log-probability vector corresponds to a score of a candidate text, and each score can be converted into a corresponding probability value through the Softmax layer for normalization processing, and the probability values are combined together to form a first text probability distribution of the voice data to be recognized.

According to the embodiment of the application, the Transfomer model based on the self-attention mechanism can be used as a voice feature extraction model to extract features of voice data to be recognized, and the first text probability distribution is obtained. FIG. 8 illustrates a block diagram of the model structure of a Transfomer model used in some embodiments of the present application. As shown in fig. 8, the model entirely includes an Encoder (Encoder) section located on the left side of the figure and a Decoder (Decoder) section located on the right side of the figure. Wherein the encoder is composed of a number of stacks of coding units and the decoder is composed of the same number of stacks of decoding units.

The Input part (Inputs) of the encoder is the voice data to be recognized, the voice data to be recognized is firstly word-embedded through an Input Embedding layer (Input Embedding) to obtain an embedded vector, and the embedded vector is Input into the encoding unit after being combined with the position encoding (Positional Encoding). The coding unit mainly comprises a Multi-Head Self-Attention layer (Multi-Head Self-Attention) and a Feed Forward neural network (Feed Forward), and the output of each part is subjected to summation and normalization (Add & normal) based on a residual network.

The output part (output) of the encoder may be input to the decoding unit of the decoder after an addition initiator corresponding to the entire Right-Shifted Right of the output part by one bit (Shifted Right) to start the prediction process.

The decoding unit mainly comprises a mask Multi-Head Self-Attention layer (mask Multi-Head Self-Attention), a coding decoding Multi-Head Attention layer (Encoder Decoder Multi-Head Attention) and a Feed Forward neural network (Feed Forward), and the output of each part is subjected to summation and normalization processing based on a residual network.

The output of the decoder is processed by a Linear layer (Linear) and a normalization layer (Softmax) in turn to obtain output probability (Output Probabilities) which is used as a first text probability distribution of the voice data to be recognized.

In step S330, feature extraction is performed on the speech reference text to obtain text coding features of the speech reference text, and a second text probability distribution of the speech data to be recognized is predicted according to the similarity between the text coding features and the speech decoding features.

For the speech reference text, a pre-trained text feature extraction model can be adopted to extract features of the speech reference text to obtain text coding features. Specifically, the embodiment of the application can obtain a pre-trained text feature extraction model, wherein the text feature extraction model comprises an embedded layer and an encoder; carrying out vectorization processing on the voice reference text through the embedding layer to obtain an embedding vector corresponding to the voice reference text; and extracting the characteristics of the embedded vector by an encoder to obtain the text coding characteristics of the speech reference text.

In some embodiments of the present application, the encoder used in the text feature extraction model may have the same network structure as the encoder portion of the speech feature extraction model.

In some embodiments of the present application, the similarity between the text encoding feature and the speech decoding feature may be calculated by means of a vector inner product or cosine distance. After the similarity between the text coding feature and the voice decoding feature is obtained, the similarity can be normalized through a normalization function, and a corresponding second text probability distribution is obtained.

In step S340, the first text probability distribution and the second text probability distribution are fused to obtain a comprehensive text probability distribution of the speech data to be recognized.

The method for fusing the first text probability distribution and the second text probability distribution may be to perform weighted fusion on the first text probability distribution and the second text probability distribution according to a preset weight coefficient. The preset weight coefficient may be a fixed coefficient obtained through model training, or a coefficient dynamically adjusted in the voice recognition process.

FIG. 9 is a flowchart illustrating steps in a method for fusing text probability distributions in some embodiments of the present application. As shown in fig. 9, the fusion processing of the first text probability distribution and the second text probability distribution in step S340 to obtain a comprehensive text probability distribution of the speech data to be recognized may include the following steps S910 to S930.

Step S910: and carrying out weighted fusion on the text coding features according to the second text probability distribution to obtain fusion coding features of the voice reference text.

Step S920: and mapping the fusion coding feature and the voice decoding feature to obtain a weight coefficient for carrying out weighted fusion on the first text probability distribution and the second text probability distribution.

Step S930: and carrying out weighted fusion on the first text probability distribution and the second text probability distribution according to the weight coefficient to obtain the comprehensive text probability distribution of the voice data to be recognized.

In some embodiments of the present application, the fusion encoded features and the speech decoded features may be mapped through a pre-trained neural network with a full-connection layer and a normalization layer. Specifically, the embodiment of the application may first obtain weight distribution parameters corresponding to the fusion coding feature and the speech coding feature, then perform weighted mapping on the fusion coding feature and the speech coding feature according to the weight distribution parameters to obtain weight distribution features, and then perform normalization processing on the weight distribution features to obtain weight coefficients for performing weighted fusion on the first text probability distribution and the second text probability distribution.

The weight distribution parameters corresponding to the fusion coding features and the voice coding features can be network parameters obtained by training in a neural network, and the embodiment of the application can obtain fixed weight distribution parameters by training the neural network by utilizing voice data samples. Specifically, the embodiment of the application can perform voice recognition processing on the voice data training sample so as to determine a loss error for representing the recognition accuracy of the voice data training sample according to the voice recognition processing result; determining error gradients of the fusion coding feature and the voice coding feature according to the loss errors respectively; and determining weight distribution parameters respectively corresponding to the fusion coding feature and the voice coding feature according to the error gradient. For example, when a speech recognition process is performed using a batch of speech data training samples, a loss error indicating the recognition accuracy is obtained. The loss error may be calculated, for example, using the cross-loss entropy H (t, p) of the predicted text and the real text.

Where x is the speech data training sample, t (x) is the probability distribution of the real text, and p (x) is the probability distribution of the predicted text.

After the loss error is calculated, the weight of the loss error can be respectively assigned to the fusion coding characteristic by the weight assignment parameter w ₁ And weight allocation parameters w for speech coding features ₂ Obtaining error gradient of two weight distribution parameters by partial derivativeAnd->

Based on the obtained error gradients, the corresponding weight distribution parameters may be updated according to the following formula.

/>

Wherein eta is a preset learning rate.

When the next speech data training sample is used for carrying out speech recognition processing, updated weight distribution parameters can be used for calculation, and new loss errors are obtained according to recognition results. And finally determining the weight distribution parameters meeting the error requirement by performing iterative training.

In some embodiments of the present application, the weight distribution parameter corresponding to the fusion coding feature and the speech coding feature may also be a parameter dynamically adjusted according to the speech recognition result in the speech recognition process. Specifically, the embodiment of the application can acquire the recognized voice data related to the voice data to be recognized, and acquire the loss error for representing the recognition accuracy of the recognized voice data; determining error gradients of the fusion coding feature and the voice coding feature according to the loss errors respectively; determining corresponding fusion coding features and speech coding features respectively based on error gradients Weight allocation parameters of (a). For example, after completing the voice recognition processing of the preceding voice data, the weight distribution parameters may be updated in real time according to the accuracy of the recognition result thereof, and the subsequent voice data may be subjected to the voice recognition processing by using the updated weight distribution parameters. In the embodiment of the application, the loss error may be calculated by using the cross entropy loss function as described above, and the partial derivative obtained by the differential processing is used as an error gradient for fusing the coding feature and the speech coding feature, so as to update the corresponding weight distribution parameter according to the error gradient. For example, the voice data to be recognized is a voice data sequence x obtained continuously in time sequence from a plurality of voice fragments ₁ 、x ₂ 、x ₃ … … based on the pair voice data x ₁ Can determine the loss error H from the identification result of (2) ₁ In utilizing the loss error H ₁ After the error gradient is calculated, the weight distribution parameter w corresponding to the fusion coding feature can be distributed ₁₁ Updated to w ₁₂ And the weight corresponding to the voice coding characteristic is distributed with a parameter w ₂₁ Updated to w ₂₂ The method comprises the steps of carrying out a first treatment on the surface of the Then distributing the parameter w according to the updated weight ₁₂ And w ₂₂ For voice data x ₂ Performing voice recognition processing, and determining loss error H based on recognition result ₂ After that, the loss error H is reused ₂ Calculating error gradient, and updating to obtain new weight distribution parameter w ₁₃ And w ₂₃ And so on until the speech recognition of the complete speech sequence is completed. The method for calculating the error gradient according to the loss error and updating the weight distribution parameter can be referred to the description in the foregoing embodiment, and will not be described herein.

In step S350, a target text as a speech recognition result of the speech data to be recognized is selected from the candidate texts according to the integrated text probability distribution.

The integrated text probability distribution is used for representing the probability that each candidate text is selected as a voice recognition result, and the embodiment of the application can select one candidate text with the highest probability as a target text based on the integrated text probability distribution and output the target text as the voice recognition result of the voice data to be recognized.

Fig. 10 is a schematic diagram of a speech recognition model structure and a speech recognition process used in an application scenario according to an embodiment of the present application. As shown in fig. 10, the speech recognition model used in the embodiment of the present application mainly includes a first encoder, a second encoder, and a decoder. The first encoder and the decoder are connected to form a voice feature extraction model, and the second encoder is used as a text feature extraction model.

The voice data to be recognized is input into a first encoder, after encoding processing, voice encoding characteristics can be output and obtained, and after decoding processing is carried out on the voice encoding characteristics through a decoder, corresponding voice decoding characteristics can be obtained, and the voice decoding characteristics can be expressed as characteristic vectors d _i . After mapping the speech decoding characteristics, a first text probability distribution p for predicting text selection probability can be obtained.

Inputting a reference text corresponding to the voice data to be recognized into a second encoder, and outputting to obtain text coding characteristics after coding, wherein the text coding characteristics can be expressed as a characteristic vector h _i 。

Based on the attention mechanism, similarity calculation can be performed on the voice decoding features and the text encoding features, so as to obtain a second text probability distribution p' of each voice data to be recognized. And then carrying out weighted fusion on the first text distribution probability p and the second text probability distribution p', and obtaining the comprehensive text probability distribution p used for representing the final selection probability.

The attention mechanism is mainly divided into three parts, query, key and value. And (4) carrying out similarity calculation based on the query and the key, and obtaining the weight of each value. And then carrying out weighted summation on the value to obtain a final value representation. In the embodiment of the application, the query is a feature representation d of a voice decoding feature output by the decoder, and the key and the value are feature representations h obtained by encoding each word in the reference text by the second encoder. Based on the speech decoding feature d _i The second text distribution probability p' may be calculated according to the following formula (1), and the fusion encoding feature finarvalue may be calculated according to the formula (2).

p′＝softmax([d _i *h ₁ ,d _i *h ₂ ……d _i *h _n ]) (1)

The weighting coefficient λ can be obtained by mapping the fusion coding feature finalvalue and the speech decoding feature d, and the weighting coefficient λ can be controlled by a dynamically weighted gate, for example, the weighting coefficient λ can be dynamically adjusted according to formula (3).

λ＝sigmoid(w ₁ *finalvalue+w ₂ *d) (3)

Wherein w is ₁ And w ₂ Is a weight allocation parameter corresponding to the fusion coding feature finarvalue and the speech coding feature d, respectively.

According to the formula (4), the first text distribution probability p and the second text distribution probability p' are weighted and fused according to the weight coefficient lambda to obtain the comprehensive text probability distribution p.

p″＝λ*p+(1-λ)*p′ (4)

Fig. 11 is a schematic diagram of an operation on a specific vector in an application scenario according to an embodiment of the present application. As shown in fig. 11, taking a specific vector operation as an example, the speech coding feature of the data to be recognized after the data is encoded by the first encoder is a vector [0.6,0.8,0.9], and the speech decoding feature of the data to be recognized after the data is decoded by the decoder is a vector [0.5,0.3,0.1]. After mapping the speech decoding characteristics, a first text probability distribution p for predicting text selection probability can be obtained. The first text probability distribution p represents the probability that each word is selected as the recognition result, and may be, for example, "I:0.1, a:0.01, choose:0.2 … new:0.4", which represents that the first text probability corresponding to the candidate text I is 0.1, the first text probability corresponding to the candidate text a is 0.01, and the first text probability corresponding to the candidate text choose is 0.2 … …, and the first text probability corresponding to the candidate text new is 0.4.

The two words of choose and relax are used as reference texts, and the corresponding text coding features [0.1,0.3,0.2] and [0.5,0.1,0.2] can be obtained after the reference texts are coded by the second coder 1002.

The feature similarity is calculated between the speech decoding feature [0.5,0.3,0.1] and the two text encoding features [0.1,0.3,0.2] and [0.5,0.1,0.2], respectively, and a second text probability distribution p' is determined according to the calculation result. The second text probability distribution p' represents the probability corresponding to each word in the reference text, for example, "choose:0.8, and relax:0.2", and represents that the probability of the second text corresponding to the reference text choose is 0.8, and the probability of the second text corresponding to the reference text relax is 0.2.

The attention-based mechanism may be solved according to equation (3) to obtain a weight coefficient λ and continue to solve for the comprehensive text probability distribution according to equation (4). For example, when the weight coefficient λ takes a value of 0.1, the comprehensive text probability corresponding to the word choose may be calculated to be 0.1×0.2+0.8×0.9=0.74, so that a comprehensive text probability distribution "choose:0.74, i:0.01 … new:0.04" may be obtained, where choose is a word with the greatest comprehensive text probability among all words, and thus the final speech recognition result may be determined as choose. If the reference text is not introduced, the first text probability distribution determined directly according to the speech decoding characteristics may be selected as the word new with the highest first text probability, thus resulting in erroneous speech recognition results.

In order to verify the effect of improving the accuracy of speech recognition based on the reference text, four types of reference text are constructed in an application scene: 1. out-of-order text (shuffle) of the text to which the audio actually corresponds; 2. identifying erroneous corrected text (true key) in the original voice identification result; 3. text that is independent of audio (wrong key); 4. a mixed text (mix key) of the erroneous text and the text irrelevant to the audio is recognized in the original speech recognition result. The data set adopted in the text is the data set of the actual oral examination scene, and the two data sets are total, and each data set comprises 1 hour of audio data and the actual audio transfer text. Table 1 shows Word Error Rates (WERs) obtained when performing speech recognition processing in combination with different reference texts.

TABLE 1 word error Rate vs. Table

	base	shuffle	wrong key	true key	mix key
						Example 1	24.6	11.7	24.7	14.9	15.8
Example 2	12.8	5.8	13.3	7.6	7.9

As shown in table 1, when a shuffle, a true key, or a mix key is used as a reference text, respectively, the word error rate WER can be reduced to a different extent than a result base in which voice recognition is directly performed without using the reference text. When using the text wrong key that is not related to audio as a reference text, a word error rate WER that is similar to that of the unused reference text can be obtained, and thus the accuracy of speech recognition is hardly adversely affected. Therefore, the voice recognition accuracy can be effectively improved by selecting the proper reference text according to the application scene.

It should be noted that although the steps of the methods in the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

The following describes an embodiment of an apparatus of the present application that may be used to perform the speech recognition method of the above-described embodiments of the present application. Fig. 12 schematically shows a block diagram of a voice recognition apparatus according to an embodiment of the present application. As shown in fig. 12, the voice recognition apparatus 1200 may mainly include: a data acquisition module 1210 configured to acquire voice data to be recognized and a voice reference text corresponding to the voice data to be recognized; the voice decoding module 1220 is configured to perform feature extraction on the voice data to be recognized to obtain voice decoding features of the voice data to be recognized, and predict a first text probability distribution of the voice data to be recognized according to the voice decoding features; the text encoding module 1230 is configured to perform feature extraction on the speech reference text to obtain text encoding features of the speech reference text, and predict a second text probability distribution of the speech data to be recognized according to the similarity between the text encoding features and the speech decoding features; the probability fusion module 1240 is configured to fuse the first text probability distribution and the second text probability distribution to obtain a comprehensive text probability distribution of the voice data to be recognized; the text selection module 1250 is configured to select a target text as a speech recognition result of the speech data to be recognized from the candidate texts according to the integrated text probability distribution.

In some embodiments of the present application, based on the above embodiments, the probability fusion module 1240 includes: the feature fusion unit is configured to perform weighted fusion on the text coding features according to the second text probability distribution to obtain fusion coding features of the voice reference text; the coefficient determining unit is configured to map the fusion coding feature and the voice decoding feature to obtain a weight coefficient for carrying out weighted fusion on the first text probability distribution and the second text probability distribution; and the probability fusion unit is configured to carry out weighted fusion on the first text probability distribution and the second text probability distribution according to the weight coefficient to obtain the comprehensive text probability distribution of the voice data to be recognized.

In some embodiments of the present application, based on the above embodiments, the coefficient determination unit includes: a parameter acquisition subunit configured to acquire weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature; the feature mapping subunit is configured to perform weighted mapping on the fusion coding feature and the voice coding feature according to the weight distribution parameter to obtain a weight distribution feature; and the feature normalization subunit is configured to normalize the weight distribution features to obtain weight coefficients for carrying out weighted fusion on the first text probability distribution and the second text probability distribution.

In some embodiments of the present application, based on the above embodiments, the parameter acquisition subunit includes: a first error determination subunit configured to perform a speech recognition process on the speech data training sample to determine a loss error representing recognition accuracy of the speech data training sample according to a result of the speech recognition process; an error gradient determination subunit configured to determine an error gradient of the fusion coding feature and the speech coding feature, respectively, from the loss errors; and a parameter determination subunit configured to determine weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature according to the error gradient.

In some embodiments of the present application, based on the above embodiments, the parameter acquisition subunit includes: a second error determination subunit configured to acquire recognized voice data related to the voice data to be recognized, and acquire a loss error representing recognition accuracy of the recognized voice data; an error gradient determination subunit configured to determine an error gradient of the fusion coding feature and the speech coding feature, respectively, from the loss errors; and a parameter determination subunit configured to determine weight distribution parameters respectively corresponding to the fusion coding feature and the speech coding feature according to the error gradient.

In some embodiments of the present application, based on the above embodiments, the data acquisition module 1210 includes: the scene determining unit is configured to acquire voice data to be recognized and determine a voice recognition scene for performing voice recognition on the voice data to be recognized; and the text selection unit is configured to select a voice reference text corresponding to the voice data to be recognized from the candidate text database according to the voice recognition scene.

In some embodiments of the present application, based on the above embodiments, the scene determining unit includes: a model acquisition subunit configured to acquire a scene classification model for scene classification of the voice data to be recognized; the feature processing subunit is configured to perform feature extraction and feature mapping processing on the voice data to be recognized through the scene classification model so as to obtain scene probability distribution of the voice data to be recognized; the first scene selection subunit is configured to select a voice recognition scene for performing voice recognition on voice data to be recognized from a plurality of candidate scenes according to a scene probability distribution.

In some embodiments of the present application, based on the above embodiments, the scene determining unit includes: a recognized data acquisition subunit configured to acquire recognized voice data related to voice data to be recognized; and a second scene selection subunit configured to select a speech recognition scene for performing speech recognition on the speech data to be recognized from the plurality of candidate scenes according to the speech recognition result of the recognized speech data.

In some embodiments of the present application, based on the above embodiments, the data acquisition module 1210 includes: a data acquisition unit configured to acquire voice data to be recognized and recognized voice data related to the voice data to be recognized; and a text selection unit configured to select a speech reference text corresponding to the speech data to be recognized from the candidate text database according to a speech recognition result of the recognized speech data.

In some embodiments of the present application, based on the above embodiments, the voice recognition apparatus 1200 further includes: the text word segmentation module is configured to perform word segmentation processing on the voice reference text to obtain text fields forming the voice reference text; and the field ordering module is configured to randomly order the text fields in the voice reference text to obtain the voice reference text consisting of the text fields with disordered orders.

In some embodiments of the present application, based on the above embodiments, the speech decoding module 1220 includes: a first model acquisition unit configured to acquire a pre-trained speech feature extraction model including an embedded layer, an encoder, and a decoder connected in sequence; the first data embedding unit is configured to perform vectorization processing on the voice data to be recognized through the embedding layer to obtain an embedding vector corresponding to the voice data to be recognized; the first feature coding unit is configured to perform feature extraction on the embedded vector through an encoder to obtain voice coding features of voice data to be recognized; and the feature decoding unit is configured to extract the features of the voice coding features through a decoder to obtain voice decoding features of the voice data to be recognized.

In some embodiments of the present application, based on the above embodiments, the text encoding module 1230 includes: a second model acquisition unit configured to acquire a text feature extraction model trained in advance, the text feature extraction model including an embedded layer and an encoder; the second data embedding unit is configured to perform vectorization processing on the voice reference text through the embedding layer to obtain an embedding vector corresponding to the voice reference text; and the second feature coding unit is configured to perform feature extraction on the embedded vector through an encoder to obtain text coding features of the voice reference text.

Specific details of the voice recognition device provided in each embodiment of the present application have been described in the corresponding method embodiments, and are not described herein.

Fig. 13 schematically shows a block diagram of a computer system for implementing an electronic device according to an embodiment of the present application.

It should be noted that, the computer system 1300 of the electronic device shown in fig. 13 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.

As shown in fig. 13, the computer system 1300 includes a central processing unit 1301 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 1302 (ROM) or a program loaded from a storage portion 1308 into a random access Memory 1303 (Random Access Memory, RAM). In the random access memory 1303, various programs and data necessary for the system operation are also stored. The cpu 1301, the rom 1302, and the ram 1303 are connected to each other via a bus 1304. An Input/Output interface 1305 (i.e., an I/O interface) is also connected to bus 1304.

The following components are connected to the input/output interface 1305: an input section 1306 including a keyboard, a mouse, and the like; an output portion 1307 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, a speaker, and the like; a storage portion 1308 including a hard disk or the like; and a communication section 1309 including a network interface card such as a local area network card, a modem, or the like. The communication section 1309 performs a communication process via a network such as the internet. The drive 1310 is also connected to the input/output interface 1305 as needed. Removable media 1311, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memory, and the like, is installed as needed on drive 1310 so that a computer program read therefrom is installed as needed into storage portion 1308.

In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such embodiments, the computer program may be downloaded and installed from a network via the communication portion 1309 and/or installed from the removable medium 1311. The computer programs, when executed by the central processor 1301, perform the various functions defined in the system of the present application.

It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal that propagates in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of speech recognition, comprising:

acquiring voice data to be recognized and a voice reference text corresponding to the voice data to be recognized;

extracting the characteristics of the voice data to be recognized to obtain voice decoding characteristics of the voice data to be recognized, and predicting first text probability distribution of the voice data to be recognized according to the voice decoding characteristics;

extracting the characteristics of the voice reference text to obtain text coding characteristics of the voice reference text, and predicting second text probability distribution of the voice data to be recognized according to the similarity between the text coding characteristics and the voice decoding characteristics;

weighting and fusing the text coding features according to the second text probability distribution to obtain fused coding features of the voice reference text;

mapping the fusion coding feature and the voice decoding feature to obtain a weight coefficient for carrying out weighted fusion on the first text probability distribution and the second text probability distribution;

weighting and fusing the first text probability distribution and the second text probability distribution according to the weight coefficient to obtain comprehensive text probability distribution of the voice data to be recognized;

And selecting target texts serving as voice recognition results of the voice data to be recognized from the candidate texts according to the comprehensive text probability distribution.

2. The method according to claim 1, wherein the mapping the fusion encoding feature and the speech decoding feature to obtain weight coefficients for weighted fusion of the first text probability distribution and the second text probability distribution, comprises:

acquiring weight distribution parameters corresponding to the fusion coding feature and the voice decoding feature respectively;

weighting mapping is carried out on the fusion coding feature and the voice decoding feature according to the weight distribution parameter to obtain a weight distribution feature;

and normalizing the weight distribution characteristics to obtain weight coefficients for carrying out weighted fusion on the first text probability distribution and the second text probability distribution.

3. The method according to claim 2, wherein the acquiring weight distribution parameters respectively corresponding to the fusion coding feature and the speech decoding feature comprises:

performing voice recognition processing on the voice data training sample to determine a loss error for representing recognition accuracy of the voice data training sample according to a voice recognition processing result;

Determining error gradients of the fusion coding feature and the voice decoding feature according to the loss errors respectively;

and determining weight distribution parameters respectively corresponding to the fusion coding feature and the voice decoding feature according to the error gradient.

4. The method according to claim 2, wherein the acquiring weight distribution parameters respectively corresponding to the fusion coding feature and the speech decoding feature comprises:

acquiring the recognized voice data related to the voice data to be recognized, and acquiring a loss error for representing the recognition accuracy of the recognized voice data;

5. The method for voice recognition according to claim 1, wherein the acquiring the voice data to be recognized and the voice reference text corresponding to the voice data to be recognized includes:

acquiring voice data to be recognized, and determining a voice recognition scene for performing voice recognition on the voice data to be recognized;

And selecting a voice reference text corresponding to the voice data to be recognized from a candidate text database according to the voice recognition scene.

6. The method according to claim 5, wherein the determining a speech recognition scenario for speech recognition of the speech data to be recognized includes:

acquiring a scene classification model for classifying the scenes of the voice data to be recognized;

performing feature extraction and feature mapping processing on the voice data to be recognized through the scene classification model to obtain scene probability distribution of the voice data to be recognized;

and selecting a voice recognition scene for carrying out voice recognition on the voice data to be recognized from a plurality of candidate scenes according to the scene probability distribution.

7. The method according to claim 5, wherein the determining a speech recognition scenario for speech recognition of the speech data to be recognized includes:

acquiring recognized voice data related to the voice data to be recognized;

and selecting a voice recognition scene for carrying out voice recognition on the voice data to be recognized from a plurality of candidate scenes according to the voice recognition result of the voice data to be recognized.

8. The method for voice recognition according to claim 1, wherein the acquiring the voice data to be recognized and the voice reference text corresponding to the voice data to be recognized includes:

acquiring voice data to be recognized and recognized voice data related to the voice data to be recognized;

and selecting a voice reference text corresponding to the voice data to be recognized from a candidate text database according to the voice recognition result of the voice data to be recognized.

9. The method of claim 1, wherein prior to feature extraction of the speech reference text to obtain text-encoded features of the speech reference text, the method further comprises:

word segmentation is carried out on the voice reference text to obtain text fields forming the voice reference text;

and randomly sequencing text fields in the voice reference text to obtain the voice reference text consisting of the text fields with disordered sequences.

10. The method for recognizing speech according to claim 1, wherein the feature extraction of the speech data to be recognized to obtain speech decoding features of the speech data to be recognized comprises:

Acquiring a pre-trained voice feature extraction model, wherein the voice feature extraction model comprises an embedded layer, an encoder and a decoder which are connected in sequence;

performing vectorization processing on the voice data to be recognized through the embedding layer to obtain an embedding vector corresponding to the voice data to be recognized;

extracting features of the embedded vectors through the encoder to obtain voice coding features of the voice data to be recognized;

and extracting the characteristics of the voice coding characteristics through the decoder to obtain voice decoding characteristics of the voice data to be recognized.

11. The method according to claim 1, wherein the feature extraction of the speech reference text to obtain text coding features of the speech reference text comprises:

acquiring a pre-trained text feature extraction model, wherein the text feature extraction model comprises an embedded layer and an encoder;

performing vectorization processing on the voice reference text through the embedding layer to obtain an embedding vector corresponding to the voice reference text;

and extracting the characteristics of the embedded vector through the encoder to obtain the text coding characteristics of the voice reference text.

12. A speech recognition apparatus, comprising:

the data acquisition module is configured to acquire voice data to be recognized and voice reference text corresponding to the voice data to be recognized;

the voice decoding module is configured to perform feature extraction on the voice data to be recognized to obtain voice decoding features of the voice data to be recognized, and predict first text probability distribution of the voice data to be recognized according to the voice decoding features;

the text coding module is configured to extract the characteristics of the voice reference text to obtain text coding characteristics of the voice reference text, and predict second text probability distribution of the voice data to be recognized according to the similarity of the text coding characteristics and the voice decoding characteristics;

the probability fusion module is configured to carry out weighted fusion on the text coding features according to the second text probability distribution to obtain fusion coding features of the voice reference text; mapping the fusion coding feature and the voice decoding feature to obtain a weight coefficient for carrying out weighted fusion on the first text probability distribution and the second text probability distribution; weighting and fusing the first text probability distribution and the second text probability distribution according to the weight coefficient to obtain comprehensive text probability distribution of the voice data to be recognized;

And the text selection module is configured to select target text serving as a voice recognition result of the voice data to be recognized from candidate texts according to the comprehensive text probability distribution.

13. A computer readable medium having stored thereon a computer program which, when executed by a processor, implements the speech recognition method of any of claims 1 to 11.

14. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the speech recognition method of any one of claims 1 to 11 via execution of the executable instructions.