CN112397053B

CN112397053B - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112397053B
Application number: CN202011205969.4A
Authority: CN
Inventors: 唐立亮
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-02
Filing date: 2020-11-02
Publication date: 2022-09-06
Anticipated expiration: 2040-11-02
Also published as: CN112397053A

Abstract

The embodiment of the application provides a voice recognition method, a voice recognition device, electronic equipment and a readable storage medium, and relates to the technical field of artificial intelligence. The method comprises the following steps: acquiring acoustic characteristic information and language characteristic information of voice data to be recognized; determining a target recognition path matched with the voice data to be recognized in a pre-constructed static word graph according to the acoustic characteristic information and the language characteristic information, and recording word node information of each word node in a candidate recognition path of the voice data to be recognized; obtaining a voice recognition result of the voice data to be recognized according to the target recognition path; and backtracking according to the target recognition path based on the word identification of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result. In the embodiment of the application, the word boundary information can be obtained while the voice recognition result is obtained, so that the time consumption for recognizing the word boundary information is reduced.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing and artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, electronic device, and readable storage medium.

Background

With the technical development of artificial intelligence, the speech recognition makes great progress in accuracy rate, and application scenes are richer. In recent years, voice input methods exist on smart phones, and voice search is more and more popular and applied to other smart devices such as smart televisions, smart sound boxes and television boxes. For example, in the fields of movie and short video, the traditional caption marking can be replaced by a voice recognition mode so as to save labor cost. In speech recognition, the recognition accuracy is inherently important, and usually the corresponding word boundary information must be returned, so that it can be known which audio corresponds to which part.

At present, the calculation of word boundary information mainly includes performing secondary FA (force alignment) calculation on a speech recognition result and speech, that is, after the speech recognition result is obtained, a recognition network is constructed by using the speech recognition result to obtain word boundary information. However, this method requires reconstruction of the recognition network after completion of speech recognition, which not only has a problem of long calculation time, but also fails to return word boundary information synchronously when returning to speech recognition.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, an electronic device and a readable storage medium, which can obtain word boundary information while obtaining a recognition result, are more convenient and faster, reduce time consumption for recognizing word boundaries, and improve processing efficiency.

In one aspect, an embodiment of the present application provides a speech recognition method, where the method includes:

acquiring acoustic characteristic information and language characteristic information of voice data to be recognized;

determining a target recognition path matched with the voice data to be recognized in a pre-constructed static word graph according to the acoustic characteristic information and the language characteristic information, and recording word node information of each word node in a candidate recognition path of the voice data to be recognized;

obtaining a voice recognition result of the voice data to be recognized according to the target recognition path;

and backtracking according to the target recognition path based on the word identification of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result.

In another aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the characteristic acquisition module is used for acquiring acoustic characteristic information and language characteristic information of the voice data to be recognized;

the relevant information determining module is used for determining a target recognition path matched with the voice data to be recognized in the pre-constructed static word graph according to the acoustic characteristic information and the language characteristic information, and recording word node information of each word node in the candidate recognition path of the voice data to be recognized;

the recognition result determining module is used for obtaining the recognition result of the voice data to be recognized according to the target recognition path;

and the word boundary information determining module is used for backtracking according to the target recognition path based on the word identification of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result.

In one aspect, an embodiment of the present application provides an electronic device, including a processor and a memory: the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform the speech recognition method as described above.

In still another aspect, the present application provides a computer-readable storage medium for storing a computer program, which when run on a computer, enables the computer to execute the speech recognition method in the above.

The technical scheme provided by the embodiment of the application has the following beneficial effects:

in the embodiment of the application, in the process of determining the recognition result of the voice data to be recognized, the word node information corresponding to each word node in the candidate recognition path of the voice data to be recognized can be recorded, then the voice recognition result of the voice data to be recognized can be obtained according to the target recognition path, path backtracking is performed according to the target recognition path based on the word identifier of each recognition word contained in the recognition result and the recorded word node information, the word boundary information of each recognition word in the voice recognition result in the voice data to be recognized is obtained, because each word node information is the information of the node which can be used for obtaining the final result in the process of obtaining the voice recognition result, the relation between each recognition word in the voice recognition result and the word boundary information does not need to be reconstructed to determine the recognition network at this moment, the word boundary information of each recognition word in the voice recognition result can be directly known based on each word node information, the word boundary information can be obtained while the voice recognition result is obtained, the method is more convenient and fast, the time consumption for recognizing the word boundary information is reduced, and the efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a diagram of a static vocabulary as provided in the embodiments of the present application;

fig. 3 is a schematic flowchart of another speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

According to the language identification method, the voice data to be identified can be identified based on the voice processing technology in artificial intelligence, the corresponding identification result and the word boundary information are obtained, the word boundary information can be obtained while the identification result is obtained, the method is more convenient and faster, the time consumed for identifying the word boundary is reduced, and the processing efficiency is improved.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among the key technologies of Speech processing Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Optionally, the data processing according to the embodiment of the present application may be implemented in a cloud computing-based manner. Among them, cloud computing (cloud computing) is a computing model that distributes computing tasks over a resource pool formed by a large number of computers, so that various application systems can obtain computing power, storage space, and information services as needed. The network that provides the resources is referred to as the "cloud". Resources in the "cloud" appear to the user as being infinitely expandable and available at any time, available on demand, expandable at any time, and paid for on-demand.

As a basic capability provider of cloud computing, a cloud computing resource pool (called as an ifas (Infrastructure as a Service) platform for short is established, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external clients.

The terms referred to in the embodiments of the present application are explained below:

acoustic characteristics: the method includes the steps of characterizing the energy of audio, zero-crossing rate, Linear Prediction Coefficient (LPC) coefficient and the like, and the acoustic features include time domain features and/or frequency domain features, wherein the time domain features are features extracted directly on the basis of an original voice signal, and the frequency domain features are features extracted by performing Fourier transform on the original voice signal, converting the original signal into a frequency domain and then extracting the frequency domain.

In the present embodiment, when the extracted acoustic feature is a Frequency domain feature, it may be a filter bank (Fbank) -based feature, a Mel-Frequency Cepstral Coefficients (MFCC) feature, a Perceptual Linear prediction coefficient (PLP) feature, or the like.

The Acoustic Model ((Acoustic Model, AM), knowledge representation of the differences of acoustics, phonetics, environmental variables, speaker gender, accent, etc., may be based on an Acoustic Model of a Hidden Markov Model (HMM), such as a hybrid gaussian Hidden Markov Model (GMM-HMM) and a Deep Neural network Hidden Markov Model (DNN) -HMM), which is a weighted finite state automaton in a discrete time domain, or of course, an Acoustic Model of an End-to-End (End) such as a connected time series Classification (CTC) Model, a Long-Short Term Memory (LSTM) Model, and an Attention (Attention) Model.

Each state of the acoustic model represents the probability distribution of the phonetic features of a phonetic unit (such as a word, a syllable, a phoneme, and the like) in the state, and the phonetic features are connected into an ordered state sequence through the transition between the states, so that the sequence of the phonetic units represented by a section of phonetic signal is obtained.

A Language Model (LM) is a knowledge representation of a Language structure, where the Language structure may include words and rules between sentences, such as a knowledge representation of grammar, common collocation of words, etc., and the Language Model may include an N-gram Model, a Recurrent Neural Network (RNN), etc.

The Viterbi algorithm is a dynamic programming algorithm used to find a sequence of hidden states of a Viterbi path that is most likely to generate a sequence of observed events, especially in the context of Markov information sources and hidden Markov models. The terms "viterbi path" and "viterbi algorithm" are also used to find the dynamic programming algorithm for which observations are most likely to explain the correlation. Dynamic programming algorithms, such as in statistical syntactic analysis, may be used to find the most likely context-free derived (parsed) strings, sometimes referred to as "viterbi analysis".

The viterbi algorithm is also used today in speech recognition, keyword recognition, computational linguistics and bioinformatics, among others. For example, in speech (speech recognition), the acoustic signal is considered as the observed sequence of events, while the text string is considered as the underlying cause of the acoustic signal, so that the viterbi algorithm can be applied to the acoustic signal to find the most likely text string.

The basis of the viterbi algorithm can be summarized as the following three points:

first, if the most probable path passes through a point in the fence network, the sub-path from the starting point to that point must also be the most probable path from the start to that point.

Second, assume that there are k states at the ith time, there are k shortest paths from the beginning to the k states at the ith time, and the final shortest path must pass through one of them, where i and k are natural numbers.

Thirdly, according to the above-mentioned properties, when calculating the shortest path of the i +1 th state, only the shortest path from the beginning to the current k state values and the shortest path from the current state value to the i +1 th state value need to be considered, for example, when t is equal to 3, the shortest path is equal to the shortest path of all state nodes when t is equal to 2, plus the shortest path of each node when t is equal to 2 to t is equal to 3.

The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. These several specific embodiments may be combined with each other below, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The voice recognition method provided by the embodiment of the application can be executed through the terminal device or the server, and can also be executed interactively through the server and the terminal device. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart watch, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Optionally, when the speech recognition method provided in the embodiment of the present application is executed by a terminal device, the speech recognition method may be an offline recognition scheme for a smart phone, a tablet computer, a smart television, and the like, that is, the terminal device may obtain speech recognition related data from a cloud in advance based on the speech data to be recognized, and perform server-independent speech recognition depending on a processor and a memory of the terminal device to obtain a recognition result and corresponding word boundary information.

When the voice recognition method provided by the embodiment of the application is executed by the server and the terminal device in an interactive manner, the terminal device sends the voice data to be recognized to the server after acquiring the voice data to be recognized, the server performs voice recognition, and sends the recognition result and the corresponding word boundary information to the terminal device after acquiring the recognition result and the corresponding word boundary information. The product that this high in the clouds speech recognition scheme was used is the relevant scene that needs call speech recognition function, for example intelligent house scene, speech input transcription, vehicle navigation, intelligent audio amplifier etc. accomplishes the scene and uses through the speech recognition ability of calling the high in the clouds, can encapsulate for speech recognition application, and the speech recognition engine of embedding in various application provides effectual speech recognition support for the interactive scene of various intelligence pronunciation.

Fig. 1 shows a schematic flowchart of a speech recognition method provided in an embodiment of the present application, where the method may be executed by any electronic device, and the method includes:

step S101, obtaining acoustic characteristic information and language characteristic information of voice data to be recognized.

The to-be-recognized voice data refers to data that needs to be subjected to voice recognition, and may be audio data or audio data in a video, and the source of the to-be-recognized voice data is not limited in this embodiment of the present application, and may be voice data input by a user or voice data received by an electronic device.

Optionally, when the method is implemented by a terminal device, the acquired voice data to be recognized may be voice data acquired by the terminal device based on a voice acquisition device of the terminal device, or may be voice data that has been stored and acquired by the terminal device from a storage medium of the terminal device. When the method is implemented by a server, the voice data to be recognized may be voice data received by the server from the terminal device.

The acoustic feature information and the language feature information of the voice data to be recognized can be obtained based on the acoustic features of the voice data to be recognized, correspondingly, when the voice data to be recognized is obtained, the acoustic features of the voice data to be recognized can be extracted, and the acoustic feature information and the language feature information of the voice data to be recognized are determined based on the acoustic features of the voice data to be recognized. The acoustic features may be Fbank features for extracting the voice data to be recognized, MFCC features and PLP features, or other types of acoustic features, which is not limited in the embodiment of the present application. Optionally, the acoustic feature in the embodiment of the present application may be an Fbank feature.

In an optional embodiment of the present application, the acquiring acoustic feature information and language feature information of the voice data to be recognized includes:

obtaining an acoustic model score of the voice data to be recognized through an acoustic model, wherein the acoustic characteristic information comprises the acoustic model score, the acoustic model score represents the probability that the voice data to be recognized corresponds to a preset state, and the preset state is a basic composition element of a text;

and acquiring a language model score of the voice data to be recognized through a language model, wherein the language characteristic information comprises the language model score, and the language model score represents the score of the text corresponding to the candidate recognition path of the voice data to be recognized.

The acoustic feature information is used to represent specific information of a speech unit (for example, a word, a syllable, a phoneme, and the like) that the speech data to be recognized may correspond to, and the language feature information is used to represent a language structure in a text that the speech data to be recognized corresponds to, such as a rule between words and sentences in the text. Optionally, in an optional embodiment of the present application, the acoustic feature information may be characterized by an acoustic model score, and the language feature information may be characterized by a language model score, where the acoustic model score characterizes a probability that the speech data to be recognized corresponds to a basic constituent element of the text, such as a probability that the speech data to be recognized corresponds to each word, syllable, or phoneme, and the language model score characterizes a probability that the text corresponding to the candidate recognition path obtained when the speech data to be recognized is a complete sentence.

Optionally, when determining the acoustic model score of the to-be-recognized voice data based on the acoustic feature of the to-be-recognized voice data, the acoustic feature of the to-be-recognized voice data may be input into a preset acoustic model to obtain the acoustic model score of the to-be-recognized voice data; the predetermined acoustic model may be a CTC model, but may also be an LSTM acoustic model, a CNN-DNN acoustic model, or the like. When the voice model score of the voice data to be recognized is determined based on the acoustic feature of the voice data to be recognized, the language model score of the voice data to be recognized can be determined according to a preset language model. The preset language model may be an N-gram model, and the embodiment of the present application is not limited.

Step S102, according to the acoustic characteristic information and the language characteristic information, determining a target recognition path matched with the voice data to be recognized in the pre-constructed static word graph, and recording word node information of each word node in the candidate recognition path of the voice data to be recognized.

The static word graph is a common expression form of a speech recognition result, and represents a plurality of candidate results obtained by decoding in a speech recognition process on a directed acyclic graph. The static word graph includes nodes and arcs connecting the nodes, the included nodes include a start node and an end node, the start node represents a start position when the voice data is decoded based on the static word graph, and the end node represents an end position when the voice data is decoded based on the static word graph. Alternatively, the phonemes may be represented by nodes in the static word graph, the connection relationship between the phonemes may be represented by arcs connecting the nodes, a path of the static word graph includes the nodes and the arcs connecting the nodes, at this time, the nodes and the words represented by the arcs connecting the nodes form a sentence, and when the phonemes represented by a node may form a word itself or may form a word with the phonemes of the nodes connected by arcs, the node may be referred to as a word node (hereinafter, referred to as a word output node).

For example, a path on the static word graph includes a node a and a node B, the node a and the node B are connected by an arc, the node a represents a phoneme "w", the node B represents a phoneme "o", and at this time, the node a and the node B may form a word "i", and then the node B is the word-out node.

Alternatively, in practical applications, the static word graph may also use arcs to represent words, and nodes represent the connection relationship of the words, and each word belongs to a path from the beginning node to the ending node, that is, a path of the static word graph may represent a sentence composed of words represented by arcs.

For example, as shown in fig. 2, a static word graph is shown, and the static word graph has a path S, which includes a start node (i.e., enter), a node a, a node B, a node C, a node D, and an end node (i.e., exit), wherein an arc between the node a and the node B represents "i", an arc between the node B and the node C represents "want", and an arc between the node C and the node D represents "eat", and then the path S represents "i want to eat".

In an optional embodiment of the present application, for any word node, the word node information includes a word identifier and word boundary information corresponding to the node.

The word identifier is used to identify words that can be formed by word nodes, and the word boundary information refers to time information of the voice data that activates the word node in the voice data to be recognized, for example, the 2 nd frame data activates the word node a, and the word boundary information corresponding to the word node a is the 2 nd frame. The method of recording the word node information of each word node may be configured in advance, for example, a trace module may be added in the process of decoding the voice data to be recognized, the trace module may include each trace unit, and each trace unit records the word node information of one word node.

Optionally, the voice data to be recognized includes at least one frame of voice data, for each frame of voice data, each frame of voice data may be decoded based on the preset static word graph, at this time, each active node may be obtained, each active node may form at least one candidate recognition path for obtaining a recognition result of the voice data to be recognized, then a word output node in each active node is determined (for example, an active node having a word identifier is used as a word output node), and a word identifier and word boundary information (that is, corresponding word node information) corresponding to each word output node are obtained and recorded.

Optionally, when each candidate recognition path is obtained based on the preset static word graph decoding, a target recognition path for obtaining a recognition result of the to-be-recognized voice data may be obtained from each candidate recognition path by using a viterbi algorithm based on the acoustic model score and the language model score of the to-be-recognized voice data. Optionally, when the target identification path is obtained from each candidate identification path, for each candidate identification path, the acoustic model scores and the language model scores corresponding to each node included in the candidate identification path may be respectively accumulated, then the composite score of the candidate identification path is obtained based on the acoustic model scores and the language model scores obtained through accumulation, and the target identification path is determined from each candidate identification path based on the composite score of each candidate identification path.

In an example, after the voice data to be recognized is decoded according to a preset static word diagram, P candidate paths are searched in the preset static word diagram, at this time, acoustic model scores of the P candidate paths and language model scores of the P candidate paths may be obtained, and then weight values corresponding to the acoustic model scores of the P candidate paths and weight values corresponding to the language model scores of the P candidate paths are obtained; then, according to the weight values corresponding to the acoustic model scores of the P candidate paths, the weight values corresponding to the language model scores of the P candidate paths, the acoustic model scores of the P candidate paths and the language model scores of the P candidate paths, the acoustic model scores and the language model scores of the P candidate paths are subjected to weighted summation to determine the comprehensive scores of the P candidate paths, then the comprehensive scores of the P candidate paths are sequenced, and the candidate path with the highest comprehensive score is determined as the target identification path. The language model score of each candidate path is the sum of the language model scores of the active nodes included in the candidate path, the acoustic model score of each candidate path is the sum of the acoustic model scores of the active nodes included in the candidate path, and the acoustic model score and the language model score of each active node can be obtained based on the acoustic features corresponding to the voice data to be recognized.

And step S103, obtaining a recognition result of the voice data to be recognized according to the target recognition path.

Correspondingly, after the target recognition path is obtained, the nodes included in the target recognition path and the words represented by the arcs between the nodes can be determined, and then the voice recognition result of the voice data to be recognized is obtained according to the sequential combination of the nodes. For example, assuming that a preset static word graph represents words through arcs, and nodes represent connection relations of the words, a target recognition path includes a start node, a node 1, a node 2, a node 3, and an end node, where an arc between the node 1 and the node 2 represents "vacation", and an arc between the node 2 and the node 3 represents "yes", a recognition result obtained based on the target recognition path is "vacation".

And step S104, based on the word identifiers of the recognition words in the voice recognition result and the word node information of the word nodes in the candidate recognition path, performing path backtracking according to the target recognition path to obtain the word boundary information of the recognition words in the voice data to be recognized in the voice recognition result.

Optionally, the path backtracking may be performed according to the target recognition path, each word output node included in the target recognition path is determined, and word node information corresponding to each word output node included in the target recognition path is determined from the recorded word node information of each word output node, and since the word node information corresponding to the word output node includes the word identifier and the word boundary information, at this time, the word boundary information of each recognition word in the voice recognition result in the to-be-recognized voice data may be determined from the word node information corresponding to each word output node included in the target recognition path according to the word identifier of each recognition word in the voice recognition result.

In the embodiment of the application, in the process of determining the recognition result of the voice data to be recognized, word node information corresponding to each word node in the candidate recognition path of the voice data to be recognized can be recorded, then the voice recognition result of the voice data to be recognized can be obtained according to the target recognition path, path backtracking is performed according to the target recognition path based on the word identifier of each recognition word in the voice recognition result and the recorded word node information of each word node, word boundary information of each recognition word in the voice data to be recognized in the voice recognition result is obtained, because the word node information of each word node is the information of the node which can be used for obtaining the final result in the process of obtaining the voice recognition result, the relation between the voice recognition result and the word boundary information can be determined without reconstructing a recognition network, the relation between the voice recognition result and the word boundary information can be directly known based on the word node information of each word node, the word boundary information can be obtained while the voice recognition result is obtained, the method is more convenient and fast, the time consumption for recognizing the word boundary information is reduced, and the processing efficiency is improved.

In an optional embodiment of the present application, the method further comprises:

acquiring word information of each word in the recognition result, wherein for any word, the word information comprises at least one of the number of characters of the word containing the recognition word or acoustic characteristic information corresponding to the recognition word;

based on the word identifiers of the recognition words in the voice recognition result and the word node information of the word nodes in the candidate recognition path, backtracking is carried out according to the target recognition path, and word boundary information of the recognition words in the voice data to be recognized in the voice recognition result is obtained, and the method comprises the following steps:

backtracking according to the target recognition path based on the word identification of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain initial word boundary information of each recognition word in the voice data to be recognized in the voice recognition result;

and for each recognition word in the voice recognition result, correcting initial word boundary information of the recognition word in the voice data to be recognized based on the word information of the recognition word to obtain corrected word boundary information.

The word information of each recognized word in the obtained speech recognition result includes at least one of the number of characters included in the recognized word or the acoustic feature information corresponding to the recognized word. Optionally, the acoustic feature information corresponding to the recognition word includes an acoustic model score corresponding to the recognition word, and the acoustic model score corresponding to the recognition word may be obtained based on the acoustic model score of the to-be-recognized voice data, for example, the acoustic model score of the to-be-recognized voice data is 100, the target recognition path includes 5 word nodes, at this time, the 5 word nodes respectively correspond to one acoustic model score, and the sum of the acoustic model scores of the 5 word nodes is 100; for a recognized word in the speech recognition result, the acoustic model score of the recognized word refers to the sum of the acoustic model scores of the word nodes used for obtaining the recognized word.

Optionally, when performing path backtracking according to a target recognition path based on the word identifier of each recognition word and the word node information of each word node in the voice recognition result, initial word boundary information of each recognition word in the voice recognition result may be obtained, and the obtained initial word boundary information may have a problem of inaccuracy. Therefore, in the embodiment of the application, the initial word boundary information of each recognition word in the speech recognition result in the speech data to be recognized can be corrected, so that the accuracy of the word boundary information obtained by recognition can be effectively improved.

In an optional embodiment of the present application, for any recognized word, the word information of the recognized word includes the number of characters included in the recognized word and acoustic feature information corresponding to the recognized word, and the initial word boundary information of the recognized word in the to-be-recognized speech data is modified based on the word information of the recognized word to obtain modified word boundary information, including:

acquiring a first weight corresponding to the number of characters and a second weight corresponding to the acoustic characteristic information;

determining a correction amount according to the first weight, the second weight, the number of characters contained in the recognition word, the acoustic characteristic information corresponding to the recognition word and a score threshold value;

and modifying the initial word boundary information of the recognition word in the voice data to be recognized according to the correction quantity to obtain modified word boundary information.

Optionally, for any recognition word included in the recognition result, the importance degree of the number of characters included in the recognition word and the acoustic feature information corresponding to the recognition word to the determined word boundary information may be different, at this time, for any recognition word included in the speech recognition result, the weights corresponding to the number of characters and the acoustic feature information may be determined respectively, then, a correction amount is determined according to the weight corresponding to the number of characters and the weight corresponding to the acoustic feature information, the number of characters included in the recognition word, the acoustic feature information corresponding to the recognition word, and a score threshold, and then, when the initial word boundary information of the recognition word in the speech data to be recognized is modified based on the determined correction amount, the accuracy may be effectively improved when the modified word boundary information is obtained. The score threshold, the specific values of the first weight and the second weight may be preset according to an empirical value and/or an experimental value, which is not limited in this embodiment of the application.

In an optional embodiment of the present application, the determining, by using the first weight, the second weight, the number of characters included in the recognized word, the acoustic feature information corresponding to the recognized word, and a score threshold, a correction amount includes:

for either word, the correction amount is determined by the following expression:

a ═ number of characters k1+ (AMscore-thres) × k2

Wherein, a represents a correction amount, k1 represents a first weight, k2 represents a second weight, AMscore represents an acoustic model score of the recognized word, and thres represents a score threshold;

correcting initial word boundary information of the recognition word in the voice data to be recognized according to the correction amount to obtain corrected word boundary information, wherein the method comprises the following steps:

obtaining corrected word boundary information through the following expression:

the corrected word boundary information is initial word boundary information + a.

Optionally, the acoustic feature information corresponding to the recognized word includes an acoustic model score of the recognized word, and at this time, the second weight corresponding to the acoustic model score is the second weight corresponding to the acoustic model score of the recognized word. Accordingly, when the first weight, the second weight, the number of characters included in the recognition word, the acoustic model score corresponding to the recognition word, and the score threshold are obtained, the correction amount may be obtained based on a preset expression for determining the correction amount, and then the corrected word boundary information may be obtained based on a preset expression for determining the corrected word boundary information.

In an example, assuming that each recognized word in the speech recognition result is [ zhang san ] [ love ] [ eat ] [ full chinese full position ], at this time, for the recognized word [ full chinese full position ], assuming that initial word boundary information corresponding to the recognized word is 3500ms, and the number of characters included in the recognized word is 4, the word boundary information after the recognized word is corrected is:

3500ms+4*k1+(AMscore-thres)*k2

in an optional embodiment of the present application, backtracking is performed according to the target recognition path based on the word identifier of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path, so as to obtain word boundary information of each recognition word in the voice data to be recognized in the voice recognition result, where the method includes:

tracing the path to a starting node in the target recognition path forwards according to a tail node in the target recognition path to obtain each target word node corresponding to the target recognition path;

determining word node information corresponding to each target word node from the recorded word node information;

for each target word node, obtaining word boundary information corresponding to the target word node based on the word node information corresponding to the target word node and the word identifier corresponding to the target word node;

and obtaining the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result according to the word boundary information corresponding to each target word node.

Optionally, when obtaining word boundary information of the recognition result, a backtracking method may be adopted to perform path backtracking according to the target recognition path, that is, backtracking from a last tail node in the target recognition path to a start node in the target recognition path, where each word node (also referred to as a word output node) experienced in the backtracking process is each target word node (hereinafter referred to as a target word output node) corresponding to the target recognition path, and then determining word node information corresponding to each target word output node based on the word node information of each word output node recorded when obtaining the voice recognition result; correspondingly, for each target word node, because the word node information of the target word node includes the corresponding word identifier and the word boundary information of the corresponding word identifier in the voice data to be recognized, and when the voice recognition result is obtained based on the target recognition path, each recognition word in the voice recognition result is obtained based on the word identifier corresponding to each target word node, that is, the word identifier of each recognition word in the known voice recognition result is the word identifier corresponding to each target word node, at this time, the word boundary information corresponding to each target word node can be obtained from the word node information corresponding to each word identifier corresponding to each target word node based on the word identifier corresponding to each target word node; and when the word boundary information corresponding to each target word node is obtained, the word boundary information of each recognition word in the voice recognition result in the voice data to be recognized can be considered to be obtained.

Optionally, in practical application, the obtained speech recognition result usually includes a tail character (generally, s), a node (i.e., a tail node) corresponding to the tail character and word node information corresponding to the node are known in advance, and if the word node information of each word node corresponds to a trace unit, the path backtracking is performed according to the target recognition path based on the word identifier of each recognition word in the speech recognition result and the word node information of each word node in the candidate recognition path, so as to obtain the word boundary information of each recognition word in the speech data to be recognized in the speech recognition result, which is specifically implemented based on the following ways:

a. determining word node information corresponding to a tail character according to a node (namely the tail node) corresponding to the tail character in a voice recognition result and a trace unit corresponding to the node, and determining word boundary information corresponding to the tail character and a corresponding phoneme identifier according to a word identifier of the tail character and the word node information corresponding to the node (assuming that s1 is used);

b. according to s1, finding a precursor node of a node corresponding to the tail character in the target recognition path until the precursor node is a word-out node, and then obtaining a phoneme identifier corresponding to the word-out node, wherein the hypothesis is sk;

c. determining a trace unit corresponding to the word output node (sk) in a trace module to obtain word node information corresponding to the word output node, and then determining word boundary information corresponding to the word output node (i.e. word boundary information of a certain recognition word in the voice recognition result in the voice data to be recognized) according to a word identifier corresponding to the word output node (i.e. identifier of the certain recognition word in the voice recognition result) and the corresponding word node information;

d. continuing to find a precursor node which is a word-out node in the target recognition path according to the sk, and obtaining corresponding word boundary information according to the word identifier and the word node information corresponding to the precursor node until the determined precursor node is empty (namely, until the starting node in the target recognition path), and skipping step e;

e. and taking the collected word boundary information corresponding to each precursor node as the word boundary information of each recognition word in the voice data to be recognized of the voice recognition result and returning.

In an optional embodiment of the present application, the word node information further includes a phoneme identifier corresponding to the word node; obtaining word boundary information corresponding to the target word node based on the word node information corresponding to the target word node and the word identifier corresponding to the target word node, including:

and for each target word node, obtaining word boundary information corresponding to the target word node according to the word identifier, the phoneme identifier and the word node information corresponding to the target word node.

Optionally, in practical applications, when the speech data to be recognized is decoded based on the preset static word graph, there may exist that speech data of different frames correspond to the same active node, and if the active node is a word output node, word node information of the active node will be recorded for many times, but phoneme identifiers and word boundary information in the word node information recorded each time are different.

Optionally, although there may be word node information of the same active node that will be recorded many times, since the phoneme identifier and the word boundary information in the word node information recorded each time are different, for each target word node, the word boundary information corresponding to the target word node may be determined from the word node information according to the word identifier and the phoneme identifier corresponding to the target word node, and thus a situation that there are multiple word boundary information due to the same word identifier corresponding to the target word node may be effectively avoided.

In an alternative embodiment of the present application, each word node is obtained by:

decoding each frame of voice data in the voice data to be recognized according to a preset static word graph, and recording each active node corresponding to each frame of voice data;

and taking each active node corresponding to each frame of voice data as each word node in the candidate recognition path.

In practical applications, when performing speech recognition on speech data to be recognized, it is usually necessary to frame the speech data to be recognized before starting speech recognition, that is, to cut the speech data to be recognized into small segments, each of which is called a frame of speech data, where each frame of speech data is usually 25 ms long, and there is an overlap between every two frames of speech data, 25-10 ms or 15 ms.

Optionally, when speech recognition is performed on speech data to be recognized, each frame of speech data in the speech data to be recognized may be decoded based on the preset static word graph, for each frame of speech data, after decoding the frame of speech data according to the preset static word graph, each active node corresponding to the frame of speech data in the preset static word graph is obtained by decoding, at this time, each active node obtained by decoding each frame of speech data may be used as each word node, and a word identifier, a phoneme identifier, and word boundary information corresponding to each word node are recorded, so as to obtain relevant information of each word node.

In an optional embodiment of the present application, taking each active node corresponding to each frame of voice data as each word node in a candidate recognition path includes:

and for any active node, when the corresponding word identifier exists in the active node, taking the active node as each word node in the candidate recognition path.

Optionally, for the active node obtained by decoding, there may be a case that the active node does not have a corresponding word identifier, and at this time, the active node without the corresponding word identifier may be filtered, that is, when the active node has the corresponding word identifier, the active node is used as a word node, and the word identifier, the phoneme identifier, and the word boundary information corresponding to the active node are recorded, so as to obtain word node information.

Optionally, in practical applications, the speech recognition method provided in the embodiment of the present application may be used as a logic module for language recognition, and is integrated in an online recognition system, for example, may be integrated in products such as a smart television service and a smart home, and in order to better understand the method provided in the embodiment of the present application, the method is described in detail below with reference to fig. 3, in this example, the speech recognition method may be used as a logic module for language recognition, and is integrated in an online recognition system of a terminal device (such as a smart phone and a smart television), the terminal device may perform speech recognition on the obtained speech data to be recognized, obtain a corresponding recognition result and word boundary information, and use an acoustic model score to characterize acoustic feature information, use a language model score to characterize language feature information, a word graph is a static word graph, and acoustic feature information corresponding to a recognized word is an acoustic model score of the recognized word, the method specifically comprises the following steps:

step S301, acquiring voice data to be recognized;

optionally, the voice data to be recognized may be voice data acquired by the terminal device based on a voice acquisition device of the terminal device itself, or may also be voice data that has been stored and acquired by the terminal device from a storage medium of the terminal device itself, for example, the voice data to be recognized acquired by the terminal device based on a voice acquisition device of the terminal device itself is "zhang san eat full of chinese.

Step S302, extracting acoustic features of voice data to be recognized (feature extraction);

alternatively, the acoustic feature may be an Fbank feature for extracting the voice data to be recognized.

Step S303, determining an acoustic model score of the voice data to be recognized (namely calculating the acoustic model score) based on the acoustic characteristics of the voice data to be recognized;

optionally, the acoustic features of the speech data to be recognized may be input into a preset acoustic model to obtain an acoustic model score of the speech data to be recognized, where the acoustic model may be a CTC model, and may also be an LSTM acoustic model, a CNN-DNN acoustic model, or the like.

Step S304, determining the language model score of the voice data to be recognized (namely calculating the language model score) based on the acoustic characteristics of the voice data to be recognized;

optionally, the acoustic features of the speech data to be recognized may be input into a preset language model to obtain a language model score of the speech data to be recognized, where the language model may be an N-gram model or the like.

Step S305, decoding the voice data to be recognized according to a preset word graph, and recording word node information of each word output node (namely recording word node information);

optionally, when performing speech recognition on the speech data to be recognized, decoding each frame of speech data in the speech data to be recognized based on the preset word graph to obtain each active node corresponding to each frame of speech data in the preset word graph, then determining word nodes from each active node, and recording word identifiers, phoneme identifiers and boundary information corresponding to each word node to obtain word node information of each word node.

Step S306, determining a target recognition path of the voice data to be recognized (namely determining the target recognition path) according to the acoustic model score, the language model score and the preset word graph;

optionally, when each candidate recognition path is obtained based on the preset word decoding, a target recognition path may be obtained from each candidate recognition path by using a viterbi algorithm based on the acoustic model score and the language model score of the to-be-recognized speech data, and the target recognition path may be used to obtain a speech recognition result of the to-be-recognized speech data.

Step S307, obtaining a voice recognition result of the voice data to be recognized (namely, determining a recognition result) according to the target recognition path;

optionally, after the target recognition path is obtained, the nodes included in the target recognition path and the words represented by arcs between the nodes may be determined, and then the represented words are combined according to the sequential combination between the nodes to obtain a voice recognition result of the recognized voice data, such as a text "zhang san eat full chinese chairman".

Step S308, based on the word identifiers of the recognition words in the voice recognition result and the word node information in the candidate recognition path, performing path backtracking according to the target recognition path to obtain initial word boundary information (namely determining initial word boundary information) of the recognition words in the voice recognition result in the voice data to be processed;

optionally, the path backtracking may be performed according to the target recognition path, each word output node included in the target recognition path is determined, word node information corresponding to each word output node included in the target recognition path is determined from the recorded word node information of each word output node, and then word boundary information corresponding to the word identifier and the phoneme identifier of each recognition word included in the voice recognition result is determined from the word node information corresponding to each word output node included in the target recognition path, and is used as initial word boundary information corresponding to the recognition result; for example, for the speech recognition result "zhang san eats the full chinese character, it is obtained that" zhang san "has initial word boundary information corresponding to 1400 milliseconds in the speech data to be processed, and" eat "has initial word boundary information corresponding to 2100 milliseconds in the speech data to be processed. And the initial word boundary information corresponding to the full-Chinese position in the voice data to be processed is 4900 milliseconds.

Step S309, for each recognition word contained in the voice recognition result, correcting the initial word boundary information of the recognition word based on the word information of the recognition word to obtain the corrected word boundary information (i.e. adjusting the initial word boundary information);

optionally, when the word information includes the number of characters included in the recognized word and the acoustic model score corresponding to the recognized word, for each recognized word included in the speech recognition result, a correction amount may be determined based on the word information of the recognized word and a preset expression, and then the initial word boundary information of the recognized word is modified based on the correction amount to obtain modified word boundary information; for example, for "full-chinese position" in the speech recognition result, the initial word boundary information corresponding to the full-chinese position is 4900 milliseconds, and the initial word boundary information of the word is modified based on the correction amount, so that the corrected word boundary information is 5000 milliseconds.

Step S310, a voice recognition result of the semantic data to be recognized and word boundary information of each recognition word in the voice data to be recognized in the voice recognition result are returned (namely, a return result).

Optionally, if the terminal device includes a display device, when the speech recognition result of the semantic data to be recognized and the word boundary information of each recognized word in the speech data to be recognized are obtained, the speech recognition result of the semantic data to be recognized and the word boundary information may be displayed to the user through the display device.

It should be understood that, the execution sequence of each step in the foregoing is only an example, and the execution sequence of each step in the embodiment of the present application is not limited. For example, step S304 and step S305 may be executed simultaneously, or step S305 may be executed first and then step S305 may be executed.

An embodiment of the present application provides a speech recognition apparatus, and as shown in fig. 4, the speech recognition apparatus 60 may include: a feature acquisition module 601, a related information determination module 602, a recognition result determination module 603, and a word boundary information determination module 604, wherein,

the feature obtaining module 601 is configured to obtain acoustic feature information and language feature information of the voice data to be recognized;

a related information determining module 602, configured to determine, according to the acoustic feature information and the language feature information, a target recognition path matched with the to-be-recognized speech data in the pre-constructed static word graph, and record word node information of each word node in a candidate recognition path of the to-be-recognized speech data;

a recognition result determining module 603, configured to obtain a recognition result of the to-be-recognized speech data according to the target recognition path;

and the word boundary information determining module 604 is configured to perform backtracking according to the target recognition path based on the word identifier of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path, so as to obtain word boundary information of each recognition word in the voice data to be recognized in the voice recognition result.

Optionally, for any word node, the word node information includes a word identifier and word boundary information corresponding to the node.

Optionally, the apparatus further includes a word information obtaining module, configured to:

acquiring word information of each recognition word in the recognition result, wherein for any word, the word information comprises at least one of the number of characters contained in the recognition word or acoustic characteristic information corresponding to the recognition word;

the word boundary information determining module is specifically configured to, when obtaining word boundary information of each recognition word in the speech recognition result in the speech data to be recognized, perform backtracking according to the target recognition path based on the word identifier of each recognition word in the speech recognition result and the word node information of each word node in the candidate recognition path:

and for each recognition word in the voice recognition result, correcting the initial word boundary information of the recognition word in the voice data to be recognized based on the word information of the recognition word to obtain the corrected word boundary information.

Optionally, for any identification word, the word information of the identification word includes the number of characters included in the identification word and the acoustic feature information corresponding to the identification word, and the word boundary information determining module is specifically configured to, when modifying the initial word boundary information of the identification word in the to-be-identified speech data based on the word information of the identification word to obtain the modified word boundary information:

acquiring a first weight corresponding to the number of characters and a second weight corresponding to the acoustic feature information;

determining correction quantity according to the first weight, the second weight, the number of characters contained in the recognition word, the acoustic characteristic information corresponding to the recognition word and a score threshold value;

and modifying the initial word boundary information of the recognized word in the voice data to be recognized according to the correction amount to obtain modified word boundary information.

Optionally, when acquiring the acoustic feature information and the language feature information of the speech data to be recognized, the feature acquisition module is specifically configured to:

Optionally, when the acoustic feature information corresponding to the recognition word includes the acoustic model score word boundary information of the recognition word, the determining module determines the correction amount according to the first weight, the second weight, the number of characters included in the recognition word, the acoustic feature information corresponding to the recognition word, and the score threshold, the determining module is specifically configured to:

a ═ character number k1+ (AMscore-thres) × k2

when the word boundary information determining module corrects the initial word boundary information of the recognition word in the speech data to be recognized according to the correction amount to obtain the corrected word boundary information, the corrected word boundary information is obtained specifically according to the following expression:

Optionally, the word boundary information determining module is specifically configured to, when obtaining the word boundary information of each recognition word in the speech recognition result in the speech data to be recognized, perform backtracking according to the target recognition path based on the word identifier of each recognition word in the speech recognition result and the word node information of each word node in the candidate recognition path:

and obtaining word boundary information of each recognition word in the voice data to be recognized in the voice recognition result according to the word boundary information corresponding to each target word node.

Optionally, the word node information further includes a phoneme identifier corresponding to the word node; for each target word node, the word boundary information determining module is specifically configured to, when obtaining word boundary information corresponding to the target word node based on the word node information corresponding to the target word node and the word identifier corresponding to the target word node:

Optionally, each word output node is obtained by the following method:

decoding each frame of voice data in the voice data to be recognized according to the pre-constructed static word graph, and recording each active node corresponding to each frame of voice data;

Optionally, each word output node is obtained by the following method:

and for any active node, when the corresponding word identification exists in the active node, taking the active node as a word node in the candidate recognition path.

The speech recognition device according to the embodiment of the present application can execute the speech recognition method according to the embodiment of the present application, which is similar to the principle of the speech recognition device, and is not described herein again.

An embodiment of the present application provides an electronic device, as shown in fig. 5, an electronic device 2000 shown in fig. 5 includes: a processor 2001 and a memory 2003. Wherein the processor 2001 is coupled to the memory 2003, such as via bus 2002. Optionally, the electronic device 2000 may also include a transceiver 2004. It should be noted that the transceiver 2004 is not limited to one in practical applications, and the structure of the electronic device 2000 is not limited to the embodiment of the present application.

The processor 2001 is applied in the embodiment of the present application to implement the functions of the modules shown in fig. 4.

The processor 2001 may be a CPU, general purpose processor, DSP, ASIC, FPGA or other programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 2001 may also be a combination of computing functions, e.g., comprising one or more microprocessors, DSPs and microprocessors, and the like.

Bus 2002 may include a path that conveys information between the aforementioned components. The bus 2002 may be a PCI bus or an EISA bus, etc. The bus 2002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.

The memory 2003 may be, but is not limited to, ROM or other types of static storage devices that can store static information and computer programs, RAM or other types of dynamic storage devices that can store information and computer programs, EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compact disk, laser disk, optical disk, digital versatile disk, blu-ray disk, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store a desired computer program or in the form of data structures and that can be accessed by a computer.

The memory 2003 is used for storing computer programs for executing the application programs of the present scheme and is controlled in execution by the processor 2001. The processor 2001 is used to execute a computer program of an application program stored in the memory 2003 to realize the actions of the voice recognition apparatus provided by the embodiment shown in fig. 4.

An embodiment of the present application provides an electronic device, including a processor and a memory: the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform any of the methods of the above embodiments.

The present application provides a computer-readable storage medium for storing a computer program, which, when run on a computer, enables the computer to execute any one of the above-mentioned methods.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations described above.

The terms and implementation principles related to a computer-readable storage medium in the present application may specifically refer to a speech recognition method in the embodiment of the present application, and are not described herein again.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of execution is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A speech recognition method, comprising:

determining a target recognition path matched with the voice data to be recognized in a pre-constructed static word graph according to the acoustic characteristic information and the language characteristic information, and recording word node information of each word node in a candidate recognition path of the voice data to be recognized, wherein the word node information comprises word identifiers, phoneme identifiers and word boundary information corresponding to the nodes;

and backtracking according to the target recognition path based on the word identification and the phoneme identification of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result.

2. The method of claim 1, further comprising:

acquiring word information of each recognition word in the recognition result, wherein for any word, the word information comprises at least one item of the number of characters contained in the recognition word or acoustic characteristic information corresponding to the recognition word;

the obtaining word boundary information of each recognition word in the voice recognition result in the voice data to be recognized based on the word identifier and the phoneme identifier of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path by backtracking according to the target recognition path comprises:

backtracking according to the target recognition path based on the word identifier and the phoneme identifier of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain initial word boundary information of each recognition word in the voice recognition result in the voice data to be recognized;

3. The method according to claim 2, wherein for any recognized word, the word information of the recognized word includes the number of characters included in the recognized word and acoustic feature information corresponding to the recognized word, and the modifying initial word boundary information of the recognized word in the speech data to be recognized based on the word information of the recognized word to obtain modified word boundary information includes:

determining a correction amount according to the first weight, the second weight, the number of characters contained in the recognition word, the acoustic feature information corresponding to the recognition word and a score threshold value;

4. The method according to any one of claims 1 to 3, wherein the obtaining of the acoustic feature information and the language feature information of the voice data to be recognized comprises:

obtaining an acoustic model score of the voice data to be recognized through an acoustic model, wherein the acoustic feature information comprises the acoustic model score, the acoustic model score represents the probability that the voice data to be recognized corresponds to a preset state, and the preset state is a basic composition element of a text;

and acquiring a language model score of the voice data to be recognized through a language model, wherein the language feature information comprises the language model score, and the language model score represents the score of the text corresponding to the candidate recognition path of the voice data to be recognized.

5. The method according to claim 3, wherein the acoustic feature information corresponding to the recognition word includes an acoustic model score of the recognition word, and the determining the correction amount based on the first weight, the second weight, the number of characters included in the recognition word, the acoustic feature information corresponding to the recognition word, and a score threshold value includes:

a ═ character number k1+ (AMscore-thres) × k2

Wherein, a represents a correction amount, k1 represents a first weight, k2 represents a second weight, AMscore represents an acoustic model score of the recognition word, thres represents a score threshold;

correcting the initial word boundary information of the recognition word in the voice data to be recognized according to the correction amount to obtain corrected word boundary information, wherein the correction method comprises the following steps:

obtaining corrected word boundary information through the following expression:

6. The method according to claim 1, wherein the obtaining word boundary information of each recognition word in the speech recognition result in the speech data to be recognized by performing backtracking according to the target recognition path based on the word identifier and the phoneme identifier of each recognition word in the speech recognition result and the word node information of each word node in the candidate recognition path comprises:

according to the tail node in the target recognition path, carrying out path backtracking to the start node in the target recognition path to obtain each target word node corresponding to the target recognition path;

determining word node information corresponding to each target word node from each recorded word node information;

for each target word node, obtaining word boundary information corresponding to the target word node based on the word node information corresponding to the target word node and the word identification and phoneme identification corresponding to the target word node;

7. The method of claim 1, wherein each of the word nodes is obtained by:

8. The method according to claim 7, wherein the using active nodes corresponding to each frame of voice data as word nodes in the candidate recognition path includes:

and for any active node, when the active node has a corresponding word identifier, taking the active node as a word node in the candidate recognition path.

9. A speech recognition apparatus, comprising:

a relevant information determining module, configured to determine, according to the acoustic feature information and the language feature information, a target recognition path matched with the to-be-recognized speech data in a pre-constructed static word graph, and record word node information of each word node in a candidate recognition path of the to-be-recognized speech data, where the word node information includes a word identifier, a phoneme identifier, and word boundary information corresponding to the node;

the recognition result determining module is used for obtaining a recognition result of the voice data to be recognized according to the target recognition path;

and the word boundary information determining module is used for backtracking according to the target recognition path based on the word identifier and the phoneme identifier of each recognition word in the voice recognition result and the word node information of each word node in the candidate recognition path to obtain the word boundary information of each recognition word in the voice data to be recognized in the voice recognition result.

10. An electronic device, comprising a processor and a memory:

the memory is configured to store a computer program which, when executed by the processor, causes the processor to perform the method of any one of claims 1-8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium is used to store a computer program which, when run on a computer, enables the computer to perform the method of any of claims 1-8.