CN109727603B

CN109727603B - Voice processing method and device, user equipment and storage medium

Info

Publication number: CN109727603B
Application number: CN201811467944.4A
Authority: CN
Inventors: 邵俊尧; 钱胜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2018-12-03
Filing date: 2018-12-03
Publication date: 2020-11-03
Anticipated expiration: 2038-12-03
Also published as: CN109727603A

Abstract

The invention relates to the technical field of information processing, and discloses a voice processing method, a voice processing device, user equipment and a storage medium, which solve the problem that in the decoding process, only the above information is relied on and the below information cannot be effectively utilized. The method comprises the following steps: acquiring voice data; according to the acoustic model, matching to obtain an acoustic score and a decoding path corresponding to each syllable data in the voice data; when the viterbi decodes to a bifurcation point of a decoding path and syllable data after the bifurcation point of the decoding path is matched according to a language model, cutting the decoding path after the bifurcation point according to acoustic scores corresponding to the syllable data after the bifurcation point; matching syllable data on the cut decoding path according to the language model to obtain a language score; and performing viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path. The embodiment of the invention is suitable for the processing process of voice data.

Description

Voice processing method and device, user equipment and storage medium

Technical Field

The present invention relates to the field of information processing technologies, and in particular, to a voice processing method and apparatus, a user equipment, and a storage medium.

Background

The traditional speech recognition system comprises a decoder, a language model and an acoustic model. After the voice signals are obtained, a plurality of decoding paths are built on a decoding space, the voice signals traverse each decoding path, scoring is carried out on the basis of an acoustic model and a language model, an acoustic model score and a language model score of each decoding path are obtained, an optimal decoding path is determined on the basis of the scores, and a final recognition result is output according to the optimal decoding path.

In the prior viterbi decoding process, a pruning algorithm is usually accompanied, and some paths with lower scores are pruned to accelerate the decoding speed. The mainstream pruning algorithm in the prior art comprises beam pruning and histogram pruning. The beam pruning is the highest score of the total scores of the acoustic scores and the language scores in all the paths after the decoding path branches are determined, and the paths corresponding to the scores with the highest score lower than a preset range are cut. The histogram pruning is to reserve the paths corresponding to the higher scores in the preset number after determining the total scores of the acoustic scores and the language scores in all the paths after the decoding path branches, and to cut the rest paths. In both of the pruning methods, paths which have undergone a large amount of calculation are pruned due to low scores, and the amount of calculation performed before is wasted.

Disclosure of Invention

The invention aims to solve the problem that the prior art only depends on the above information and cannot effectively utilize the following information in the decoding process, and provides a voice processing method, a device, user equipment and a storage medium, so that the following information is better utilized, unnecessary language model scoring is effectively reduced, the decoding speed is improved, and the voice data processing efficiency is improved.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a speech processing method, where the method includes: acquiring voice data; according to a pre-established acoustic model, matching to obtain an acoustic score corresponding to each syllable data in the voice data and a decoding path where the acoustic score is located; when viterbi decodes a bifurcation point of the decoding path and matches syllable data after the bifurcation point of the decoding path according to a pre-established language model, cutting the decoding path after the bifurcation point according to an acoustic score corresponding to the syllable data after the bifurcation point; matching syllable data on the cut decoding path according to the language model to obtain a language score; and carrying out viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path.

Further, the clipping the decoding path after the bifurcation point according to the acoustic score corresponding to the syllable data after the bifurcation point includes: and cutting the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all the decoding paths after the bifurcation point.

Further, the clipping the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all the decoding paths after the bifurcation point includes: comparing the acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point to determine the highest acoustic score; and cutting a decoding path where the first syllable data corresponding to the acoustic score in the preset range lower than the highest acoustic score is located.

Further, the performing viterbi decoding on the voice data frame by frame according to the acoustic score and the linguistic score on the clipped decoding path includes: and selecting an optimal decoding path according to the acoustic score and the language score on the cut decoding path, and outputting the optimal decoding path as the recognition result of the voice data.

An embodiment of a second aspect of the present invention provides a speech processing apparatus, including: an acquisition unit configured to acquire voice data; the acoustic matching unit is used for matching to obtain an acoustic score corresponding to each syllable data in the voice data and a decoding path where the acoustic score corresponds to each syllable data according to a pre-established acoustic model; the cutting unit is used for cutting the decoding path behind the bifurcation point according to the acoustic score corresponding to the syllable data behind the bifurcation point when the viterbi decodes to the bifurcation point of the decoding path and the syllable data behind the bifurcation point of the decoding path is matched according to a pre-established language model; the language matching unit is used for matching syllable data on the cut decoding path according to the language model to obtain a language score; and the decoding unit is used for carrying out viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path.

Further, the clipping unit is further configured to clip the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all the decoding paths after the bifurcation point.

Further, the clipping unit is further configured to compare acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point, and determine a highest acoustic score; and cutting a decoding path where the first syllable data corresponding to the acoustic score in the preset range lower than the highest acoustic score is located.

Further, the decoding unit is further configured to select an optimal decoding path according to the acoustic score and the linguistic score on the clipped decoding path, and output the optimal decoding path as the recognition result of the voice data.

A third embodiment of the present invention provides a user equipment, which includes a microphone, a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the microphone is used for acquiring a voice signal, and the processor implements the voice processing method as described above when executing the program.

A fourth aspect of the present invention provides a storage medium having stored therein instructions that, when run on a computer, cause the computer to perform the speech processing method as described above.

By the technical scheme, in the viterbi decoding process, the acoustic scores in the following information are used for pruning in advance, so that unnecessary scoring of the language model is effectively reduced, the decoding speed is increased, and the voice data processing efficiency is improved.

Drawings

Fig. 1 is a schematic flow chart of a speech processing method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a decoding network in accordance with one embodiment of the present invention;

FIG. 3 is a schematic diagram of another decoding network in accordance with one embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.

Fig. 1 is a flowchart illustrating a speech processing method according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 101, acquiring voice data;

step 102, according to a pre-established acoustic model, matching to obtain an acoustic score corresponding to each syllable data in the voice data and a decoding path where the acoustic score is located;

103, when the viterbi decodes to a bifurcation point of a decoding path and syllable data after the bifurcation point of the decoding path is matched according to a pre-established language model, cutting the decoding path after the bifurcation point according to an acoustic score corresponding to the syllable data after the bifurcation point;

104, matching syllable data on the cut decoding path according to the language model to obtain a language score; and

and 105, performing viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path.

In an embodiment of the present invention, after the received voice signal is subjected to preprocessing, such as front-end processing, feature extraction, and the like, voice data corresponding to the received voice signal is obtained, so that matching between an acoustic model and a language model is performed on the voice data, and a recognition result of the voice data is finally obtained.

The existing front-end processing means that before feature extraction, a received voice signal is generally processed first, so that influences caused by noise and different speakers are eliminated as much as possible, and the processed signal can reflect essential features of voice better. The most common front-end processes in existence are endpoint detection and speech enhancement. The endpoint detection is to distinguish the speech signal from the non-speech signal in the speech signal to accurately determine the starting point of the speech signal. After the endpoint detection, the subsequent processing can be carried out on the voice signal only, which plays an important role in improving the accuracy of the model and the recognition accuracy. The main task of speech enhancement is to eliminate the effect of ambient noise on speech.

The preprocessing techniques such as front-end processing utilized in the embodiments of the present invention may be existing or may appear in the future, and the present invention is not limited thereto.

In one embodiment of the present invention, the pre-established acoustic model is trained based on a CTC (connection temporal Classification) technique. Specifically, feature extraction may be performed on a large amount of speech data to obtain a feature vector of each speech data. And then adding blank labels in the feature vectors at intervals of a preset number of pronunciation units, training the voice data added with the blank labels based on connection time sequence classification, and establishing an acoustic model. Wherein, the acoustic model comprises a plurality of syllable data.

The plurality of syllable data in the acoustic model and the jump relationship between the syllable data can form a large number of decoding paths, and the decoding paths can form a decoding network. For example, fig. 2 is a schematic diagram of a decoding network in accordance with one embodiment of the present invention. As shown in fig. 2, wherein circles represent syllable data in the decoding network and arrows represent jump relationships between syllable data. As can be seen from fig. 2, there are multiple decoding paths in the decoding network. Each decoding path is one possible decoding result when decoding speech data. In the embodiment of the present invention, the process of performing viterbi decoding on the voice data is a process of selecting an optimal decoding path from a plurality of decoding paths in a decoding network according to a feature vector frame of the voice data.

In one embodiment of the invention, the contextual information is leveraged to reduce unnecessary scoring of language models. The specific implementation process is that after the acoustic score and the decoding path corresponding to each syllable data in the acquired voice data are obtained by matching according to the acoustic model, when the syllable data of the decoding path are matched according to the pre-established language model, when the viterbi decodes the bifurcation point of the decoding path, the decoding path after the bifurcation point is cut according to the acoustic score corresponding to the syllable data after the bifurcation point. Specifically, when the viterbi decoder reaches the bifurcation point of the decoding path, the scores of the language models on the following paths are not calculated, but the acoustic scores corresponding to the first syllable data on all paths after the bifurcation point are compared, and the highest acoustic score is determined. And cutting off the decoding path where the first syllable data corresponding to the acoustic score which is lower than 5% of the highest acoustic score is located according to the preset range, for example, the preset range is 5%, so as to obtain the cut decoding path. The setting form of the preset range is not limited in the present invention, and may be the percentage of the above example, or a numerical range, or other forms, and the preset range may be set according to specific needs.

For example, fig. 3 is a schematic diagram of a decoding network according to an embodiment of the present invention. As shown in fig. 3, wherein when the viterbi decodes to reach the bifurcation point a of the decoding path, the acoustic scores of da4 after the point a and ren2 are compared first, thereby determining the highest acoustic score. If the acoustic score of da4 is the highest, the acoustic score of da4 is determined as the highest acoustic score, and if the acoustic score of ren2 is lower than the preset range of the acoustic score of da4, the decoding path corresponding to ren2 is cut out, namely, the path which branches from the point A to the 'Beijing person' and is decoded is directly discarded. Similarly, if the acoustic score of ren2 is the highest acoustic score and the acoustic score of da4 is lower than the preset range of the acoustic score of ren2, the decoding path corresponding to da4 is cut off, and the path branching from point a to "beijing university" is abandoned. In addition, if the acoustic score of da4 is not below the preset range of acoustic scores of ren2, the decoding path corresponding to da4 may be preserved.

By the method for pre-clipping by using the following information, namely when the syllable data after the bifurcation point is subjected to language model scoring, the decoding path is pre-clipped according to the acoustic score on the decoding path after the bifurcation point, and then the language model scoring is carried out on the clipped decoding path, so that unnecessary scoring of the language model is reduced (corresponding to the condition that the decoding path with low acoustic score is clipped), the decoding accuracy is also ensured (corresponding to the condition that the decoding path with higher acoustic score is reserved), and the viterbi decoding speed is improved.

In an embodiment of the present invention, after the clipping, the syllable data on the clipped decoding path may be matched according to a pre-established language model to obtain a language score. The language model applied in the embodiment of the invention is an N-gram language model.

And then, selecting an optimal decoding path according to the acoustic score and the language score on the cut decoding path, and outputting the optimal decoding path as a recognition result of the voice data. Wherein, the pruning technique in the prior art can also be utilized to select the optimal decoding path from the clipped decoding path, which is not described herein again.

According to the embodiment of the invention, in the viterbi decoding process, the acoustic scores in the following information are used for carrying out pruning in advance, so that unnecessary scoring of the language model is effectively reduced, the decoding speed is increased, and the voice data processing efficiency is improved.

Correspondingly, fig. 4 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present invention. As shown in fig. 4, the apparatus includes an acquisition unit 41 for acquiring voice data; the acoustic matching unit 42 is configured to match an acoustic score corresponding to each syllable data in the speech data and a decoding path where the syllable data is located according to a pre-established acoustic model; a clipping unit 43, configured to, when the viterbi decodes to a bifurcation point of the decoding path and matches the syllable data after the bifurcation point of the decoding path according to a pre-established language model, clip the decoding path after the bifurcation point according to an acoustic score corresponding to the syllable data after the bifurcation point; the language matching unit 44 is used for matching the syllable data on the cut decoding path according to the language model to obtain a language score; and a decoding unit 45, configured to perform viterbi decoding on the speech data frame by frame according to the acoustic score and the language score on the clipped decoding path.

Further, the clipping unit is further configured to clip the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all the decoding paths after the bifurcation point. The clipping unit is further configured to compare acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point, and determine a highest acoustic score; and cutting a decoding path where the first syllable data corresponding to the acoustic score in the preset range lower than the highest acoustic score is located.

Correspondingly, the embodiment of the present invention further provides a user device, which includes a microphone, a processor, a memory, and a computer program stored in the memory and operable on the processor, where the microphone is used to acquire a voice signal, and the processor implements the voice processing method as described above when executing the program.

Accordingly, an embodiment of the present invention further provides a storage medium, where instructions are stored, and when the storage medium runs on a computer, the storage medium causes the computer to execute the voice processing method described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of speech processing, the method comprising:

acquiring voice data;

according to a pre-established acoustic model, matching to obtain an acoustic score corresponding to each syllable data in the voice data and a decoding path where the acoustic score is located;

when viterbi decodes a bifurcation point of the decoding path and matches syllable data after the bifurcation point of the decoding path according to a pre-established language model, cutting the decoding path after the bifurcation point according to an acoustic score corresponding to the syllable data after the bifurcation point;

matching syllable data on the cut decoding path according to the language model to obtain a language score; and

and carrying out viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path.

2. The method of claim 1, wherein the clipping the decoding path after the bifurcation point according to the acoustic score corresponding to the syllable data after the bifurcation point comprises:

and cutting the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all the decoding paths after the bifurcation point.

3. The method of claim 2, wherein the clipping the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point comprises:

comparing the acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point to determine the highest acoustic score; and

and cutting a decoding path where the first syllable data corresponding to the acoustic score in the preset range lower than the highest acoustic score is located.

4. The method of claim 1, wherein said performing viterbi decoding on said speech data frame by frame according to the acoustic score and the linguistic score on the clipped decoding path comprises:

and selecting an optimal decoding path according to the acoustic score and the language score on the cut decoding path, and outputting the optimal decoding path as the recognition result of the voice data.

5. A speech processing apparatus, characterized in that the apparatus comprises:

an acquisition unit configured to acquire voice data;

the acoustic matching unit is used for matching to obtain an acoustic score corresponding to each syllable data in the voice data and a decoding path where the acoustic score corresponds to each syllable data according to a pre-established acoustic model;

the cutting unit is used for cutting the decoding path behind the bifurcation point according to the acoustic score corresponding to the syllable data behind the bifurcation point when the viterbi decodes to the bifurcation point of the decoding path and the syllable data behind the bifurcation point of the decoding path is matched according to a pre-established language model;

the language matching unit is used for matching syllable data on the cut decoding path according to the language model to obtain a language score; and

and the decoding unit is used for carrying out viterbi decoding on the voice data frame by frame according to the acoustic score and the language score on the cut decoding path.

6. The apparatus of claim 5, wherein the clipping unit is further configured to clip the decoding path after the bifurcation point according to the acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point.

7. The apparatus of claim 6, wherein the clipping unit is further configured to compare acoustic scores corresponding to the first syllable data on all decoding paths after the bifurcation point to determine a highest acoustic score; and cutting a decoding path where the first syllable data corresponding to the acoustic score in the preset range lower than the highest acoustic score is located.

8. The apparatus of claim 5, wherein the decoding unit is further configured to select an optimal decoding path according to the acoustic score and the linguistic score on the clipped decoding path, and output the optimal decoding path as the recognition result of the speech data.

9. User equipment comprising a microphone, a processor, a memory and a computer program stored on the memory and executable on the processor, characterized in that the microphone is adapted to acquire speech signals and the processor implements the speech processing method according to any of the claims 1-4 when executing the program.

10. A storage medium having stored therein instructions which, when run on a computer, cause the computer to execute the speech processing method according to any one of claims 1-4.