CN111145756A

CN111145756A - Voice recognition method and device for voice recognition

Info

Publication number: CN111145756A
Application number: CN201911369489.9A
Authority: CN
Inventors: 郑宏
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-12
Anticipated expiration: 2039-12-26
Also published as: CN111145756B; WO2021128880A1

Abstract

The embodiment of the invention provides a voice recognition method and device and a device for voice recognition. The method specifically comprises the following steps: receiving voice information input by a user; acquiring an individualized word bank of the user, wherein the individualized word bank is established according to historical input content generated in the process of using an input method by the user; determining decoding path weight corresponding to the voice information according to the personalized word bank; and determining a voice recognition result corresponding to the voice information according to the decoding path weight. The embodiment of the invention can improve the accuracy of voice recognition.

Description

Voice recognition method and device for voice recognition

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a speech recognition method and apparatus, and an apparatus for speech recognition.

Background

Speech recognition technology is a technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process. With the continuous development of scientific technology, the speech recognition technology is rapidly developed, the accuracy of speech recognition is continuously improved, and the application in the field of human-computer interaction is gradually expanded.

In practical applications, different users may speak the same pronunciation of speech and have different meanings. For example, when the user a says "libing", the voice recognition result that the user a wants to obtain is "li bing", and when the user B says "libing", the voice recognition result that the user B wants to obtain is "li bing". In this case, it may result in low accuracy of speech recognition.

Disclosure of Invention

The embodiment of the invention provides a voice recognition method, a voice recognition device and a device for voice recognition, which can improve the accuracy of voice recognition.

In order to solve the above problem, an embodiment of the present invention discloses a speech recognition method, where the method includes:

receiving voice information input by a user;

acquiring an individualized word bank of the user, wherein the individualized word bank is established according to historical input content generated in the process of using an input method by the user;

determining decoding path weight corresponding to the voice information according to the personalized word bank;

and determining a voice recognition result corresponding to the voice information according to the decoding path weight.

In another aspect, an embodiment of the present invention discloses a speech recognition apparatus, including:

the voice receiving module is used for receiving voice information input by a user;

the word stock acquisition module is used for acquiring the personalized word stock of the user, and the personalized word stock is established according to historical input content generated in the process of using the input method by the user;

the weight determining module is used for determining the decoding path weight corresponding to the voice information according to the personalized word bank;

and the result determining module is used for determining a voice recognition result corresponding to the voice information according to the decoding path weight.

In yet another aspect, an embodiment of the present invention discloses an apparatus for speech recognition, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprise instructions for:

receiving voice information input by a user;

In yet another aspect, embodiments of the invention disclose a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a speech recognition method as described in one or more of the preceding.

The embodiment of the invention has the following advantages:

according to the embodiment of the invention, after voice information input by a user is received, the personalized word bank of the user is obtained, the decoding path weight corresponding to the voice information is determined according to the personalized word bank, and the voice recognition result corresponding to the voice information is determined according to the decoding path weight. The personalized word bank is established according to historical input content generated in the process that the user uses the input method, the personalized word bank conforms to the input habit of the user, and the voice decoding path can be weighted in real time according to the personalized word bank of the user in the process of decoding voice information input by the user, so that the final recognition result is inclined to the input habit of the user, and the accuracy of voice recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a flow chart of the steps of one embodiment of a speech recognition method of the present invention;

FIG. 2 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus 800 for speech recognition of the present invention; and

fig. 4 is a schematic diagram of a server in some embodiments of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Method embodiment

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a speech recognition method of the present invention is shown, which may specifically include the following steps:

step 101, receiving voice information input by a user;

102, acquiring an individualized word bank of the user, wherein the individualized word bank is established according to historical input contents generated in the process of using an input method by the user;

103, determining a decoding path weight corresponding to the voice information according to the personalized word bank;

and step 104, determining a voice recognition result corresponding to the voice information according to the decoding path weight.

The speech recognition method of the embodiment of the invention can be applied to electronic equipment, and the electronic equipment comprises but is not limited to: a server, a smart phone, a tablet computer, an e-book reader, an MP3 (Moving Picture Experts Group Audio Layer III) player, an MP4 (Moving Picture Experts Group Audio Layer IV) player, a laptop, a car computer, a desktop computer, a set-top box, a smart tv, a wearable device, and so on.

The voice recognition method provided by the embodiment of the invention can be used for automatically recognizing the voice information input by the user and converting the voice information into corresponding text information. The speech information may be a continuous piece of speech, such as a sentence, a piece of speech, etc. It is to be understood that the source of the voice information is not limited by the embodiment of the present invention, for example, the voice information may be a voice fragment collected in real time through a recording function of the electronic device.

In an optional embodiment of the present invention, the acquiring the voice information input by the user may specifically include: and acquiring voice information input or sent or received by a user through the instant messaging application.

The instant messaging application is an application program for realizing online chatting and exchanging through an instant messaging technology. The voice information acquired by the embodiment of the invention can comprise: the voice information input by the user through the instant messaging application, the voice information sent by the user to the communication opposite terminal through the instant messaging application and the voice information received by the user from the communication opposite terminal through the instant messaging application.

In the embodiment of the invention, the voice information input by the user can be segmented into a plurality of voice fragments according to the preset window length and frame shift. Wherein the window length may be used to represent the duration of each frame of the speech segment and the frame shift may be used to represent the time difference between adjacent frames. For example, when the window length is 25ms and the frame is moved by 15ms, the first frame speech segment is 0-25 ms, the second frame speech segment is 15-40 ms, and so on. The specific window length and frame shift may be set according to actual requirements, which is not limited in the embodiments of the present invention.

Optionally, before segmenting the speech information into multiple frames of speech segments, the electronic device may further perform noise reduction processing on the speech information to improve the processing capability of a subsequent system on the information.

In order to improve the accuracy of voice recognition, the embodiment of the invention establishes a personalized word bank for the user in advance, wherein the personalized word bank is established according to historical input contents generated in the process of using the input method by the user and can reflect the input habits of the user.

In the current voice recognition process, the main problem of inaccurate voice recognition is that the homophone is replaced by mistake. For example, when the user a says "libing", the voice recognition result that the user a wants to obtain is "li bing", and when the user B says "libing", the voice recognition result that the user B wants to obtain is "li bing". For the homophone "dubing," different users have different speech recognition intentions.

According to the embodiment of the invention, based on the historical input content generated by each user in the process of using the input method, the personalized word bank is established for each user, the personalized word bank conforms to the input habit of the user, the decoding path weight corresponding to the voice information input by the user can be determined according to the personalized word bank of the user, and the voice recognition result corresponding to the voice information can be further determined according to the decoding path weight.

The history input content generated by the user in the process of using the input method may specifically include: text content that has been previously on-screen at the current cursor position, text content that the user copied, etc. The historical input content can also be text content which is input by a user in the instant messaging application and is sent to a communication opposite terminal, or can also be text content which is input by the user in input environments such as a browser, a document, a microblog and a mail; it is to be understood that the specific source of the historical input content is not limited by the embodiments of the present invention.

In an application example of the present invention, when "libing" speech information input by the user a is received, in the process of decoding the speech information of the user a, the speech decoder can know that the historical input content generated in the process of using the input method by the user a often contains "plum ice" in combination with the personalized word stock of the user a, and therefore, an additional weight can be added to a decoding path containing "plum ice", so that the decoding path is preferentially selected, and a speech recognition result of the speech information of the user a is "plum ice".

Therefore, by using the voice recognition method of the invention, after receiving the voice information input by the user through the microphone, the voice decoder can automatically load the personalized word bank of the user, and then carry out real-time weighting on the decoding path hitting the personalized word bank in the decoding process, so that the final recognition result is inclined to the input habit of the user, and the accuracy of voice recognition can be further improved.

In an optional embodiment of the present invention, before the obtaining the personalized word bank of the user, the method may further include:

step S11, collecting the history input content generated by the user in the process of using the input method;

step S12, preprocessing the history input content to obtain preprocessed history input content;

step S13, filtering the non-personalized words in the preprocessed historical input content to obtain personalized words;

and step S14, establishing the personalized word bank of the user according to the personalized words.

The embodiment of the invention can store the historical input contents generated by the user through the input method in the process that the user uses the input method, and establish the personalized word stock of the user according to the historical input contents.

It is understood that the history input content may be stored in the terminal device of the user, or may be stored in the cloud server. The embodiment of the invention can be used for preprocessing the historical input content of the user, for example, cleaning the historical input content, removing noise data and the like to obtain the preprocessed historical input content. And then filtering the non-personalized words in the preprocessed historical input content to obtain personalized words. The non-personalized vocabulary may include conjunctions, prepositions, fictional words, and the like. The personalized words can comprise common names, place names, organization names, personal idioms, domain interest words, network hot words and the like. It can be understood that the category of the personalized vocabulary may be preset according to actual needs, and the embodiment of the present invention does not limit the category of the personalized vocabulary. The personalized vocabulary may be described as a representation of the user.

The personalized word bank can be stored in the terminal equipment of the user or can be stored in the cloud server, and the personalized word bank of the user has a one-to-one correspondence relationship with the identity information of the user. For example, a correspondence between the user identifier and the personalized word stock of the user may be established, or a correspondence between the voiceprint characteristics of the user and the personalized word stock of the user may also be established.

When a user inputs a voice to a microphone of a terminal device, a user identifier of the user, such as a login account, can be acquired; or, the voice input by the user can be subjected to voiceprint recognition to obtain the voiceprint characteristics of the user. According to the user identification or the voiceprint characteristics of the user, the personalized word bank of the user can be loaded from the cloud server, and in the process of carrying out voice recognition on the voice information input by the user, the weight of a decoding path hitting the personalized word bank can be strengthened in real time, so that the final voice recognition result is inclined to personalized words in the personalized word bank of the user, and the input habit of the user is met.

Optionally, the update may be done periodically for the user's personalized lexicon. For example, after the personalized word bank of the user is established, the input content generated by the user through an input method can be obtained in real time, and the input content is preprocessed and filtered to obtain personalized words which are added into the existing personalized word bank of the user, so that the personalized word bank of the user can be continuously updated and continuously adapts to new input habits and preferences of the user.

In an optional embodiment of the present invention, after the creating the personalized word bank of the user, the method may further include:

step S21, calculating the similarity between the personalized word banks of at least two users;

and step S22, if the similarity meets the preset condition, merging the personalized word banks of the at least two users to obtain a merged personalized word bank.

In a particular application, some users have similar input habits and usually have the same or similar speech recognition results. Therefore, the embodiment of the invention judges whether the users have similar input habits or not by calculating the similarity of the personalized word banks of different users, and merges the personalized word banks of the users with the similar input habits so as to expand the personalized word banks of the users.

Specifically, if the similarity between the personalized word banks of the at least two users meets the preset condition, which indicates that the at least two users have similar input habits, the personalized word banks of the at least two users may be merged to obtain a merged personalized word bank.

For example, for the user U and the user V, if the similarity between the personalized word bank n (U) of the user U and the personalized word bank n (V) of the user V satisfies a preset condition, n (U) and n (V) may be merged to obtain a merged personalized word bank n (uv), where the merged personalized word bank n (uv) includes personalized words in n (U) and personalized words in n (V). The step 102 of obtaining the personalized word bank of the user may specifically include: and acquiring the combined personalized word stock.

In the process of identifying the voice information input by the user U, the combined personalized word bank N (uv) can be loaded, and the decoding path weight corresponding to the voice information is determined according to the personalized words in N (uv). Similarly, in the process of recognizing the speech information input by the user V, the combined personalized lexicon n (uv) may be loaded, and the decoding path weight corresponding to the speech information is determined according to the personalized vocabulary in n (uv). Therefore, the personalized word banks of the user U and the user V are expanded, so that the personalized word banks of the user can be enriched, and the accuracy of voice recognition is further improved.

In an optional embodiment of the present invention, the calculating a similarity between the personalized word banks of at least two users specifically may include:

step S31, calculating the cosine distance between the personalized word banks of the at least two users according to the common vocabulary contained in the personalized word banks of the at least two users;

and step S32, calculating the similarity between the personalized word libraries of the at least two users according to the cosine distance.

Generally, if more common personalized words are contained in the personalized word libraries of two users, the more similar the input habits of the two users are, that is, the higher the similarity of the personalized word libraries of the two users is. Therefore, the embodiment of the invention can calculate the similarity between the users through the common personalized vocabulary contained in the personalized word bank of the users.

Specifically, in the embodiment of the present invention, the cosine distance between the personalized word banks of the at least two users may be calculated according to a common vocabulary amount included in the personalized word banks of the at least two users. Wherein the common vocabulary refers to the number of common personalized vocabularies.

For example, let n (U) denote a personalized lexicon of the user U, including the personalized vocabulary of the user U, for the user U and the user V; let n (V) represent a personalized lexicon of the user V, which comprises the personalized vocabulary of the user V. Then, the similarity W between N (u) and N (v) can be calculated by the cosine distance_UVThe method comprises the following steps:

it can be seen from formula (1) that the more the common vocabulary of N (u) and N (v), the higher the similarity between them. Step S22, determining that the similarity between the personalized word libraries of the at least two users satisfies a preset condition, which may specifically include: and if the cosine distance between the personalized word banks of the at least two users is smaller than a preset threshold, determining that the similarity between the personalized word banks of the at least two users meets a preset condition. That is, if W_UVIf the similarity is smaller than the preset threshold, it can be determined that the similarity between the personalized word bank n (U) of the user U and the personalized word bank n (V) of the user V meets the preset condition, and n (U) and n (V) can be combined.

step S41, classifying the personalized vocabulary according to the corresponding field of the personalized vocabulary in the personalized vocabulary bank;

and step S42, carrying out statistics on personalized words corresponding to different fields in the personalized word stock, and determining personalized labels corresponding to the personalized word stock.

After the personalized word bank of the user is established, the embodiment of the invention can also determine the field to which the personalized words in the personalized word bank of the user belong, and further can mark the personalized word bank of the user with a personalized tag representing the field tendency.

For example, if the personalized word bank of the user contains personalized words such as "hot pot", "western food", "nethong dining room", etc., since the domain corresponding to these personalized words is "food", it can be determined that the personalized tag corresponding to the personalized word bank of the user is "food", etc. For another example, if the personalized word library of the user includes personalized words such as a "block chain", a "face recognition", and a "cloud computing", since the field corresponding to the personalized words is "IT technology", IT can be determined that the personalized tag corresponding to the personalized word library of the user is "IT engineer", "IT technology", and the like.

In the embodiment of the invention, personalized words corresponding to different fields can be preset, the personalized words corresponding to different fields in the personalized word bank of the user are counted, and personalized tags corresponding to the personalized word bank can be determined. In practical application, if the personalized word bank of a certain user contains a plurality of personalized words corresponding to different fields, a plurality of different personalized labels can be marked on the personalized word bank of the user. For example, the personalized vocabulary corresponding to different fields in the personalized word library of a certain user is counted to obtain personalized vocabulary in the personalized word library of the user including multiple fields such as the "finance and economics" field, the "judicial" field, the "medical health" field and the like, and multiple personalized tags such as the "finance and economics", the "judicial" field, the "medical health" field and the like can be marked on the personalized word library of the user.

Step S22, determining that the similarity between the personalized word libraries of the at least two users satisfies a preset condition, which may specifically include: and if the personalized word banks of the at least two users have the same personalized tag, determining that the similarity between the personalized word banks of the at least two users meets a preset condition.

If two users have the same personalized tag, it is stated that the two users have the same domain tendency, for example, the two users may be engaged in the same profession, or have the same hobbies, etc. Therefore, if it is determined that the personalized word banks of the at least two users have the same personalized tag, it may be determined that the similarity between the personalized word banks of the at least two users satisfies the preset condition, and the personalized word banks of the at least two users may be merged.

In an optional embodiment of the present invention, the merging the personalized word banks of the at least two users in step S22 to obtain a merged personalized word bank specifically includes:

all the vocabularies in the personalized word banks of the at least two users are merged to obtain a merged personalized word bank; or

And merging the vocabularies meeting the matching conditions in the personalized word banks of the at least two users to obtain a merged personalized word bank.

In the embodiment of the present invention, after determining the personalized word banks of at least two users whose similarity satisfies the preset condition, the personalized word banks of the at least two users may be merged, and two schemes may be adopted for the merging.

Specifically, all the words in the personalized words to be merged may be merged, for example, for the user U and the user V, let n (U) represent the personalized word bank of the user U, which includes the personalized words of the user U; let n (V) represent a personalized lexicon of the user V, which comprises the personalized vocabulary of the user V. If the similarity between N (u) and N (v) meets the preset condition, all the vocabularies in N (u) and all the vocabularies in N (v) can be merged, and of course, after merging, processing such as duplication removal can be performed, so that a merged personalized word bank is obtained.

Or, only the words meeting the matching condition in the personalized words to be merged may be merged. Wherein the matching condition may include: belonging to the same domain, having the same personalized tag, etc. It is understood that the matching condition can be set by those skilled in the art according to actual needs. For example, in the above example, after determining that the similarity between n (u) and n (v) satisfies the preset condition, the vocabularies satisfying the matching condition may be further determined in n (u) and n (v), for example, the vocabularies in n (u) and n (v) that belong to the same field are determined, and only the vocabularies in n (u) and n (v) that belong to the same field are merged, and of course, processing such as deduplication may be performed after merging, so as to obtain a merged personalized word bank.

In an optional embodiment of the present invention, the determining, according to the personalized word bank, a decoding path weight corresponding to the voice information may specifically include:

step S51, matching the word sequences of the decoding paths corresponding to the voice information with the personalized word stock respectively;

and step S52, determining the weight of the decoding path corresponding to the matched vocabulary according to the word frequency information corresponding to the matched vocabulary of the word sequence in the personalized word stock.

The embodiment of the invention can carry out acoustic model scoring and language model checking on the voice segments corresponding to the voice information frame by frame through a preset decoding network so as to obtain a voice recognition result. The basic structure of the decoding network is a directed graph which consists of nodes and arcs. Each arc may hold an entry and acoustic model information and/or language model information for that entry. In practice, the acoustic model information is generally expressed as an acoustic model score, the language model information is generally expressed as a language model score, and the voice recognition is a process of finding an optimal path on the directed graph according to the input voice data. Where acoustic models are used for probability calculations from speech to syllables and language models are used for probability calculations from syllables to words. Both the acoustic model score and the language model score may be obtained through prior model training.

In the embodiment of the invention, after the electronic device on which the speech recognition method operates performs acoustic model scoring and language model score checking on the speech segments frame by frame, the speech recognition result can be obtained according to the final scoring result. Specifically, after the acoustic model scoring and the language model scoring are performed, the scores of all nodes on each decoding path in the decoding network may be added as the score of the decoding path. And then backtracking one or more decoding paths with the highest score to obtain the word sequence corresponding to the corresponding decoding path. Thus, a phrase or sentence composed of the obtained word sequence can be used as a speech recognition result.

Step 104, determining a speech recognition result corresponding to the speech information according to the decoding path weight, which may specifically include: and determining a voice recognition result corresponding to the voice information according to the weight of the decoding path corresponding to the matched vocabulary in the word sequence of each decoding path.

In the decoding process, the word sequences corresponding to each decoding path are respectively matched with the personalized word bank of the user, and if one or more word sequences have matched words in the personalized word bank, the weight of the decoding path corresponding to the matched words is determined according to the word frequency information corresponding to the matched words. For example, for a user U, who often inputs "lie ice" through an input method, a personalized word "lie ice" and a word frequency corresponding to the personalized word may be included in a personalized word bank. When a user U inputs voice information "library" through a microphone, the voice decoder can match word sequences corresponding to each decoding path with the personalized word bank of the user respectively in the process of decoding the voice information. Assuming that matching words exist in the word sequences of 'li ice' and 'li soldier' in the personalized word bank of the user U, the word frequencies of the matching words of 'li ice' and 'li soldier' in the personalized word bank are obtained, and assuming that the word frequency of 'li ice' is greater than that of 'li soldier', the weight corresponding to the decoding path of 'li ice' can be determined to be greater than that of the decoding path of 'li soldier'. Therefore, the final recognition result of the plum ice can be obtained, and the recognition result accords with the input habit of the user U, so that the recognition result is more accurate.

In an optional embodiment of the present invention, after determining, in step 104, a speech recognition result corresponding to the speech information according to the decoding path weight, the method may further include:

step S61, context information of the matched vocabulary in the voice recognition result is obtained;

and step S62, correcting the voice recognition result according to the context information to obtain a voice recognition result after error correction.

In practical applications, although the personalized vocabulary in the user personalized word bank can reflect the input habits of the user, the weight of the decoding path is determined only according to the personalized vocabulary, which may cause some false incentives. For example, assuming that the voice information input by a certain user is "lijiewansu" (understanding the age of ten thousand), since the personalized word "lie sists" with a high word frequency is contained in the personalized word library of the user, the recognition result of "lie sists age of ten thousand" may be obtained. However, it can be known from the context information that the result that the user wants to output more than ever may be "understanding all ages", resulting in a wrong recognition result.

Therefore, after the voice recognition result corresponding to the voice information is determined, the embodiment of the present invention may further obtain context information of the matching vocabulary in the voice recognition result, and perform error correction on the voice recognition result according to the context information to obtain an error-corrected voice recognition result.

For example, in the above example, the speech recognition result may be determined to be "lie sister ten thousand years old" according to the personalized word stock of the user, and at this time, context information of the matching word "lie sister" in the speech recognition result may be acquired. If the voice recognition result of the obtained voice information is 'laijie wangsu, thinking and incrustation', the voice recognition result can be determined to have errors according to the following information 'thinking and incrustation', and then the voice recognition result can be corrected. Specifically, the word sequence with the highest decoding path weight corresponding to the speech information may be acquired, assuming "understanding", and "understanding ten thousand years old" is consistent with the current context information, the "understanding" may be used to correct the error of "lie sister", obtain the speech recognition result "understanding ten thousand years old" after error correction, and output the speech recognition result after error correction.

To sum up, after receiving the voice information input by the user, the embodiment of the present invention may obtain the personalized word bank of the user, determine the decoding path weight corresponding to the voice information according to the personalized word bank, and determine the voice recognition result corresponding to the voice information according to the decoding path weight. The personalized word bank is established according to historical input content generated in the process that the user uses the input method, the personalized word bank conforms to the input habit of the user, and the voice decoding path can be weighted in real time according to the personalized word bank of the user in the process of decoding voice information input by the user, so that the final recognition result is inclined to the input habit of the user, and the accuracy of voice recognition can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Device embodiment

Referring to fig. 2, a block diagram of a speech recognition apparatus according to an embodiment of the present invention is shown, where the apparatus may specifically include:

a voice receiving module 201, configured to receive voice information input by a user;

a lexicon obtaining module 202, configured to obtain an individualized lexicon of the user, where the individualized lexicon is established according to historical input content generated in a process in which the user uses an input method;

the weight determining module 203 is configured to determine, according to the personalized word bank, a decoding path weight corresponding to the voice information;

and a result determining module 204, configured to determine a speech recognition result corresponding to the speech information according to the decoding path weight.

Optionally, the apparatus may further include:

the data collection module is used for collecting historical input contents generated in the process of using the input method by the user;

the data processing module is used for preprocessing the historical input content to obtain preprocessed historical input content;

the data filtering module is used for filtering the non-personalized vocabulary in the preprocessed historical input content to obtain personalized vocabulary;

and the word stock establishing module is used for establishing the personalized word stock of the user according to the personalized words.

Optionally, the apparatus may further include:

the similarity calculation module is used for calculating the similarity between the personalized word banks of at least two users;

the database merging module is used for merging the personalized word banks of the at least two users to obtain a merged personalized word bank if the similarity between the personalized word banks of the at least two users is determined to meet a preset condition;

the word stock obtaining module is specifically configured to obtain the combined personalized word stock.

Optionally, the similarity calculation module may specifically include:

the distance calculation submodule is used for calculating the cosine distance between the personalized word banks of the at least two users according to the common vocabulary contained in the personalized word banks of the at least two users;

the similarity calculation operator module is used for calculating the similarity between the personalized word banks of the at least two users according to the cosine distance;

the database merging module may specifically include:

the first determining submodule is used for determining that the similarity between the personalized word banks of the at least two users meets a preset condition if the cosine distance between the personalized word banks of the at least two users is smaller than a preset threshold.

Optionally, the apparatus may further include:

the vocabulary classification module is used for classifying the personalized vocabulary according to the corresponding field of the personalized vocabulary in the personalized vocabulary bank;

the tag establishing module is used for counting personalized words corresponding to different fields in the personalized word stock and determining personalized tags corresponding to the users;

optionally, the database merging module may specifically include:

the first merging submodule is used for merging all vocabularies in the personalized word banks of the at least two users to obtain a merged personalized word bank; or

And the first merging submodule is used for merging the vocabularies meeting the matching conditions in the personalized word banks of the at least two users to obtain a merged personalized word bank.

The database merging module may specifically include:

and the second determining submodule is used for determining that the similarity between the personalized word banks of the at least two users meets the preset condition if the personalized word banks of the at least two users have the same personalized tag.

Optionally, the weight determining module may specifically include:

the word bank matching submodule is used for respectively matching the word sequences of the decoding paths corresponding to the voice information with the personalized word bank;

the weight determining submodule is used for determining the weight of a decoding path corresponding to the matched vocabulary according to the word frequency information corresponding to the matched vocabulary of the word sequence in the personalized word bank;

and the result determining module is specifically configured to determine a speech recognition result corresponding to the speech information according to the weight of the decoding path corresponding to the matched vocabulary in the word sequence of each decoding path.

Optionally, the apparatus may further include:

the context acquisition module is used for acquiring context information of the matched vocabulary in the voice recognition result;

and the result error correction module is used for correcting the voice recognition result according to the context information to obtain the voice recognition result after error correction.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

An embodiment of the present invention provides an apparatus for speech recognition, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors comprises instructions for: receiving voice information input by a user; acquiring an individualized word bank of the user, wherein the individualized word bank is established according to historical input content generated in the process of using an input method by the user; determining decoding path weight corresponding to the voice information according to the personalized word bank; and determining a voice recognition result corresponding to the voice information according to the decoding path weight.

Fig. 3 is a block diagram illustrating an apparatus 800 for speech recognition according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 3, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice information processing mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on radio frequency information processing (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 4 is a schematic diagram of a server in some embodiments of the invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.

The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform the speech recognition method shown in fig. 1.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (server or terminal), enable the apparatus to perform a speech recognition method, the method comprising: receiving voice information input by a user; acquiring an individualized word bank of the user, wherein the individualized word bank is established according to historical input content generated in the process of using an input method by the user; determining decoding path weight corresponding to the voice information according to the personalized word bank; and determining a voice recognition result corresponding to the voice information according to the decoding path weight.

The embodiment of the invention discloses A1 and a voice recognition method, which comprises the following steps:

receiving voice information input by a user;

A2, before the obtaining the personalized word bank of the user according to the method of A1, the method further comprises:

collecting historical input contents generated by the user in the process of using an input method;

preprocessing the historical input content to obtain preprocessed historical input content;

filtering the non-personalized vocabulary in the preprocessed historical input content to obtain personalized vocabulary;

and establishing an individualized word bank of the user according to the individualized words.

A3, after the creating of the personalized thesaurus of the user according to the method of A2, the method further comprising:

calculating the similarity between the personalized word banks of at least two users;

if the similarity between the personalized word banks of the at least two users is determined to meet a preset condition, combining the personalized word banks of the at least two users to obtain a combined personalized word bank;

the acquiring of the personalized word bank of the user includes:

and acquiring the combined personalized word stock.

A4, according to the method in A3, the calculating the similarity between the personalized word banks of at least two users includes:

calculating the cosine distance between the personalized word banks of the at least two users according to the common vocabulary contained in the personalized word banks of the at least two users;

calculating the similarity between the personalized word banks of the at least two users according to the cosine distance;

the determining that the similarity between the personalized word banks of the at least two users meets a preset condition includes:

and if the cosine distance between the personalized word banks of the at least two users is smaller than a preset threshold, determining that the similarity between the personalized word banks of the at least two users meets a preset condition.

A5, after the creating of the personalized thesaurus of the user according to the method of A3, the method further comprising:

classifying the personalized vocabularies according to the fields corresponding to the personalized vocabularies in the personalized word bank;

carrying out statistics on personalized words corresponding to different fields in the personalized word bank, and determining personalized tags corresponding to the users;

and if the personalized word banks of the at least two users have the same personalized tag, determining that the similarity between the personalized word banks of the at least two users meets a preset condition.

A6, according to the method of A3, the merging the personalized word banks of the at least two users to obtain a merged personalized word bank includes:

A7, according to the method of A1, the determining the decoding path weight corresponding to the voice information according to the personalized word stock includes:

matching the word sequences of the decoding paths corresponding to the voice information with the personalized word bank respectively;

determining the weight of a decoding path corresponding to the matched vocabulary according to the word frequency information corresponding to the matched vocabulary of the word sequence in the personalized word bank;

the determining a voice recognition result corresponding to the voice information according to the decoding path weight includes:

and determining a voice recognition result corresponding to the voice information according to the weight of the decoding path corresponding to the matched vocabulary in the word sequence of each decoding path.

A8, after determining the speech recognition result corresponding to the speech information according to the decoding path weight according to the method of A7, the method further includes:

obtaining context information of the matched vocabulary in the voice recognition result;

and correcting the voice recognition result according to the context information to obtain the corrected voice recognition result.

The embodiment of the invention discloses B9, a speech recognition device, comprising:

B10, the apparatus of B9, the apparatus further comprising:

B11, the apparatus of B10, the apparatus further comprising:

B12, the apparatus of B11, the similarity calculation module comprising:

the database merging module comprises:

B13, the apparatus of B11, the apparatus further comprising:

the database merging module comprises:

B14, the apparatus of B11, a database merge module comprising:

B15, the apparatus of B9, the weight determination module comprising:

B16, the apparatus of B15, the apparatus further comprising:

The embodiment of the invention discloses C17, a device for speech recognition, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors comprise instructions for:

receiving voice information input by a user;

C18, the device of C17, the device also configured to execute the one or more programs by one or more processors including instructions for:

C19, the device of C18, the device also configured to execute the one or more programs by one or more processors including instructions for:

the acquiring of the personalized word bank of the user includes:

and acquiring the combined personalized word stock.

C20, the apparatus according to C19, the calculating similarity between personalized word banks of at least two users includes:

C21, the device of C19, the device also configured to execute the one or more programs by one or more processors including instructions for:

C22, according to the apparatus of C19, the merging the personalized word banks of the at least two users to obtain a merged personalized word bank includes:

C23, determining the decoding path weight corresponding to the voice message according to the personalized thesaurus by the apparatus of C17, including:

C24, the device of C23, the device also configured to execute the one or more programs by one or more processors including instructions for:

Embodiments of the present invention disclose D25, a machine-readable medium having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform a speech recognition method as described in one or more of a 1-a 8.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

The foregoing has described in detail a speech recognition method, a speech recognition apparatus and a speech recognition apparatus provided by the present invention, and the present invention has been described in detail by applying specific examples to explain the principles and embodiments of the present invention, and the description of the above examples is only used to help understand the method and core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of speech recognition, the method comprising:

receiving voice information input by a user;

2. The method of claim 1, wherein prior to obtaining the personalized word bank of the user, the method further comprises:

3. The method of claim 2, wherein after the establishing of the personalized lexicon of the user, the method further comprises:

the acquiring of the personalized word bank of the user includes:

and acquiring the combined personalized word stock.

4. The method of claim 3, wherein calculating the similarity between the personalized word banks of at least two users comprises:

5. The method of claim 3, wherein after the establishing of the personalized lexicon of the user, the method further comprises:

6. The method of claim 3, wherein the merging the personalized word banks of the at least two users to obtain a merged personalized word bank comprises:

7. The method of claim 1, wherein determining the decoding path weight corresponding to the speech information according to the personalized thesaurus comprises:

8. A speech recognition apparatus, characterized in that the apparatus comprises:

9. An apparatus for speech recognition comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs comprising instructions for:

receiving voice information input by a user;

10. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform a speech recognition method as claimed in one or more of claims 1 to 7.