CN105931642B

CN105931642B - Voice recognition method, device and system

Info

Publication number: CN105931642B
Application number: CN201610375073.8A
Authority: CN
Inventors: 汤跃忠
Original assignee: iFlytek Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: iFlytek Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2016-05-31
Filing date: 2016-05-31
Publication date: 2020-11-10
Anticipated expiration: 2036-05-31
Also published as: CN105931642A

Abstract

The invention provides a voice recognition method, voice recognition equipment and a voice recognition system. The method comprises the following steps: acquiring voice input of a user; selecting a speech database to recognize speech input by a user and outputting a resulting recognition output; selecting one or more candidate optimal recognition outputs from the recognition outputs using a domain decision; and determining an optimal recognition output of the one or more candidate optimal recognition outputs with the individual identification information of the user as a determination condition. The above scheme improves the accuracy of speech recognition without increasing the response time.

Description

Voice recognition method, device and system

Technical Field

The invention relates to the field of voice recognition, in particular to a voice recognition method, voice recognition equipment and a voice recognition system.

Background

With the popularization of the application of intelligent devices, a voice recognition system becomes a new means for information application, and meanwhile, intelligent control of the devices can be realized through the voice recognition system.

In the use of speech recognition systems, the user experience has become a focus of much of the system focus. For the application of the voice recognition system, the response time and the accuracy of judgment become core contents for improving the user experience. In the current determination form, a specific data model is often used to determine the speech data. This type of decision makes all speech environment decisions using a common system. The judgment form inevitably increases the workload of voice recognition, prolongs the response judgment time and further reduces the user experience.

In the art, automatic speech recognition systems (ASR) are common in the art for recognition of speech input by a recognition engine system. The engine model of a speech recognition system usually consists of two parts, an acoustic model and a language model, corresponding to the calculation of the speech-to-syllable probability and the calculation of the syllable-to-word probability, respectively. The language model is mainly divided into a rule model and a statistical model, and the statistical rule inherent in a language unit is revealed by a probability statistical method. The engine unit completes the recognition output of the voice input through the judgment of the knowledge field.

There are various ways to perform the voice judgment in a specific range by adding specific user information marks to the general system, thereby improving the response time and the judgment accuracy. Common forms in the art are: the database classification set for different dialects and accent forms is set, so that the voice input can be classified systematically in the initial judgment stage, and the quick response time is realized. Specific information identifiers can be added to the selected form of the database. The information identification can come from the user terminal. The identification information may be obtained by processing voice input information of the user. The same identification information may be obtained in other ways, such as by location information of the user, signal source of the mobile device, etc. The information is used as the identification information of the user and is input into the ASR system, so that the data selection and judgment of the user are assisted, the response time is prolonged, and the misjudgment rate is reduced.

However, although the above form adds the identification information of the user, the above information merely assists the system in selecting the language database by inputting the information of the language type and the location. In this form, while the response time is reduced, in the final recognition result output, the targeted output of the corresponding user cannot be obtained through the application of the identification information, that is, the recognition efficiency is not high.

There is therefore a need for an identification method that can improve the identification efficiency of a user while achieving an increase in response time.

Disclosure of Invention

In order to solve the above problem, embodiments of the present invention provide a speech recognition method, device, and system, so as to improve the accuracy of speech recognition without increasing response time.

According to an aspect of the present invention, there is provided a speech recognition method including: acquiring voice input of a user; selecting a speech database to recognize speech input by a user and outputting a resulting recognition output; selecting one or more candidate optimal recognition outputs from the recognition outputs using a domain decision; and determining an optimal recognition output of the one or more candidate optimal recognition outputs with the individual identification information of the user as a determination condition.

According to another aspect of the present invention, there is provided a speech recognition apparatus including: a voice acquisition unit for acquiring a voice input of a user; a voice recognition unit for selecting a voice database to recognize a voice input by a user and outputting a recognition output as a result; a first decision unit for selecting one or more candidate optimal recognition outputs from the recognition outputs using domain decision; and a second determination unit configured to determine an optimal recognition output among the one or more candidate optimal recognition outputs using the individual identification information of the user as a determination condition.

According to a third aspect of the present invention, there is provided a speech recognition system comprising: the above-described voice recognition device; and a client device communicatively coupled to the speech recognition device.

According to the scheme, the secondary result judgment of the voice recognition is carried out by using the specific information identification of the user, the judgment result is output as the final result, the multi-stage output of the voice recognition judgment output is realized, and the newly-increased judgment range of the judgment output adopts the output result of the field judgment as the input. Therefore, only a small number of results can be reserved for final determination, and therefore, the scheme does not increase the load of the system and can more accurately determine the output result of the voice recognition on the premise of not reducing the response time.

Drawings

The above features and advantages of the present invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic flow diagram of a speech recognition method according to an embodiment of the present invention;

FIG. 2 provides a flow diagram of a method for speech recognition using native information of a user in accordance with an embodiment of the present invention;

FIG. 3 shows a flow diagram of another speech recognition method according to an embodiment of the invention;

FIG. 4 is a schematic block diagram illustrating a speech recognition device for implementing a speech recognition method according to an embodiment of the present invention; and

FIG. 5 shows a schematic block diagram of a speech recognition system according to an embodiment of the present invention.

Detailed Description

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to designate the same or similar components, although they are shown in different drawings. For the purposes of clarity and conciseness, a detailed description of known functions and configurations incorporated herein will be omitted to avoid making the subject matter of the present invention unclear.

Fig. 1 shows a schematic flow diagram of a speech recognition method according to an embodiment of the present invention.

As shown in fig. 1, in step S01, a voice input of the user is acquired.

In some examples, the user's voice input may be obtained through a client device (e.g., a voice receiving unit of the client device, such as a microphone, etc.) that the user is using. A speech recognition device communicatively coupled to the client device may then obtain speech input from the client device.

Here, the client device used by the user may be a mobile phone, a fixed terminal, a PDA (personal digital assistant), a notebook, a netbook, a tablet computer, etc. of the user, however, the present invention is not limited thereto, and any mobile or non-mobile device conceivable to those skilled in the art may be used as the client device.

The voice recognition device described in the present application may be referred to as a server, a cloud server, a remote terminal, etc. in some implementations, however, the present invention is not limited thereto, and the voice recognition device in the present invention may be any device that can be used to implement the technical solution of the present invention, regardless of whether it is mobile or non-mobile, and regardless of the name of the device in the specific implementation.

In some examples, the user's voice information may be read by a unit such as a microphone of the client device. The user's voice information may be converted to an electronic signal and stored, for example, the user may make a voice input through a microphone system of the electronic device: "play a musical drama", "play a drama", "i want to listen to a drama", and the like.

Even in some examples, the client device may not be used, for example where the speech recognition device is local to the user, who may input speech directly at the speech recognition device (e.g., his microphone).

At step S02, the speech database is selected to recognize the speech input by the user, and the resulting recognition output is output.

In some examples, a speech database to be used may be selected, and recognition of speech is performed using an acoustic model and a language model of a speech recognition engine, etc., according to the selected speech database, and a recognition result is output.

In step S03, one or more candidate optimal recognition outputs are selected from the recognition outputs using domain determination.

The most preferable candidate output result can be selected for output by domain decision. A plurality of output results to be selected can be included in the output; for example, the plurality of results to be selected may be a plurality of results such as "i want to listen to a comedy", "i want to listen to a yue-opera", and the like. Of course, in some cases, only one output result may be output.

Optionally, in step S04, the personal identification information of the user is detected.

This step may be performed between step S03 and step S05, which will be described in detail below, but the present invention is not limited thereto, and this step may be performed at any time before step S05 is performed. For example, in the case where the user uses the voice recognition apparatus a plurality of times, it is also possible to store the personal identification information detected when the user used the voice recognition apparatus before, and use the stored personal identification information in the present recognition.

The personalized identity information may include, for example, geographic location information of the user, the source of the current connection signal of the mobile device used by the user, the native place of the user, and other information known to those skilled in the art that can personalize the identity of the user. The geographical location information of the user may be obtained in a variety of ways. The information collection can be a combination of various methods, or a single method, and may include, for example: the IP address is obtained through network connection of the user, for example, when the user uses an intelligent voice device connected with a cloud server, the address of the user is 'Shaoxing city in Zhejiang province' through detection of user network information; or may be determined by the location of the base station with which the user's mobile device is associated; the geographic location of the user may also be obtained by the GPS system of the user's mobile device. One of the above-mentioned multiple acquisition manners may be used, and any combination of the multiple acquisition manners may be used to avoid misjudgment (for example, when an internet user uses a proxy server, it is difficult to judge the location of the user through network information).

In step S05, an optimal recognition output among the one or more candidate optimal recognition outputs is determined with the individual identification information of the user as a determination condition.

And further judging a plurality of candidate optimal recognition outputs by using the individual identification information of the user as a judgment condition, and judging the most suitable optimal recognition output in the plurality of candidate optimal recognition outputs through small-range retrieval and recognition. For example, if the candidate optimal recognition outputs determined in step S03 are "my want to hear a yue opera" and "my want to hear a yue opera", and the location of the geographical information of the user obtained in step S04 is "shaoxing city, zhejiang province", for example, the candidate optimal recognition output determined in step S03 is searched for a low sample size using the above information as a determination condition, and thus it can be determined that the output result is "my want to hear a yue opera".

Therefore, the identification accuracy is improved through the relevance between the user individual identification information and the identification output. In the above method, in step S05, the recognition determination is performed again only in the small-range recognition domain, so that the determination form does not impose an excessive load on the overall response time, and thus the above method ensures that the recognition rate of the user voice input is improved on the premise that the response time is not substantially increased, thereby obtaining a high user experience.

In another example, if the user inputs 'i want to listen to a swan goose' through the intelligent voice system, the system can judge that the combination with higher probability is 'swan' or 'red goose', the two can be used as the output form of a plurality of optimal combinations, personalized identification is added in the final selected form database for system judgment, different results are finally guided to be output according to different acquired personalized identifications, and therefore the user experience is improved to a greater extent, and the user requirements are accurately identified.

When the input information of the user is definitely directed to the position of geographic information, for example, after the multiple candidate optimal results are judged and output, the retrieval form of a small sample can be avoided, and the geographic information identified by the user is directly marked as judgment information to be compared and output with multiple optimal solutions, so that the result is output more quickly: for example, the user inputs "weather facing sun", and in the outputted plurality of areas facing sun, the selection is made by the geographic information identification recognized by the user. The above approach further simplifies the recognition pattern, but is only limited to the condition that multiple optimal solutions are all directed to the same personal identification information (e.g., geographic information).

There may also be a case where the number of candidate recognition outputs output in step S03 is only 1. In this case, the process of step S05 may be bypassed. However, in other examples, step S05 may be used to determine whether the one candidate recognition output of step S03 is suitable, and discard recognition outputs that are clearly unsuitable, again prompting the user to input speech.

At the end of the determination of the optimum recognition output, in step S06, the optimum recognition output is output. The output means that can be used herein may include, but is not limited to, sound, image, text, or any other means used in the art to output information, and the present invention is not limited thereto.

In the above description of the technical solution, the geographical location information of the user is used as an example of the personal identification information of the user, however, other personal identification information may also be used. Such as native information of the user, etc.

In the case of using the native information of the user, the determination may be made by dialect, accent, of the user voice input, thereby determining the native information of the user. FIG. 2 provides a flow diagram of a method for speech recognition using native information of a user, according to an embodiment of the present invention.

In the step S01 shown in fig. 2, when the voice input of the user is acquired, the dialect and/or accent attribute of the user may be recognized by the acquired voice of the user to determine the native information of the user (step S07).

After the above-mentioned native information is acquired, the determination of the optimum output result is made in step S05 using the native information as the individual identification information of the user.

For example, in step S02, the dialect attribute of the speech may be determined by the speech recognition system, and the determination result is, for example, "zhejiang dialect".

The plurality of candidate optimal outcomes selected in step S03 may then be further determined in step S05 using the above-described "zhejiang dialect" attribute as a determination condition. For example, when the candidate optimal results to be determined are "i want to listen to the over-the-river" and "i want to listen to the yue-opera", the final output result can be determined to be "i want to listen to the over-the-river" by determining the condition "zhejiang dialect".

The manner of performing the determination by using the native information as the individual identification information of the user can avoid a determination error caused by obtaining the geographic information identification through device association, for example, an error which may be generated when the user is native in Zhejiang province and uses the voice recognition device in the situation of the Guangdong at present.

The case of using the geographical location information and the native information as the individual identification information of the user is described above with reference to fig. 1 and 2, respectively. However, in some examples, the two situations may be combined to obtain a more accurate determination result. For example, the native information of the user and the geographical location information of the user may be used in combination as the individual identification information.

A third specific embodiment and a third specific embodiment are a combination of the first and second embodiments, wherein in a specific embodiment, the determination of the local information and the determination of the geographical location information are used in combination, the determination results of the two are compared, and the comparison result is used as the determination identification information in S05. For example, when the determination results are the same (e.g., both are Zhejiang), the determination result is used as the determination identification information. However, in other embodiments, if the two determinations are different, a higher priority may be given to the local determination or the geographic location determination, for example, based on system settings or user settings. Or in other embodiments, in the case of having more individual identification information, the determination may also be performed in combination with the more individual identification information, for example, different weights are assigned to different identification information, and the determination result with the largest total score is selected. Any other determination method using various different personalized identification information, which is easily conceivable by those skilled in the art, may be adopted in the technical solution of the present invention, and is not described herein again.

The individual identification information of the user is used only in the determination of step S05 in the above example, however, in some examples, the individual identification information of the user may also be used in the voice recognition of step S02. FIG. 3 shows a flow diagram of another speech recognition method according to an embodiment of the invention.

As shown in fig. 3, in step S01, a voice input of the user is acquired.

In a next step, the individual identification information of the user is detected. For example, native location information of the user can be detected according to voice input of the user, or geographical location information of the user can be detected by other means, and the invention is not limited to this. Of course, as previously described, this detection step may be performed at any time before the personality identifying information is used (in this example, before step S02). In some cases, it may even be possible to use stored previously acquired personality identification information.

Then, in the voice recognition step of step S02, the individual identification information is used as the criteria for database selection in step S02 to speed up the voice recognition;

in the subsequent step, the same method is used for data recognition, and in S05, the personal identification information is used again to determine a small sample, and finally, the output data is accurately obtained.

In the above example, the personal identification information is used twice, the first use of the identification information is used for selecting the voice judgment database (for example, the database used in voice recognition is selected by a specific geographic information identification), and the second use of the geographic information identification is used for selecting an appropriate judgment output from the candidate optimal results, because even if an appropriate voice database is selected, the output information which is not appropriate according to the probability combination also appears, and therefore, the optimal results can be screened by the personal identification information (for example, the personal information, or the geographic information identification).

Fig. 4 is a schematic block diagram illustrating a speech recognition apparatus for implementing the above-described speech recognition method according to an embodiment of the present invention. As shown in fig. 4, the voice recognition apparatus may include a voice acquiring unit 410 for acquiring a voice input of a user; a voice recognition unit 420 for selecting a voice database to recognize a voice input by a user and outputting a recognition output as a result; a first decision unit 430 for selecting one or more candidate optimal recognition outputs from the recognition outputs using domain decision; and a second decision unit 440 for deciding an optimal recognition output among the one or more candidate optimal recognition outputs with the individual identification information of the user as a decision condition.

In some examples, the speech recognition unit 420 may also be to: the speech input by the user is recognized using the acoustic model and the language model of the speech recognition engine according to the selected speech database.

In some examples, the speech recognition device may further include: the information detecting unit 450 is configured to detect the individual identification information of the user.

In some examples, the voice recognition apparatus may further include a memory 460 for storing the individual identification information detected by the information detecting unit 450. In addition, the memory may also store any data used by the voice recognition device in performing voice recognition, such as the voice database described above, which is not limited by the present invention.

The personal identification information of the user in the present invention may include one or more of geographical location information of the user, a currently connected signal source of the mobile device used by the user, and a home country of the user. And as described above, the individual identification information of the user in the present invention is not limited thereto, but may be any information used in the art for individually identifying the user.

In some examples, the information detection unit 450 is further configured to: the native place of the user is obtained by recognizing the dialect and/or accent attribute of the user when performing recognition of the voice input by the user.

In some examples, the speech recognition unit 420 is further to: the individual identification information of the user is used to select a voice database for voice recognition.

A schematic block diagram of a speech recognition device according to an embodiment of the present invention is described above in modules/units. It should be noted, however, that one or more of the modules/units may be implemented by one or more specific hardware. Fig. 4 is a schematic block diagram for explaining the technical solution of the present invention. More or fewer modules/units may also be included in an actual implementation. For example, in some implementations, an output device such as a speaker, display, etc. for outputting information may also be included. And in some implementations, various storage devices may be further included to store data/programs or generated data/programs, etc. required in implementing the technical solutions of the present invention, and the present invention is not limited thereto.

FIG. 5 shows a schematic block diagram of a speech recognition system according to an embodiment of the present invention. As shown in fig. 5, the speech recognition system includes a cloud server (or called speech recognition device) and a client speech intelligent device (or called client device) communicatively connected to the speech recognition device according to the speech recognition system shown in fig. 4. As previously described, the client device may also be omitted when the user is co-located with the speech recognition device. The user may input speech directly at the speech recognition device.

The speech recognition process of the speech recognition apparatus shown in fig. 5 is the same as the process described with reference to fig. 1, 2, and 3, and will not be described again.

It should be noted that the technical solutions described in the embodiments of the present invention may be arbitrarily combined without conflict.

In the embodiments provided in the present invention, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one second processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a hardware mode, and the electricity can be realized in a hardware and software functional unit mode.

The above description is only for implementing the embodiments of the present invention, and those skilled in the art will understand that any modification or partial replacement without departing from the scope of the present invention shall fall within the scope defined by the claims of the present invention, and therefore, the scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method comprising:

acquiring voice input of a user;

selecting a speech database to recognize speech input by a user and outputting a resulting recognition output;

selecting one or more candidate optimal recognition outputs from the recognition outputs using a domain decision; and

and taking the individual identification information of the user as a judgment condition, and performing retrieval and identification on the one or more candidate optimal recognition outputs so as to judge the optimal recognition output in the one or more candidate optimal recognition outputs.

2. The speech recognition method of claim 1, wherein the selecting a speech database to recognize speech input by a user comprises:

the speech input by the user is recognized using the acoustic model and the language model of the speech recognition engine according to the selected speech database.

3. The speech recognition method of claim 1, further comprising:

and detecting the individual identification information of the user.

4. The voice recognition method of claim 3, wherein the individual identification information of the user comprises one or more of geographical location information of the user, a currently connected signal source of a mobile device used by the user, and a user's whereabouts.

5. The voice recognition method according to claim 4, wherein the individual identification information of the user includes a native of the user, which is obtained by recognizing a dialect and/or accent attribute of the user when performing recognition of the voice input by the user.

6. The speech recognition method of claim 1, further comprising:

selecting a voice database for voice recognition using the individual identification information of the user.

7. A speech recognition device comprising:

a voice acquisition unit for acquiring a voice input of a user;

a voice recognition unit for selecting a voice database to recognize a voice input by a user and outputting a recognition output as a result;

a first decision unit for selecting one or more candidate optimal recognition outputs from the recognition outputs using domain decision; and

and the second judging unit is used for searching and identifying the one or more candidate optimal recognition outputs by taking the individual identification information of the user as a judging condition so as to judge the optimal recognition output in the one or more candidate optimal recognition outputs.

8. The speech recognition device of claim 7, wherein the speech recognition unit is further configured to:

9. The speech recognition device of claim 7, the information detection unit further to:

and detecting the individual identification information of the user.

10. The speech recognition device of claim 9, wherein the personality identification information of the user includes one or more of geographic location information of the user, a currently connected signal source of a mobile device used by the user, and a user's whereabouts.

11. The speech recognition device of claim 10, wherein the information detection unit is further configured to: the native place of the user is obtained by recognizing the dialect and/or accent attribute of the user when performing recognition of the voice input by the user.

12. The speech recognition device of claim 7, wherein the speech recognition unit is further configured to:

13. A speech recognition system comprising:

the speech recognition device of any one of claims 7 to 12; and

a client device in communication with the speech recognition device.