CN111199730B

CN111199730B - Voice recognition method, device, terminal and storage medium

Info

Publication number: CN111199730B
Application number: CN202010019444.5A
Authority: CN
Inventors: 孙天炜; 马百鸣; 高璐宇
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2023-02-03
Anticipated expiration: 2040-01-08
Also published as: CN111199730A

Abstract

The present disclosure relates to a voice recognition method, apparatus, terminal and storage medium, the method comprising: receiving first voice information; comparing the first voice information with information stored in a corpus of a current user; responding to second voice information which meets a first similarity condition with the first voice information and exists in the corpus, and outputting response information of the first voice information based on a current correct recognition result of the second voice information. In this embodiment, based on information pre-stored in a corpus, the first speech information is compared with the stored information, and in the presence of second speech information satisfying a first similarity condition with the first speech information, response information of the first speech information can be output based on a current correct recognition result of the second speech information, so that accuracy of speech recognition can be improved.

Description

Voice recognition method, device, terminal and storage medium

Technical Field

The present disclosure relates to the field of terminal technologies, and in particular, to a voice recognition method, apparatus, terminal, and storage medium.

Background

Terminals have been applied to various fields as voice recognition devices, for example, smart homes, vehicles, and the like. And the intelligent voice assistant is realized through the voice recognition of the terminal. However, in the related art, in the process of performing speech recognition, users are often required to use a standard language to recognize the speech accurately, and therefore how to improve the accuracy of speech recognition becomes an urgent technical problem to be solved.

Disclosure of Invention

According to a first aspect of the embodiments of the present disclosure, there is provided a speech recognition method, including:

receiving first voice information;

comparing the first voice information with information stored in a corpus of a current user;

responding to second voice information which meets a first similarity condition with the first voice information and exists in the corpus, and outputting response information of the first voice information based on a current correct recognition result of the second voice information.

Optionally, the method further comprises:

carrying out voice recognition on the first voice information to obtain first text information;

the comparing the first voice information with information stored in a corpus of a current user includes:

comparing the first text information with historical text information stored in the corpus of the current user;

the outputting response information of the first voice information based on a current correct recognition result of the second voice information in response to the second voice information meeting a first similarity condition with the first voice information in the corpus comprises:

responding to second text information which meets a first similarity condition with the first text information and represents the current correct recognition result of the second voice information in the corpus, and outputting response information of the first voice information based on the second text information

Optionally, the method further comprises:

responding to the second voice information which meets a first similarity condition with the first voice information and does not exist in the corpus, and identifying the first voice information;

and outputting response information of the first voice message based on the third text message obtained by the first voice message recognition.

Optionally, the method further comprises:

acquiring feedback information based on the response information;

and in response to the fact that the feedback information represents that the third text information is correctly recognized, storing the third text information into the corpus as a current correct recognition result of the first voice information.

Optionally, the method further comprises:

receiving a plurality of pieces of first voice information within first preset time;

determining similarity between the first voice information;

acquiring feedback information corresponding to the first voice information respectively;

the step of, in response to the feedback information representing that the third text information is correctly recognized, storing the third text information as a current correct recognition result of the first speech information in the corpus, includes:

responding to that the similarity among the first voice information satisfies a second similarity condition, and at least one piece of feedback information exists in the feedback information corresponding to the first voice information respectively and represents that the corresponding first voice information is correctly identified,

and storing the third text information corresponding to the correctly recognized first voice information into the corpus as the current correct recognition result of the first voice information.

Optionally, the acquiring feedback information based on the response information includes at least one of:

acquiring confirmation information received within second preset time when the response information is output;

acquiring the negative confirmation information received in the second preset time of the response information output;

when the response information is determined not to be output within second preset time, user feedback is not received, and feedback information indicating that the third text information is correctly identified is generated;

and receiving the next piece of first voice information meeting a second similarity condition within second preset time after the response information is determined to be output, and generating feedback information indicating that the third text information is identified wrongly.

Optionally, the method further comprises:

receiving a plurality of pieces of first voice information within a third preset time;

determining similarity between the first voice information;

and in response to that the similarity among the first voice information meets a second similarity condition, storing third text information corresponding to the first voice information with the last condition in the first voice information as a current correct recognition result of the first voice information in the corpus.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech recognition apparatus including:

a first receiving module configured to receive first voice information;

the comparison module is configured to compare the first voice information with information stored in a corpus of a current user;

the first output module is configured to respond to the second voice information which meets a first similarity condition with the first voice information and exists in the corpus, and output response information of the first voice information based on the current correct recognition result of the second voice information.

Optionally, the apparatus further comprises:

the first recognition module is configured to perform voice recognition on the first voice information to obtain first text information;

the alignment module is further configured to:

a first output module further configured to:

responding to second text information which meets a first similarity condition with the first text information and represents a current correct recognition result of the second voice information in the corpus, and outputting response information of the first voice information based on the second text information.

Optionally, the apparatus further comprises:

a second recognition module configured to recognize the first voice information in response to no second voice information satisfying a first similarity condition with the first voice information existing in the corpus;

and the second output module is configured to output response information of the first voice message based on the third text message obtained by the first voice message recognition.

Optionally, the apparatus further comprises:

an acquisition module configured to acquire feedback information based on the response information;

and the storage module is configured to respond to the fact that the feedback information represents that the third text information is correctly recognized, and then store the third text information into the corpus as a current correct recognition result of the first voice information.

Optionally, the apparatus further comprises:

a second receiving module configured to receive a plurality of pieces of first voice information within a first predetermined time;

the first determining module is configured to acquire feedback information corresponding to the plurality of pieces of first voice information respectively;

the storage module further configured to:

responding to that the similarity among the first voice messages meets a second similarity condition, and at least one piece of feedback information exists in the feedback information corresponding to the first voice messages respectively to represent that the corresponding first voice messages are correctly identified,

Optionally, the obtaining module is further configured to at least one of:

and generating feedback information indicating that the third text information is identified wrongly when the next piece of first voice information of which the received information meets a second similarity condition is received within second preset time after the response information is determined to be output.

Optionally, the apparatus further comprises:

a third receiving module configured to receive a plurality of pieces of first voice information within a third predetermined time;

a fourth determining module configured to determine a similarity between the plurality of pieces of first voice information;

the storage module further configured to:

and in response to that the similarity among the first voice messages meets a second similarity condition, storing a third text message corresponding to the last first voice message in the first voice messages as a current correct recognition result of the first voice messages in the corpus.

According to a third aspect of the embodiments of the present disclosure, there is provided a terminal, including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to: when executed, the voice recognition method of any embodiment is realized.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method steps of any of the above.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

in the embodiment of the disclosure, the first voice message is received; and comparing the first voice information with information stored in a corpus of a current user, responding to second voice information which meets a first similarity condition with the first voice information in the corpus, and outputting response information of the first voice information based on a current correct recognition result of the second voice information. That is to say, in this embodiment, the terminal does not directly output the response information of the first voice message based on the first voice message, but finds the second voice message satisfying the first similarity condition with the first voice message in the corpus, and outputs the response information of the voice request based on the current correct recognition result of the second voice message satisfying the first similarity condition with the first voice message. In this way, since the corpus can store information for the individual pronunciation habits or pronunciation characteristics of the user, the embodiment can recognize the first voice information by responding to the second voice information which satisfies the first similarity condition with the first voice information in the corpus and based on the current correct recognition result of the second voice information, so that the personalized recognition in the voice recognition process can be satisfied without the standard pronunciation characteristics of the first voice information. That is, the speech recognition of the present embodiment can allow the user to have accents, and make accurate recognition even if the user uses non-standard pronunciation characteristics, thereby improving the accuracy of the speech recognition. Meanwhile, the current correct recognition result of the second voice information meeting the first similarity condition with the first voice information can be directly obtained from the corpus, so that the voice recognition efficiency can be improved, and the user experience is finally improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of speech recognition according to an exemplary embodiment;

FIG. 2 is another flow diagram illustrating a method of speech recognition according to an exemplary embodiment;

FIG. 3 is yet another flow diagram illustrating a method of speech recognition in accordance with an exemplary embodiment;

FIG. 4 is yet another flow diagram illustrating a method of speech recognition in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a speech recognition apparatus according to an example embodiment;

fig. 6 is a block diagram illustrating a terminal according to an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

FIG. 1 is a flow chart illustrating a method of speech recognition, as shown in FIG. 1, according to an exemplary embodiment, the method including the steps of:

step 101: first voice information is received.

The method of the embodiment is applied to a terminal, and the terminal can be at least one of a mobile phone, a tablet computer, a notebook computer and an intelligent device. Here, the smart device may be a smart audio, a smart television, a smart refrigerator, or the like. It should be noted that the smart audio may be a vehicle-mounted smart audio, or a home smart audio.

Here, the first voice information includes a voice request of the user, and the voice request may be a voice request for voice control, for example, voice control for controlling the terminal device to perform some operation, such as voice control for turning on music playing. In other embodiments, the voice request may also be a voice request for a voice conversation, such as a voice question for a conversation with the intelligent voice robot, or the like.

Step 102: and comparing the first voice information with information stored in a corpus of a current user.

Here, the information stored in the corpus of the current user may include: and identifying historical text information corresponding to the historical voice information based on the historical voice information. For example, the historical text information of the obtained historical voice information "please play the song mountain" is identified according to the historical voice information "please play the song mountain".

In other embodiments, the information stored in the corpus of the current user may further include: historical speech information of the user.

It is to be added that the corpus of the current user not only stores the historical text information obtained by identifying the historical speech information, but also stores the correct identification result corresponding to the historical text information obtained by identifying the historical semantic information. For example, the correct recognition result corresponding to the history text information of "please play song mountain" is "please play song three".

Or, not only the historical speech information of the user but also the correct recognition result corresponding to the historical speech information of the user are stored in the corpus of the current user. For example, the correct recognition result corresponding to the history voice information of "please play song mountain" is "please play song three".

Step 103: responding to second voice information which meets a first similarity condition with the first voice information and exists in the corpus, and outputting response information of the first voice information based on a current correct recognition result of the second voice information.

Here, the first similarity condition may include: the matching degree of the first voice information and the second voice information is larger than the threshold value of the matching degree. That is, the second speech information having a matching degree with the first speech information greater than the threshold matching degree is the second speech information satisfying the first similarity condition with the first speech information.

Here, the second speech information may actually include historical speech information and/or historical text information recognized by the historical speech information.

In fact, in a case where the information stored in the corpus of the user includes historical speech information of the user, the matching degree between the first speech information and the second speech information is greater than the threshold matching degree, which may include: and if the pronunciation characteristics of the first voice information and the second voice information which exceed the preset proportion are the same or similar, the matching degree of the first voice information and the second voice information is considered to be larger than the threshold value of the matching degree. For example, if more than 80% of the phonetic features of the two pieces of speech information, namely "play song mountain" and "play song three" are the same, the matching degree of "play song mountain" and "play song three" is considered to be greater than the threshold matching degree.

In other embodiments, when the information stored in the corpus of the user includes historical text information identified from the historical speech information of the user, the method further includes: carrying out voice recognition on the first voice information to obtain first text information;

the comparing the first voice message with the information stored in the corpus of the current user includes:

Here, the matching degree of the first voice information and the second voice information is greater than the threshold matching degree, which may include: words which exceed a preset ratio and are obtained by the first text information and the historical text information through the first voice information recognition are the same, and then the matching degree of the first text information and the historical text information obtained through the first voice information recognition is considered to be larger than a matching degree threshold value. For example, if the first text information of "love, please close music" is the same as the historical text information of "love, please close song" in more than 80% of words, the matching degree of "love, please close music" with "love, song" is greater than the threshold value of the matching degree.

On the contrary, if the matching degree is smaller than or equal to the threshold value of the matching degree, it is indicated that the second voice information meeting the first similarity condition with the first voice information does not exist in the corpus.

In this embodiment, the matching degree at least greater than the threshold of the matching degree is used as a basic condition for determining whether there is second speech information satisfying the first similarity condition with the first speech information, and the second speech information having a matching degree less than or equal to the threshold is excluded, so that a phenomenon of speech recognition error caused by blind utilization of the corpus, that is, speech recognition using the second speech information having a matching degree less than or equal to the threshold in the corpus is reduced.

Further, in order to enable more accurate recognition, in some embodiments, the terminal matches the first voice information with a plurality of second voice information in the corpus to obtain a plurality of matching degrees; and if the matching degree is greater than the threshold value of the matching degree and is the highest matching degree in the multiple matching degrees, determining that the first similarity condition is met.

Here, the terminal uses the second voice information with the matching degree satisfying the threshold value of the matching degree and the highest matching degree as the voice information satisfying the first similar condition with the first voice information, so as to improve the accuracy of voice recognition.

In other embodiments, the matching the first speech information with a plurality of second speech information in the corpus to obtain a plurality of matching degrees includes:

converting the first voice information into a first voice vector and converting the second voice information into a second voice vector;

and matching the first voice vector with a plurality of second voice vectors to obtain a plurality of matching degrees.

Here, by converting the voice information into a voice vector, it can be understood that the voice information is expressed in the form of a voice vector. Specifically, each pronunciation feature in the speech information is expressed by a speech vector, and the speech vector is constructed according to characteristics of pronunciation tone or pronunciation audio of the speech information, for example.

The phonetic vector includes pronunciation characteristics of one or more pronunciation elements in the first phonetic information, and the pronunciation characteristics can be one or more.

Wherein, for different people, the pronunciation characteristics of the same pronunciation element "H" are different, some people will pronounce the pronunciation of "H" to the similar pronunciation of "F", and some people may pronounce "H" to other voices. In general, the same pronunciation element may have multiple pronunciation characteristics.

Here, it should be noted that the pronunciation characteristics of different words are different, for example, the pronunciation characteristics of "p" are different from those of "a".

Therefore, according to the embodiment, through the matching between the first voice vector and the second voice vector, the calculation amount of the voice information matching can be simplified, and the efficiency of the voice information matching is improved.

Further, the converting the first speech information into a first speech vector and the converting the second speech information into a second speech vector includes:

converting the first voice information into first pinyin information, and converting the first pinyin information into the first voice vector; and the number of the first and second groups,

and converting the second voice information into second pinyin information, and converting the second pinyin information into the second voice vector.

It can be understood that the voice information is converted into the pinyin information, and then the pinyin information is converted into the voice vector, so that the expression range of the voice information can be expanded compared with the situation that the voice information is directly converted into the voice vector, and more accurate voice vector expression is obtained. For example, a piece of speech information is "left tomorrow", where there is some accent, when the speech information is converted into a speech vector, the conversion is limited due to the difference of tones in the tones, so that the matching degree is reduced, and further the matching is inaccurate. If the voice information 'mintianyuma' is converted into the pinyin information 'mintianxia yu ma' firstly, the voice information can be completely matched with the 'mintianyuma', so that the matching accuracy can be improved, and finally the voice recognition accuracy is improved.

Of course, in other embodiments, if the second speech information includes historical text information identified from the historical speech information, the method further includes:

identifying the first voice information to obtain first text information;

the matching the first speech information with the plurality of second speech information in the corpus to obtain a plurality of matching degrees includes:

converting the first text information into a first word vector and converting the historical text information into second word information;

and matching the first word vector with a plurality of second word vectors to obtain a plurality of matching degrees.

Here, by converting text information into a word vector, it can be understood that the text information is expressed in vectorization. For example, a word vector is constructed according to the meaning of words in text information.

It should be understood that the corpus stores correct recognition results of a plurality of pieces of historical speech information, or stores correct recognition results corresponding to historical text information recognized by a plurality of pieces of historical speech information, where the correct recognition results are results that are considered to be recognized correctly after verification. For example, the historical text information obtained by the second voice information recognition is "please play song mountain", and the second text information of the current correct recognition result of the second voice information is "please play song three". That means, the second text information is actually a correct recognition result after the second speech information is calibrated, so even if the pronunciation of the "three" word of the user is not standard, if the recognition is performed by using the second text information corresponding to the second speech information, the obtained recognition result is still a correct recognition result.

Therefore, compared with the case of directly performing speech recognition by using the first speech information, in this embodiment, if the second speech information satisfying the first similarity condition with the first speech information exists in the corpus, the response information of the speech request is output based on the second text information representing the current correct recognition result of the second speech information, so that the phenomenon of inaccurate speech recognition due to the possible nonstandard pronunciation characteristics in the first speech information can be reduced.

It should be noted that different users may set different corpora, for example, the zhang san and the lie si have different accents respectively, and have their own unique pronunciations for specific words, so the zhang san corpus is different from the lie si corpus. That is, accurate speech recognition results can be provided for different users through different corpora. In this embodiment, a correct recognition result that satisfies a first similarity condition with the first text information and that represents the current second speech information is obtained by comparing the information stored in the corpus, so that accuracy of speech recognition for an individual can be improved, and personalization of the terminal in speech recognition can be facilitated.

It is to be added that in some embodiments, the corpus is stored in the terminal device. Therefore, when voice recognition is needed, the corpus can be directly called from the local, the first text information of the second voice information meeting the first similarity condition with the first voice information is recognized, and therefore the first voice information is recognized, and a voice recognition result is obtained.

In this embodiment, the operation of retrieving the corpus does not need to rely on a network, and speech recognition can be performed directly based on the local corpus.

In other embodiments, the corpus is stored in a server, and when speech recognition is required, the corpus corresponding to the identification information is called from the server based on terminal identification information. Therefore, the storage space of the terminal equipment can be saved, and the performance of the terminal equipment is improved.

In other embodiments, in order to reduce user operations, the user does not need to set up a corpus fully for his own voice in advance. User corpora for different regions may be stored in the server. And when voice recognition is required, based on the position information of the terminal, the corpus corresponding to the position information is called from the server. For example, the corpus in the northeast region is different from the corpus in the Hunan region and also different from the corpus in the Guangdong region of Fujian. Therefore, the user does not need to set a corpus in advance, and user operation is reduced.

Here, the geographic location information may be the current geographic location information of the terminal, or may also be the geographic location information set by the terminal, which is not limited herein.

Therefore, in the embodiment, the terminal can correct the first voice information based on the corpus, can be correctly recognized without standard voice pronunciations of users, can adapt to various crowds, allows various accents to exist, and is favorable for individuation of the terminal for voice recognition; and the corpus is not directly utilized for recognition, so that the voice recognition efficiency is improved.

Referring to fig. 2 as an alternative embodiment, fig. 2 is another flow chart illustrating a speech recognition according to an exemplary embodiment, as shown in fig. 2, the method further includes:

step 201: identifying the first voice information in response to the fact that second voice information meeting a first similarity condition with the first voice information does not exist in the corpus;

step 202: and outputting response information of the first voice message based on the third text message obtained by the first voice message recognition.

Here, when there is no second Speech information satisfying a first similarity condition with the first Speech information in the corpus, ASR (Automatic Speech Recognition) is used to recognize the first Speech information, so as to obtain third text information corresponding to the first Speech information, and response information of the first Speech information is output.

In this embodiment, in the absence of the second speech information that satisfies the first similarity condition with the first speech information, the terminal may automatically recognize, based on the ASR, the third text information recognized from the first speech information, and obtain the response information of the first speech information. Therefore, the phenomenon that the voice request cannot be responded when the second voice information meeting the first similarity condition with the first voice information is not stored in the corpus can be reduced.

As another alternative embodiment, referring to fig. 3, fig. 3 is a further flowchart illustrating a speech recognition according to an exemplary embodiment, as shown in fig. 3, the method further includes:

step 301: acquiring feedback information based on the response information;

step 302: and in response to the feedback information representing that the third text information is correctly recognized, storing the third text information into the corpus as a current correct recognition result of the first voice information.

Here, the terminal receives feedback information based on the response information, determines whether the third text information is correctly identified based on the feedback information, and stores the third text information into the corpus as a current correct identification result of the first voice information if the feedback information represents that the third text information is correctly identified. Therefore, the correct recognition result can be conveniently found from the corpus directly in the next voice recognition, and the voice recognition is carried out based on the correct recognition result, so that the accuracy of the voice recognition and the efficiency of the voice recognition are improved.

As an optional embodiment, the method further comprises:

the acquiring feedback information based on the response information includes:

Here, the second similarity condition may include: for example, words exceeding a predetermined percentage in a plurality of third text messages respectively corresponding to the plurality of pieces of first voice information are the same, and it can be considered that the matching degree of the plurality of pieces of first voice information is greater than the preset value.

The second similarity may be the same as or different from the first similarity.

In practical applications, in order to ensure accuracy of speech recognition and to increase richness of information stored in the corpus, the first similarity is greater than the second similarity.

In fact, the predetermined degree here can be set to be low in practice, and even if only continuous speech information within the first predetermined time can be summarized as the first speech information satisfying the second similarity, thereby ensuring the richness of the information stored in the corpus.

Specifically, the terminal receives a plurality of pieces of first voice information meeting the second similarity condition within the first preset time, which indicates that the user is sending voice requests for multiple times according to the same requirement, and it can be seen that some response information obtained by the voice requests is not correct response information when the voice requests for multiple times are available, and only when at least one piece of first voice information is correctly identified, the third text information corresponding to the correctly identified first voice information needs to be utilized as the current correct identification result.

For example, the terminal receives three pieces of first voice information, namely 'play song mountain', 'play song dealer' and 'play song three', meeting the second similarity condition within a first preset time, and only the piece of first voice information 'play song three' is correctly identified.

In this embodiment, the terminal may store, in the corpus, text information corresponding to the plurality of pieces of first voice information that satisfy the second similarity condition, and the third text information corresponding to the first voice information that is determined to be correctly identified, that is, three pieces of third text information, namely, "play song mountain", "play song dealer", and "play song three", that satisfy the second similarity condition and that are correctly identified "play song three", corresponding to the first voice information that satisfies the second similarity condition. Therefore, when the user uses 'playing song mountain' as the voice request of the first voice information next time, the text information 'playing song three' can still be correctly identified for identification, and therefore the accuracy of voice identification is guaranteed. And the third text information which is correctly identified based on the plurality of pieces of first voice information can be correctly identified without the standard voice pronunciations when the user identifies next time, so that the personalized identification function of the user is realized, and the user experience is improved.

Further, in other embodiments, the user may also use the brief speech information "song mountain" as one of the N pieces of first speech information satisfying the second similarity condition. Therefore, when the user identifies next time, the user can identify the 'play song three' only by the short voice request, and the user experience is improved.

As another optional embodiment, in step 301, the obtaining of the feedback information based on the response information includes at least one of:

acquiring confirmation information received in a second preset time when the response information is output;

and receiving the next piece of first voice information meeting a second similarity condition within second preset time after determining that the response information is output, and generating feedback information indicating that the third text information is identified wrongly.

Here, the second predetermined time may be different from the first predetermined time, and the second predetermined time may include: the average time or the longest time between the terminal outputting the response information historically and receiving the feedback information historically based on the response information.

The first predetermined time may be a feedback time including N consecutive pieces of the first voice information, that is, the first predetermined time includes at least N-1 second predetermined times.

Here, the acquiring the acknowledgement information received within the second predetermined time of the response information output may include: the terminal acquires the voice reply information which is received in the response information output second preset time and represents confirmation.

For example, when the terminal receives the voice reply message indicating confirmation, such as "OK", "thank you", or "OK", within the second predetermined time of the reply message output, the terminal indicates the confirmation of the user with respect to the reply message. This confirmation is a positive confirmation, and the feedback information of the response information can indicate that the third text information is correctly recognized.

In other embodiments, the terminal may output the acknowledgement message received within the second predetermined time from the response message, and may further include: the terminal acquires first operation information received in second preset time after the response information is output, wherein the first operation information can indicate the confirmation of the user aiming at the response information.

For example, the terminal receives a first operation on the "confirmation control" within a second predetermined time of the response message output, which indicates that the user confirmed the response message. This confirmation is a positive confirmation, and the feedback information of the response information can indicate that the third text information is correctly recognized.

Here, the obtaining the response message and outputting the negative-acknowledgement message received within the second predetermined time may include: the terminal acquires the voice reply information which is received within the second preset time and represents the denial within the response information output.

For example, when the terminal receives the voice reply message indicating "no answer" or "retry request" or the like indicating denial within the second predetermined time of the response message output, it indicates the denial of the user with respect to the response message, and this denial indicates that the third text message is recognized incorrectly, that is, the feedback message of the response message indicates that the third text message is recognized incorrectly.

In other embodiments, the obtaining the response message and outputting the negative-acknowledgement message received within the second predetermined time may further include: and the terminal acquires second operation information received in a second preset time of response information output, wherein the second operation information can indicate the denial of the user for the response information.

For example, the terminal receives a second operation on the "denial control" within a second predetermined time of the response information output, which indicates that the user performs denial on the response information. That is, the feedback information of the response information indicates that the third text information is recognized incorrectly.

The step of generating feedback information indicating that the third text information is correctly identified when it is determined that the user feedback is not received within the second predetermined time after the response information is output may be understood as that the user feedback is not received within the second predetermined time after the response information is output, which indicates that the response information is correct, and the user does not make a voice request or input other feedback information, and at this time, generating feedback information indicating that the third text information is correctly identified, which indicates that the voice identification is correct.

For example, for the intelligent sound, if the voice request initiated by the user is to request to play an english song, the response made by the intelligent sound based on the voice request is to play the english song, so the user does not perform other feedback based on this response information any more, and thus, the terminal does not receive the user feedback within the predetermined time, which actually indicates that the response information of the terminal is the response information that the user wants, and therefore, feedback information indicating that the third text information is correctly identified is generated to indicate that the voice identification is correct.

The step of receiving the next first voice message meeting the second similarity condition within the second predetermined time after determining that the response message is output may be to generate feedback information indicating that the third text message is recognized incorrectly, where it is understood that receiving the next first voice message meeting the second similarity condition within the second predetermined time after determining that the response message is output indicates that the response message for the first voice message is incorrect, and therefore, the user may re-input the next first voice message similar to the first voice message in time and re-issue a voice request, and at this time, the first voice message generates feedback information indicating that the third text message is recognized incorrectly to indicate that the voice recognition is incorrect.

Still taking an example that the voice request initiated by the user is a request for playing an english song, assuming that the user does not express the voice request completely for some reasons or the expression is deviated, the response made by the smart sound based on the voice request is not playing an english song, so the user can send a second voice request in time according to the wrong response information, that is, the terminal can receive the next piece of the first voice information meeting the second similarity condition again, and thus, it can be determined that the third text information identified by the current first voice information is wrong, and at this time, feedback information indicating that the third text information is identified by mistake is generated to indicate that the voice identification is wrong.

As another optional embodiment, the method further comprises:

the terminal receives a plurality of pieces of first voice information in third preset time;

determining similarity between the first voice information;

and in response to that the similarity among the first voice information meets a second similarity condition, storing third text information corresponding to the last first voice information in the first voice information as a current correct recognition result of the first voice information in the corpus.

It is understood that, when the terminal receives a plurality of pieces of first voice information within the third predetermined time, after the last piece of first voice information is sent, because no new first voice information is sent, whether feedback information based on response information of the last piece of first voice information is received or not can be determined as correct for the third text information corresponding to the last piece of first voice information by default.

Here, the third predetermined time may be the same as or different from the first predetermined time.

Further, the present disclosure provides a specific embodiment to further understand the speech recognition method provided by the embodiment of the present disclosure.

It should be noted that the speech recognition principle of ASR is to analyze whether some general features of a recognized sentence conform to a trained speech model to correct errors that may occur in the sentence. Firstly, a large amount of speech is used for model training to obtain a universal speech model, then preprocessing operations such as word segmentation and sentence segmentation are carried out on the recognized sentence, the overall probability of the sentence and the probability of each word segmentation appearing in the sentence are calculated through the model, when the appearing probability is smaller than a specified threshold value, an error is considered to be possibly contained, then a candidate word set which is possibly correct is selected from the candidate word sets in a pinyin similarity, editing distance and other modes, and finally the best candidate word is selected through a certain scoring strategy to be replaced and corrected. Although the existing ASR can solve the problems of speech recognition, noise contained in the user's audio, etc., well, it is a general method, and the trained model is also in accordance with the overall human habit, and does not consider the user's individual speech habit, such as accent, etc.

Referring to fig. 4, fig. 4 is a further flowchart illustrating a method of speech recognition according to an exemplary embodiment, as shown in fig. 4, the method comprising:

step 401: receiving first voice information;

here, the step 401 may be understood as the step 101 described in the above embodiment.

Step 402: calling a personal corpus;

here, step 402 may be understood as invoking a personal corpus from a local terminal or a server, where the personal corpus may be understood as the corpus in the above embodiments.

In some embodiments, the terminal may pre-construct a personal corpus of the user, where the personal corpus includes historical text information recognized by the historical speech and speech information of the user. Here, the user's history voice information may understand the second voice information described in the above embodiments.

Step 403: calculating the similarity between the first text information and the historical text information in the personal corpus by the first speech information recognition;

in step 403, it can be actually understood that the terminal described in the above embodiment matches the first text information obtained by recognizing the first voice information with a plurality of second voice information in the personal corpus to obtain a plurality of matching degrees. The similarity here can be understood as the matching degree described in the above embodiments.

Step 404: judging whether the similarity is greater than a threshold value, if so, executing a step 405, and if not, executing a step 407;

step 405: selecting historical text information with the maximum similarity;

here, the selected historical text information with the maximum similarity may be understood as that the embodiment selects the historical text information which satisfies the first similarity condition with the first text information obtained by the first speech information recognition, that is, the historical text information which satisfies that the matching degree is greater than the matching degree threshold and corresponds to the highest matching in the matching degrees.

Step 406: outputting a first voice recognition result;

in step 406, a first speech recognition result of the first speech information is output based on the second text information, which is actually the selected maximum similarity and is characteristic of correct recognition of the second speech information.

Step 407: and outputting a second voice recognition result based on the first voice information.

In step 407, it may be understood that, if the similarity is not greater than the threshold, a second speech recognition result of the first speech information is output based on the first speech information. That is, if the similarity is not greater than the threshold, it indicates that the second speech information similar to the first speech information does not exist in the personal corpus, and therefore, the speech recognition may be performed by using the ASR recognition method directly according to the first speech information.

Step 408: and updating the third text information with correct first voice information recognition into the personal corpus.

Here, the step 408 may be understood as storing the third text information recognized by the first speech information into the personal corpus if the second speech recognition result indicates that the second speech recognition result is correctly recognized through the feedback information.

In some embodiments, for a plurality of consecutive voice requests with high similarity, that is, the terminal continuously receives a plurality of first voice messages satisfying the second similarity condition within the first predetermined time, it may be considered that the third text message identified by the first voice message corresponding to the last voice request is the correctly identified third text message, and therefore, here, the terminal may update the third text message corresponding to the last first voice message and the first voice message corresponding to the consecutive voice requests with high similarity into the personal corpus.

In this embodiment, the constructed personal corpus is used to identify the first speech information of the user by using the corrected second speech information in the personal corpus to correspond to the current correct identification result, so that the personal language habits of the user can be fully utilized, that is, the correction speed is increased, and the correction accuracy is also improved.

FIG. 5 is a block diagram illustrating a speech recognition device according to an example embodiment. Referring to fig. 5, the apparatus includes a first receiving module 51, a comparing module 52, and a first outputting module 53; wherein the content of the first and second substances,

the first receiving module 51 is configured to receive first voice information;

the comparison module 52 is configured to compare the first voice information with information stored in a corpus of a current user;

the first output module 53 is configured to, in response to that second voice information which satisfies a first similarity condition with the first voice information exists in the corpus, output response information of the first voice information based on a current correct recognition result of the second voice information.

In some embodiments, the apparatus further comprises:

the alignment module is further configured to:

the first output module further configured to:

In some embodiments, the apparatus further comprises:

and the second output module is used for outputting response information of the first voice information based on the third text information obtained by the first voice information recognition.

In some embodiments, the apparatus further comprises:

and the storage module is configured to respond to the fact that the feedback information represents that the third text information is correctly recognized, and then the third text information is stored into the corpus as the current correct recognition result of the first voice information.

In some embodiments, the apparatus further comprises:

a first determining module configured to determine a similarity between the plurality of pieces of first voice information;

the obtaining module is further configured to obtain feedback information corresponding to the plurality of pieces of first voice information respectively;

the storage module further configured to:

In some embodiments, the obtaining module is further configured to at least one of:

In some embodiments, the apparatus further comprises:

the storage module further configured to:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 6 is a block diagram illustrating a terminal 600 according to an example embodiment. For example, the terminal 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, terminal 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an interface to input/output (I/O) 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the terminal 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 may include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operations at the terminal 600. Examples of such data include instructions for any application or method operating on terminal 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 606 provides power to the various components of terminal 600. Power components 606 can include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for terminal 600.

The multimedia component 608 comprises a screen providing an output interface between the terminal 600 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 600 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the terminal 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing various aspects of status assessment for the terminal 600. For example, sensor component 614 can detect an open/closed state of terminal 600, relative positioning of components, such as a display and keypad of terminal 600, change in position of terminal 600 or a component of terminal 600, presence or absence of user contact with terminal 600, orientation or acceleration/deceleration of terminal 600, and temperature change of terminal 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the terminal 600 and other terminals in a wired or wireless manner. The terminal 600 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the terminal 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the terminal 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer-readable storage medium, wherein instructions of the storage medium, when executed by a processor of a terminal, enable the terminal to perform the speech recognition method according to the above embodiments.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A speech recognition method, comprising:

receiving first voice information;

comparing the first voice information with information stored in a corpus of a current user; the information stored in the corpus of the current user includes: historical voice information of the user;

responding to second voice information which meets a first similarity condition with the first voice information in the corpus, and outputting response information of the first voice information based on a current correct recognition result of the second voice information; the first similarity condition includes: the first voice information and the second voice information have the same or similar pronunciation characteristics exceeding a preset proportion.

2. The method of claim 1, further comprising:

performing voice recognition on the first voice information to obtain first text information;

3. The method according to claim 1 or 2, characterized in that the method further comprises:

identifying the first voice information in response to the fact that second voice information meeting a first similarity condition with the first voice information does not exist in the corpus;

4. The method of claim 3, further comprising:

feedback information based on the response information is acquired;

and in response to the feedback information representing that the third text information is correctly recognized, storing the third text information into the corpus as a current correct recognition result of the first voice information.

5. The method of claim 4, further comprising:

determining similarity between the first voice information;

the acquiring feedback information based on the response information includes:

and in response to that the similarity among the first voice information meets a second similarity condition and at least one piece of feedback information exists in the feedback information corresponding to the first voice information respectively to represent that the corresponding first voice information is correctly recognized, storing third text information corresponding to the correctly recognized first voice information as a current correct recognition result of the first voice information in the corpus.

6. The method according to claim 4 or 5, wherein the obtaining feedback information based on the response information comprises at least one of:

when it is determined that no user feedback is received within second preset time after the response information is output, generating feedback information indicating that the third text information is correctly identified;

7. The method of claim 4, further comprising:

determining similarity between the first voice information;

8. A speech recognition apparatus, comprising:

a first receiving module configured to receive first voice information;

the comparison module is configured to compare the first voice information with information stored in a corpus of a current user; the information stored in the corpus of the current user includes: historical voice information of the user;

a first output module configured to output response information of the first voice information based on a current correct recognition result of second voice information in response to the second voice information meeting a first similarity condition with the first voice information existing in the corpus; the first similarity condition includes: the first voice information and the second voice information have the same or similar pronunciation characteristics exceeding a preset proportion.

9. The apparatus of claim 8, further comprising:

the alignment module is further configured to:

a first output module further configured to:

10. The apparatus of claim 8 or 9, further comprising:

11. The apparatus of claim 10, further comprising:

12. The apparatus of claim 11, further comprising:

a first determination module configured to determine a similarity between the plurality of pieces of first voice information;

the storage module further configured to:

13. The apparatus of claim 11 or 12, the obtaining module further configured to at least one of:

14. The apparatus of claim 11, further comprising:

the storage module further configured to:

and in response to that the similarity among the first voice messages meets a second similarity condition, storing third text information corresponding to the last first voice message in the first voice messages into the corpus as a current correct recognition result of the first voice messages.

15. A terminal, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to: when executed, implement the speech recognition method of any of the above claims 1 to 7.

16. A non-transitory computer-readable storage medium, on which a computer program is stored, the program being executed by a processor to implement the method steps of any of claims 1 to 7.