CN110797014A

CN110797014A - Voice recognition method and device and computer storage medium

Info

Publication number: CN110797014A
Application number: CN201810784944.0A
Authority: CN
Inventors: 梁晓辉
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2018-07-17
Filing date: 2018-07-17
Publication date: 2020-02-14
Anticipated expiration: 2038-07-17

Abstract

The embodiment of the invention discloses a voice recognition method, which comprises the following steps: receiving a voice signal to be recognized; acquiring a voice recognition model corresponding to a terminal, wherein the voice recognition model corresponding to the terminal is determined according to the corresponding relation between the terminal and a historical use voice area; and recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with the accuracy meeting the set condition as a target voice recognition result. The embodiment of the invention also discloses a voice recognition device and a computer storage medium.

Description

Voice recognition method and device and computer storage medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a speech recognition method, apparatus, and computer storage medium.

Background

Speech recognition is a technical means for converting a speech signal into a text, and generally, an input speech signal is recognized according to an established speech recognition model to obtain a speech recognition result. Because different users have different characteristic attributes such as regions, the input voice signal also has corresponding characteristic attributes. Therefore, if the input speech signals of all regions are recognized by one speech recognition model, not only the efficiency of speech recognition is affected, but also the accuracy of speech recognition is reduced. In addition, when the geographic location of the user changes due to work, life, and the like, the geographic location of the terminal, that is, the area where the terminal is located, also changes correspondingly, and if the input voice signal is recognized by using the voice recognition model corresponding to the area where the terminal is currently located, the accuracy may be low.

Disclosure of Invention

In order to solve the existing technical problems, embodiments of the present invention provide a speech recognition method, a speech recognition device, and a computer storage medium, which can improve speech recognition efficiency and accuracy.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

in a first aspect, an embodiment of the present invention provides a speech recognition method, including:

receiving a voice signal to be recognized;

acquiring a voice recognition model corresponding to a terminal, wherein the voice recognition model corresponding to the terminal is determined according to the corresponding relation between the terminal and a historical use voice area;

and recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with the accuracy meeting the set condition as a target voice recognition result.

In the foregoing solution, the obtaining a speech recognition model corresponding to a terminal includes:

sending an acquisition request carrying the terminal identification to a cloud server;

and receiving a voice recognition model corresponding to a historical use voice area matched with the terminal identification returned by the cloud server based on the terminal identification.

In the foregoing solution, the receiving, by the cloud server, a voice recognition model corresponding to a history usage voice area matched with the terminal identifier based on the terminal identifier includes:

receiving a historical use voice record which is returned by the cloud server based on the terminal identification and is matched with the terminal identification, and a voice recognition model corresponding to a historical use voice area which is matched with the terminal identification, wherein the historical use voice record comprises the frequency and/or duration of the terminal using voice in each historical use voice area;

the recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with accuracy meeting set conditions as a target voice recognition result, comprising:

determining the screening sequence of the voice recognition model corresponding to the historical use voice area corresponding to the terminal according to the frequency and/or the duration;

and sequentially screening the voice recognition results generated by the voice recognition model for recognizing the voice signals according to the screening sequence until the voice recognition results with the accuracy higher than the set accuracy threshold are obtained, and taking the voice recognition results with the accuracy higher than the set accuracy threshold as target voice recognition results.

In the foregoing solution, according to the screening sorting, sequentially screening the speech recognition results generated by the speech recognition model recognizing the speech signal until obtaining a speech recognition result whose accuracy is higher than a set accuracy threshold, and taking the speech recognition result whose accuracy is higher than the set accuracy threshold as a target speech recognition result includes:

taking the speech recognition model with the highest screening priority as an initial target speech recognition model;

recognizing the voice signal according to the target voice recognition model to obtain a voice recognition result;

when the accuracy of the voice recognition result is lower than a set accuracy threshold value, sequentially selecting the voice recognition model with the next screening priority as an updated target voice recognition model according to the screening sequence, and recognizing the voice signal according to the updated target voice recognition model;

and when the accuracy of the voice recognition result is determined to be higher than the set accuracy threshold, taking the voice recognition result with the accuracy higher than the set accuracy threshold as a target voice recognition result.

In the foregoing solution, the determining, according to the frequency and/or the duration, a screening order of the speech recognition models corresponding to the region where the historically used speech of the terminal is located includes:

and taking the frequency of using the voice in the historical voice using area corresponding to the terminal as a first sequencing reference index and the duration as a second sequencing reference index, sequencing the voice recognition models according to a sequencing rule that the frequency is from high to low and the duration is from large to small, and obtaining the screening sequencing of the voice recognition models from high to low.

In the foregoing solution, the recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with accuracy meeting a set condition as a target voice recognition result includes:

according to a sorting rule from high accuracy to low accuracy, sorting the voice recognition results generated by recognizing the voice signals by using the voice recognition model corresponding to the voice area in the history corresponding to the terminal;

and selecting a set number of the voice recognition results as target voice recognition results from high to low.

In the foregoing scheme, before the obtaining of the speech recognition model corresponding to the terminal, the method further includes:

determining the area information of the area where the terminal is located currently;

collecting the corpus of the area where the terminal is located currently;

and sending the corpus and the area information of the area where the terminal is currently located to the cloud server, wherein the corpus and the area information of the area where the terminal is currently located are used for the cloud server to establish a voice recognition model corresponding to the area where the terminal is currently located.

In a second aspect, an embodiment of the present invention provides a speech recognition apparatus, including:

the receiving module is used for receiving a voice signal to be recognized;

the terminal comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a voice recognition model corresponding to the terminal, and the voice recognition model corresponding to the terminal is determined according to the corresponding relation between the terminal and a historical use voice area;

and the recognition module is used for recognizing the voice signal according to the voice recognition model corresponding to the terminal and obtaining a voice recognition result with the accuracy meeting the set condition as a target voice recognition result.

In a third aspect, an embodiment of the present invention provides a speech recognition apparatus, including a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,

the processor is configured to execute the speech recognition method of the first aspect when running the computer program.

In a fourth aspect, an embodiment of the present invention provides a computer storage medium, where a computer program is stored, and the computer program, when executed by a processor, implements the speech recognition method according to the first aspect.

The voice recognition method, the voice recognition device and the computer storage medium provided by the embodiments acquire the voice recognition model corresponding to the terminal determined according to the corresponding relationship between the terminal and the history use voice area, and recognize the voice signal to be recognized according to the voice recognition model corresponding to the terminal, thereby acquiring the voice recognition result with the accuracy meeting the set condition as the target voice recognition result. Therefore, the voice recognition model for obtaining the target voice recognition result is determined according to the corresponding relation between the terminal and the historical voice using area, and compared with a simple mode of selecting the voice recognition model according to the current area of the terminal or adopting a unified voice recognition model for voice recognition, the matching between the voice recognition model for obtaining the target voice recognition result and the terminal is optimized, and the voice recognition efficiency and accuracy are improved.

Drawings

FIG. 1 is a diagram illustrating an application environment of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a voice recognition apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a voice recognition apparatus according to another embodiment of the present invention;

FIG. 6 is a flow chart illustrating a speech recognition method according to an alternative embodiment of the present invention;

FIG. 7 is a diagram illustrating a training phase and a recognition phase according to an embodiment of the present invention;

fig. 8 is a flowchart illustrating a process of obtaining and displaying a speech recognition result according to rule one in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further elaborated by combining the drawings and the specific embodiments in the specification. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

Fig. 1 is a schematic diagram of an optional application environment of the voice recognition method provided in the embodiment of the present invention, and the application environment includes a terminal 10 and a cloud server 11, where software with a voice recognition function is installed on the terminal 10, and the terminal 10 is connected to the cloud server 11 through a network 12. The terminal 10 receives a voice signal to be recognized input by an object such as a user, and sends the voice signal and a terminal identifier to the cloud server 11 through the network 12. After acquiring the voice signal and the terminal identifier, the cloud server 11 queries and acquires a corresponding relationship between the terminal 10 and a history voice using area according to the terminal identifier, thereby determining a voice recognition model corresponding to the terminal 10. The cloud server 11 sends the voice recognition model corresponding to the terminal 10 through the network 12, and the terminal 10 recognizes the voice signal according to the voice recognition model corresponding to the terminal 10, and obtains a voice recognition result with accuracy meeting set conditions as a target voice recognition result. Or, the cloud server 11 identifies the voice signal according to the voice identification model corresponding to the terminal 10, obtains a voice identification result with accuracy meeting a set condition as a target voice identification result, and sends the target voice identification result to the terminal 10. Therefore, the voice recognition model for obtaining the target voice recognition result is determined according to the corresponding relation between the terminal and the historical voice using area, the matching between the voice recognition model for obtaining the target voice recognition result and the terminal is optimized, the problem of low accuracy in recognition of the voice signal by using the voice recognition model corresponding to the area where the terminal is located at present is solved, and the voice recognition efficiency and accuracy are improved. The terminal 10 may be an intelligent device such as a personal computer, a mobile phone, a tablet computer, and the like, and the cloud server 11 may be a physical server or a logical server formed by virtualizing a plurality of physical servers, or may be a server cluster formed by a plurality of servers capable of communicating with each other.

As shown in fig. 2, which is an alternative structural diagram of the terminal 10 in fig. 1, the terminal 10 includes a processor, a storage medium, a memory, a network interface, a display screen, and a microphone connected by a system bus, where the storage medium stores an operating system and a speech recognition device for implementing the speech recognition method applied to the terminal 140 according to the embodiment of the present application. The processor has computing and control capabilities for supporting the operation of the entire terminal 10. The memory in the terminal 10 is used to provide an environment for the operation of the voice recognition device in the storage medium, and the network interface is used to perform network communication with the cloud server 11, receive or transmit data, for example, transmit the voice signal and the terminal identifier to the cloud server 11. The display screen is used for displaying a target voice recognition result, and the microphone is used for receiving a voice signal input by a user and the like. For a terminal 10 with a touch screen, the display screen may be a touch screen.

Referring to fig. 3, a speech recognition method provided in an embodiment of the present invention can be applied to the terminal 10 shown in fig. 1 and 2, and the method includes the following steps:

step S101: receiving a voice signal to be recognized;

here, the receiving the voice signal to be recognized may be that the terminal receives a voice signal to be recognized input by a user. It is understood that, when a user uses an Application (APP) installed on a terminal, a voice signal to be recognized may be input through a voice recognition function provided by the APP. Here, the APP may refer to a speech recognition application program for implementing the speech recognition method in the embodiment of the present application.

Step S102: acquiring a voice recognition model corresponding to a terminal, wherein the voice recognition model corresponding to the terminal is determined according to the corresponding relation between the terminal and a historical use voice area;

here, the terminal and/or the cloud server stores a historical usage voice region corresponding to the terminal, where the historical usage voice region corresponding to the terminal is located when the terminal uses the historical usage voice. The terminal can locally store the terminal identifier and the area where the terminal historical use voice is located, and establish or update the corresponding relation between the terminal identifier and the historical use voice area corresponding to the terminal; or, the terminal may send the terminal identifier and the area where the terminal uses the voice to the cloud server, so that the cloud server establishes or updates a corresponding relationship between the terminal identifier and the historical voice area used by the terminal. The voice recognition model corresponding to the terminal can be stored in the cloud server and can also be stored in the terminal. Correspondingly, the voice recognition model corresponding to the terminal can be obtained by the terminal according to the corresponding relation between the terminal and the historical use voice area to determine the voice recognition model corresponding to the terminal, or the terminal sends the terminal identifier to the cloud server, and the voice recognition model corresponding to the terminal determined according to the corresponding relation between the terminal and the historical use voice area is obtained from the cloud server. The voice recognition model corresponding to the terminal includes a voice recognition model corresponding to a history use voice region corresponding to the terminal, and the voice recognition model corresponding to the history use voice region corresponding to the terminal may include a voice recognition model corresponding to a region where the terminal is currently located. If the voice recognition model corresponding to the historical voice using area corresponding to the terminal does not include the voice recognition model corresponding to the current area of the terminal, the voice recognition model corresponding to the terminal can also include the voice recognition model corresponding to the current area of the terminal.

In an alternative embodiment, the step S102: the method for acquiring the voice recognition model corresponding to the terminal comprises the following steps:

Specifically, when a voice recognition model corresponding to a historical use voice area corresponding to the terminal is stored in a cloud server, the terminal sends an acquisition request carrying a terminal identifier to the cloud server, and receives a voice recognition model corresponding to the historical use voice area matched with the terminal identifier returned by the cloud server based on the terminal identifier.

Here, when the cloud server stores a correspondence between the terminal identifier and a historical usage voice region corresponding to the terminal, and a correspondence between different regions and voice recognition models, the terminal may obtain, from the cloud server, a voice recognition model corresponding to the historical usage voice region corresponding to the terminal by sending an obtaining request. The terminal identifier is used to uniquely identify the terminal, and may be an International Mobile Equipment Identity (IMEI) corresponding to the terminal. Therefore, the cloud server has large storage space and high operation speed, and can improve the voice recognition efficiency and speed.

In an alternative embodiment, the step S102: before obtaining the speech recognition model corresponding to the terminal, the method further comprises:

collecting the corpus of the area where the terminal is located currently;

Specifically, a terminal acquires the current geographical position information of the terminal, and determines the area information of the area where the terminal is located according to the current geographical position information of the terminal; and collecting the corpus of the current area of the terminal, and sending the corpus and the area information of the current area of the terminal to the cloud server, wherein the corpus and the area information of the current area of the terminal are used for the cloud server to establish a voice recognition model corresponding to the current area of the terminal.

Here, the terminal may obtain the current geographic location information of the terminal through satellite positioning, network positioning, triangulation positioning, and the like. According to the current geographical position information of the terminal, the area information of the area where the terminal is currently located can be directly or indirectly obtained, and the area information includes but is not limited to a street area, an administrative area, a longitude and latitude range area, a user-defined area and the like where the terminal is currently located. The terminal is used as a mobile phone, the current geographic position information of the mobile phone is assumed to be the capital airport of the sunny district in Beijing city, and if the division is carried out according to the street regions, the regional information of the current region of the mobile phone is the street of the capital airport of the sunny district in Beijing city; and if the administrative region is divided, the region information of the region where the mobile phone is currently located is the sunny region in Beijing.

Here, the corpus is a professional vocabulary in the process of establishing the speech recognition model, that is, a source required by establishing the speech recognition model. The method for collecting the linguistic data of the current area of the terminal by the terminal can be that voice sent by a user of the current area of the terminal is collected to be used as the linguistic data, and the established linguistic data of the current area of the terminal can also be directly obtained from a third party. For different regions, the corpus has regional characteristics, i.e., regional characteristics, for example, voices corresponding to the same word in different regions may be different. Based on the corpus of the current region of the terminal, the cloud server can establish a speech recognition model corresponding to the current region of the terminal by adopting the existing speech recognition algorithms such as a deep neural network and a Markov model. The cloud server establishes the speech recognition model corresponding to the current area of the terminal according to the corpora of the current area of the terminal collected by different terminals. For example, the cloud server establishes a speech recognition model corresponding to the area a according to the corpora collected by all the terminals in the area a.

It should be noted that, if a terminal establishes a voice recognition model corresponding to the terminal, the terminal acquires current geographical position information of the terminal, and determines area information of an area where the terminal is currently located according to the current geographical position information of the terminal; and collecting the corpus of the current area of the terminal, and establishing a voice recognition model corresponding to the current area of the terminal according to the corpus.

Therefore, the terminal collects the corpora of the current region and sends the corpora to the cloud server so that the cloud server can establish the voice recognition model corresponding to the current region, the voice recognition model is ensured to have high accuracy, and the recognition accuracy of the voice recognition model is improved.

Step S103: and recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with the accuracy meeting the set condition as a target voice recognition result.

Specifically, the terminal identifies the voice signal according to a voice identification model corresponding to the terminal, and obtains a voice identification result with accuracy meeting set conditions as a target voice identification result; or the cloud server identifies the voice signal according to the voice identification model corresponding to the terminal, obtains a voice identification result with the accuracy meeting set conditions as a target voice identification result, and sends the target voice identification result to the terminal.

Here, the setting condition may be set and adjusted according to actual needs, and the accuracy meeting the setting condition may be accuracy higher than a set accuracy threshold or the like. Because the voice recognition model corresponding to the terminal comprises one or more voice recognition models, one or more voice recognition results generated by recognizing the voice signal according to the voice recognition model corresponding to the terminal correspond to each other, and the voice recognition result with the accuracy meeting the set condition is used as the target voice recognition result. In addition, because the voice recognition model corresponding to the terminal not only includes the voice recognition model corresponding to the current area where the terminal is located, but also includes the voice recognition model corresponding to the historical voice use area corresponding to the terminal, the historical voice use area corresponding to the terminal can effectively represent the regional law of the voice use of the terminal user, such as the region frequently using voice or the region occasionally using voice, and the characteristic attributes of the user of the terminal, such as the place where the user of the terminal lives, and the like, can be effectively known according to the regional law of the voice use of the terminal user, such as the place where the user of the terminal lives and the like. Therefore, when the current region of the terminal does not belong to the historical speech region corresponding to the terminal, the speech signal to be recognized is recognized based on the speech recognition model corresponding to the historical speech region corresponding to the terminal, the obtained recognition result has higher accuracy, multiple recognition operations caused by low accuracy can be avoided, and the speech recognition efficiency is improved. For example, assuming that the user's native place and the residence are both in the a area, if the user uses the speech recognition function of the terminal in the B area and the input speech signal is derived from the corpus of the a area, the accuracy of the speech recognition result obtained by recognizing the speech signal using the speech recognition model corresponding to the B area is lower than the accuracy of the speech recognition result obtained by recognizing the speech signal using the speech recognition model corresponding to the a area.

In summary, in the voice recognition method provided in the above embodiment, the voice recognition model corresponding to the terminal determined according to the correspondence between the terminal and the history use voice area is obtained, and the voice signal to be recognized is recognized according to the voice recognition model corresponding to the terminal, so that the voice recognition result with the accuracy meeting the set condition is obtained as the target voice recognition result. Therefore, the voice recognition model for obtaining the target voice recognition result is determined according to the corresponding relation between the terminal and the historical voice using area, and compared with a simple mode of performing voice recognition according to the voice recognition model corresponding to the current area of the terminal or by adopting a unified voice recognition model, the method optimizes the matching between the voice recognition model for obtaining the target voice recognition result and the terminal, and improves the voice recognition efficiency and accuracy.

In an optional embodiment, the receiving, by the cloud server, a speech recognition model corresponding to a historical usage speech area matched with the terminal identifier returned based on the terminal identifier includes:

and according to the screening sequence, sequentially screening the voice recognition results generated by recognizing the voice signals by using the voice recognition model corresponding to the voice area in the history corresponding to the terminal until the voice recognition results with the accuracy higher than the set accuracy threshold are obtained, and taking the voice recognition results with the accuracy higher than the set accuracy threshold as target voice recognition results.

Here, the history voice usage record includes a history voice usage area corresponding to the terminal and a frequency and/or duration of voice usage of the terminal in each history voice usage area. The frequency of using the voice by the terminal in a certain area is the number of times of using the voice by the terminal in the area within a set time range, the duration is the total time of using the voice by the terminal in the area within the set time range, and the duration can be obtained by accumulating the time of using the voice by the terminal in the area within the set time range each time. The setting time range can be set according to actual conditions, and for example, the setting time range can be set to 30 days or 180 days. The terminal sends the area and the use duration of each voice use and the terminal identifier to the cloud server, and the cloud server establishes or updates the historical voice use record matched with the terminal identifier. And the screening and sorting of the voice recognition models corresponding to the historical use voice areas corresponding to the terminals can also be executed by the cloud server, and the voice recognition models are sent to the terminals by the cloud server. In addition, when the screening sorting of the voice recognition models corresponding to the historical voice use areas corresponding to the terminal is executed by the cloud server, the cloud server can sequentially screen the voice recognition results generated by recognizing the voice signals by the voice recognition models corresponding to the historical voice use areas corresponding to the terminal according to the screening sorting to obtain target voice recognition results, and return the target voice recognition results to the terminal. The accuracy threshold may be set and adjusted according to actual requirements, for example, set to 90%.

Therefore, the voice recognition results generated by recognizing the voice signals by the voice recognition models corresponding to the historical voice using areas corresponding to the terminals are sequentially screened according to the screening sequence of the voice recognition models corresponding to the historical voice using areas corresponding to the terminals, so that the voice recognition results with the accuracy higher than the set accuracy threshold are obtained as the target voice recognition results, and the voice recognition accuracy is further improved.

In an optional embodiment, the determining, according to the frequency and/or the duration, a filtering order of the speech recognition models corresponding to the regions where the historically used speech of the terminal is located includes:

and taking the frequency of using the voice in the historical voice using area corresponding to the terminal as a first sequencing reference index and the duration as a second sequencing reference index, sequencing the voice recognition models according to a sequencing rule that the frequency is from high to low and the duration is from large to small, and obtaining the screening sequencing of the priority of the voice recognition models corresponding to the historical voice using area corresponding to the terminal from high to low.

It should be noted that, if the screening and sorting of the voice recognition models corresponding to the regions where the historical speech is used by the terminal is determined according to the frequency, the frequency of using the speech in the historical speech region corresponding to the terminal is used as a sorting reference index, the voice recognition models are sorted according to a sorting rule that the frequency is from high to low, and the screening and sorting of the priority of the voice recognition models corresponding to the historical speech region corresponding to the terminal is from high to low is obtained. And if the screening and sorting of the voice recognition models corresponding to the regions where the historical voice used by the terminal is located is determined according to the duration, taking the duration of the voice used by the historical voice used by the terminal as a sorting reference index, sorting the voice recognition models according to a sorting rule that the duration is from large to small, and obtaining the screening and sorting of the voice recognition models corresponding to the historical voice used by the terminal from high to low.

For example, if the historical speech area corresponding to the terminal includes A, B, C and D, where the frequency of speech used by the terminal in the a area is 180 times per month and the duration is 320 minutes per month, the frequency of speech used by the terminal in the B area is 50 times per month and the duration is 40 minutes per month, the frequency of speech used by the terminal in the C area is 10 times per month and the duration is 12 minutes per month, and the frequency of speech used by the terminal in the D area is 5 times per month and the duration is 8 minutes per month, the priority of the speech recognition model corresponding to the historical speech area corresponding to the terminal is sequentially ranked from high to low as the speech recognition model corresponding to the a area, the speech recognition model corresponding to the B area, the speech recognition model corresponding to the C area, and the speech recognition model corresponding to the D area.

Therefore, the frequency is used as a first sequencing reference index, the duration is used as a second sequencing reference index to sequence the voice recognition models, the screening priority of the voice recognition models corresponding to the areas with the large number of times of using the voice by the terminal and the voice recognition models corresponding to the areas with the long time of using the voice by the terminal is improved, the area with the large number of times of using the voice by the terminal and the area with the long time of using the terminal can reflect the area rule of using the voice by the user of the terminal most, and the accuracy of the obtained target voice recognition result is further improved.

In an optional embodiment, the sequentially screening, according to the screening order, the voice recognition results generated by recognizing the voice signal by using the voice recognition model corresponding to the voice region in the history corresponding to the terminal until a voice recognition result with an accuracy higher than a set accuracy threshold is obtained, and taking the voice recognition result with the accuracy higher than the set accuracy threshold as the target voice recognition result includes:

It should be noted that, when it is determined that the accuracy of the speech recognition result is lower than the set accuracy threshold, sequentially selecting the speech recognition model of the next screening priority as the updated target speech recognition model according to the screening ranking, recognizing the speech signal according to the updated target speech recognition model, and after obtaining the corresponding speech recognition result, if the accuracy of the corresponding speech recognition result is still lower than the set accuracy threshold, returning to the step of sequentially selecting the speech recognition model of the next screening priority as the updated target speech recognition model according to the screening ranking, and recognizing the speech signal according to the updated target speech recognition model. The terminal can also recognize the voice signal by using the voice recognition model corresponding to the history use voice area corresponding to the terminal to generate a corresponding voice recognition result, and then obtain a target voice recognition result from the voice recognition result according to the screening sequence. For example, assuming that the priorities of the speech recognition models corresponding to the historical use speech areas corresponding to the terminal are sequentially the speech recognition model corresponding to the area a, the speech recognition model corresponding to the area B, the speech recognition model corresponding to the area C, and the speech recognition model corresponding to the area D from high to low, it may be determined whether the accuracy of the speech recognition result of the speech recognition model corresponding to the area a on the speech signal is higher than the set accuracy threshold, and if so, the speech recognition result of the speech recognition model corresponding to the area a on the speech signal is taken as the target speech recognition result; if the accuracy of the speech recognition result of the speech signal by the speech recognition model corresponding to the B area is lower than the set accuracy threshold, and the process is repeated.

Here, when the accuracy of the speech recognition result generated by recognizing the speech signal by using the speech recognition model corresponding to the history use speech area corresponding to the terminal is lower than the set accuracy threshold, the speech recognition result of the speech signal by using the common speech recognition model may be used as the target speech recognition result. The universal speech recognition model is generated after training according to a universal corpus, and the universal corpus is a corpus which is not distinguished.

Therefore, the voice recognition results of the voice recognition models are screened according to the screening sequence of the voice recognition models corresponding to the historical use voice areas corresponding to the terminal until the voice recognition results with the accuracy higher than the set accuracy threshold are obtained, and the accuracy of the voice recognition results is improved.

In an optional embodiment, the recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with accuracy meeting a set condition as a target voice recognition result includes:

Specifically, the terminal sorts the voice recognition results generated by recognizing the voice signals by the voice recognition model corresponding to the history use voice area corresponding to the terminal according to a sorting rule from high accuracy to low accuracy, obtains a voice recognition result sorting from high accuracy to low accuracy, and selects a set number of the voice recognition results as target voice recognition results according to the order from high accuracy to low accuracy. For example, assuming that the history use speech recognition model corresponding to the speech area corresponding to the terminal is a speech recognition model corresponding to an a area, a speech recognition model corresponding to a B area, a speech recognition model corresponding to a C area, and a speech recognition model corresponding to a D area, wherein the accuracy of the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the a area is 95%, the accuracy of the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the B area is 65%, the accuracy of the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the C area is 15%, and the accuracy of the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the D area is 5%, the speech recognition results are sorted in order of high to low accuracy, the first two speech recognition results with the highest accuracy, that is, the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the a region and the speech recognition result generated by recognizing the speech signal by the speech recognition model corresponding to the B region, may be selected as the target speech recognition result.

Here, the set number may be set according to actual needs, for example, may be set to be 1, or may be 2 or more, such as 3 or 4.

Therefore, the voice recognition results generated by recognizing the voice signals through the voice recognition model corresponding to the historical voice using area corresponding to the terminal are sorted, and the set number of the voice recognition results are selected from high to low as the target voice recognition results for the user to select, so that the method is simple and convenient, and has high accuracy.

In order to implement the foregoing method, an embodiment of the present invention further provides a speech recognition apparatus, which is applied to a terminal, and as shown in fig. 4, includes: a receiving module 20, an obtaining module 21 and an identifying module 22; wherein the content of the first and second substances,

the receiving module 20 is configured to receive a voice signal to be recognized;

the obtaining module 21 is configured to obtain a voice recognition model corresponding to a terminal, where the voice recognition model corresponding to the terminal is determined according to a correspondence between the terminal and a history used voice area;

and the recognition module 22 is configured to recognize the voice signal according to the voice recognition model corresponding to the terminal, and obtain a voice recognition result with accuracy meeting a set condition as a target voice recognition result.

In summary, in the voice recognition apparatus provided in the above embodiment, the voice recognition model corresponding to the terminal determined according to the correspondence between the terminal and the history use voice area is obtained, and the voice signal to be recognized is recognized according to the voice recognition model corresponding to the terminal, so that the voice recognition result with the accuracy meeting the set condition is obtained as the target voice recognition result. Therefore, the voice recognition model for obtaining the target voice recognition result is determined according to the corresponding relation between the terminal and the historical voice using area, and compared with a simple mode of selecting the voice recognition model according to the current area of the terminal or adopting a unified voice recognition model for voice recognition, the matching between the voice recognition model for obtaining the target voice recognition result and the terminal is optimized, and the voice recognition efficiency and accuracy are improved.

In an optional embodiment, the obtaining module 21 includes a sending sub-module 210 and a receiving sub-module 211; wherein the content of the first and second substances,

the sending submodule 210 is configured to send an obtaining request carrying the terminal identifier to a cloud server;

the receiving submodule 211 is configured to receive a voice recognition model corresponding to a history use voice area matched with the terminal identifier, which is returned by the cloud server based on the terminal identifier.

Therefore, the cloud server has large storage space and high operation speed, and can improve the voice recognition efficiency and speed.

In an alternative embodiment of the method according to the invention,

the receiving sub-module 211 is specifically configured to: receiving a historical use voice record which is returned by the cloud server based on the terminal identification and is matched with the terminal identification, and a voice recognition model corresponding to a historical use voice area which is matched with the terminal identification, wherein the historical use voice record comprises the frequency and/or duration of the terminal using voice in each historical use voice area;

the identification module 22 is specifically configured to:

Therefore, the voice recognition results generated by the voice recognition models for recognizing the voice signals are sequentially screened according to the screening sequence of the voice recognition models corresponding to the historical use voice areas corresponding to the terminals, so that the voice recognition results with the accuracy higher than the set accuracy threshold are obtained and serve as the target voice recognition results, and the voice recognition accuracy is further improved.

In an alternative embodiment, the identification module 22 is specifically configured to:

In an alternative embodiment, the apparatus further comprises a determination module 23 and a collection module 24; wherein the content of the first and second substances,

the determining module 23 is configured to determine area information of an area where the terminal is currently located;

the collecting module 24 is configured to collect corpora of an area where the terminal is currently located;

the sending submodule 210 is further configured to send the corpus and the area information of the area where the terminal is currently located to the cloud server, where the corpus and the area information of the area where the terminal is currently located are used for the cloud server to establish a speech recognition model corresponding to the area where the terminal is currently located.

Therefore, the terminal collects the corpora of the current region and sends the corpora to the cloud server so that the cloud server can establish the voice recognition model corresponding to the current region, the voice recognition model is ensured to have higher accuracy, and the recognition accuracy of the voice recognition model is improved.

In an optional embodiment, the apparatus further comprises a display module 25 for displaying the target speech recognition result.

It should be noted that: in the speech recognition apparatus provided in the above embodiment, when implementing the speech recognition method, only the division of the above program modules is taken as an example, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the speech recognition apparatus is divided into different program modules to complete all or part of the processing described above. In addition, the client provided by the above embodiment and the corresponding speech recognition method embodiment belong to the same concept, and the specific implementation process thereof is detailed in the method embodiment and will not be described herein again.

An embodiment of the present invention provides a speech recognition apparatus, as shown in fig. 5, the speech recognition apparatus includes: at least one processor 310 and a memory 311 for storing computer programs capable of running on the processor 310; the processor 310 illustrated in fig. 5 is not used to refer to the number of the processors 310 as one, but is only used to refer to the position relationship of the processor 310 relative to other devices, and in practical applications, the number of the processors 310 may be one or more; similarly, the memory 311 shown in fig. 5 is also used in the same sense, i.e. it is only used to refer to the position relationship of the memory 311 with respect to other devices, and in practical applications, the number of the memory 311 may be one or more.

Wherein, when the processor 310 is configured to run the computer program, the following steps are executed:

receiving a voice signal to be recognized;

In an alternative embodiment, the processor 310 is further configured to execute the following steps when the computer program is executed:

collecting the corpus of the area where the terminal is located currently;

The speech recognition apparatus further includes: at least one network interface 312. The various components of the speech recognition apparatus are coupled together by a bus system 313. It will be appreciated that the bus system 313 is used to enable communications among the components connected. The bus system 313 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 313 in FIG. 5.

The memory 311 may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 311 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The memory 311 in the embodiment of the present invention is used to store various types of data to support the operation of the voice recognition apparatus. Examples of such data include: any computer program for operating on the speech recognition device, such as operating systems and application programs; contact data; telephone book data; a message; a picture; video, etc. The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs may include various application programs such as a Media Player (Media Player), a Browser (Browser), etc. for implementing various application services. Here, the program that implements the method of the embodiment of the present invention may be included in an application program.

The present embodiment also provides a computer storage medium, in which a computer program is stored, where the computer storage medium may be a Memory such as a magnetic random access Memory (FRAM), a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

A computer storage medium having a computer program stored therein, the computer program, when executed by a processor, performing the steps of:

receiving a voice signal to be recognized;

In an alternative embodiment, the computer program, when executed by the processor, further performs the steps of:

collecting the corpus of the area where the terminal is located currently;

The following describes an embodiment of the present invention in further detail by using specific examples, and please refer to fig. 4 again, where the voice recognition apparatus provided in the embodiment of the present invention is applied to a terminal, and includes:

a receiving module 20, configured to receive a voice signal sent by a user, and when the voice signal sent by the user is received, read terminal identification information, such as an IMEI number, which may uniquely represent the physical terminal, and further trigger an obtaining module 21 to determine current area information and historical voice area information of the terminal;

the obtaining module 21 is configured to, when the terminal uses a voice function, read current area information corresponding to a current location of the terminal, and record frequency and duration of the voice function used in the area; reading a voice recognition model corresponding to a current area and a historical voice using area of the terminal stored on a cloud server;

the recognition module 22 is used for providing the corpus of the voice functions used in different areas to the cloud server; according to the voice recognition models and the voice recognition model reading rules corresponding to the plurality of different areas obtained by the obtaining module 21, further performing voice recognition, and outputting a voice recognition result;

and the display module 25 is used for displaying one or more voice recognition results transmitted by the recognition module 22 so as to be selected and displayed by the user.

The principle process of the voice recognition device applied to the terminal provided by the embodiment of the invention for voice recognition is as follows:

1) when the terminal uses the voice recognition software function, the terminal records the frequency and the duration of the voice function used in the area by reading the area information corresponding to the current position;

2) the terminal provides the voice materials of the voice functions used in different areas to the cloud server;

3) a terminal receives a voice signal sent by a user;

4) the terminal reads the current geographical position information of the terminal;

5) the terminal determines the current area information of the terminal according to the current geographical position information;

6) the terminal reads a voice recognition model corresponding to a current area and a historical voice using area;

7) the terminal reads the voice recognition model according to a preset voice recognition model reading rule to obtain a voice recognition result; the voice recognition model reading rule comprises a rule I and a rule II; wherein the content of the first and second substances,

the first rule is: carrying out priority sorting according to the regional frequency of the used voice function to obtain the screening sorting of the priorities of the voice recognition models corresponding to the terminal in different regions from high to low; reading a voice recognition model corresponding to a region arranged at a first position, selecting a voice recognition model corresponding to a region arranged at a second position when a terminal determines that the accuracy contained in a voice recognition result output by the voice recognition model is lower than a preset accuracy threshold value until the accuracy contained in the obtained voice recognition result is higher than the preset accuracy threshold value, and displaying the voice recognition result;

the second rule is: and reading the voice recognition models corresponding to the terminal in different areas simultaneously, comparing the accuracy contained in the voice recognition results output by the voice recognition models, and performing descending order arrangement on the voice recognition results according to the accuracy, preferably, selecting the voice recognition results with the accuracy arranged in the first three by the terminal for displaying, and further selecting by the user.

As shown in fig. 6, a schematic diagram of a speech recognition method according to an alternative embodiment of the present invention includes the following steps:

step S201: the terminal receives a voice signal;

specifically, the terminal receives a voice signal uttered by the user through voice software or the like having a voice recognition function.

Step S202: the terminal acquires the position and determines the area;

specifically, the terminal reads the IMEI number of the terminal, acquires the position of the terminal, and determines the area information of the terminal according to the position of the terminal.

Step S203: the terminal is connected with the cloud server, corpus collection and training of the voice recognition model are carried out, and a corresponding relation between the area and the voice recognition model is established;

referring to fig. 7, if the area of the terminal includes the first area, the second area, and the like, the terminal may collect the corpus of the first area, the corpus of the second area, the general corpus, and the like, and send the collected corpuses to the cloud server, so that the cloud server collects the corpuses of different terminals or the same terminal in different areas, and the corpuses may be speech signals. For example: the terminal sends out voice signals in different areas such as a west-an anser tower area, a west-an non-central area, a thai-a-maurus bay and the like by using voice recognition software, and the cloud server can collect the voice signals to be used as language materials in the three areas such as the west-an anser tower area, the west-an non-central area and the thai-a-maurus bay and the like.

Here, the cloud server trains according to the corpora of different areas to generate the speech recognition models of the corresponding areas. For example, a speech recognition model of the first region is established according to the corpus of the first region, a speech recognition model of the second region is established according to the corpus of the second region, and a universal speech recognition model of the first region is established according to the universal corpus. For example, a speech recognition model of the thaumatin bay is generated from the corpus of the thaumatin bay. Then, the cloud server establishes a corresponding relation between the region and the voice recognition model, so that the corresponding voice recognition model can be obtained according to the region information.

Step S204: the terminal records the frequency and duration of voice signals sent out by using voice recognition software in each area;

specifically, the terminal records the frequency and duration of the voice signal sent by the user in the area where the user is located using the voice software. Here, the frequency may be regarded as the number of times the user utters the voice signal using the voice software within the set time range, and the duration may be regarded as the total time the user utters the voice signal using the voice software within the set time range. Preferably, the cloud server stores, in the format IMEI _ Area _ F _ T, the frequency and duration of the voice signal sent by the user in the Area where the user is located using the voice software, which is recorded by the terminal, that is, the frequency used by the terminal identified as IMEI to send the voice signal using the voice software in the Area is F, and the duration is T.

Step S205: the terminal transmits the corresponding relation between the terminal and the area information to a cloud server;

specifically, the terminal sends information such as the duration of a voice signal sent by the terminal in a current area by using voice software, the current area and the like to the cloud server, so that the cloud server establishes a corresponding relation between the IMEI number of the terminal, a historical voice using area and the frequency and duration of voice using in the corresponding area.

Step S206: the terminal acquires a corresponding voice recognition model;

specifically, the terminal acquires a voice recognition model corresponding to the terminal in a historical voice using area from a cloud server.

Step S207: judging whether the preset speech recognition model reading rule is a rule one or a rule two, and if the preset speech recognition model reading rule is the rule one, executing the step S208; if the preset speech recognition model reading rule is rule two, executing step S209;

step S208: acquiring and displaying a voice recognition result according to the first rule;

here, the process of acquiring and displaying the speech recognition result according to rule one is shown in fig. 8, and includes:

step S2081: reading the recorded frequency F and duration T of voice signals sent by voice recognition software used in a corresponding area according to the IMEI number of the terminal;

specifically, the terminal acquires the frequency F and the duration T of voice signals sent by voice recognition software used by the terminal in each area, which are stored on the cloud server, according to the IMEI number of the terminal.

Step S2082: sorting the sum of F and T, and taking the voice recognition model with the highest priority in the corresponding area with the large sum;

specifically, the terminal sorts the obtained frequency F and duration T corresponding to each region according to the sum of the frequency F and duration T, and the speech recognition model of the corresponding region with the larger sum has the highest priority. I.e. ordered according to the following rules: the priority of more frequency and longer time length is highest; the priority of more frequency and short duration is the second; the priority of less frequency and long duration is the second; the "frequency" is small, and the "duration" is short with the lowest priority.

Step S2083: adopting a voice recognition model corresponding to the region with the highest priority to recognize the voice signal sent by the user to obtain a recognition result;

here, the terminal presets a threshold K of accuracy of the voice recognition result.

Step S2084: judging whether the accuracy of the identification result is greater than a preset accuracy threshold, if so, executing a step S2087, otherwise, executing a step S2085;

step S2085: adopting a voice recognition model with the next priority to recognize the voice signal sent by the user until the accuracy of the obtained voice recognition result is greater than an accuracy threshold;

specifically, when the accuracy D1 included in the speech recognition result S1 is lower than the accuracy threshold K, selecting a speech recognition model with a second priority according to the order of priority to recognize the speech signal sent by the user, and obtaining a speech recognition result S2; when the accuracy D2 included in the speech recognition result S2 is greater than the accuracy threshold K, the speech recognition result S2 is displayed; when the accuracy D2 included in the speech recognition result S2 is still lower than the accuracy threshold K, the speech recognition models of the next priority are continuously read in the order of priority until the accuracy of the obtained speech recognition result is greater than the accuracy threshold K.

Step S2086: when the accuracy of the voice recognition results does not reach the accuracy threshold, sorting the voice recognition results of all the voice recognition models according to the accuracy, and selecting the voice recognition result with the highest accuracy for display;

specifically, when the terminal identifies the voice signals sent by the user in the current area and the historical voice area by using the area voice identification models corresponding to the voice areas, and the accuracy of the obtained voice identification results is less than or equal to the threshold value K, the voice identification results of the voice signals sent by the user are sorted according to the accuracy by all the voice identification models, and the voice identification result with the highest accuracy is selected to be displayed.

Step S2087: displaying a voice recognition result;

step S2088: and when the displayed recognition result still does not meet the requirements of the user, performing voice recognition by using the universal voice recognition model of the cloud server, and collecting the corpus of the current area to update the voice recognition model corresponding to the current area.

Step S209: acquiring and displaying a voice recognition result according to a rule II;

specifically, according to the IMEI number of the terminal, a voice recognition model corresponding to each area stored on a cloud server by the terminal is synchronously adopted to recognize voice signals sent by the user; the regional voice recognition model corresponding to each region comprises a voice recognition model corresponding to the current region of the terminal and a regional voice model corresponding to the historical voice using region of the terminal; and for the voice recognition results obtained by recognizing the voice signals sent by the user by adopting all the voice recognition models, sequencing the voice recognition results according to the accuracy of the voice recognition results, extracting the voice recognition results with the high accuracy and ranking the voice recognition results in the first three, displaying the voice recognition results, and further selecting the voice recognition results by the user.

To sum up, the voice recognition method provided in the embodiment of the present invention recognizes a voice signal sent by a user through voice recognition software on a terminal by using a voice recognition model corresponding to a region where the terminal is located and a history use voice region, thereby improving the efficiency of voice recognition, and in the voice recognition process, by adding a preset rule for obtaining the voice recognition model corresponding to the corresponding region, the accuracy of voice recognition is enhanced.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. The scope of the invention is to be determined by the scope of the appended claims.

Claims

1. A method of speech recognition, the method comprising:

receiving a voice signal to be recognized;

2. The method of claim 1, wherein the obtaining of the voice recognition model corresponding to the terminal comprises:

3. The method of claim 2, wherein the receiving the voice recognition model corresponding to the voice area used by the cloud server based on the history matching with the terminal identifier returned by the terminal identifier comprises:

4. The method of claim 3, wherein the sequentially filtering the speech recognition results generated by the speech recognition model for recognizing the speech signal according to the filtering ordering until obtaining the speech recognition result with accuracy higher than a set accuracy threshold, and using the speech recognition result with accuracy higher than the set accuracy threshold as the target speech recognition result comprises:

5. The method according to claim 3 or 4, wherein the determining, according to the frequency and/or the duration, the screening order of the speech recognition models corresponding to the regions where the historically used speech of the terminal is located includes:

6. The method according to claim 1 or 2, wherein the recognizing the voice signal according to the voice recognition model corresponding to the terminal, and obtaining a voice recognition result with an accuracy meeting a set condition as a target voice recognition result comprises:

7. The method according to claim 1 or 2, wherein before the obtaining the speech recognition model corresponding to the terminal, the method further comprises:

collecting the corpus of the area where the terminal is located currently;

8. A speech recognition apparatus, comprising:

the receiving module is used for receiving a voice signal to be recognized;

9. A speech recognition apparatus, characterized in that the speech recognition apparatus comprises a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,

the processor is adapted to perform the steps of the speech recognition method according to any of claims 1 to 7 when running the computer program.

10. A computer storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.