CN115762477A

CN115762477A - Voice recognition model selection method and device, electronic equipment and storage medium

Info

Publication number: CN115762477A
Application number: CN202211449826.7A
Authority: CN
Inventors: 徐铭驰; 高峰
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-03-07

Abstract

The application provides a voice recognition model selection method, a voice recognition model selection device, electronic equipment and a storage medium. The method comprises the following steps: establishing an evaluation index according to the decision tree model; performing preprocessing operation on pre-recorded initial audio data to determine test audio data, and determining test text data corresponding to the test audio data according to the initial text data corresponding to the initial audio data; for each pre-acquired voice recognition model, inputting test audio data into the voice recognition model to determine recognition text data, and determining evaluation data of the voice recognition model in an evaluation index according to the test text data and the recognition text data; ranking the evaluation indexes of each voice recognition model according to the plurality of evaluation data to determine a total score corresponding to each voice recognition model; and selecting the speech recognition model with the highest total score as the target speech recognition model. The adaptability of the speech recognition model and the current application scene is improved, and the accuracy of speech recognition is improved.

Description

Voice recognition model selection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for selecting a speech recognition model, an electronic device, and a storage medium.

Background

In the related art, for different speech recognition application scenarios, a general speech recognition technology model is usually adopted, but because different models have corresponding constraint conditions, the general speech recognition model cannot be applied to all application scenarios, and therefore the related art has the problem of low speech recognition accuracy caused by poor adaptability of the speech recognition model to the application scenarios.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, an electronic device and a storage medium for selecting a speech recognition model.

In view of the above, in a first aspect, the present application provides a speech recognition model selection method, including:

establishing an evaluation index according to the decision tree model;

performing preprocessing operation on pre-recorded initial audio data to determine test audio data, and determining test text data corresponding to the test audio data according to the initial text data corresponding to the initial audio data;

for each of the speech recognition models that are acquired in advance,

inputting the test audio data into the speech recognition model to determine recognized text data,

determining evaluation data of the speech recognition model in evaluation indexes according to the test text data and the recognition text data;

ranking the evaluation indexes of each voice recognition model according to the plurality of evaluation data to determine a total score corresponding to each voice recognition model;

and selecting the speech recognition model with the highest total score as a target speech recognition model.

In a second aspect, the present application provides a speech recognition model selection apparatus comprising:

a construction module configured to construct an evaluation index according to the decision tree model;

the device comprises a determining module, a pre-processing module and a processing module, wherein the determining module is configured to execute pre-processing operation on pre-recorded initial audio data to determine test audio data, and determine test text data corresponding to the test audio data according to the initial text data corresponding to the initial audio data;

a testing module configured to, for each of the pre-acquired speech recognition models,

determining evaluation data of the speech recognition model in an evaluation index according to the test text data and the recognition text data;

the evaluation module is configured to rank each voice recognition model in an evaluation index according to a plurality of evaluation data so as to determine a total score corresponding to each voice recognition model;

a selection module configured to select the speech recognition model with the highest overall score as a target speech recognition model.

In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the speech recognition model selection method according to the first aspect when executing the program.

In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the speech recognition model selection method according to the first aspect.

As can be seen from the above, according to the speech recognition model selection method, the speech recognition model selection device, the electronic device, and the storage medium provided by the present application, an evaluation index can be constructed according to a decision tree model, a preprocessing operation can be performed on pre-recorded initial audio data, so as to obtain test audio data, and test text data corresponding to the test audio data can be determined according to the initial text data corresponding to the initial audio data; further, for each pre-acquired voice recognition model, the test audio data can be input into the voice recognition model to obtain recognition text data, and the evaluation data of the voice recognition model on each evaluation index is determined according to the test text data and the recognition text data obtained through the test; further, each speech recognition model can be sorted under each evaluation index according to the evaluation data corresponding to each speech recognition model, so that the total score corresponding to each speech recognition model is determined, in order to meet the comprehensive requirements of the current application scene, the speech recognition model with the highest total score can be selected to act on the target speech recognition model, and the speech recognition work under the current application scene is executed. By the method, the plurality of voice recognition models can be graded according to different application scenes, and then the voice recognition model most suitable for the current application scene is selected, so that the adaptability of the voice recognition model and the current application scene is improved, and the accuracy of voice recognition is improved.

Drawings

In order to more clearly illustrate the technical solutions in the present application or the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 illustrates an exemplary flowchart of a speech recognition model selection method provided in an embodiment of the present application.

Fig. 2 shows an exemplary structural diagram of a speech recognition model selection apparatus provided in an embodiment of the present application.

Fig. 3 shows an exemplary structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail below with reference to specific embodiments and the accompanying drawings.

It should be noted that technical terms or scientific terms used in the embodiments of the present application should have a general meaning as understood by those having ordinary skill in the art to which the present application belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the present application is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, in the related art, a generic speech recognition technology model is typically employed for different speech recognition application scenarios.

However, the applicant finds through research that different models have corresponding constraint conditions, and a general speech recognition model cannot be applied to all application scenarios, so that the related art has the problem of low speech recognition accuracy caused by poor adaptability of the speech recognition model to the application scenarios.

Therefore, according to the speech recognition model selection method, the speech recognition model selection device, the electronic equipment and the storage medium, the evaluation index can be constructed according to the decision tree model, the pre-processing operation can be performed on the pre-recorded initial audio data, the test audio data can be obtained, and the test text data corresponding to the test audio data can be determined according to the initial text data corresponding to the initial audio data; further, for each pre-acquired voice recognition model, the test audio data can be input into the voice recognition model to obtain recognition text data, and the evaluation data of the voice recognition model on each evaluation index is determined according to the test text data and the recognition text data obtained through the test; and further, sequencing each voice recognition model under each evaluation index according to the evaluation data corresponding to each voice recognition model, and further determining the total score corresponding to each voice recognition model. By the method, the plurality of voice recognition models can be graded according to different application scenes, and then the voice recognition model most suitable for the current application scene is selected, so that the adaptability of the voice recognition model and the current application scene is improved, and the accuracy of voice recognition is improved.

The following describes a speech recognition model selection method provided in the embodiments of the present application with specific embodiments.

Referring to fig. 1, a method for selecting a speech recognition model provided in an embodiment of the present application specifically includes the following steps:

s102: and establishing an evaluation index according to the decision tree model.

S104: the method comprises the steps of executing preprocessing operation on pre-recorded initial audio data to determine test audio data, and determining test text data corresponding to the test audio data according to the initial text data corresponding to the initial audio data.

S106: for each of the speech recognition models that were previously acquired,

and determining evaluation data of the speech recognition model in an evaluation index according to the test text data and the recognition text data.

S108: and sequencing each voice recognition model in the evaluation indexes according to the plurality of evaluation data to determine the total score corresponding to each voice recognition model.

S110: and selecting the speech recognition model with the highest total score as a target speech recognition model.

In some embodiments, the evaluation objects are speech recognition models of different manufacturers, and the evaluation index may be constructed according to a decision tree model, for example, a basic evaluation index is related to speech, the basic evaluation index may be a volume, an accent, a speech speed, a timbre, a dialect, a dialogue scene, a language, a sound source, a speaking mode, a text content, a sound pickup device, and the like, and the decision tree model may select a basic evaluation index with a higher weight as a final evaluation index according to a weight of each basic evaluation index. The weight of each basic evaluation index can be determined according to an expert experience method, and taking the basic evaluation index of the volume U1 as an example, the volume U1 can be subdivided into: the normal volume U11, the small volume U12 and the large volume U13 are basically normal volumes in practical application scenarios, so the weight of the normal volume U11 can be set to 0.9, the weight of the small volume U12 can be set to 0.05, and the weight of the large volume U13 can be set to 0.05.

In some embodiments, different speech recognition models may be tested separately by recording initial audio data in order to evaluate the capabilities of the different speech recognition models. However, because the scenes that need to be considered for voice recognition are quite various, testing can be performed according to the evaluation indexes corresponding to the available fields Jing Xuanqu, but obtaining of the test audio data is very difficult, and manual voice collection and labeling consume a lot of time and energy, so that the initial audio data which are normally pre-recorded can be processed in an automatic mode, preprocessing operation is further performed on the initial audio data, and different test audio data are obtained.

Specifically, the initial audio data may include one or more of audio data of different accents, audio data of a multi-person conversation scene, or audio data of different languages, so as to meet test requirements of different application scenes.

In order to further meet the test requirements of different application scenarios and reduce the workload of the pre-recording stage, a plurality of types of test audio data can be determined by performing preprocessing operation on the initial audio data. For example, the test audio data may include: the audio data is tested at variable speeds. The method comprises the steps of firstly obtaining initial audio data of a normal voice speed through recording, and further adjusting the initial audio data according to speed change parameters to determine speed change test audio data. In particular, the speech speed of the variable speed test audio data may be expressed as

V _Testing ＝V _{Initiation of} ×A

Wherein, V _Initial Indicating the normal speech speed, a the speed change parameter, which can be any multiple. The speed change parameter may be set to be less than 1 in order to obtain the test audio data with the slower speech speed, and the speed change parameter may be set to be greater than 1 in order to obtain the test audio data with the faster speech speed.

In some embodiments, testing the audio data may include: the variable volume test audio data. The method comprises the steps of firstly determining the current volume of initial audio data obtained by prerecording, taking the volume as the normal volume, and further adjusting the initial audio data according to volume adjusting parameters to determine variable volume test audio data. In particular, the volume of the variable volume test audio data may be expressed as

bel _{Testing of} ＝bel _Initial +db

Wherein bel _Initial Represents the current volume of the initial audio data, and db represents the volume adjustment parameter. The volume adjustment parameter may be set to a negative number when test audio data with a small volume is desired to be obtained, and the volume adjustment parameter may be set to a positive number when test audio data with a large volume is desired to be obtained.

In some embodiments, testing the audio data may include: the test audio data is mixed. The method comprises the steps of firstly determining noise audio data and initial audio data obtained by prerecording, and further mixing and superposing the initial audio data and the noise audio data so as to determine mixed test audio data.

In some embodiments, the most important evaluation factor for a speech recognition model should be the level of accuracy of speech recognition by the speech recognition model under different influence factors, so that the evaluation index may include the first word accuracy rate in addition to the base evaluation index. Specifically, after test audio data is input into different voice recognition models, for any voice recognition model, whether a replaced character exists in the test text data or not can be determined according to the corresponding character positions in the test text data and the recognition text data, if the replaced character exists, it is proved that a replacement error exists in the recognition result of the current voice recognition model, and further, the number of the replaced characters, namely, the number of the replaced error words can be determined; whether the rejected characters exist in the test text data or not can be determined according to the test text data and the corresponding character positions in the recognition text data, if the rejected characters exist, it is proved that rejection errors exist in the recognition result of the current voice recognition model, and the number of the rejected characters, namely the number of the rejected error characters, can be further determined; and determining whether the inserted characters exist in the test text data or not according to the test text data and the corresponding character positions in the recognition text data, if so, proving that the insertion error exists in the recognition result of the current voice recognition model, and further determining the number of the inserted characters, namely the number of the inserted error words.

Further, the total number of characters in the test text data can be determined, and then, the first word accuracy can be determined according to the total number of characters, the number of replaced words, the number of removed words and the number of inserted words, and a quantization value corresponding to the first word accuracy is used as evaluation data of the voice recognition model.

For example, the test text data is: "when a cloud-manufacturing town is built? ", the recognition text data obtained by the current speech recognition model according to the test text data is: "Yun Zhi time to town! ". Comparing the two sections of text data:

when cloud manufacturing towns are built?

Yun Zhi time of towns built!

Wherein, the 'intelligence' in the identification text data replaces the 'system' in the test text data, which belongs to the replacement error, and the number of the replaced characters is 1; compared with the test text data, the identification text data has the deletion of 'small', belongs to the elimination error, and has the number of the eliminated characters as 1; and compared with the test text data, the identification text data is inserted with 'o', belongs to an insertion error, and the number of the inserted characters is 1. And, the total number of characters of the test text data is 12. The first word error rate of the current speech recognition model may be expressed as

Wherein S represents the number of replaced characters, D represents the number of removed characters, I represents the number of inserted characters, and N represents the total number of characters.

Further, determining the first word correct rate based on the first word error rate may be expressed as

WCR＝1-WER

The first word accuracy of the speech recognition model is

75% of the first word is the quantized value of the first word accuracy, i.e. the evaluation data.

In some embodiments, punctuation marks may exist in the recognition result in addition to the characters, so that in order to further evaluate the recognition accuracy of a speech recognition model, the recognition result of the punctuation marks may be added to the reference factor, and thus the evaluation index may further include the second character accuracy. Specifically, after test audio data is input into different speech recognition models, for any one speech recognition model, whether a replaced punctuation mark exists in the test text data or not can be determined according to the positions of the punctuation marks corresponding to the test text data and the recognition text data, if the replaced punctuation mark exists, it is proved that a replacement error exists in the recognition result of the current speech recognition model, and further, the number of the replaced punctuation marks, that is, the number of the replaced wrong punctuation marks, can be determined; whether the removed punctuation marks exist in the test text data can be determined according to the positions of the punctuation marks corresponding to the test text data and the recognition text data, if the removed punctuation marks exist, the recognition result of the current voice recognition model is proved to have removal errors, and the number of the removed punctuation marks, namely the number of the removed erroneous punctuation marks, can be further determined; and determining whether the inserted punctuation marks exist in the test text data or not according to the positions of the punctuation marks corresponding to the test text data and the identification text data, if the inserted punctuation marks exist, proving that an insertion error exists in the identification result of the current voice identification model, and further determining the number of the inserted punctuation marks, namely the number of the inserted error punctuation marks.

Further, the total number of characters in the test text data can be determined, and further, a second word accuracy rate can be determined according to the total number of characters, the number of replaced characters, the number of removed characters, the number of inserted characters, the number of replaced punctuations, the number of removed punctuations and the number of inserted punctuations, and a quantization value corresponding to the second word accuracy rate is used as evaluation data of the voice recognition model.

when a cloud manufacturing town is built?

Yun Zhi time of towns built!

Wherein, recognizing that "wisdom" in the text data replaces "system" in the test text data, belongs to a replacement error, the number of characters replaced is 1, recognizing "? "replace in test text data"! ", belongs to the replacement error, the number of the replaced punctuation marks is 1; compared with the test text data, the identification text data has the deletion of 'small', belongs to the elimination error, and has the number of the eliminated characters as 1; the identification text data is inserted with "o" as an insertion error compared with the test text data, the number of inserted characters is 1, and the total number of characters of the test text data is 12.

The second word accuracy of the speech recognition model is

Of which 66.7% is the quantized value of the second word accuracy, i.e. the ratings data.

In some embodiments, in order to more fully embody the recognition accuracy of the speech recognition model, the evaluation index may further include a first sentence accuracy. Specifically, after the test audio data is input into different speech recognition models, for any one speech recognition model, it may be determined whether at least one first target sentence exists in the test text data, where the first target sentence is a sentence in which any one of the replaced character, the removed character, or the inserted character is wrong, and any one sentence may be represented as a character set between two adjacent punctuation marks. If at least one first target sentence is present in the test text data, the number of first target sentences may be determined. Further, the total number of sentences in the test text data can be determined in a mode that every two adjacent punctuations determine a sentence according to the test text data, the accuracy of the first sentence is determined according to the first target sentence number and the total number of sentences, and a quantization value corresponding to the accuracy of the first sentence is used as evaluation data. That is, the first sentence accuracy rate is regardless of whether the punctuation mark has a recognition error.

Wherein the first sentence error rate can be expressed as

Where SE represents the first target number of statements and N' represents the total number of statements.

In some embodiments, punctuation marks may exist in the recognition result besides characters, so that the evaluation index may further include the second sentence accuracy in order to more fully reflect the recognition accuracy of the speech recognition model. Specifically, after the test audio data is input into different speech recognition models, for any one speech recognition model, it may be determined whether at least one second target sentence exists in the test text data, where the first target sentence is a sentence in which any one error of a replaced word, a removed word, an inserted word, a replaced punctuation mark, a removed punctuation mark or an inserted punctuation mark exists, and any one sentence may be represented as a word set between two adjacent punctuation marks. If at least one second target sentence is present in the test text data, the number of second target sentences may be determined. Further, the total number of sentences in the test text data can be determined in a mode that every two adjacent punctuations determine a sentence according to the test text data, the accuracy of the second sentence is determined according to the second target sentence number and the total number of sentences, and a quantization value corresponding to the accuracy of the second sentence is used as evaluation data. That is, the second sentence accuracy rate is based on whether the punctuation mark has a recognition error.

In some embodiments, when the evaluation index is one, after obtaining the evaluation data of each speech recognition model, the speech recognition models may be sorted from high to low according to the evaluation data, and the corresponding rank of each speech recognition model is determined. For example, if the evaluation index is the first word accuracy, the evaluation data of the speech recognition model a is 100%, the evaluation data of the speech recognition model B is 75%, and the evaluation data of the speech recognition model C is 80%, the ranking is A, C, B in this order. Further, the score of each speech recognition model may be assigned sequentially from high to low according to the rank corresponding to each speech recognition model in a descending order, and then the total score corresponding to each speech recognition model is determined, for example, the score of the speech recognition model a is 3, the score of the speech recognition model C is 2, and the score of the speech recognition model B is 1. According to the principle of highest-priority selection, the speech recognition model A can be selected as a target speech recognition model in the current application scene.

In some embodiments, the evaluation index is at least two, i.e., the evaluation index includes at least any two of the following: a first word accuracy rate, a second word accuracy rate, a first sentence accuracy rate, and a second sentence accuracy rate. After the evaluation data of each speech recognition model is obtained, the speech recognition models can be ranked according to one evaluation index of the evaluation data from high to low, and a first ranking corresponding to each speech recognition model is determined. For example, if the evaluation index is the first word accuracy, the evaluation data of the speech recognition model a is 100%, the evaluation data of the speech recognition model B is 75%, and the evaluation data of the speech recognition model C is 80%, the rank is A, C, B in this order. Further, the first scores corresponding to each of the speech recognition models may be determined by assigning the first rank corresponding to each of the speech recognition models sequentially from high to low in descending order, and then determining the first score corresponding to each of the speech recognition models, for example, the first score of the speech recognition model a is 3 scores, the first score of the speech recognition model C is 2 scores, and the first score of the speech recognition model B is 1 score. Still further, the speech recognition models may be ranked on another evaluation index from high to low according to the evaluation data, and a second rank corresponding to each speech recognition model may be determined. For example, if the evaluation index is the second sentence accuracy, the evaluation data of the speech recognition model a is 85%, the evaluation data of the speech recognition model B is 90%, and the evaluation data of the speech recognition model C is 80%, the rank is B, A, C in this order. Further, the scores of the speech recognition models may be sequentially assigned to each speech recognition model in descending order according to the second rank corresponding to each speech recognition model from high to low, and then the second score corresponding to each speech recognition model is determined, for example, the first score of the speech recognition model B is 3 scores, the first score of the speech recognition model a is 2 scores, and the first score of the speech recognition model C is 1 score. And finally, respectively determining the total score corresponding to each voice recognition model according to the sum of the first score and the second score corresponding to each voice recognition model, wherein the total score of the voice recognition model A is 5 points, the total score of the voice recognition model B is 4 points, and the total score of the voice recognition model C is 3 points. According to the principle of highest-priority selection, the speech recognition model A can be selected as a target speech recognition model in the current application scene.

Specifically, the score corresponding to the speech recognition model under each evaluation index can be expressed as

M _i ＝x+1-Rank(x)

Where x represents the vendor identification of the speech recognition model, such as the speech recognition model of the xth vendor, and Rank (x) represents the Rank of the speech recognition model of the xth vendor.

The overall score of the speech recognition model may be expressed as

Wherein i represents the number of the evaluation index, m _i And representing the weight value corresponding to each evaluation index.

In some embodiments, testing of the speech recognition model may be performed in different dimensions, so the overall score of the speech model may also be optimized according to the different dimensions. Because there may be a plurality of different dimensions in the pre-recorded initial audio data, the current dimension of the test audio data may be determined according to the initial audio data, where the current dimension may include: accent, volume, speed of speech, dialect type, single/multi-person dialog scenario, and/or language type. Further, the first score corresponding to each voice recognition model may be weighted according to a preset weight value corresponding to the current dimension, and then the first weighted score corresponding to the voice recognition model in each current dimension is determined. Similarly, the second score corresponding to each speech recognition model may be weighted according to a preset weight value corresponding to the current dimension, and then the second weighted score corresponding to the speech recognition model in each current dimension may be determined. And further, respectively determining a total score corresponding to each voice recognition model according to the sum of all the first weighted scores and all the second weighted scores corresponding to each voice recognition model. Referring to table 1, test text data in different dimensions may be set to different numbers.

Table 1 test text data in different dimensions

In some embodiments, experiments are performed for the dimension of different volumes, the first word accuracy, the second word accuracy, the first sentence accuracy and the second sentence accuracy are selected for the evaluation index, and the test is performed for the voice recognition models of 4 different manufacturers, and the test results are shown in table 2.

TABLE 2 dimensional test results for different volumes

It can be seen that the dependence of the voice recognition model of the manufacturer D on the volume is high, the recognition effect under low volume is obviously poor, and after weighting, the scoring conditions of each manufacturer under different evaluation indexes are shown in table 3 in consideration of the weight values under different dimensions.

TABLE 3 Final score

In some embodiments, according to the above experimental results, it can be found that, since there may be cases in the speech recognition model where the performance is good in most scenes and very poor in individual scenes, in order to clearly present a significant defect, a defect level is introduced, and the parameter is used to amplify the defect in the scene, so that the comparative evaluation is fairer. For example, the four speech recognition models are respectively divided into [4,2,3,1] in different speech speed dimensions, the speech recognition model with 1 division is ranked at the end, the corresponding defect level is given to the speech recognition model, and the final score of the four speech recognition models in the speech speed dimensions is [4,2,3, -2].

In some embodiments, the calculation of the accuracy is affected due to the non-uniformity of the characters, but actually the recognition result cannot be considered as wrong, for example, the case where "WIFI" is recognized as "WIFI", "6" is recognized as "six", and the like. Therefore, chinese and English, arabic numerals, punctuation marks, special characters and the like can be normalized in the statistical process by performing normalization operation on the initial audio data.

As can be seen from the above, according to the speech recognition model selection method, the speech recognition model selection device, the electronic device, and the storage medium provided by the present application, an evaluation index can be constructed according to a decision tree model, a preprocessing operation can be performed on pre-recorded initial audio data, so as to obtain test audio data, and test text data corresponding to the test audio data can be determined according to the initial text data corresponding to the initial audio data; further, for each pre-acquired voice recognition model, the test audio data can be input into the voice recognition model to obtain recognition text data, and the evaluation data of the voice recognition model on each evaluation index is determined according to the test text data and the recognition text data obtained through the test; and further, sequencing each voice recognition model under each evaluation index according to the evaluation data corresponding to each voice recognition model, and further determining the total score corresponding to each voice recognition model. By the method, the multiple voice recognition models can be graded according to different application scenes, and then the voice recognition model most suitable for the current application scene is selected, so that the adaptability of the voice recognition model and the current application scene is improved, and the accuracy of voice recognition is improved.

It should be noted that the method of the embodiment of the present application may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and is completed by the mutual cooperation of a plurality of devices. In this distributed scenario, one device of the multiple devices may only perform one or more steps of the method of the embodiment of the present application, and the multiple devices interact with each other to complete the method.

It should be noted that the above describes some embodiments of the present application. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the application also provides a voice recognition model selection device.

Referring to fig. 2, the speech recognition model selection apparatus includes: the device comprises a construction module, a determination module, a test module, an evaluation module and a selection module; wherein the content of the first and second substances,

a testing module configured to, for each speech recognition model acquired in advance,

In one possible implementation, the testing the audio data includes: variable speed test audio data;

the determination module is further configured to:

recording to obtain initial audio data with normal voice speed;

adjusting the initial audio data according to a variable speed parameter to determine the variable speed test audio data; wherein the speech speed of the variable speed test audio data is represented as

V _Testing ＝V _Initial ×A

Wherein, V _Initial Indicating a normal speech speed and a variable speed parameter.

In one possible implementation, the testing the audio data includes: variable volume test audio data;

the determination module is further configured to:

determining the current volume of the initial audio data obtained by recording;

adjusting the initial audio data according to a volume adjusting parameter to determine the variable volume test audio data; wherein the volume of the variable volume test audio data is represented as

bel _Testing ＝bel _Initial +db

Wherein bel _{Initiation of} Represents the current volume of the initial audio data, and db represents the volume adjustment parameter.

In one possible implementation, the testing the audio data includes: mixing the test audio data;

the determination module is further configured to:

determining noise audio data and the initial audio data obtained by recording;

mixing and superimposing the noise audio data and the initial audio data to determine the mixed test audio data.

In one possible implementation, the initial audio data includes one or more of the following audio data: audio data of different accents, audio data of a multi-person conversation scene, or audio data of different languages.

In one possible implementation manner, the evaluation index includes: a first word accuracy rate;

the test module is further configured to:

determining whether replaced characters exist in the test text data or not according to the character positions corresponding to the test text data and the identification text data;

determining a number of replaced words in response to the presence of replaced words in the test text data;

determining whether the removed characters exist in the test text data according to the corresponding character positions of the test text data and the recognition text data;

determining the number of the removed characters in response to the removed characters existing in the test text data;

determining whether the inserted characters exist in the test text data or not according to the character positions corresponding to the test text data and the identification text data;

in response to the presence of inserted words in the test text data, determining a number of words inserted;

determining the total number of characters in the test text data, determining a first character accuracy according to the total number of characters, the number of replaced characters, the number of removed characters and the number of inserted characters, and taking a quantization value corresponding to the first character accuracy as the evaluation data.

In a possible implementation manner, the evaluation index includes: a second word accuracy;

the test module is further configured to:

determining whether the replaced punctuation marks exist in the test text data according to the punctuation mark positions corresponding to the test text data and the identification text data;

determining the number of punctuation marks to be replaced in response to the punctuation marks to be replaced existing in the test text data;

determining whether the removed punctuations exist in the test text data or not according to the positions of the punctuations corresponding to the test text data and the identification text data;

determining the number of the removed punctuations in response to the removed punctuations in the test text data;

determining whether the inserted punctuation marks exist in the test text data according to the punctuation mark positions corresponding to the test text data and the identification text data;

determining the number of punctuation marks inserted in response to the presence of punctuation marks inserted in the test text data;

and determining a second word accuracy rate according to the total number of characters, the number of replaced characters, the number of removed characters, the number of inserted characters, the number of replaced punctuations, the number of removed punctuations and the number of inserted punctuations, and taking a quantization value corresponding to the second word accuracy rate as the evaluation data.

In one possible implementation manner, the evaluation index includes: a first sentence accuracy rate;

the test module is further configured to:

determining whether at least one first target sentence exists in the test text data; the first target sentence is a character set of characters with replaced characters, removed characters or inserted characters between two adjacent punctuation marks;

in response to at least one first target sentence existing in the test text data, determining the number of the first target sentences;

determining the total number of sentences in the test text data according to a character set between two adjacent punctuations in the test text data;

and determining a first sentence accuracy rate according to the first target sentence quantity and the sentence total number, and taking a quantization value corresponding to the first sentence accuracy rate as the evaluation data.

In one possible implementation manner, the evaluation index includes: second sentence accuracy;

the test module is further configured to:

determining whether at least one second target sentence exists in the test text data; the second target sentence is a character which is replaced, a character which is eliminated, a character which is inserted, a punctuation mark which is replaced, a punctuation mark which is eliminated or a character and punctuation mark set of the punctuation mark which is inserted exist between two adjacent punctuation marks;

determining the number of second target sentences in response to the existence of at least one second target sentence in the test text data;

and determining a second sentence accuracy rate according to the second target sentence quantity and the sentence total number, and taking a quantization value corresponding to the second sentence accuracy rate as the evaluation data.

In one possible implementation, the evaluation module is further configured to:

sequencing all the voice recognition models in the evaluation indexes according to the sequence of the evaluation data from high to low, and determining the corresponding rank of each voice recognition model;

and assigning scores to each voice recognition model in a descending order according to the ranking corresponding to each voice recognition model from high to low in sequence to determine the total score corresponding to each voice recognition model.

In a possible implementation manner, the evaluation index at least includes any two of the following indexes: a first word accuracy rate, a second word accuracy rate, a first sentence accuracy rate, and a second sentence accuracy rate;

the evaluation module is further configured to:

sequencing all the voice recognition models in one evaluation index according to the sequence of the evaluation data from high to low, and determining a first ranking corresponding to each voice recognition model;

sequencing all the voice recognition models in the other evaluation index according to the sequence of the evaluation data from high to low, and determining a second ranking corresponding to each voice recognition model;

assigning scores to each voice recognition model in a descending order according to the first ranking corresponding to each voice recognition model from high to low in sequence to determine a first score corresponding to each voice recognition model;

assigning scores to each voice recognition model in a descending order according to the second ranking corresponding to each voice recognition model from high to low in sequence to determine a second score corresponding to each voice recognition model;

and respectively determining the total score corresponding to each voice recognition model according to the sum of the first score and the second score corresponding to each voice recognition model.

In one possible implementation, the evaluation module is further configured to:

determining the current dimensionality of the test audio data according to the initial audio data; wherein the current dimension comprises: at least one of accent, volume, speed of speech, dialect type, single/multi-person dialog scenario, and/or language type;

performing weighting according to a preset weight value corresponding to the current dimension and a first score corresponding to each voice recognition model to determine a first weighted score corresponding to the voice recognition model under each current dimension;

performing weighting according to a preset weight value corresponding to the current dimension and a second score corresponding to each voice recognition model to determine a second weighted score corresponding to the voice recognition model under each current dimension;

and respectively determining a total score corresponding to each voice recognition model according to the sum of all the first weighted scores and all the second weighted scores corresponding to each voice recognition model.

In one possible implementation, the evaluation module is further configured to:

in response to the fact that the ranking of the voice recognition model under the current dimension is the lowest, based on a preset defect grade corresponding to the current dimension, giving the defect grade to a first score corresponding to the voice recognition model under the current dimension to determine a first defect score; wherein the first defect score is less than the first score.

In one possible implementation, the determining module is further configured to:

a normalization operation is performed on the initial audio data.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations as the present application.

The apparatus of the foregoing embodiment is used to implement the corresponding speech recognition model selection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 3 shows an exemplary structural schematic diagram of an electronic device provided in an embodiment of the present application.

Based on the same inventive concept, corresponding to the method of any embodiment described above, the present application further provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the program, the method for selecting a speech recognition model according to any embodiment described above is implemented. Fig. 3 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 310, a memory 320, an input/output interface 330, a communication interface 340, and a bus 350. Wherein the processor 310, memory 320, input/output interface 330, and communication interface 340 are communicatively coupled to each other within the device via bus 350.

The processor 310 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 320 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 320 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 320 and called to be executed by the processor 310.

The input/output interface 330 is used for connecting an input/output module to realize information input and output. The input/output module may be configured as a component within the device (not shown) or may be external to the device to provide corresponding functionality. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 340 is used for connecting a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, bluetooth and the like).

Bus 350 includes a path that transfers information between the various components of the device, such as processor 310, memory 320, input/output interface 330, and communication interface 340.

It should be noted that although the above-mentioned device only shows the processor 310, the memory 320, the input/output interface 330, the communication interface 340 and the bus 350, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only the components necessary to implement the embodiments of the present disclosure, and need not include all of the components shown in the figures.

The electronic device of the foregoing embodiment is used to implement the corresponding speech recognition model selection method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the speech recognition model selection method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the foregoing embodiment are used to enable the computer to execute the speech recognition model selection method according to any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiment, which are not described herein again.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the context of the present application, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present application as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the application. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the application, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the application are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the application, it should be apparent to one skilled in the art that the embodiments of the application can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present application has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures, such as Dynamic RAM (DRAM), may use the discussed embodiments.

The present embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present application are intended to be included within the scope of the present application.

Claims

1. A method for selecting a speech recognition model, comprising:

establishing an evaluation index according to the decision tree model;

for each of the speech recognition models that are acquired in advance,

ranking the evaluation indexes of each speech recognition model according to the plurality of evaluation data to determine a total score corresponding to each speech recognition model;

2. The method of claim 1, wherein the testing the audio data comprises: variable speed test audio data;

the pre-processing operation performed on the pre-recorded initial audio data to determine test audio data includes:

recording to obtain initial audio data with normal voice speed;

V _Testing ＝V _Initial ×A

Wherein, V _{Initiation of} Indicating a normal speech speed and a variable speed parameter.

3. The method of claim 1, wherein the testing the audio data comprises: variable volume test audio data;

determining the current volume of the initial audio data obtained by recording;

bel _Testing ＝bel _Initial +db

Wherein bel _{Initiation of} Represents a current volume of the initial audio data, and db represents the volume adjustment parameter.

4. The method of claim 1, wherein the testing the audio data comprises: mixing the test audio data;

determining noise audio data and the initial audio data obtained by recording;

5. The method of claim 1, wherein the initial audio data comprises one or more of the following audio data: audio data of different accents, audio data of a multi-person dialog scene, or audio data of different languages.

6. The method according to claim 1, wherein the evaluation index includes: a first word accuracy rate;

for each pre-acquired speech recognition model, determining evaluation data of the speech recognition model in an evaluation index according to the test text data and the recognition text data, including:

7. The method of claim 6, wherein the evaluation index comprises: a second word accuracy;

for each pre-acquired voice recognition model, determining evaluation data of the voice recognition model in an evaluation index according to the test text data and the recognition text data, wherein the evaluation data comprises:

in response to the presence of replaced punctuation marks in the test text data, determining a number of punctuation marks that are replaced;

determining whether the removed punctuation marks exist in the test text data or not according to the punctuation mark positions corresponding to the test text data and the identification text data;

8. The method of claim 6, wherein the evaluation index comprises: a first sentence accuracy rate;

9. The method according to claim 7, wherein the evaluation index includes: second sentence accuracy;

determining whether at least one second target sentence exists in the test text data; the second target sentence is a character and punctuation mark set which has replaced characters, removed characters, inserted characters, replaced punctuation marks, removed punctuation marks or inserted punctuation marks between two adjacent punctuation marks;

in response to the existence of at least one second target sentence in the test text data, determining the number of the second target sentences;

10. The method of claim 1, wherein ranking each speech recognition model in an evaluation index according to the plurality of evaluation data to determine a total score for each speech recognition model comprises:

11. The method according to claim 1, wherein the evaluation index includes at least any two of the following: a first word accuracy rate, a second word accuracy rate, a first sentence accuracy rate, and a second sentence accuracy rate;

the step of sequencing each speech recognition model in an evaluation index according to a plurality of evaluation data to determine a total score corresponding to each speech recognition model comprises:

12. The method of claim 11, wherein the determining the total score for each of the speech recognition models according to the sum of the first score and the second score for each of the speech recognition models comprises:

13. The method of claim 11, wherein assigning scores to each of the speech recognition models in descending order according to the first ranking corresponding to each of the speech recognition models from high to low in order to determine the first score corresponding to each of the speech recognition models comprises:

determining the current dimensionality of the test audio data according to the initial audio data; wherein the current dimension includes: at least one of accent, volume, speed of speech, dialect type, single/multi-person dialog scenario, and/or language type;

14. The method of claim 1, wherein prior to performing the pre-processing operation on the pre-recorded initial audio data to determine the test audio data, further comprising:

a normalization operation is performed on the initial audio data.

15. A speech recognition model selection apparatus, comprising:

16. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 14 when executing the program.

17. A non-transitory computer readable storage medium storing computer instructions for causing a computer to implement the method of any one of claims 1 to 14.