CN117219058A

CN117219058A - Method, system and medium for improving speech recognition accuracy

Info

Publication number: CN117219058A
Application number: CN202311487594.9A
Authority: CN
Inventors: 邓从健; 陈茂强; 张志青; 邵德伟; 汤冬儿; 江晓锋; 陈小丰; 李礼红
Original assignee: Guangzhou Yunqu Information Technology Co ltd
Current assignee: Guangzhou Yunqu Information Technology Co ltd
Priority date: 2023-11-09
Filing date: 2023-11-09
Publication date: 2023-12-12
Anticipated expiration: 2043-11-09
Also published as: CN117219058B

Abstract

The embodiment of the application provides a method, a system and a medium for improving the accuracy of voice recognition. The method comprises the following steps: obtaining a voice emotion induced disturbance factor index according to the processing of the voice fragment characteristic data of a user, obtaining a ring condition net state coefficient and a ring sound disturbance compensation coefficient according to the processing of the sound field environment characteristic data and the environment sound noise disturbance characteristic data, obtaining a corresponding model by combining the voice fragment characteristic data, obtaining a plurality of voice key ideographic data and voice expression action data by the recognition processing, obtaining semantic behavior prejudging response data by combining the processing of the voice emotion induced disturbance factor index, obtaining semantic behavior recognition response correction data by correction, and finally comparing the semantic behavior recognition response data with a semantic behavior recognition response threshold value to judge the accuracy of voice behavior recognition; therefore, based on the voice big data, the voice information is combined with the personalized information and the scene environment information to perform data processing and evaluation, the accuracy judgment is performed on the voice recognition result, and the calibration judgment on the user voice recognition accuracy is improved.

Description

Method, system and medium for improving speech recognition accuracy

Technical Field

The application relates to the technical field of intelligent voice, in particular to a method, a system and a medium for improving voice recognition accuracy.

Background

The current wide use of voice recognition technology covers all fields of man-machine interaction, the core difficulty of voice recognition is to recognize individualized user voice expression habits, discriminate voice semantics under different environmental interferences, and increase difficulty of voice recognition accuracy and effect due to compensatory interference of emotion holding of semantic expressions caused by differences of professions, identities, contexts and language systems of different users, but the current lack of effective compensation and correction can be carried out according to user expression information in combination with personality and expression contexts and voice acquisition environments, so as to realize effective correction and judgment means of accuracy of voice recognition response capability, therefore, how to acquire individualized expression information and semantic information of users, recognize interference influence factors of voice environments and user expression emotion factors, carry out result correction on recognition judgment response of user voice semantic behaviors, improve accuracy of voice recognition response capability and carry out inspection, and have practical application significance.

In view of the above problems, an effective technical solution is currently needed.

Disclosure of Invention

The embodiment of the application aims to provide a method, a system and a medium for improving the accuracy of voice recognition, which can process and evaluate data of voice information combined with scene environment based on voice big data, judge the accuracy of voice recognition results and improve the calibration judgment of the accuracy of voice recognition of users.

The embodiment of the application also provides a method for improving the accuracy of voice recognition, which comprises the following steps:

collecting voice fragment information of a user in a preset time period, and acquiring user attribute marking information and voice environment information of a voice acquisition environment;

extracting voice fragment characteristic data according to the voice fragment information, and processing according to the voice fragment characteristic data to obtain a voice tone characteristic factor, a user emotion correction factor and a voice emotion induction factor index;

extracting sound field environment characteristic data and environment sound noise disturbance characteristic data according to the voice environment information, and respectively processing according to the sound field environment characteristic data and the environment sound noise disturbance characteristic data to obtain a ring condition net state coefficient and a ring sound disturbance compensation coefficient;

according to the data of the user attribute marking information, combining the voice fragment characteristic data, the sound field environment characteristic data and the environment sound noise disturbance characteristic data to obtain a corresponding preset type semantic picking recognition model and a semantic behavior recognition response threshold;

performing recognition processing on the voice fragment information according to the preset type semantic picking recognition model to obtain a plurality of voice key ideographic data and a plurality of voice expression action data, and combining the voice emotion induced factor index to perform processing to obtain semantic behavior prejudging response data;

Correcting the semantic behavior prejudging response data according to the ring condition net state coefficient and the ring sound disturbance compensation coefficient to obtain semantic behavior identification response correction data;

and comparing the threshold value with the semantic behavior recognition response threshold value according to the semantic behavior recognition response correction data, and judging the accuracy of voice behavior recognition of the user according to a threshold value comparison result.

Optionally, in the method for improving accuracy of voice recognition according to the embodiment of the present application, the collecting voice clip information of a user in a preset time period, and obtaining user attribute mark information and voice environment information of a voice obtaining environment includes:

collecting voice fragment information of a user in a preset time period, and acquiring user attribute marking information;

acquiring voice environment information of the environment where the user voice is located;

and extracting user identity attribute characteristic data and user native language category marking data according to the user attribute marking information.

Optionally, in the method for improving speech recognition accuracy according to the embodiment of the present application, extracting speech segment feature data according to the speech segment information, and obtaining a speech tone feature factor, a user emotion correction factor, and a speech emotion induced disturbance factor index according to processing of the speech segment feature data includes:

Extracting voice segment characteristic data according to the voice segment information, wherein the voice segment characteristic data comprises tone audio characteristic data, note pronunciation characteristic data, broadcasting definition characteristic data, language tone fluctuation characteristic data and modal fluctuation characteristic data;

processing according to the tone audio characteristic data, the note pronunciation characteristic data, the broadcasting definition characteristic data, the language tone fluctuation characteristic data and the modal fluctuation characteristic data through a preset voice emotion induced interference recognition model to respectively obtain a voice tone characteristic factor and a user emotion correction factor;

processing according to the voice tone characteristic factors and the user emotion correction factors to obtain voice emotion induced factor indexes;

the program formula of the voice emotion induced factor index is as follows:

；

wherein,index of voice emotion inducing factor, +.>For speech tone characteristic factor, < >>For the user emotion correction factor, < >>、/>、/>、/>、/>Respectively, timbre audio characteristic data, note pronunciation characteristic data, broadcasting definition characteristic data, language tone fluctuation characteristic data and moral fluctuation characteristic data, +.>For the preset category of native language recognition compensation factors, +.>、/>、、/>、/>、/>、/>Is a preset characteristic coefficient.

Optionally, in the method for improving accuracy of voice recognition according to the embodiment of the present application, extracting sound field environmental feature data and environmental sound noise disturbance feature data according to the voice environmental information, and processing the sound field environmental feature data and the environmental sound noise disturbance feature data respectively to obtain a ring condition net state coefficient and a ring sound disturbance compensation coefficient, including:

Extracting sound field environment characteristic data and environment sound noise disturbance characteristic data according to the voice environment information;

the sound field environment characteristic data comprise environment space index data, sound dispersion distribution index data, reverberation degree index data and sound coverage rate data, and the environment sound noise disturbance characteristic data comprise environment noisy degree index data, noise frequency color classification data, sound dispersion attenuation rate data and howling index data;

obtaining a ring condition net state coefficient according to the sound field environmental characteristic data processing, and obtaining a ring sound disturbance compensation coefficient according to the environmental sound noise disturbance characteristic data processing;

the calculation formula of the ring condition net state coefficient is as follows:

；

the calculation formula of the annular sound disturbance compensation coefficient is as follows:

；

wherein,is the net state coefficient of the ring condition, < >>For the annular disturbance compensation coefficient +.>、/>、/>、/>Respectively being environmental space index data, sound dispersion distribution index data, reverberation index data, sound coverage rate data, +.>、/>、/>、/>Respectively being environmental noisy degree index data, noise frequency color classification data, sound dispersion attenuation rate data and howling index data, < ->、/>、/>、/>、、/>、/>、/>Is a preset characteristic coefficient.

Optionally, in the method for improving speech recognition accuracy according to the embodiment of the present application, the obtaining, by combining the data of the user attribute flag information with the speech segment feature data and the sound field environmental feature data and the environmental sound noise disturbance feature data, a corresponding preset type of semantic pick-up recognition model and a semantic behavior recognition response threshold includes:

And according to the user identity attribute characteristic data and the user native language category marking data, combining the tone color audio characteristic data, the reverberation index data, the sound dispersion distribution index data, the noise frequency color classification data and the sound dispersion attenuation rate data, obtaining a corresponding preset type semantic picking and identifying model and a corresponding semantic behavior identifying response threshold value through a preset type semantic picking and identifying model library.

Optionally, in the method for improving accuracy of voice recognition according to the embodiment of the present application, the performing recognition processing on the voice clip information according to the preset type semantic picking recognition model to obtain a plurality of voice key ideographic data and a plurality of voice expression action data, and performing processing in combination with the voice emotion induced factor index to obtain semantic behavior pre-judging response data includes:

performing recognition processing on the voice fragment information according to the preset type semantic picking recognition model to obtain a plurality of voice key ideographic data and a plurality of voice expression action data;

processing through a preset type semantic behavior detection model according to a plurality of voice key ideographic data and voice expression action data and the voice emotion induced factor index to obtain semantic behavior prejudging response data;

The program formula of the semantic behavior prejudging response data is as follows:

；

wherein,prejudging response data for semantic behavior, +.>Is the z-th voice key ideographic data, < >>For the y-th speech expression action data, +.>For the preset category of native language recognition compensation factors, +.>Index of voice emotion inducing factor, +.>、/>、/>Is a preset characteristic coefficient.

Optionally, in the method for improving speech recognition accuracy according to the embodiment of the present application, the correcting the semantic behavior pre-judging response data according to the ring condition net state coefficient and the ring acoustic disturbance compensation coefficient to obtain semantic behavior recognition response correction data includes:

the correction formula of the semantic behavior recognition response correction data is as follows:

；

wherein,identifying response correction data for semantic behavior, +.>Prejudging response data for semantic behavior, +.>Is the net state coefficient of the ring condition, < >>For the annular disturbance compensation coefficient +.>、/>、/>Is a preset characteristic coefficient.

In a second aspect, an embodiment of the present application provides a system for improving accuracy of speech recognition, the system including: the device comprises a memory and a processor, wherein the memory comprises a program for improving the voice recognition accuracy, and the program for improving the voice recognition accuracy realizes the following steps when being executed by the processor:

Optionally, in the system for improving accuracy of voice recognition according to the embodiment of the present application, the collecting voice clip information of a user in a preset time period, and obtaining user attribute mark information and voice environment information of a voice obtaining environment includes:

In a third aspect, an embodiment of the present application further provides a computer readable storage medium, where a method program for improving accuracy of speech recognition is included, where the method program for improving accuracy of speech recognition, when executed by a processor, implements the steps of the method for improving accuracy of speech recognition according to any one of the above.

It can be seen from the foregoing that, according to the method, system and medium for improving speech recognition accuracy provided by the embodiments of the present application, by collecting speech segment information, user attribute marking information and speech environment information, processing according to extracted speech segment feature data to obtain a speech emotion induced factor index, processing according to extracted sound field environment feature data and environment sound noise induced feature data to obtain a ring condition net state coefficient and a ring sound disturbance compensation coefficient, obtaining a corresponding preset type semantic pick-up recognition model in combination with the speech segment feature data, performing recognition processing on the speech segment information to obtain a plurality of speech key ideographic data and speech expression action data, processing in combination with the speech emotion induced factor index to obtain semantic behavior pre-judgement response data, correcting according to the ring condition net state coefficient and the ring sound disturbance compensation coefficient to obtain semantic behavior recognition response correction data, and finally performing threshold comparison with a semantic behavior recognition response threshold to determine the accuracy of speech behavior recognition; therefore, data processing and evaluation are carried out on the voice information combined with the scene environment based on the voice big data, accuracy judgment is carried out on the voice recognition result, and calibration judgment on the user voice recognition accuracy is improved.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for improving accuracy of speech recognition according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for improving accuracy of speech recognition according to an embodiment of the present application, in which speech segment information, user attribute flag information, and speech environment information are obtained;

FIG. 3 is a flowchart of a method for improving accuracy of speech recognition according to an embodiment of the present application for obtaining a speech emotion induced index;

Fig. 4 is a schematic structural diagram of a system for improving accuracy of speech recognition according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present application, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for improving accuracy of speech recognition according to some embodiments of the application. The method for improving the voice recognition accuracy is used in terminal equipment, such as computers, mobile phone terminals and the like. The method for improving the accuracy of voice recognition comprises the following steps:

s101, collecting voice fragment information of a user in a preset time period, and acquiring user attribute marking information and voice environment information of a voice acquisition environment;

s102, extracting voice fragment characteristic data according to the voice fragment information, and processing according to the voice fragment characteristic data to obtain a voice tone characteristic factor, a user emotion correction factor and a voice emotion induced factor index;

s103, extracting sound field environment characteristic data and environment sound noise disturbance characteristic data according to the voice environment information, and respectively processing according to the sound field environment characteristic data and the environment sound noise disturbance characteristic data to obtain a ring condition net state coefficient and a ring sound disturbance compensation coefficient;

s104, according to the data of the user attribute marking information, combining the voice fragment characteristic data, the sound field environment characteristic data and the environment sound noise disturbance characteristic data to obtain a corresponding preset type semantic picking recognition model and a semantic behavior recognition response threshold;

S105, carrying out recognition processing on the voice fragment information according to the preset type semantic picking recognition model to obtain a plurality of voice key ideographic data and a plurality of voice expression action data, and processing by combining the voice emotion induction factor index to obtain semantic behavior prejudging response data;

s106, correcting the semantic behavior prejudgment response data according to the ring condition net state coefficient and the ring sound disturbance compensation coefficient to obtain semantic behavior identification response correction data;

and S107, comparing the threshold value with the semantic behavior recognition response threshold value according to the semantic behavior recognition response correction data, and judging the accuracy of voice behavior recognition of the user according to a threshold value comparison result.

It should be noted that, in order to implement compensation and inspection of the semantic recognition result of the user voice in combination with the user personalized attribute and emotion element information and voice collection environment information, to obtain accurate inspection of the voice recognition capability effect, thereby promoting effective judgment of the user voice recognition accuracy result, through collecting the voice fragment information of the user in the preset time period, simultaneously obtaining the user attribute marking information and the voice environment information of the voice collection environment, extracting voice fragment feature data and processing to obtain voice tone characteristic factors, user emotion correction factors and voice emotion induced interference factor indexes, that is, obtaining interference element factors reflecting the voice personalized frequency tones, emotion expression and voice emotion of the user through the voice fragment feature of the user, processing according to the extracted sound field environment feature data and environment sound noise interference feature data, obtaining ring state coefficient and ring sound interference compensation coefficient, that is, performing coefficient evaluation on the voice environment sound purification condition and sound interference condition, then obtaining corresponding preset type recognition according to the voice fragment feature data and environment sound field feature data and environment sound noise interference feature data, then obtaining key expression pattern response type recognition by combining the preset voice expression type recognition model, that is, extracting key expression pattern recognition and processing to obtain key pattern response type recognition performance data, the method comprises the steps of processing the obtained voice recognition result data in combination with emotion interference factors of user voice expression to obtain accurate response result data of voice meaning behavior recognition pre-judgment, correcting the voice meaning behavior pre-judgment response data according to ring condition net state coefficients and ring sound disturbance compensation coefficients to obtain semantic behavior recognition response correction data, enabling recognition response results to be more accurate through compensation and correction, finally comparing the obtained semantic behavior recognition response data with an obtained semantic behavior recognition response threshold value, judging accuracy of voice behavior recognition of a user according to the threshold value comparison result, and effectively judging the effect of voice recognition response accuracy of the user.

Referring to fig. 2, fig. 2 is a flowchart of a method for improving accuracy of speech recognition according to some embodiments of the application for obtaining speech segment information, user attribute flag information, and speech environment information. According to the embodiment of the application, the voice clip information of the user in the preset time period is collected, and the user attribute marking information and the voice environment information of the voice acquisition environment are acquired, specifically:

s201, collecting voice fragment information of a user in a preset time period, and acquiring user attribute marking information;

s202, acquiring voice environment information of an environment where the voice of the user is located;

and S203, extracting user identity attribute characteristic data and user native language category marking data according to the user attribute marking information.

It should be noted that, firstly, the voice fragment information of the user, that is, the voice fragment of the user, is collected in a preset time period, the user attribute marking information is obtained, the voice environment information of the environment where the user voice is located is obtained, and the characteristic data reflecting the user identity attribute and the marking data of the native language category of the user, such as the user identity, occupation, household registration, residence and the like, are extracted according to the user attribute marking information.

Referring to fig. 3, fig. 3 is a flowchart of a method for obtaining a voice emotion induced index according to some embodiments of the application. According to the embodiment of the application, the voice segment characteristic data is extracted according to the voice segment information, and the voice tone characteristic factor, the user emotion correction factor and the voice emotion induced factor index are obtained according to the voice segment characteristic data processing, specifically:

S301, extracting voice segment characteristic data according to the voice segment information, wherein the voice segment characteristic data comprises tone audio characteristic data, note pronunciation characteristic data, broadcasting definition characteristic data, language tone fluctuation characteristic data and modal fluctuation characteristic data;

s302, processing according to the tone audio feature data, the note pronunciation feature data, the broadcasting definition feature data, the language tone fluctuation feature data and the emotion fluctuation feature data through a preset voice emotion induced interference recognition model to respectively obtain a voice tone characteristic factor and a user emotion correction factor;

s303, processing according to the voice tone characteristic factors and the user emotion correction factors to obtain voice emotion induced factor indexes;

the program formula of the voice emotion induced factor index is as follows:

；

wherein,index of voice emotion inducing factor, +.>For speech tone characteristic factor, < >>For the user emotion correction factor, < >>、/>、/>、/>、/>Respectively, timbre audio characteristic data, note pronunciation characteristic data, broadcasting definition characteristic data, language tone fluctuation characteristic data and moral fluctuation characteristic data, +.>For the preset category of native language recognition compensation factors, +.>、/>、、/>、/>、/>、/>The characteristic coefficients are preset (the category native language recognition compensation factors and the characteristic coefficients are obtained through query of a semantic picking recognition platform database).

It should be noted that, in order to evaluate the interference effect of the personalized voice elements such as voice frequency tone and language characteristic emotion of the user on the voice recognition of the user, the characteristic data of the voice fragments of the user is analyzed and processed to obtain relevant element interference factors and further processed to obtain factor indexes, the voice fragment characteristic data including the characteristic data of voice tone, note pronunciation characteristics, broadcasting clarity, fluctuation characteristics of language tone and emotion fluctuation of the user are extracted according to the voice fragment information, then the extracted characteristic data is calculated and processed through a preset voice emotion induction interference recognition model of a third-party semantic pick-up recognition platform to obtain the characteristic factors reflecting the voice tone characteristics of the user and the influence correction factors of the user emotion characteristics on the voice, and then the voice emotion induction factor indexes are obtained through calculation and processing according to the voice tone characteristic factors and the user emotion correction factors, so as to obtain the factor indexes of interference induction generated by the emotion expression of the voice expression of the user.

According to the embodiment of the invention, the sound field environment characteristic data and the environment sound noise disturbance characteristic data are extracted according to the voice environment information, and are respectively processed according to the sound field environment characteristic data and the environment sound noise disturbance characteristic data to obtain the ring condition net state coefficient and the ring sound disturbance compensation coefficient, which are specifically as follows:

；

wherein,is the net state coefficient of the ring condition, < >>For the annular disturbance compensation coefficient +.>、/>、/>、/>Respectively being environmental space index data, sound dispersion distribution index data, reverberation index data, sound coverage rate data, +.>、/>、/>、/>Respectively being environmental noisy degree index data, noise frequency color classification data, sound dispersion attenuation rate data and howling index data, < ->、/>、/>、/>、、/>、/>、/>For preset feature coefficients (feature coefficients are obtained by semantic picking recognition platform database query).

It should be noted that, because the collecting environment of the user voice has complexity, diversity and variability, the voice environment condition has a benign or weakened compensation function for voice recognition, therefore, the interference factor coefficient of the user voice collecting environment needs to be evaluated to compensate and correct the evaluation of the voice recognition effect, the environment characteristic data of the sound field of the voice and the interference characteristic data of the environmental sound noise are extracted and obtained according to the voice environment information, wherein the sound field environment characteristic data comprises space size index data of the voice collecting environment, index data of the field environment sound dispersion distribution, index data of the field environment mixed loudness and field sound coverage rate data, the environment sound noise interference characteristic data comprises index data of the noise noisy degree, classification data of the noise frequency tone, attenuation rate data of the field sound dispersion and index data of the field sound noise, and then the environment sound interference characteristic data are respectively calculated and processed through a preprocessing formula to obtain the ring condition clean state coefficient and the ring sound interference compensation coefficient, namely, the sound condition purifying condition of the voice collecting field environment and the environment sound condition compensation coefficient of the environment sound interference condition are obtained.

According to the embodiment of the invention, the data according to the user attribute marking information is combined with the voice segment feature data and the sound field environment feature data and the environment sound noise disturbance feature data to obtain a corresponding preset type semantic pick-up recognition model and a semantic behavior recognition response threshold, which specifically comprises the following steps:

It should be noted that, because the voice elements such as the voice personality, the expression, the native language category and the like of different users have variability, in order to accurately identify the semantic information and the expression behavior information in the voice segments of the users, that is, identify the meaning and the instruction behavior expression of the expression for the voice segments, the semantic pick-up identification model of the corresponding category needs to be obtained in a targeted manner, that is, the semantic pick-up identification model adapted to the voice personalized feature of the users is obtained, according to the user identity attribute data and the native language category marking data of the users, the voice characteristic data, the reverberation index data, the sound dispersion distribution index data, the noise frequency classification data and the sound dispersion attenuation rate data are combined, the corresponding preset type semantic pick-up identification model is obtained through a preset third party type semantic pick-up identification model library, the semantic pick-up identification model library is an integrated library containing a plurality of voice types of semantic pick-up identification models, the data is compared through the similarity of the data, the similarity comparison can be performed through the cosine similarity, and the corresponding preset type semantic pick-up identification model and the semantic behavior identification response threshold value is obtained.

According to the embodiment of the invention, the voice fragment information is identified according to the preset type semantic picking identification model to obtain a plurality of voice key ideographic data and a plurality of voice expression action data, and the voice key ideographic data and the voice expression action data are processed by combining the voice emotion induced factor index to obtain semantic behavior prejudging response data, specifically:

；

wherein,prejudging response data for semantic behavior, +.>Is the z-th voice key ideographic data, < >>For the y-th speech expression action data, +.>For the preset category of native language recognition compensation factors, +.>Index of voice emotion inducing factor, +.>、/>、/>For preset feature coefficients (feature coefficients are obtained by semantic picking recognition platform database query).

It should be noted that, the voice segment information is identified by the preset type semantic picking identification model obtained by the model library, so as to obtain a plurality of voice key ideographic data and a plurality of voice expression action data, that is, the voice segment is identified and extracted by the corresponding type identification model to obtain relevant data, then the voice key ideographic data and the voice expression action data are combined with the voice emotion induction factor index to be calculated by the preset program calculation formula, so as to obtain semantic behavior prejudging response data, that is, the obtained voice identification result data are combined with the emotion interference factor of the voice expression of the user to obtain evaluation result data for identifying and prejudging response to the voice meaning action, and the evaluation result of accurate response to the voice identification judgment of the user is reflected.

According to the embodiment of the invention, the semantic behavior prejudging response data is corrected according to the ring condition net state coefficient and the ring sound disturbance compensation coefficient to obtain semantic behavior recognition response correction data, specifically:

；

wherein,identifying response correction data for semantic behavior, +.>Prejudging response data for semantic behavior, +.>Is the net state coefficient of the ring condition, < >>For the annular disturbance compensation coefficient +.>、/>、/>For preset feature coefficients (feature coefficients are obtained by semantic picking recognition platform database query).

In order to further improve the evaluation accuracy of the voice recognition response result, so as to obtain correction and judgment of the effect of the voice recognition accuracy of the user, the obtained semantic behavior pre-judgment response data is corrected according to the ring condition net state coefficient and the ring sound disturbance compensation coefficient through a correction calculation formula, so that the semantic behavior recognition response correction data is obtained, the judgment response result of the voice recognition capability is more accurate through compensation correction, and the accuracy of the calibration processing and judgment means of the effect of the voice recognition accuracy is improved.

As shown in fig. 4, the present invention also discloses a system for improving the accuracy of speech recognition, which comprises a memory 41 and a processor 42, wherein the memory includes a method program for improving the accuracy of speech recognition, and the method program for improving the accuracy of speech recognition realizes the following steps when the processor executes the sign anomaly correction data:

According to the embodiment of the invention, the voice clip information of the user in the preset time period is collected, and the user attribute marking information and the voice environment information of the voice acquisition environment are acquired, specifically:

According to the embodiment of the invention, the voice segment characteristic data is extracted according to the voice segment information, and the voice tone characteristic factor, the user emotion correction factor and the voice emotion induced factor index are obtained according to the voice segment characteristic data processing, specifically:

the program formula of the voice emotion induced factor index is as follows:

；

wherein,prejudging response data for semantic behavior, +.>Is the z-th voice key ideographic data, < >>For the y-th speech expression action data, +.>For the preset category of native language recognition compensation factors, +.>For words of Chinese characterIndex of emotion inducing factor->、/>、/>For preset feature coefficients (feature coefficients are obtained by semantic picking recognition platform database query).

；

A third aspect of the present invention provides a readable storage medium having embodied therein a method program for improving the accuracy of speech recognition, which when executed by a processor, implements the steps of the method for improving the accuracy of speech recognition as described in any one of the preceding claims.

The invention discloses a method, a system and a medium for improving the accuracy of voice recognition, which are characterized in that voice fragment information, user attribute marking information and voice environment information are collected, voice emotion induced factor indexes are obtained according to the processing of extracted voice fragment characteristic data, ring condition clean state coefficients and ring sound disturbance compensation coefficients are obtained according to the processing of extracted sound field environment characteristic data and environment sound noise disturbance characteristic data, then a voice fragment characteristic data is combined to obtain a corresponding preset type semantic pick-up recognition model, voice fragment information is subjected to recognition processing to obtain a plurality of voice key ideographic data and voice expression action data, the voice emotion induced factor indexes are combined to obtain semantic behavior pre-judgment response data, then the semantic behavior recognition response correction data is obtained by correction according to the ring condition clean state coefficients and the ring sound disturbance compensation coefficients, and finally the accuracy of voice behavior recognition is judged by threshold value comparison with a semantic behavior recognition response threshold value; therefore, data processing and evaluation are carried out on the voice information combined with the scene environment based on the voice big data, accuracy judgment is carried out on the voice recognition result, and calibration judgment on the user voice recognition accuracy is improved.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units; can be located in one place or distributed to a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

Claims

1. A method for improving the accuracy of speech recognition, comprising the steps of:

2. The method for improving accuracy of voice recognition according to claim 1, wherein the steps of collecting voice clip information of a user in a preset time period, and obtaining user attribute mark information and voice environment information of a voice obtaining environment comprise:

3. The method for improving accuracy of speech recognition according to claim 2, wherein extracting speech segment feature data from the speech segment information and obtaining speech tone feature factors and user emotion correction factors and speech emotion induced factor indices from processing the speech segment feature data comprises:

the program formula of the voice emotion induced factor index is as follows:

；

wherein,index of voice emotion inducing factor, +.>For speech tone characteristic factor, < >>The factor is modified for the emotion of the user,、/>、/>、/>、/>respectively, timbre audio characteristic data, note pronunciation characteristic data, broadcasting definition characteristic data, language tone fluctuation characteristic data and moral fluctuation characteristic data, +.>For the preset category of native language recognition compensation factors, +.>、/>、/>、/>、/>、/>、/>Is a preset characteristic coefficient.

4. The method for improving accuracy of voice recognition according to claim 3, wherein extracting sound field environmental characteristic data and environmental sound noise disturbance characteristic data according to the voice environmental information, and processing respectively according to the sound field environmental characteristic data and the environmental sound noise disturbance characteristic data to obtain a ring condition net state coefficient and a ring sound disturbance compensation coefficient comprises:

；

wherein,is the net state coefficient of the ring condition, < >>For the annular disturbance compensation coefficient +.>、/>、/>、/>Respectively being environmental space index data, sound dispersion distribution index data, reverberation index data, sound coverage rate data, +.>、/>、/>、/>Respectively being environmental noisy degree index data, noise frequency color classification data, sound dispersion attenuation rate data and howling index data, < ->、/>、/>、/>、/>、、/>、/>Is a preset characteristic coefficient.

5. The method according to claim 4, wherein the obtaining the corresponding preset type of semantic pick-up recognition model according to the data of the user attribute marking information in combination with the speech segment feature data and the sound field environmental feature data and the environmental sound noise disturbance feature data, and the semantic behavior recognition response threshold includes:

6. The method for improving accuracy of speech recognition according to claim 5, wherein the performing recognition processing on the speech segment information according to the preset type semantic pick-up recognition model to obtain a plurality of speech key ideographic data and a plurality of speech expression action data, and performing processing in combination with the speech emotion induced factor index to obtain semantic behavior pre-judgment response data comprises:

；

7. The method for improving accuracy of speech recognition according to claim 6, wherein the correcting the semantic behavior pre-judgment response data according to the loop condition net state coefficient and the loop noise compensation coefficient to obtain semantic behavior recognition response correction data comprises:

；

8. A system for improving speech recognition accuracy, the system comprising: the device comprises a memory and a processor, wherein the memory comprises a program for improving the voice recognition accuracy, and the program for improving the voice recognition accuracy realizes the following steps when being executed by the processor:

9. The system for improving accuracy of speech recognition according to claim 8, wherein the collecting the speech segment information of the user in the preset time period and obtaining the user attribute mark information and the speech environment information of the speech obtaining environment comprises:

10. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a method program for improving the accuracy of speech recognition, which method program, when being executed by a processor, implements the steps of the method for improving the accuracy of speech recognition according to any one of claims 1 to 7.