CN114333828A

CN114333828A - Quick voice recognition system for digital product

Info

Publication number: CN114333828A
Application number: CN202210218615.6A
Authority: CN
Inventors: 周俊太; 蒋博峰
Original assignee: Shenzhen China Ark Information Industry Co ltd
Current assignee: Shenzhen China Ark Information Industry Co ltd
Priority date: 2022-03-08
Filing date: 2022-03-08
Publication date: 2022-04-12

Abstract

The invention relates to the field of voice recognition, and discloses a rapid voice recognition system for digital products, which comprises: the starting module is used for starting a program, managing the operation of the program and running a sending instruction; the recording module is used for recording voice data output by a user; the voice print recognition module is used for collecting voice print characteristics in the voice data of the user and determining whether the voice print characteristics are the user himself or herself; the binding module is used for binding user login information and recording user voiceprint characteristics so as to unlock program operation; and the conversion module is used for converting the voice data input by the user into text data in real time. The invention can provide the function of correcting errors in the voice recognition process for the user, carry out error reporting reminding for the voice data which cannot be recognized, provide similar text commands with higher sentence coincidence degree for the user to select, help the user to quickly control, and still directly obtain the required instruction even if the input inaccurate voice data.

Description

Quick voice recognition system for digital product

Technical Field

The invention relates to the technical field of voice recognition, in particular to a rapid voice recognition system for digital products.

Background

Speech recognition is a cross discipline, and with the development of science and technology, speech recognition technology makes remarkable progress, starts to move from the laboratory to the market, and speech recognition technology has gradually entered into each field such as industry, household electrical appliances, communication, automotive electronics, medical treatment, home service, consumer electronics, and the field that speech recognition technology relates includes: the speech recognition technology is a high technology which enables a machine to convert speech signals into corresponding texts or commands through a recognition and understanding process, and mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology;

many intelligent digital products have also increasingly employed speech recognition technology;

however, in a speech recognition system mounted on an existing digital product, recognition still fails because a common term of a user is inconsistent with a template text recorded in a database even though the common term of the user is similar to the meaning of the template text, and the user cannot perform custom editing on the template text in the database, cannot provide help for correcting errors for the user, and affects the use experience of the user.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects in the prior art, the invention provides a rapid voice recognition system for a digital product, which can effectively solve the problems that the voice recognition system carried on the digital product in the prior art often fails to recognize because the common wording of a user is inconsistent with the template text recorded in the database, even if the common wording of the user is similar to the meaning of the template text, the user can not carry out self-defined editing on the template text in the database, the user can not be provided with the help of error correction, and the use experience of the user is influenced.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention discloses a rapid voice recognition system for digital products, which comprises:

the starting module is used for starting a program, managing the operation of the program and running a sending instruction;

the recording module is used for recording voice data output by a user;

the voice print recognition module is used for collecting voice print characteristics in the voice data of the user and determining whether the voice print characteristics are the user himself or herself;

the binding module is used for binding user login information and recording user voiceprint characteristics so as to unlock program operation;

the conversion module is used for converting voice data input by a user into text data in real time;

the database module is used for recording the text data of the trigger instruction and can write in;

the retrieval module is used for searching the converted text data in a database to find out corresponding text data;

the error correction module is used for carrying out error reporting reminding on text sentences which cannot be accurately identified in the error detection process;

the replacement selection module is used for providing a similar text command with higher sentence coincidence degree for a user to select when the sentence reports an error;

the memory module is used for recording the selection of the user after error correction for many times and performing associated memory on the error correction statement and the text correct statement in the database;

and the instruction sending module is used for sending the instruction corresponding to the final text.

Furthermore, the database module is interactively connected with a shortcut word module through a wireless network, and the shortcut word module is used for editing shortcut words so as to correspond to related long text trigger instructions.

Furthermore, the memory module is interactively connected with the database module through a wireless network, and the memory module records the result and reports the result to the database module in real time so that the user can present the result obtained by memory when recording for the second time.

Furthermore, the binding module is interactively connected with the input module through a wireless network, and when a user logs in for the first time, the binding module records the voiceprint characteristics of the user through the input module, reports the voiceprint characteristics to the binding module for recording, and unlocks the program to run.

The fast speech recognition method for digital product includes the following steps:

step 1: a user inputs initial voice and records voiceprint characteristics;

step 2: a user inputs a shortcut word to replace a long voice instruction and stores the shortcut word in a database text;

step 3: the user wakes up the voice program by a specific statement;

step 4: after the voice is input, converting the voice into a text, and performing identification retrieval in a database text;

step 5: identifying to be normal, and sending a corresponding instruction according to the text;

step 6: recognizing abnormality, reminding and providing text instruction selection with similar meanings;

step 7: selecting and determining the provided instruction options by the user;

step 8: recording the sentences subjected to error correction for many times and the sentences selected by the user for the second time, and uploading the sentences to a database for recording;

step 9: and finishing the instruction sending.

Furthermore, the voiceprint characteristics in Step1 are specifically expressed as tone quality, duration, intensity and pitch, and after such characteristics are extracted, the voice parameters reflecting the physiological and behavioral characteristics of the speaker in the voiceprint waveform are obtained;

when the voiceprint features are extracted, the input sound signals need to be processed and analyzed to obtain a group of feature description vectors which can be divided into auditory features and acoustic features, wherein the auditory features refer to sound features which can be identified and described by human ears, and the acoustic features refer to a group of acoustic description parameters extracted from the sound signals by a computer algorithm;

the feature extraction method comprises the following steps: gaussian mixture model, combined factor analysis method, and deep neural network method.

Further, the process of identifying and retrieving in Step4 includes:

analyzing the voice signal to obtain characteristic parameters of the voice, and then processing the parameters to form a standard template;

when a text access program for voice conversion exists, the system processes the voice signals and then matches the templates in the reference database to obtain a recognition result.

Further, the Step of identifying an abnormal reminding manner in Step6 includes: and broadcasting and correcting errors through preset error-reporting voice, and displaying and reminding through sending error-reporting text information.

Further, the specific concept of the selection of the text instruction with similar meaning in Step6 is as follows: and in the abnormal text data and the text data recorded in the database, the words with similar pinyin, the words with higher character coincidence degree and the words with similar meaning.

Furthermore, the voice command entry in step2 has an identification rate, where the identification rate refers to a probability that the voice to be identified can correctly find the corresponding speaker from the target speaker set, and it is determined that the voice to be identified has the greatest similarity with the target speaker set, and the recognition rate is also referred to as Top-1 recognition recall rate, and when N recognized speakers having the greatest similarity in the target speaker set include correct speakers, the recognition is correct, and the recognition correct rate thus counted is referred to as Top-N recognition recall rate, and a calculation formula of the recognition recall rate is as:

Top-N=

wherein m = number of successful recalls;

g = number of test voices.

(III) advantageous effects

Compared with the known public technology, the technical scheme provided by the invention has the following beneficial effects:

1. the invention can provide the function of correcting errors in the voice recognition process for the user, carry out error reporting reminding for the voice data which cannot be recognized, and provide similar text commands with higher sentence coincidence degree for the user to select, thereby improving the use experience of the user and facilitating the rapid voice control of the user.

2. The invention can automatically associate the corrected sentences with the correct template texts in the database after error correction for many times, is convenient for users to directly obtain required instructions even if the input inaccurate voice data is recorded in the subsequent use process, can carry out self-defined editing, and defines long instructions into words edited by the users.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram of a fast speech recognition system for digital products;

FIG. 2 is a schematic structural flow chart of a fast speech recognition method for digital products;

FIG. 3 is a schematic illustration of a speech recognition demonstration process of the present invention;

the reference numerals in the drawings denote: 1. a starting module; 2. a recording module; 3. a voiceprint recognition module; 4. a binding module; 5. a conversion module; 6. a database module; 7. a shortcut word module; 8. a retrieval module; 9. an error correction module; 10. a permutation selection module; 11. a memory module; 12. and an instruction sending module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention will be further described with reference to the following examples.

Example 1

The fast speech recognition system for digital products of the present embodiment, as shown in fig. 1, includes:

the starting module 1 is used for starting a program, managing the operation of the program and running a sending instruction;

the recording module 2 is used for recording voice data output by a user;

the voiceprint recognition module 3 is used for collecting voiceprint characteristics in the voice data of the user and determining whether the user is the user;

the binding module 4 is used for binding user login information and recording user voiceprint characteristics so as to unlock program operation;

the conversion module 5 is used for converting voice data input by a user into text data in real time;

the database module 6 is used for recording the text data of the trigger instruction and writing in;

the retrieval module 8 is used for searching the converted text data in a database to find out corresponding text data;

the error correction module 9 is used for carrying out error reporting reminding on text sentences which cannot be accurately identified in the error detection process;

a replacement selection module 10, configured to provide a similar text command with a higher sentence coincidence degree for a user to select when a sentence is reported with an error;

the memory module 11 is used for recording the selection of the user after error correction for many times and performing associated memory on the error correction statement and the text correct statement in the database;

and the instruction sending module 12 is configured to send an instruction corresponding to the final text.

As shown in fig. 1, the database module 6 is interactively connected with a shortcut word module 7 through a wireless network, and the shortcut word module 7 is used for editing a shortcut word to correspond to a related long text trigger instruction.

As shown in fig. 1, the memory module 11 is interactively connected with the database module 6 through a wireless network, and the memory module 11 reports the recorded result to the database module 6 in real time, so that the user can present the memorized result when recording for the second time.

As shown in fig. 1, the binding module 4 and the entry module 2 are interconnected through a wireless network, and when a user logs in for the first time, the binding module 4 records the voiceprint characteristics of the user through the entry module 2, reports the voiceprint characteristics to the binding module 4 for recording, and unlocks program operation.

After the system is carried, a user firstly inputs initial voice through an input module 2, the initial voice is identified through a voiceprint identification module 3, the initial voice is recorded and bound through a binding module 4, when a starting module 1 is awakened and started, the user inputs voice, after the voiceprint identification module 3 is normally identified, the voice is converted into text data through a conversion module 5, the text data is searched in a database module 6 through a retrieval module 8, after the results are matched, a corresponding command is sent through an instruction sending module 12, when the results are deviated, the error correction module 9 reminds the user, a replacement selection module 10 provides selection of a similar meaning command, after the user selects, the instruction sending module 12 sends the selected text data, a memory module 11 records the selection of the user after error correction, the recorded text data is uploaded to the database module 6, the text subjected to error correction is associated with the text selected after error correction, and the user quickly edits words through a quick word module 7, and converting the long command into a self-defined word, and uploading the word to the database module 6 for storage.

Example 2

In other aspects, the present embodiment further provides a fast speech recognition method for digital products, as shown in fig. 2, including the following steps:

step 1: a user inputs initial voice and records voiceprint characteristics;

step 3: the user wakes up the voice program by a specific statement;

step 7: selecting and determining the provided instruction options by the user;

step 9: and finishing the instruction sending.

As shown in fig. 2, the voiceprint features in Step1 are specifically expressed as tone quality, duration, intensity and pitch, and after such features are extracted, the voice parameters reflecting the physiological and behavioral features of the speaker in the voiceprint waveform are obtained;

As shown in fig. 2, the process of identifying and retrieving in Step4 includes:

As shown in fig. 2, the Step6 of identifying the abnormal reminding mode includes: and broadcasting and correcting errors through preset error-reporting voice, and displaying and reminding through sending error-reporting text information.

As shown in fig. 2, the specific concept of the selection of the text instruction with similar meaning in Step6 is as follows: and in the abnormal text data and the text data recorded in the database, the words with similar pinyin, the words with higher character coincidence degree and the words with similar meaning.

As shown in fig. 2, the voice command entry in step2 has an identification rate, where the identification rate refers to a probability that the voice to be identified can correctly find the corresponding speaker from the target speaker set, and the voice to be identified with the highest similarity in the target speaker set is identified as the identified speaker, and a rate of correct identification is also referred to as Top-1 recognition recall rate, and when N identified speakers with the highest similarity in the target speaker set include the correct speaker, the identification is correct, and a rate of correct identification counted in this way is referred to as Top-N recognition recall rate, and a calculation formula of the recognition recall rate is as:

Top-N=

wherein m = number of successful recalls;

g = number of test voices.

Example 3

In this example, as shown in fig. 3, in the voice input process, feature extraction needs to be performed first, such as pre-emphasis, in the audio recording process, the high-frequency signal is more easily attenuated, and the pronunciations of some factors, such as picture element sound, include more components of the high-frequency signal, and the loss of the high-frequency signal may cause the formants of the phonemes to be not obvious, so that the modeling capability of the acoustic model on the phonemes is not strong. Pre-emphasis is a first-order high-pass filter that boosts the energy of the high-frequency part of the signal, followed by framing, where the speech signal is an astable, time-varying signal. But the speech signal can be considered stationary, time-invariant over a short time horizon. This short time is generally 10-30ms, so that when speech signal processing is performed, the speech signal is segmented to reduce the overall unsteady, time-varying influence of the speech signal, where each segment is called a frame, and the frame length is generally 25 ms. In order to make the transition between frames smooth and maintain their continuity, the framing generally adopts an overlapping segmentation method to ensure that two adjacent frames overlap each other by a certain amount. The time difference between the starting positions of two adjacent frames is called frame shift, and the frame shift value is generally 10ms in use;

since the signal is then FFT-ed, the requirements of the FFT transformation are: the signal is either from-infinity to + ∞ora periodic signal. In the real world, it is impossible to acquire signals from-infinity to + ∞ in time, and only signals of a limited time length are possible. Since the framed signal is non-periodic, the problem of frequency leakage after FFT transformation occurs, and in order to minimize this leakage error, a weighting function, also called a window function, needs to be used. Windowing is primarily to make the time domain signal seem to better meet the periodicity requirements of the FFT process, reducing leakage.

An acoustic model is a model of the utterance, which can convert speech input into an output of an acoustic representation, giving the probability that the speech belongs to an acoustic symbol;

a language model represents the probability of occurrence of a certain sequence of words and is a knowledge representation of the composition of a set of sequences of words. One of the functions of the method is to solve the problem of polyphone characters, and after the acoustic model gives out a pronunciation sequence, a character string sequence with the maximum probability is found out from the candidate character sequences;

under the condition of giving an input characteristic sequence, in a search space formed by knowledge sources such as an acoustic model, a pronunciation dictionary, a language model and the like, a word sequence with the maximum probability is searched through a certain search algorithm, decoding is completed, and finally characters are output.

In summary, the invention can provide the user with the function of error correction in the voice recognition process, carry out error reporting reminding on unrecognized voice data, provide a similar text command with high sentence coincidence for the user to select, automatically associate the corrected sentence with the correct template text in the database after error correction for many times, and in the subsequent use process, even if the input inaccurate voice data is still able to directly obtain the required instruction, and can carry out custom editing to define a long instruction as a word edited by the user.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A fast speech recognition system for digital products, comprising:

the starting module (1) is used for starting a program, managing the operation of the program and running a sending instruction;

the recording module (2) is used for recording voice data output by a user;

the voiceprint recognition module (3) is used for collecting voiceprint characteristics in the voice data of the user and determining whether the user is the user;

the binding module (4) is used for binding user login information and recording user voiceprint characteristics so as to unlock program operation;

the conversion module (5) is used for converting voice data input by a user into text data in real time;

the database module (6) is used for recording the text data of the trigger instruction and can write in;

the retrieval module (8) is used for searching the converted text data in a database to find out the corresponding text data;

the error correction module (9) is used for carrying out error reporting reminding on text sentences which cannot be accurately identified in the error detection process;

the replacement selection module (10) is used for providing a similar text command with higher sentence coincidence degree for a user to select when the sentence reports an error;

the memory module (11) is used for recording the selection of the user after error correction for many times and performing associated memory on the error correction statement and the text correct statement in the database;

and the instruction sending module (12) is used for sending the instruction corresponding to the final text.

2. The fast speech recognition system for digital products according to claim 1, wherein: the database module (6) is interactively connected with a shortcut word module (7) through a wireless network, and the shortcut word module (7) is used for editing shortcut words so as to correspond to related long text trigger instructions.

3. The fast speech recognition system for digital products according to claim 1, wherein: the memory module (11) is in interactive connection with the database module (6) through a wireless network, and the memory module (11) reports the recorded result to the database module (6) in real time so that the user can display the recorded result when recording for the second time.

4. A fast speech recognition system for digital products according to claim 3, characterized in that: the binding module (4) is in interactive connection with the entry module (2) through a wireless network, and when a user logs in for the first time, the binding module (4) records the voiceprint characteristics of the user through the entry module (2), reports to the binding module (4) for recording, and unlocks the program to run.

5. A method for fast speech recognition of a digital product, said method being implemented in a system for fast speech recognition of a digital product according to any one of claims 1 to 4, comprising the steps of:

step 1: a user inputs initial voice and records voiceprint characteristics;

step 3: the user wakes up the voice program by a specific statement;

step 7: selecting and determining the provided instruction options by the user;

step 9: and finishing the instruction sending.

6. The fast speech recognition method for digital products according to claim 5, wherein: the voiceprint characteristics in Step1 are specifically expressed as tone quality, duration, intensity and pitch, and after such characteristics are extracted, voice parameters reflecting physiological and behavioral characteristics of the speaker in the voiceprint waveform are obtained;

7. The fast speech recognition method for digital products according to claim 5, wherein: the process of identifying and retrieving in Step4 includes:

8. The fast speech recognition method for digital products according to claim 5, wherein: the Step of identifying the abnormal reminding mode in Step6 comprises the following steps: and broadcasting and correcting errors through preset error-reporting voice, and displaying and reminding through sending error-reporting text information.

9. The fast speech recognition method for digital products according to claim 5, wherein: the concrete concept of selecting the similar text instruction in the Step6 is as follows: and in the abnormal text data and the text data recorded in the database, the words with similar pinyin, the words with higher character coincidence degree and the words with similar meaning.

10. The fast speech recognition method for digital products according to claim 5, wherein: the voice command input in step2 has an identification rate, the identification rate refers to a probability that the voice to be identified can correctly find the corresponding speaker from the target speaker set, the voice to be identified is identified as the identified speaker with the maximum similarity in the target speaker set, the identification correct rate is also called Top-1 identification recall rate, when the N identified speakers with the maximum similarity in the target speaker set include the correct speaker, the identification correct rate is called Top-N identification recall rate, and a calculation formula of the identification recall rate is as follows:

Top-N=

wherein m = number of successful recalls;

g = number of test voices.