CN112331210B

CN112331210B - Speech recognition device

Info

Publication number: CN112331210B
Application number: CN202110005142.7A
Authority: CN
Inventors: 黄海峰
Original assignee: Taiji Computer Corp Ltd
Current assignee: Taiji Computer Corp Ltd
Priority date: 2021-01-05
Filing date: 2021-01-05
Publication date: 2021-05-18
Anticipated expiration: 2041-01-05
Also published as: CN112331210A

Abstract

The invention discloses a voice recognition device, which comprises a preprocessing unit, a check data generating unit, a check unit, a registered voice updating unit and a user intention recognizing unit, wherein the registered voice data of a user can be effectively updated through timely automatic updating and user intention recognition of the registered voice data, the accuracy of voice recognition in a specific period is improved, the convenience of user verification is effectively guaranteed, the error triggering of intelligent equipment is reduced, and the user experience of the user when the intelligent equipment is used is improved.

Description

Speech recognition device

Technical Field

The invention relates to the field of network information security, in particular to a voice data recognition device.

Background

With the development of economy, the transmission of network information is very common, with the development of information technology and the popularization of artificial intelligence concepts, more and more customer services are developing towards the direction of intellectualization, and people can realize the identification of artificial and voice data through simple voice input and intelligent equipment. Natural language processing is an important direction in the field of computer science and the field of artificial intelligence, and under the condition that the existing intelligent equipment is popularized, the situation that the safety verification of the intelligent equipment is performed to start verification and the like through the input of natural language voice and the like is common.

The processing of current natural pronunciation, when carrying out safe smart machine and opening, because the sample time of people's sound when registering is generally long, but traditional verification safety is considered, when updating verification data, like bank system's database etc., generally all focus on the registration voice data stability of original collection for the guarantee security, lead to registering voice data's real-time not enough, make this registration sample when matching the discernment with real pronunciation, there is the low situation of matching degree, thereby make voice recognition's pronunciation distinguish the degree low. Furthermore, sometimes, when the voice intelligent device is turned on, if the voice intelligent device is forgotten to be turned off or is continuously in an on state, misjudgment is easy to occur. For home equipment, standby and power supply states are generally adopted, and when voice characteristic information of a user changes, such as a sound changing state, such as a cold, or voice information which is spoken by people or sent unconsciously by self is recognized by intelligent equipment, the voice equipment can be automatically started or refused to start due to reception of sound. Namely, the real intention and better verification in the interaction process of the user and the intelligent device cannot be realized, thereby influencing the recognition degree of the user on the intelligent device.

In view of the above. How to ensure that the voice data is accurately recognized, a voice recognition device is provided to improve the user experience.

Disclosure of Invention

The invention provides a voice recognition device, which comprises a preprocessing unit, a check data generating unit and a voice processing unit, wherein the preprocessing unit is used for performing pre-emphasis, framing and windowing on a voice signal, performing normalization processing on the preprocessed voice signal and sending the voice signal to the check data generating unit; a collation data generating unit generating collation speech data from the speech signal acquired by the preprocessing unit; a registration data storage unit for storing pre-registered user voice registration data;

a collation unit collates the collation data generated by the collation data generation unit with the registered voice data of the user stored in the registered data storage unit; the collation unit determines that collation is successful when the degree of similarity between the collated input voice data and the registered data is equal to or larger than a threshold value, and determines collation is failed when the degree of similarity between the collated voice data and the registered voice data is smaller than the threshold value; when the voice check failure times of the user with the check data and the registration data exceed a preset period, performing security authentication in other pre-registration modes, and when the security authentication in other modes passes, registering the data with the voice check failure and the check success as verification data to form a distribution relation between the verification data and time;

and a registered voice updating unit for updating the registered voice data according to the distribution relation between the verification data and the time in the checking unit.

Further, the verification data comprises a false reject rate, wherein the false reject rate is the verification data that the user passes through other authentication modes and the voice check signal prompts failure.

Further, the voice registration updating unit updates the registered voice data by extracting the voice features of which the time parameters represent different time periods through the time parameters, and selecting the user input voice data which is determined to be stable but has no difference in the current voice feature point but has a reduced similarity through verifying the distribution relationship between the data and the time, as the new registered voice data.

Further, the registered voice updating unit updates the registered voice when the degree of similarity of this time is continuously equal to or smaller than a set threshold value W set to indicate that all updates are required.

Further, when the degree of similarity of the registration updating unit this time is continuously equal to or smaller than the set threshold value P but larger than the threshold value W, the registration voice data is partially updated.

Further, the registration voice updating unit stores two pieces of registration data at the same time, one is original registration voice data, and the other is updated new registration voice data, and whether the non-original registration voice data is updated or not is judged by the score of comparing the two pieces of registration voice data with the input check voice data.

Further, the preprocessing unit obtains a starting point of the voice signal based on zero-crossing detection and breakpoint detection, and obtains voice information through a voice detection algorithm MFCC.

The voice recognition device further comprises an intention recognition unit, wherein the intention recognition unit is used for collecting the operation habit language commands of the user before the initial operation commands, extracting the characteristics of the history data with correct field classification through a word segmentation tool and a word frequency matrix to form a characteristic word list, and judging the context intention of the user through a history matching template of the user.

Further, the other registration modes comprise mobile phone random code verification, short message verification and mailbox activation verification.

Further, the intention recognition unit predicts the intention in the intention recognition unit by training through a neural network to obtain a user history matching template.

According to the invention, the registration voice updating unit and the user intention identifying unit are arranged, so that the registration data of the user can be effectively updated, the accuracy of voice identification in a specific period is ensured, and the user experience of the user when the user uses the intelligent equipment is improved.

Drawings

The features and advantages of the present invention will be more clearly understood by reference to the accompanying drawings, which are schematic and should not be construed as limiting the invention in any way, and in which

FIG. 1 is a prior art speech recognition framework diagram;

fig. 2 is a schematic diagram of a framework of the speech device of the present application.

Detailed Description

These and other features and characteristics of the present invention, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, will be better understood upon consideration of the following description and the accompanying drawings, which form a part of this specification. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended as a definition of the limits of the invention. It will be understood that the figures are not drawn to scale. Various block diagrams are used in the present invention to illustrate various variations of embodiments according to the present invention.

Example 1

Since various noises may exist in the human-computer interaction environment, the speech recognition algorithm in the noise environment is mainly divided into two stages of training and recognition as shown in fig. 1, including the steps of speech signal preprocessing, endpoint detection, feature extraction, training and recognition, and the like.

Fig. 2 is a schematic diagram of main functional blocks of the speech recognition apparatus according to the present invention. The invention adopts the method of preprocessing the voice signal, and in the preprocessing unit, at least selecting, pre-emphasizing, framing and windowing are selected in the preprocessing. Pre-emphasis is to emphasize the high frequency part of the speech so that the whole spectrum becomes flat; the framing is realized by adopting a method of weighting by using a movable finite-length window in order to obtain the short-time stationarity of the voice, and the windowing ensures that a main lobe is sharp and a side lobe is lower. The endpoint detection is used for judging the starting and stopping endpoints of a meaningful signal in a voice signal segment, under the condition of noise, the voice signal cannot be detected by simply using short-time energy or a short-time zero-crossing rate, and under the condition of noise, the robustness is improved and increased by combining the endpoint detection and the zero-crossing rate. The speech features are Mel Frequency Cepstral Coefficients (MFCC). MFCC is a feature based on the auditory model of the human ear. The frequency spectrum of the signal is converted into Mel frequency from linear scale in frequency domain, and then is converted into cepstrum domain to obtain cepstrum coefficient.

Speech data is obtained, and a collation data generating unit generates collation speech data based on the speech information obtained by the preprocessing unit. The registration data storage unit holds user registration data registered in advance. The collation unit collates the collation data generated by the collation data generation unit with the registered voice data of the user stored in the registered data storage unit. The collation unit determines that collation is successful when a similarity between the collation data and the registration data is equal to or larger than a threshold value, and determines that collation is failed when the similarity between the collation data and the attention voice data is smaller than the threshold value; when the number of times of voice collation failure of the user with the collation data and the registration data exceeds a predetermined term or number of times, security authentication is performed by other pre-registration means, and when the other security authentication is passed, the data of the voice collation failure and the collation success in advance is registered and recorded as verification data to form a similarity time lapse distribution pattern relating to the event. The other modes comprise verification modes of the contact mode of the level during pre-registration, random codes, mailboxes, short messages and the like.

During development, the voice production state of voice changes with time according to different users, if the voice production state is continuous in a period of time, if the registration data is not updated in time, the truth rate of the voice during the possible increase of the verification failure rate is increased, and the convenience is possibly reduced. Therefore, a time statistic distribution factor is introduced, and when the acceptance of the authentication in other auxiliary modes is increased and the false rejection rate of the voice verification failure is increased, the voice feature information of the user needs to be updated. The registered voice updating unit extracts time parameters to represent voice characteristics of different time periods through the time parameters when updating is carried out, optionally, fine-grained division can be carried out into morning, noon and evening, day can be taken as a characteristic from the perspective of large granularity, long-term results are selected to judge 'check data storage users', acceptance rates and rejection rates under different similarities can be determined, and users who do not have difference in voice characteristics at present but have temporary difference in voice quality in a long term can be determined through time selection to serve as new 'verification data storage users', namely new registered user verification data. The selection mode is that the voice characteristic data is calculated in each fixed period in the past, rather than using a long-term average value, so that the verification convenience in a period of time is kept, for example, the characteristic data of the user is obtained conveniently and automatically updated in a cold period when the voice of the user changes.

Optionally, the checking unit calculates similarity between the voice data and the registered voice feature data, and determines whether they match according to the similarity. Alternatively, each time the similarity between the registered voice and the comparison voice is calculated by the voice collation generation unit, the similarity is stored corresponding to the parameter of time, and a similarity history concerning the time variation distribution of the similarity is created. When the registered voice and the collation voice are determined to match by the voice collation unit, the registered voice may be updated based on the created time variation distribution of the similarity. The update judgment means judgment and when the registered voice is determined to be updatable by the update determination unit, the registered voice updating means updates the comparison voice determined to match the registered voice to a new registered voice.

And a voice registration updating unit that calculates the similarity by calculating an average value of the similarities by a time variation distribution of the similarities when the voice collation unit determines that the registered voice and the collation voice match. When the average value of the similarity calculated by the similarity average value calculation unit and the similarity stored in the past are higher than the set threshold value P, or the average value of the similarity is higher than the set threshold value P, the registered voice may not be updated, and it is determined that the registered voice may be updated when the similarity this time is continuously equal to or lower than the set threshold value W. When the average value of the similarity is higher than the set threshold value P and the similarity is not continuous below the set threshold value this time, the registered voice is not updated.

Optionally, to ensure that the update of the speech feature data is triggered by accurate prediction, when it is determined that the inclination of the regression line of the "time variation distribution about collation similarity" data is greater than a predetermined angle, and assuming that the collation similarity of the collected collation speech data is significantly lower than the set threshold Th at this time, it is further determined whether the number of times of mismatch determination is K. Then, at the next step, it is determined that the collation speech data collected this time is inconsistent with the registered speech feature data, and the inclination of the regression line for the "time variation distribution on collation similarity" data is inclined beyond a predetermined angle. If it is determined that the number of differences has been determined to be K or greater, it is determined that the voice data of the registrant has significantly changed over time, and it is necessary to re-register.

Another key point of the scheme of the present application is that when updating the registered voice data, a parameter based on similarity is used, and two parameters, i.e., parameters W and P, are set for this purpose. As with the previously calculated average value of the similarity, when the average value of the similarity is higher than the threshold value P, indicating that the similarity is high, the updating of the primary and secondary threshold values is not required, and at this moment, the updating of the voice data is not necessary. Only when the average similarity is between W and P, the update is performed in a weighted manner by the percentage of similarity. The update is performed in a percentage of the similarity assuming that the similarity is C at this time, and the input signal and the registered speech feature signal are weighted by C and (1-C) as coefficients. The weighting and the registration of the speech feature signal may also be performed by weighting a superimposed normalized signal of a plurality of signals higher than W based on the time parameter. The meaning of the W threshold is that the update is performed entirely, while P is that the update is performed partially.

The selectable voice signal is a waveform image, and the comparison can be performed in a manner of performing weighted superposition on the waveform image to extract corresponding feature points. The registration storage unit searches for updated feature data of a voice signal closest to the voice of the user to judge whether the voice signal is the same as the voice signal corresponding to the feature data registered in advance, and if the similarity meets a set value, the voice signal is determined to be correct.

Optionally, the registration voice updating unit may store two pieces of registration data at the same time, one of the registration data is original registration data, and the other is updated new registration data, and the new registration data is matched with the original registration data through matching between the original registration data and the voice input data and matching between the updated registration data and the updated new registration data, wherein matching between two pieces of registration data is performed, and a score is obtained comprehensively to verify whether the newly stored registration data needs to be updated. By the method, the completeness of the original data can be guaranteed, the verification information can be updated in real time, and the user experience is improved.

Optionally, when performing the start authentication, the voice recognition apparatus further includes an intention identifying unit, configured to perform intention judgment after the voice recognition, and after the voice trigger signal, the intention judgment needs to be performed on the authentication signal, that is, the verification is successful, and if the intention judgment is not matched, the device is kept in a standby state, optionally, the device may interact with the user, or wait for the next trigger of the user, so as to perform context trigger learning. Optionally, in the operation mode of starting authentication for acquiring voice data, there may be an intention judgment template further for the voice data, and in the actual development process, generally, in a noiseless environment, an operation of starting command operation is performed, but actually there may also be a trigger in the verification start, so that a user history voice data command is adopted, and a meaning inference of context is added in the judgment for the voice of the user related to the start command, instead of a manner of directly extracting characteristic voice command data. Therefore, before an initial operation command, the method of the application is to collect the user operation habit language command after the previous verification command, and perform feature extraction on a large amount of historical data with correct domain classification through a word segmentation tool and a word frequency matrix to form a feature word list.

In the real-time voice transcription process, the filtering and automatic segmentation are carried out on the speech words, and the segmentation is carried out by combining the preceding and following semantics, the pause duration and the like. The keyword library is executed according to the field, provides a keyword optimization function, and can effectively improve the recognition accuracy of the keyword by inputting special nouns such as professional vocabularies in advance. Taking the example that the user starts the air conditioner, extracting keywords from the text of the user in a single request text voice command and in a preset time period, executing matching with the probability of occurrence of the keywords between historical command data, and extracting the keywords, thereby improving the accuracy of command voice recognition. Illustratively, when the user sends verification that the starting is within the preset time and the user is the air conditioner, the currently adapted historical data is detected, and when the extracted data is the word skipping probability such as 'cooling' or 'heating', 'temperature', 'turning down', 'turning up', 'cold wind' and the like, the word skipping probability is detected after the keyword matching is 'starting'. And acquiring a template of a standard user according to the using habit of the user, and performing intention matching when the user acquires the voice data of the user. Thereby improving the accuracy and preventing the misjudgment rate.

Optionally, a transition probability map is set according to the historical statistical data, and a tuple < S, a, T, R, > is used for description, which means S: a set of states of the system; a: set of actions of the system, T (s', a, s): state transition function of the system, describing the probability that when the system performs action a in state s, a transition to s' is possible, R (s, a): the reward function of the system describes that when the action a is executed in the state s, the system obtains an immediate reward value, namely the immediate reward value is known according to data of historical users, at each moment, the system is in a hidden state s, the system selects an action a according to the current belief distribution b to obtain an immediate reward r, and then the system is transferred to the next hidden state s ', s' depends on s and a. The probability of the optional statistics can be used to determine the confidence of the transition to the s' state.

The probability can be input and trained correspondingly or in a machine learning mode to obtain the predicted jump probability of the user, for example, an RNN algorithm, a classification recognition model is trained by using historical learning knowledge as a training sample and the knowledge of the user, the prediction probability is obtained, and the intention is clarified.

Example 2

A speech recognition apparatus, which may also be implemented by means of computing software, comprises a processor and a memory, on which a computer program is stored which is executed by the processor to carry out the method steps of the functional apparatus of embodiment 1.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

As used in this application, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being: a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of example, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from one component interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Claims

1. A speech recognition apparatus characterized by:

the voice recognition device comprises a preprocessing unit, wherein the preprocessing unit executes pre-emphasis, framing and windowing operations on a voice signal, performs normalization processing on the preprocessed voice signal and sends the voice signal to a check data generating unit; a collation data generating unit generating collation speech data from the speech signal acquired by the preprocessing unit; a registration data storage unit for storing pre-registered user registration voice data;

a collation unit collates the collation voice data generated by the collation data generation unit with the registration voice data of the user stored in the registration data storage unit; the collation unit determines that collation is successful when a similarity between collation voice data and registered voice data inputted is collated is equal to or larger than a threshold value, and determines collation is failed when the similarity between the collation voice data and the registered voice data is smaller than the threshold value; when the verification failure times of the verification voice data and the registered voice data of the user exceed the preset times, performing security authentication in other pre-registered modes, and when the security authentication in other modes passes, registering the data failed in the pre-voice verification and the data successfully verified as verification data to form a distribution relation between the verification data and time;

a registered voice updating unit for updating the registered voice data according to the distribution relation between the verification data and the time in the checking unit;

the registered voice updating unit is used for updating the registered voice data, namely extracting time parameters, representing voice characteristics in different time periods through the time parameters, and determining the check voice data input by the user, which has no difference on the voice characteristic point but has stable similarity when the similarity is reduced, as new registered voice data through verifying the distribution relation of data and time.

2. The speech recognition apparatus of claim 1, wherein: the registered voice updating unit is configured to update the registered voice when the similarity at this time is continuously equal to or smaller than a set threshold value W set to indicate that all updates are required.

3. The speech recognition apparatus of claim 2, wherein: the registered voice updating unit is used for partially updating the registered voice data when the similarity is continuously equal to or less than a set threshold value P and is greater than a threshold value W.

4. The speech recognition apparatus of claim 3, wherein: the preprocessing unit obtains the starting point of the voice signal based on zero-crossing detection and breakpoint detection, and obtains voice information through a voice detection algorithm MFCC.

5. The speech recognition apparatus of claim 4, wherein: the voice recognition device also comprises an intention recognition unit, wherein the intention recognition unit is used for collecting the operation habit language commands of the user before the initial operation commands, extracting the characteristics of the history data with correct field classification through a word segmentation tool and a word frequency matrix to form a characteristic word list, and judging the context intention of the user through a history matching template of the user.

6. The speech recognition apparatus of any one of claims 1-5, wherein: the other pre-registration modes comprise mobile phone random code verification, short message verification and mailbox activation verification.

7. The speech recognition apparatus of claim 5, wherein: and training a user history matching template in the intention recognition unit by adopting a neural network.