CN108257604B

CN108257604B - Speech recognition method, terminal device and computer-readable storage medium

Info

Publication number: CN108257604B
Application number: CN201711293919.4A
Authority: CN
Inventors: 梁承飞
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2017-12-08
Filing date: 2017-12-08
Publication date: 2021-01-08
Anticipated expiration: 2037-12-08
Also published as: CN108257604A

Abstract

The invention is applicable to the technical field of information processing, and provides a voice recognition method, terminal equipment and a computer readable storage medium. The voice recognition method comprises the steps of monitoring a comparison result of a received first input voice and a prestored first reference voice, and calling a voice splicing tool to splice the first input voice with a mark stamp and the first reference voice when the comparison result is matched to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the first segmented voice is compared with the second input voice, and when the first segmented voice is not matched with the second input voice, the second input voice and the second segmented voice are compared with the voiceprint characteristics, so that the reference voice is updated in the voice recognition process, and the phenomenon of inaccurate voice recognition caused by natural change of human voice is avoided.

Description

Speech recognition method, terminal device and computer-readable storage medium

Technical Field

The present invention belongs to the field of information processing technologies, and in particular, to a speech recognition method, a terminal device, and a computer-readable storage medium.

Background

Biometric information recognition technology is widely used in information verification services, and existing biometric identification technology includes: face recognition, fingerprint recognition, iris recognition, voice recognition and the like.

In the existing voice recognition scheme, after reference voice is recorded in advance, voice of a user is collected in real time to be compared with the reference voice acoustically, so that voice recognition is completed according to a comparison result. Since the human voice changes with age or changes with natural physiological changes, when the human voice changes naturally due to physiological changes, if a reference voice previously recorded is used as a reference, the phenomenon that the voice recognition result is inaccurate occurs.

Disclosure of Invention

In view of this, embodiments of the present invention provide a speech recognition method, a terminal device and a computer-readable storage medium, so as to avoid inaccurate speech recognition caused by human voice changes.

A first aspect of an embodiment of the present invention provides a speech recognition method, including:

if the preset operation for voice recognition is detected, monitoring a comparison result of a first input voice received in the preset operation and a prestored first reference voice;

if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice;

calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice;

when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice;

comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features;

if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice;

and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.

A second aspect of embodiments of the present invention provides a speech recognition apparatus, including means for performing the method of the first aspect.

A third aspect of an embodiment of the present invention provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect when executing the computer program.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, which when executed by a processor implements the steps of the method of the first aspect.

The embodiment of the invention monitors a comparison result of a first input voice received in the preset operation and a prestored first reference voice when the preset operation for voice recognition is detected, sets a mark stamp for the first input voice when the comparison result is matched, and splices the first input voice with the first reference voice by calling a voice splicing tool to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic flow chart illustrating an implementation of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating an implementation of a speech recognition method according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

Referring to fig. 1, it is a flowchart of an implementation of a speech recognition method provided in an embodiment of the present invention, and the speech recognition method shown in fig. 1 may include:

s11: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.

In step S11, the preset operation for performing voice recognition may be a trigger operation that performs voice recognition when a preset application is opened on the terminal, or a trigger operation that performs voice password input by manual triggering in the process of using the preset application, or a trigger operation that triggers the current operation interface to jump to the voice recognition interface in the step of obtaining the request for permission, where the trigger operation may be implemented by single click, double click, or continuous pressing of a trigger voice recognition button.

It should be noted that the comparison result between the first input speech and the preset first reference speech reflects whether the source of the first input speech targeted by the speech recognition is the same as the source of the first reference speech.

In this embodiment, if the source of the first input speech is the same as the source of the first reference speech, it indicates that the first reference speech can be updated according to the first input speech. If the source of the first input voice is different from that of the first reference voice, the first input voice indicates that the first reference voice cannot be updated according to the first input voice. By monitoring the comparison result of the first input voice received in the preset operation and the prestored first reference voice, whether the display interface content is consistent with the interface content corresponding to the preset operation or not can be judged by acquiring the display interface content after voice recognition, and then the comparison result of the first input voice and the prestored first reference voice is determined.

Taking the resource payment in the preset application by the user through the voice recognition as an example, the interface content corresponding to the preset operation is prompt that the resource payment is successful, and when the obtained display interface content after the voice recognition is prompt that the payment is not completed or the payment fails, it is determined that the comparison result between the first input voice and the pre-stored first reference voice is not matched. And when the obtained display interface content after the voice recognition indicates that the payment is successful, determining that the comparison result of the first input voice and the pre-stored first reference voice is matching.

In other embodiments, the comparison result between the first input voice received in the preset operation and the pre-stored first reference voice is monitored, and the comparison result between the first input voice and the pre-stored first reference voice can be determined by determining whether a new task or a new process exists or by obtaining the content of the new task or the new process.

Illustrated is voice recognition for account login.

For example, the preset operation is used for inputting a first input voice for login verification, and when the comparison result between the first input voice and a pre-stored first reference voice is a match, the corresponding login interface is loaded and displayed when the language identification is successful. And when the comparison result of the first input voice and the pre-stored first reference voice is not matched, no operation is performed. Therefore, the comparison result of the first input voice and the pre-stored first reference voice can be determined by judging whether the newly added task or the newly added process of the logged interface is loaded and displayed.

S12: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.

In step S12, the tag stamp is used to tag the first input speech and reflect that the source of the first input speech is legal, i.e. the source of the first input speech is the same as the source of the first reference speech.

It should be noted that the first input voice and the first reference voice each include a header protocol and a voice data content corresponding to each other, where the header protocol can be used to reflect at least a file size of the voice, a duration of the voice content, and a voice format.

In this embodiment, the setting of the mark stamp for the first input voice may be setting a marker in a header protocol corresponding to the first input voice, or setting a mark keyword in a file name of the first input voice.

As a possible implementation manner, step S12 may include: if the comparison result is that the first input voice is matched with the first reference voice and the file format of the first input voice is consistent with that of the first reference voice, setting a mark stamp for the first input voice; if the comparison result is that the first input voice is matched with the first reference voice and the file format of the first input voice is not consistent with that of the first reference voice, a voice format conversion tool is called to convert the first input voice into a target input voice with the file format consistent with that of the first reference voice, and a mark stamp is set for the target input voice.

It is understood that the voice format conversion tool may be an existing voice file format conversion tool, for example, the first input voice is in MP3 format, the first reference voice is in WAV format, and the ". MP 3" is modified to ". WAV" by calling the voice file format conversion tool to modify the suffix name of the first input voice, so that the first input voice and the first reference voice can be spliced and played.

S13: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.

In step S13, the first input voice with the tag stamp is spliced with the first reference voice, specifically, the voice data of the first input voice with the tag stamp is spliced with the voice data of the first reference voice, and the spliced voice data is encapsulated with the new header protocol, so as to obtain the second reference voice.

It should be noted that the voice data corresponding to the second reference voice at least includes the voice data of the first input voice and the voice data of the first reference voice.

In this embodiment, the voice splicing tool is a script file for splicing the first input voice with the mark stamp and the first reference voice, wherein the object-oriented object of the script file is the voice data of the first input voice with the mark stamp and the voice data of the first reference voice.

It should be noted that voice splicing is different from voice synthesis, and voice splicing is to splice voice data in at least two voice files, and may be voice data head-to-tail splicing or voice data segmentation interception splicing, where when voice data head-to-tail splicing is performed, a start timestamp position, a splicing point timestamp position, and an end timestamp position of the spliced voice data are determined in the voice data in at least two voice files; when the voice data in at least two voice files are segmented and spliced, a plurality of voice segments to be spliced are obtained, and the plurality of voice segments to be spliced are spliced into complete voice data according to a preset voice data segment splicing strategy.

It can be understood that, in the process of calling the voice splicing tool to perform voice splicing to obtain the second reference voice, the script file corresponding to the voice splicing tool may be written by the existing logic language, that is, in actual application, the object-oriented of the voice splicing tool may also be a voice splicing process.

S14: and when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice.

In step S14, the first segmented speech corresponds to the first reference speech, and the second segmented speech corresponds to the first input speech.

In this embodiment, the first segmented voice corresponds to the first reference voice, which means that the voice data corresponding to the first segmented voice is the voice data corresponding to the first reference voice, that is, the voice content of the first segmented voice is the same as the content of the first reference voice, and similarly, the second segmented voice corresponds to the first input voice, that is, the voice content of the second segmented voice is the same as the content of the first input voice.

It should be noted that, the second reference voice segment is divided into a first segmented voice and a second segmented voice, where a mark point for distinguishing the first segmented voice from the second segmented voice is set in the second reference voice segment, and a corresponding mark position is set according to the respective voice data lengths of the first reference voice and the first input voice, so as to divide the second reference voice segment into the first segmented voice and the second segmented voice.

S15: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.

In step S15, the comparing the voiceprint features of the second input voice with the first segmented voice is performed by drawing a target voiceprint corresponding to the second input voice and a first voiceprint corresponding to the first segmented voice, extracting the voiceprint features in the target voiceprint, and then comparing the voiceprint features with the first voiceprint as a reference.

It is noted that the voiceprint can be at least one of a broadband voiceprint, a narrowband voiceprint, an amplitude voiceprint, a contour voiceprint, a time spectrum voiceprint, and a cross-sectional voiceprint, wherein the cross-sectional voiceprint comprises a cross-sectional broadband voiceprint and a cross-sectional narrowband voiceprint. The broadband voiceprint image and the narrowband voiceprint image are used for reflecting the change characteristics of the frequency and the intensity of the voice in the voice along with the time; the amplitude voiceprint image, the contour voiceprint image and the time spectrum voiceprint image are used for reflecting the characteristics of the change of the voice intensity or the voice pressure along with the time; the cross-sectional voiceprint is used for reflecting the intensity and frequency characteristics of the sound wave at a certain time point.

In all embodiments of the present invention, when voiceprint image comparison is performed between voices, the categories of the two voiceprint images that are compared are the same.

In this embodiment, the voiceprint feature comparison between the second input speech and the first segmented speech may specifically be to compare similar features in the voiceprints of the same characters and words in the second input speech and the first segmented speech. For example, the frequency values of resonance peaks in the voiceprint images of the second input voice and the first segmented voice are respectively selected for comparison, and then the same point and the different point between the second input voice and the first segmented voice are found out.

It can be understood that, in practical applications, when comparing the voiceprint features between the second input voice and the first segmented voice, and when comparing based on different voiceprint maps, the feature points used for comparison may also be different, and a specific voiceprint feature comparison scheme exists in the prior art, and therefore details are not repeated here.

S16: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.

In step S16, the first matching rate is used to reflect the comparison result between the second input speech and the first segmented speech. The first preset matching rate is used for describing the lowest matching rate standard when the comparison result of the second input voice and the first segmented voice is matched.

It should be noted that, in all embodiments of the present invention, the matching rate is used to describe the similarity degree between two voices being compared, that is, the higher the matching rate value is, the more similar the two voices being compared with the voiceprint feature are, and the greater the possibility of belonging to the same source is.

In this embodiment, the specific implementation manner of comparing the voiceprint characteristics of the second input voice with the first segmented voice is similar to that in step S15, and therefore, the detailed description is omitted here.

It is understood that in other embodiments of the present invention, the speech recognition method further comprises a first parallel step juxtaposed with step S16: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.

It should be noted that step S16 is not in the order of execution of the first parallel step, and the first parallel step is not executed after step S16 is executed, and step S16 is not executed after the first parallel step is executed.

S17: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.

In step S17, the second matching rate is used to reflect the comparison result between the second input speech and the second segmented speech. The second preset matching rate is used for describing the lowest matching rate standard when the comparison result of the second input voice and the second segmented voice is matched.

In this embodiment, the specific implementation manner of comparing the voiceprint characteristics of the second input voice with the second segmented voice is similar to that in step S15, and therefore, the detailed description is omitted here.

In the present embodiment, the speech recognition method further includes a second parallel step parallel to step S17: if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.

It should be noted that, step S17 is not in sequence with the execution of the second parallel step, and the second parallel step is not executed after step S167 is executed, and the step S17 is not executed after the second parallel step is executed.

As can be seen from the above, in the embodiment of the present invention, when a preset operation for performing voice recognition is detected, a comparison result between a first input voice received in the preset operation and a prestored first reference voice is monitored, and when the comparison result is a match, a mark stamp is set for the first input voice, and the first input voice with the mark stamp is spliced with the first reference voice by calling a voice splicing tool, so as to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.

Referring to fig. 2, fig. 2 is a schematic flow chart of a speech recognition method according to another embodiment of the present invention. As shown in fig. 2, a speech recognition method according to another embodiment of the present invention may include:

s21: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.

In step S21, the preset operation for performing voice recognition may be a trigger operation that performs voice recognition when a preset application is opened on the terminal, or a trigger operation that performs voice password input by manual triggering in the process of using the preset application, or a trigger operation that triggers the current operation interface to jump to the voice recognition interface in the step of obtaining the request for permission, where the trigger operation may be implemented by single click, double click, or continuous pressing of a trigger voice recognition button.

It is understood that in the present embodiment, a specific implementation manner of the step S21 is the same as the specific implementation manner of the step S11 in the previous embodiment, and please refer to the description of the step S11, which is not described herein again.

S22: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.

In step S22, the tag stamp is used to tag the first input speech and reflect that the source of the first input speech is legal, i.e. the source of the first input speech is the same as the source of the first reference speech.

It is understood that in the present embodiment, a specific implementation manner of the step S22 is the same as the specific implementation manner of the step S12 in the previous embodiment, and please refer to the description of the step S12, which is not described herein again.

S23: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.

In step S23, the speech concatenation tool includes: a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice each include a header protocol and voice data content.

As a possible implementation manner of this embodiment, step S23 may specifically include: calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.

In this embodiment, the header protocol tool may be a preset wavheader.h script, and the first header protocol and the first voice data content, and the second header protocol and the second voice data content are obtained by executing the content in the script for parsing the header protocol.

In the WavHeader.h script, the number of digits of each parameter in a voice data header protocol and the number of digits occupied by voice data content are defined and distinguished, and the first input voice and the first reference voice are split by operating the WavHeader.h script to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice.

In this embodiment, the first header protocol and the second header protocol are respectively used to describe the voice duration, the voice size, and other contents of the first input voice and the first reference voice. And the voice time described by the new data header protocol generated according to the first data header protocol and the second data header protocol is the sum of the first input voice time and the first reference voice time, and the voice size described by the new data header protocol is the sum of the first input voice size and the first reference voice size.

The data content splicing tool may include a voice data reading tool DataRead and a voice data writing tool DataWriter.

It should be noted that both the voice data reading tool DataRead and the voice data writing tool DataWriter can be packaged and read via corresponding binary data streams.

In this embodiment, a new header protocol and new voice data content are encapsulated to obtain a second reference voice, where various voice data parameters in the new header protocol correspond to the new voice data content, that is, voice duration information and voice size information in the new header protocol are consistent with the duration and size of the voice data content.

S24: when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice.

S25: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.

S26: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.

S27: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.

It should be noted that, the specific implementation manner of steps S24 to S27 in the present embodiment corresponds to steps S14 to S17 in the previous embodiment, and please refer to the description of steps S14 to S17, which is not described herein again.

It is understood that, in the present embodiment, the step S27 is performed only when the step S26 is performed.

In the present embodiment, step S27 is followed by step S28 and step S29.

Step S28: setting the count value in the counter to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice.

In step S28, I_nN and N are integers, N is not less than 1, and N is more than 1.

In this embodiment, the result of matching the second input speech with the second reference speech only occurs once during each speech recognition process. When the voice recognition is carried out for N times and each voice recognition result is the result of the matching of the second input voice and the second reference voice, the counting value in the counter is set as I_n. Setting the count value in the counter to I if it is determined that the second input voice matches the second reference voice_nWherein, I_nNot less than 0 and I_n＝I_n-1+1. When I is_nWhen the matching threshold value is equal to the preset matching threshold value N, the fact that the second input voice is matched with the second reference voice is an event is an inevitable event, namely the possibility that the event that the second input voice is matched with the second reference voice is an accidental event is eliminated。

In practical applications, the preset matching threshold may be determined according to a period of a change of the voice, or according to a number of times that the second reference voice is compared with the most comparative standard, or according to a duration of use of the second reference voice.

It should be noted that, a marking stamp is set for the second input speech, and the marking stamp is used for marking the second input speech and reflecting that the source of the second input speech is legal, that is, the source of the second input speech is the same as the source of the second reference speech.

S29: and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.

In step S29, the second reference voice includes the voice data content corresponding to the first input voice and the voice data content corresponding to the first reference voice. The target voice is a voice section corresponding to the first input voice in the second reference voice.

It should be noted that, in order to avoid that the content of the reference voice is continuously increased along with the continuous increase of the number of times of voice recognition, when the voice splicing tool is called to splice the second input voice with the mark stamp with the target voice segment in the second reference voice, the target voice is the voice segment corresponding to the first input voice in the second reference voice.

In this embodiment, by setting a preset matching threshold N and determining that the second input speech matches the second reference speech, the count value in the counter is set to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice to obtain third reference voice, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the user voice, and meanwhile, the situation that the reference voice changes along with the change of the user voice is avoidedThe phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.

Setting a preset matching threshold value N, and setting the count value in the counter to be I when the second input voice is determined to be matched with the second reference voice_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice, and third reference voice is obtained, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the voice of the user, and meanwhile, the phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.

Referring to fig. 3, fig. 3 is a schematic block diagram of a speech recognition apparatus according to an embodiment of the present invention. A speech recognition apparatus 3 of the present embodiment includes: a listening unit 31, a first marking unit 32, a first splicing unit 33, a segmentation unit 34, a first comparison unit 35, a second comparison unit 36 and a determination unit 37. Specifically, the method comprises the following steps:

the monitoring unit 31 is configured to monitor a comparison result between a first input voice received in a preset operation and a pre-stored first reference voice if the preset operation for performing voice recognition is detected.

For example, if the monitoring unit 31 detects a preset operation for performing voice recognition, it monitors a comparison result between a first input voice received in the preset operation and a pre-stored first reference voice.

A first marking unit 32, configured to set a marking stamp for the first input voice if the comparison result is that the first input voice matches the first reference voice.

For example, if the comparison result is that the first input voice matches the first reference voice, the first labeling unit 32 sets a labeling stamp for the first input voice.

Further, the voice splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content.

The first marking unit 32 is specifically configured to invoke the header protocol tool to split the first input voice and the first reference voice respectively, so as to obtain a first header protocol and a first voice data content corresponding to the first input voice, and a second header protocol and a second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.

For example, the first marking unit 32 calls the header protocol tool to split the first input voice and the first reference voice respectively, so as to obtain a first header protocol and a first voice data content corresponding to the first input voice, and a second header protocol and a second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.

And a first splicing unit 33, configured to invoke a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice, so as to obtain a second reference voice.

For example, the first splicing unit 33 invokes a speech splicing tool to splice the first input speech with the markup stamp and the first reference speech to obtain a second reference speech.

A segmenting unit 34, configured to, when the preset operation is detected again, divide the second reference speech segment into a first segmented speech and a second segmented speech, where the first segmented speech corresponds to the first reference speech, and the second segmented speech corresponds to the first input speech.

For example, when the preset operation is detected again, the segmentation unit 34 segments the second reference speech segment into a first segmented speech corresponding to the first reference speech and a second segmented speech corresponding to the first input speech.

A first comparing unit 35, configured to perform voiceprint feature comparison on the second input voice received in the preset operation and the first segmented voice, which are detected again.

For example, the first comparing unit 35 performs voiceprint feature comparison on the second input voice received in the preset operation detected again and the first segmented voice.

A second comparing unit 36, configured to compare the voiceprint features of the second input voice with the second segmented voice if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate.

For example, if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, the second comparing unit 36 performs voiceprint feature comparison on the second input voice and the second segmented voice.

A determining unit 37, configured to determine that the second input voice matches the second reference voice if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate.

For example, if a second matching rate obtained by comparing the second input speech with the second segmented speech is equal to or greater than a second preset matching rate, the determining unit 37 determines that the second input speech matches with the second reference speech.

Optionally, the speech recognition device 30 may further include: a second marking unit 38 and a second splicing unit 39. Specifically, the method comprises the following steps:

a second marking unit 38 for setting the count value in the counter to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.

For example, the second flag unit 38 sets the count value in the counter to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.

And the second splicing unit 39 is configured to invoke a voice splicing tool to splice the second input voice with the marked stamp with the target voice segment in the second reference voice to obtain a third reference voice, where the target voice is the voice segment corresponding to the first input voice.

For example, the second splicing unit 39 calls a voice splicing tool to splice the second input voice with the mark stamp and the target voice segment in the second reference voice to obtain a third reference voice, where the target voice is the voice segment corresponding to the first input voice.

Referring to fig. 4, a schematic block diagram of a terminal according to another embodiment of the present invention is shown. The terminal device 400 in the present embodiment as shown in the figure may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and memory 404. The processor 401, the input device 402, the output device 403, and the memory 404 are connected by a bus 405. The memory 402 is used for storing, the computer program comprises instructions, and the processor 401 executes the following operations by calling the computer program stored in the memory 402:

the processor 401 is configured to: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.

The processor 401 is configured to: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.

The processor 401 is configured to: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.

The processor 401 is configured to: when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice.

The processor 401 is configured to: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.

The processor 401 is configured to: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.

The processor 401 is configured to: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.

The processor 401 is further configured to: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.

The processor 401 is further configured to: if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.

The processor 401 is further configured to: setting the count value in the counter to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.

The processor 401 is further configured to: and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.

The processor 401 is specifically configured to: the calling of the voice splicing tool splices the first input voice with the mark stamp and the first reference voice to obtain a second reference voice, and the method comprises the following steps:

calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice;

generating a new data header protocol according to the first data header protocol and the second data header protocol;

calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content;

and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.

It should be understood that, in the embodiment of the present invention, the Processor 501 may be a Central Processing Unit (CPU), and the Processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device 402 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device 403 may include a display (LCD, etc.), a speaker, etc.

The memory 404 may include a read-only memory and a random access memory, and provides instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.

In a specific implementation, the processor 401, the input device 402, and the output device 403 described in this embodiment of the present invention may execute the implementation manners described in the first embodiment and the second embodiment of the speech recognition method provided in this embodiment of the present invention, and may also execute the implementation manners of the devices described in this embodiment of the present invention, which is not described herein again.

In another embodiment of the invention, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements:

The computer program when executed by the processor further implements:

and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.

The computer program when executed by the processor further implements:

if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.

The computer program when executed by the processor further implements:

setting the count value in the counter to I_nWherein, I_nNot less than 0 and I_n＝I_n-1+1, when I_nWhen the number of the input voices is equal to a preset matching threshold value N, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1;

and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.

The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of a computer. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus. The computer-readable storage medium is used for storing the computer program and other programs and data required by the apparatus. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech recognition method, comprising:

if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice;

after determining that the second input speech matches the second reference speech, the method further includes:

2. The speech recognition method of claim 1, wherein the speech splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content;

the calling of the voice splicing tool splices the first input voice with the mark stamp and the first reference voice to obtain a second reference voice, and the method comprises the following steps:

3. The speech recognition method of claim 1, wherein after comparing the re-detected second input speech received in the preset operation with the first segmented speech in a voiceprint feature manner, further comprising:

4. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program:

5. The terminal device of claim 4, wherein the voice splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content;

6. The terminal device of claim 4, wherein after comparing the second input voice received in the re-detected preset operation with the first segmented voice for voiceprint features, the method further comprises:

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.