CN108257604B - Speech recognition method, terminal device and computer-readable storage medium - Google Patents

Speech recognition method, terminal device and computer-readable storage medium Download PDF

Info

Publication number
CN108257604B
CN108257604B CN201711293919.4A CN201711293919A CN108257604B CN 108257604 B CN108257604 B CN 108257604B CN 201711293919 A CN201711293919 A CN 201711293919A CN 108257604 B CN108257604 B CN 108257604B
Authority
CN
China
Prior art keywords
voice
input
segmented
input voice
data content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711293919.4A
Other languages
Chinese (zh)
Other versions
CN108257604A (en
Inventor
梁承飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Puhui Enterprise Management Co Ltd
Original Assignee
Ping An Puhui Enterprise Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Puhui Enterprise Management Co Ltd filed Critical Ping An Puhui Enterprise Management Co Ltd
Priority to CN201711293919.4A priority Critical patent/CN108257604B/en
Publication of CN108257604A publication Critical patent/CN108257604A/en
Application granted granted Critical
Publication of CN108257604B publication Critical patent/CN108257604B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The invention is applicable to the technical field of information processing, and provides a voice recognition method, terminal equipment and a computer readable storage medium. The voice recognition method comprises the steps of monitoring a comparison result of a received first input voice and a prestored first reference voice, and calling a voice splicing tool to splice the first input voice with a mark stamp and the first reference voice when the comparison result is matched to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the first segmented voice is compared with the second input voice, and when the first segmented voice is not matched with the second input voice, the second input voice and the second segmented voice are compared with the voiceprint characteristics, so that the reference voice is updated in the voice recognition process, and the phenomenon of inaccurate voice recognition caused by natural change of human voice is avoided.

Description

Speech recognition method, terminal device and computer-readable storage medium
Technical Field
The present invention belongs to the field of information processing technologies, and in particular, to a speech recognition method, a terminal device, and a computer-readable storage medium.
Background
Biometric information recognition technology is widely used in information verification services, and existing biometric identification technology includes: face recognition, fingerprint recognition, iris recognition, voice recognition and the like.
In the existing voice recognition scheme, after reference voice is recorded in advance, voice of a user is collected in real time to be compared with the reference voice acoustically, so that voice recognition is completed according to a comparison result. Since the human voice changes with age or changes with natural physiological changes, when the human voice changes naturally due to physiological changes, if a reference voice previously recorded is used as a reference, the phenomenon that the voice recognition result is inaccurate occurs.
Disclosure of Invention
In view of this, embodiments of the present invention provide a speech recognition method, a terminal device and a computer-readable storage medium, so as to avoid inaccurate speech recognition caused by human voice changes.
A first aspect of an embodiment of the present invention provides a speech recognition method, including:
if the preset operation for voice recognition is detected, monitoring a comparison result of a first input voice received in the preset operation and a prestored first reference voice;
if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice;
calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice;
when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice;
comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features;
if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice;
and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.
A second aspect of embodiments of the present invention provides a speech recognition apparatus, including means for performing the method of the first aspect.
A third aspect of an embodiment of the present invention provides a terminal device, including: a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, which when executed by a processor implements the steps of the method of the first aspect.
The embodiment of the invention monitors a comparison result of a first input voice received in the preset operation and a prestored first reference voice when the preset operation for voice recognition is detected, sets a mark stamp for the first input voice when the comparison result is matched, and splices the first input voice with the first reference voice by calling a voice splicing tool to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic flow chart illustrating an implementation of a speech recognition method according to an embodiment of the present invention;
fig. 2 is a schematic flow chart illustrating an implementation of a speech recognition method according to another embodiment of the present invention;
fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Referring to fig. 1, it is a flowchart of an implementation of a speech recognition method provided in an embodiment of the present invention, and the speech recognition method shown in fig. 1 may include:
s11: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.
In step S11, the preset operation for performing voice recognition may be a trigger operation that performs voice recognition when a preset application is opened on the terminal, or a trigger operation that performs voice password input by manual triggering in the process of using the preset application, or a trigger operation that triggers the current operation interface to jump to the voice recognition interface in the step of obtaining the request for permission, where the trigger operation may be implemented by single click, double click, or continuous pressing of a trigger voice recognition button.
It should be noted that the comparison result between the first input speech and the preset first reference speech reflects whether the source of the first input speech targeted by the speech recognition is the same as the source of the first reference speech.
In this embodiment, if the source of the first input speech is the same as the source of the first reference speech, it indicates that the first reference speech can be updated according to the first input speech. If the source of the first input voice is different from that of the first reference voice, the first input voice indicates that the first reference voice cannot be updated according to the first input voice. By monitoring the comparison result of the first input voice received in the preset operation and the prestored first reference voice, whether the display interface content is consistent with the interface content corresponding to the preset operation or not can be judged by acquiring the display interface content after voice recognition, and then the comparison result of the first input voice and the prestored first reference voice is determined.
Taking the resource payment in the preset application by the user through the voice recognition as an example, the interface content corresponding to the preset operation is prompt that the resource payment is successful, and when the obtained display interface content after the voice recognition is prompt that the payment is not completed or the payment fails, it is determined that the comparison result between the first input voice and the pre-stored first reference voice is not matched. And when the obtained display interface content after the voice recognition indicates that the payment is successful, determining that the comparison result of the first input voice and the pre-stored first reference voice is matching.
In other embodiments, the comparison result between the first input voice received in the preset operation and the pre-stored first reference voice is monitored, and the comparison result between the first input voice and the pre-stored first reference voice can be determined by determining whether a new task or a new process exists or by obtaining the content of the new task or the new process.
Illustrated is voice recognition for account login.
For example, the preset operation is used for inputting a first input voice for login verification, and when the comparison result between the first input voice and a pre-stored first reference voice is a match, the corresponding login interface is loaded and displayed when the language identification is successful. And when the comparison result of the first input voice and the pre-stored first reference voice is not matched, no operation is performed. Therefore, the comparison result of the first input voice and the pre-stored first reference voice can be determined by judging whether the newly added task or the newly added process of the logged interface is loaded and displayed.
S12: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.
In step S12, the tag stamp is used to tag the first input speech and reflect that the source of the first input speech is legal, i.e. the source of the first input speech is the same as the source of the first reference speech.
It should be noted that the first input voice and the first reference voice each include a header protocol and a voice data content corresponding to each other, where the header protocol can be used to reflect at least a file size of the voice, a duration of the voice content, and a voice format.
In this embodiment, the setting of the mark stamp for the first input voice may be setting a marker in a header protocol corresponding to the first input voice, or setting a mark keyword in a file name of the first input voice.
As a possible implementation manner, step S12 may include: if the comparison result is that the first input voice is matched with the first reference voice and the file format of the first input voice is consistent with that of the first reference voice, setting a mark stamp for the first input voice; if the comparison result is that the first input voice is matched with the first reference voice and the file format of the first input voice is not consistent with that of the first reference voice, a voice format conversion tool is called to convert the first input voice into a target input voice with the file format consistent with that of the first reference voice, and a mark stamp is set for the target input voice.
It is understood that the voice format conversion tool may be an existing voice file format conversion tool, for example, the first input voice is in MP3 format, the first reference voice is in WAV format, and the ". MP 3" is modified to ". WAV" by calling the voice file format conversion tool to modify the suffix name of the first input voice, so that the first input voice and the first reference voice can be spliced and played.
S13: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.
In step S13, the first input voice with the tag stamp is spliced with the first reference voice, specifically, the voice data of the first input voice with the tag stamp is spliced with the voice data of the first reference voice, and the spliced voice data is encapsulated with the new header protocol, so as to obtain the second reference voice.
It should be noted that the voice data corresponding to the second reference voice at least includes the voice data of the first input voice and the voice data of the first reference voice.
In this embodiment, the voice splicing tool is a script file for splicing the first input voice with the mark stamp and the first reference voice, wherein the object-oriented object of the script file is the voice data of the first input voice with the mark stamp and the voice data of the first reference voice.
It should be noted that voice splicing is different from voice synthesis, and voice splicing is to splice voice data in at least two voice files, and may be voice data head-to-tail splicing or voice data segmentation interception splicing, where when voice data head-to-tail splicing is performed, a start timestamp position, a splicing point timestamp position, and an end timestamp position of the spliced voice data are determined in the voice data in at least two voice files; when the voice data in at least two voice files are segmented and spliced, a plurality of voice segments to be spliced are obtained, and the plurality of voice segments to be spliced are spliced into complete voice data according to a preset voice data segment splicing strategy.
It can be understood that, in the process of calling the voice splicing tool to perform voice splicing to obtain the second reference voice, the script file corresponding to the voice splicing tool may be written by the existing logic language, that is, in actual application, the object-oriented of the voice splicing tool may also be a voice splicing process.
S14: and when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice.
In step S14, the first segmented speech corresponds to the first reference speech, and the second segmented speech corresponds to the first input speech.
In this embodiment, the first segmented voice corresponds to the first reference voice, which means that the voice data corresponding to the first segmented voice is the voice data corresponding to the first reference voice, that is, the voice content of the first segmented voice is the same as the content of the first reference voice, and similarly, the second segmented voice corresponds to the first input voice, that is, the voice content of the second segmented voice is the same as the content of the first input voice.
It should be noted that, the second reference voice segment is divided into a first segmented voice and a second segmented voice, where a mark point for distinguishing the first segmented voice from the second segmented voice is set in the second reference voice segment, and a corresponding mark position is set according to the respective voice data lengths of the first reference voice and the first input voice, so as to divide the second reference voice segment into the first segmented voice and the second segmented voice.
S15: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.
In step S15, the comparing the voiceprint features of the second input voice with the first segmented voice is performed by drawing a target voiceprint corresponding to the second input voice and a first voiceprint corresponding to the first segmented voice, extracting the voiceprint features in the target voiceprint, and then comparing the voiceprint features with the first voiceprint as a reference.
It is noted that the voiceprint can be at least one of a broadband voiceprint, a narrowband voiceprint, an amplitude voiceprint, a contour voiceprint, a time spectrum voiceprint, and a cross-sectional voiceprint, wherein the cross-sectional voiceprint comprises a cross-sectional broadband voiceprint and a cross-sectional narrowband voiceprint. The broadband voiceprint image and the narrowband voiceprint image are used for reflecting the change characteristics of the frequency and the intensity of the voice in the voice along with the time; the amplitude voiceprint image, the contour voiceprint image and the time spectrum voiceprint image are used for reflecting the characteristics of the change of the voice intensity or the voice pressure along with the time; the cross-sectional voiceprint is used for reflecting the intensity and frequency characteristics of the sound wave at a certain time point.
In all embodiments of the present invention, when voiceprint image comparison is performed between voices, the categories of the two voiceprint images that are compared are the same.
In this embodiment, the voiceprint feature comparison between the second input speech and the first segmented speech may specifically be to compare similar features in the voiceprints of the same characters and words in the second input speech and the first segmented speech. For example, the frequency values of resonance peaks in the voiceprint images of the second input voice and the first segmented voice are respectively selected for comparison, and then the same point and the different point between the second input voice and the first segmented voice are found out.
It can be understood that, in practical applications, when comparing the voiceprint features between the second input voice and the first segmented voice, and when comparing based on different voiceprint maps, the feature points used for comparison may also be different, and a specific voiceprint feature comparison scheme exists in the prior art, and therefore details are not repeated here.
S16: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.
In step S16, the first matching rate is used to reflect the comparison result between the second input speech and the first segmented speech. The first preset matching rate is used for describing the lowest matching rate standard when the comparison result of the second input voice and the first segmented voice is matched.
It should be noted that, in all embodiments of the present invention, the matching rate is used to describe the similarity degree between two voices being compared, that is, the higher the matching rate value is, the more similar the two voices being compared with the voiceprint feature are, and the greater the possibility of belonging to the same source is.
In this embodiment, the specific implementation manner of comparing the voiceprint characteristics of the second input voice with the first segmented voice is similar to that in step S15, and therefore, the detailed description is omitted here.
It is understood that in other embodiments of the present invention, the speech recognition method further comprises a first parallel step juxtaposed with step S16: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.
It should be noted that step S16 is not in the order of execution of the first parallel step, and the first parallel step is not executed after step S16 is executed, and step S16 is not executed after the first parallel step is executed.
S17: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.
In step S17, the second matching rate is used to reflect the comparison result between the second input speech and the second segmented speech. The second preset matching rate is used for describing the lowest matching rate standard when the comparison result of the second input voice and the second segmented voice is matched.
It should be noted that, in all embodiments of the present invention, the matching rate is used to describe the similarity degree between two voices being compared, that is, the higher the matching rate value is, the more similar the two voices being compared with the voiceprint feature are, and the greater the possibility of belonging to the same source is.
In this embodiment, the specific implementation manner of comparing the voiceprint characteristics of the second input voice with the second segmented voice is similar to that in step S15, and therefore, the detailed description is omitted here.
In the present embodiment, the speech recognition method further includes a second parallel step parallel to step S17: if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.
It should be noted that, step S17 is not in sequence with the execution of the second parallel step, and the second parallel step is not executed after step S167 is executed, and the step S17 is not executed after the second parallel step is executed.
As can be seen from the above, in the embodiment of the present invention, when a preset operation for performing voice recognition is detected, a comparison result between a first input voice received in the preset operation and a prestored first reference voice is monitored, and when the comparison result is a match, a mark stamp is set for the first input voice, and the first input voice with the mark stamp is spliced with the first reference voice by calling a voice splicing tool, so as to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.
Referring to fig. 2, fig. 2 is a schematic flow chart of a speech recognition method according to another embodiment of the present invention. As shown in fig. 2, a speech recognition method according to another embodiment of the present invention may include:
s21: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.
In step S21, the preset operation for performing voice recognition may be a trigger operation that performs voice recognition when a preset application is opened on the terminal, or a trigger operation that performs voice password input by manual triggering in the process of using the preset application, or a trigger operation that triggers the current operation interface to jump to the voice recognition interface in the step of obtaining the request for permission, where the trigger operation may be implemented by single click, double click, or continuous pressing of a trigger voice recognition button.
It is understood that in the present embodiment, a specific implementation manner of the step S21 is the same as the specific implementation manner of the step S11 in the previous embodiment, and please refer to the description of the step S11, which is not described herein again.
S22: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.
In step S22, the tag stamp is used to tag the first input speech and reflect that the source of the first input speech is legal, i.e. the source of the first input speech is the same as the source of the first reference speech.
It is understood that in the present embodiment, a specific implementation manner of the step S22 is the same as the specific implementation manner of the step S12 in the previous embodiment, and please refer to the description of the step S12, which is not described herein again.
S23: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.
In step S23, the speech concatenation tool includes: a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice each include a header protocol and voice data content.
As a possible implementation manner of this embodiment, step S23 may specifically include: calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
In this embodiment, the header protocol tool may be a preset wavheader.h script, and the first header protocol and the first voice data content, and the second header protocol and the second voice data content are obtained by executing the content in the script for parsing the header protocol.
In the WavHeader.h script, the number of digits of each parameter in a voice data header protocol and the number of digits occupied by voice data content are defined and distinguished, and the first input voice and the first reference voice are split by operating the WavHeader.h script to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice.
In this embodiment, the first header protocol and the second header protocol are respectively used to describe the voice duration, the voice size, and other contents of the first input voice and the first reference voice. And the voice time described by the new data header protocol generated according to the first data header protocol and the second data header protocol is the sum of the first input voice time and the first reference voice time, and the voice size described by the new data header protocol is the sum of the first input voice size and the first reference voice size.
The data content splicing tool may include a voice data reading tool DataRead and a voice data writing tool DataWriter.
It should be noted that both the voice data reading tool DataRead and the voice data writing tool DataWriter can be packaged and read via corresponding binary data streams.
In this embodiment, a new header protocol and new voice data content are encapsulated to obtain a second reference voice, where various voice data parameters in the new header protocol correspond to the new voice data content, that is, voice duration information and voice size information in the new header protocol are consistent with the duration and size of the voice data content.
S24: when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice.
S25: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.
S26: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.
S27: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.
It should be noted that, the specific implementation manner of steps S24 to S27 in the present embodiment corresponds to steps S14 to S17 in the previous embodiment, and please refer to the description of steps S14 to S17, which is not described herein again.
It is understood that, in the present embodiment, the step S27 is performed only when the step S26 is performed.
In the present embodiment, step S27 is followed by step S28 and step S29.
Step S28: setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice.
In step S28, InN and N are integers, N is not less than 1, and N is more than 1.
In this embodiment, the result of matching the second input speech with the second reference speech only occurs once during each speech recognition process. When the voice recognition is carried out for N times and each voice recognition result is the result of the matching of the second input voice and the second reference voice, the counting value in the counter is set as In. Setting the count value in the counter to I if it is determined that the second input voice matches the second reference voicenWherein, InNot less than 0 and In=In-1+1. When I isnWhen the matching threshold value is equal to the preset matching threshold value N, the fact that the second input voice is matched with the second reference voice is an event is an inevitable event, namely the possibility that the event that the second input voice is matched with the second reference voice is an accidental event is eliminated。
In practical applications, the preset matching threshold may be determined according to a period of a change of the voice, or according to a number of times that the second reference voice is compared with the most comparative standard, or according to a duration of use of the second reference voice.
It should be noted that, a marking stamp is set for the second input speech, and the marking stamp is used for marking the second input speech and reflecting that the source of the second input speech is legal, that is, the source of the second input speech is the same as the source of the second reference speech.
S29: and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.
In step S29, the second reference voice includes the voice data content corresponding to the first input voice and the voice data content corresponding to the first reference voice. The target voice is a voice section corresponding to the first input voice in the second reference voice.
It should be noted that, in order to avoid that the content of the reference voice is continuously increased along with the continuous increase of the number of times of voice recognition, when the voice splicing tool is called to splice the second input voice with the mark stamp with the target voice segment in the second reference voice, the target voice is the voice segment corresponding to the first input voice in the second reference voice.
In this embodiment, by setting a preset matching threshold N and determining that the second input speech matches the second reference speech, the count value in the counter is set to InWherein, InNot less than 0 and In=In-1+1, when InWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice to obtain third reference voice, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the user voice, and meanwhile, the situation that the reference voice changes along with the change of the user voice is avoidedThe phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.
As can be seen from the above, in the embodiment of the present invention, when a preset operation for performing voice recognition is detected, a comparison result between a first input voice received in the preset operation and a prestored first reference voice is monitored, and when the comparison result is a match, a mark stamp is set for the first input voice, and the first input voice with the mark stamp is spliced with the first reference voice by calling a voice splicing tool, so as to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.
Setting a preset matching threshold value N, and setting the count value in the counter to be I when the second input voice is determined to be matched with the second reference voicenWherein, InNot less than 0 and In=In-1+1, when InWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice, and third reference voice is obtained, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the voice of the user, and meanwhile, the phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.
Referring to fig. 3, fig. 3 is a schematic block diagram of a speech recognition apparatus according to an embodiment of the present invention. A speech recognition apparatus 3 of the present embodiment includes: a listening unit 31, a first marking unit 32, a first splicing unit 33, a segmentation unit 34, a first comparison unit 35, a second comparison unit 36 and a determination unit 37. Specifically, the method comprises the following steps:
the monitoring unit 31 is configured to monitor a comparison result between a first input voice received in a preset operation and a pre-stored first reference voice if the preset operation for performing voice recognition is detected.
For example, if the monitoring unit 31 detects a preset operation for performing voice recognition, it monitors a comparison result between a first input voice received in the preset operation and a pre-stored first reference voice.
A first marking unit 32, configured to set a marking stamp for the first input voice if the comparison result is that the first input voice matches the first reference voice.
For example, if the comparison result is that the first input voice matches the first reference voice, the first labeling unit 32 sets a labeling stamp for the first input voice.
Further, the voice splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content.
The first marking unit 32 is specifically configured to invoke the header protocol tool to split the first input voice and the first reference voice respectively, so as to obtain a first header protocol and a first voice data content corresponding to the first input voice, and a second header protocol and a second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
For example, the first marking unit 32 calls the header protocol tool to split the first input voice and the first reference voice respectively, so as to obtain a first header protocol and a first voice data content corresponding to the first input voice, and a second header protocol and a second voice data content corresponding to the second input voice; generating a new data header protocol according to the first data header protocol and the second data header protocol; calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content; and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
And a first splicing unit 33, configured to invoke a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice, so as to obtain a second reference voice.
For example, the first splicing unit 33 invokes a speech splicing tool to splice the first input speech with the markup stamp and the first reference speech to obtain a second reference speech.
A segmenting unit 34, configured to, when the preset operation is detected again, divide the second reference speech segment into a first segmented speech and a second segmented speech, where the first segmented speech corresponds to the first reference speech, and the second segmented speech corresponds to the first input speech.
For example, when the preset operation is detected again, the segmentation unit 34 segments the second reference speech segment into a first segmented speech corresponding to the first reference speech and a second segmented speech corresponding to the first input speech.
A first comparing unit 35, configured to perform voiceprint feature comparison on the second input voice received in the preset operation and the first segmented voice, which are detected again.
For example, the first comparing unit 35 performs voiceprint feature comparison on the second input voice received in the preset operation detected again and the first segmented voice.
A second comparing unit 36, configured to compare the voiceprint features of the second input voice with the second segmented voice if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate.
For example, if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, the second comparing unit 36 performs voiceprint feature comparison on the second input voice and the second segmented voice.
A determining unit 37, configured to determine that the second input voice matches the second reference voice if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate.
For example, if a second matching rate obtained by comparing the second input speech with the second segmented speech is equal to or greater than a second preset matching rate, the determining unit 37 determines that the second input speech matches with the second reference speech.
Optionally, the speech recognition device 30 may further include: a second marking unit 38 and a second splicing unit 39. Specifically, the method comprises the following steps:
a second marking unit 38 for setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.
For example, the second flag unit 38 sets the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.
And the second splicing unit 39 is configured to invoke a voice splicing tool to splice the second input voice with the marked stamp with the target voice segment in the second reference voice to obtain a third reference voice, where the target voice is the voice segment corresponding to the first input voice.
For example, the second splicing unit 39 calls a voice splicing tool to splice the second input voice with the mark stamp and the target voice segment in the second reference voice to obtain a third reference voice, where the target voice is the voice segment corresponding to the first input voice.
As can be seen from the above, in the embodiment of the present invention, when a preset operation for performing voice recognition is detected, a comparison result between a first input voice received in the preset operation and a prestored first reference voice is monitored, and when the comparison result is a match, a mark stamp is set for the first input voice, and the first input voice with the mark stamp is spliced with the first reference voice by calling a voice splicing tool, so as to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.
Setting a preset matching threshold value N, and setting the count value in the counter to be I when the second input voice is determined to be matched with the second reference voicenWherein, InNot less than 0 and In=In-1+1, when InWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice, and third reference voice is obtained, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the voice of the user, and meanwhile, the phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.
Referring to fig. 4, a schematic block diagram of a terminal according to another embodiment of the present invention is shown. The terminal device 400 in the present embodiment as shown in the figure may include: one or more processors 401; one or more input devices 402, one or more output devices 403, and memory 404. The processor 401, the input device 402, the output device 403, and the memory 404 are connected by a bus 405. The memory 402 is used for storing, the computer program comprises instructions, and the processor 401 executes the following operations by calling the computer program stored in the memory 402:
the processor 401 is configured to: if the preset operation for voice recognition is detected, monitoring a comparison result of the first input voice received in the preset operation and a pre-stored first reference voice.
The processor 401 is configured to: and if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice.
The processor 401 is configured to: and calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice.
The processor 401 is configured to: when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice.
The processor 401 is configured to: and comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features.
The processor 401 is configured to: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice.
The processor 401 is configured to: and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.
The processor 401 is further configured to: and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.
The processor 401 is further configured to: if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.
The processor 401 is further configured to: setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InAnd when the preset matching threshold value N is equal, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1.
The processor 401 is further configured to: and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.
The processor 401 is specifically configured to: the calling of the voice splicing tool splices the first input voice with the mark stamp and the first reference voice to obtain a second reference voice, and the method comprises the following steps:
calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice;
generating a new data header protocol according to the first data header protocol and the second data header protocol;
calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content;
and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
It should be understood that, in the embodiment of the present invention, the Processor 501 may be a Central Processing Unit (CPU), and the Processor may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The input device 402 may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device 403 may include a display (LCD, etc.), a speaker, etc.
The memory 404 may include a read-only memory and a random access memory, and provides instructions and data to the processor 401. A portion of the memory 404 may also include non-volatile random access memory. For example, the memory 404 may also store device type information.
In a specific implementation, the processor 401, the input device 402, and the output device 403 described in this embodiment of the present invention may execute the implementation manners described in the first embodiment and the second embodiment of the speech recognition method provided in this embodiment of the present invention, and may also execute the implementation manners of the devices described in this embodiment of the present invention, which is not described herein again.
In another embodiment of the invention, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, implements:
if the preset operation for voice recognition is detected, monitoring a comparison result of a first input voice received in the preset operation and a prestored first reference voice;
if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice;
calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice;
when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice;
comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features;
if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice;
and if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice.
The computer program when executed by the processor further implements:
calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice;
generating a new data header protocol according to the first data header protocol and the second data header protocol;
calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content;
and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
The computer program when executed by the processor further implements:
and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.
The computer program when executed by the processor further implements:
if a second matching rate obtained by comparing the second input voice with the second segmented voice is smaller than a second preset matching rate, determining that the second input voice is not matched with the second reference voice; wherein the first preset matching rate is equal to the second preset matching rate.
The computer program when executed by the processor further implements:
setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InWhen the number of the input voices is equal to a preset matching threshold value N, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1;
and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.
As can be seen from the above, in the embodiment of the present invention, when a preset operation for performing voice recognition is detected, a comparison result between a first input voice received in the preset operation and a prestored first reference voice is monitored, and when the comparison result is a match, a mark stamp is set for the first input voice, and the first input voice with the mark stamp is spliced with the first reference voice by calling a voice splicing tool, so as to obtain a second reference voice; when the preset operation is detected again, the second reference voice segment is divided into the first segmented voice and the second segmented voice, the second input voice detected again is compared with the first segmented voice in a voiceprint characteristic mode to obtain a first matching rate, whether the voiceprint characteristic comparison is carried out on the second input voice and the second segmented voice or not is determined according to a comparison result of the first matching rate and the first preset matching rate, the reference voice is updated in the voice recognition process, the reference voice can change along with the natural change of the voice of the same recognized person, and the phenomenon of inaccurate voice recognition caused by the change of the voice of the person is avoided.
Setting a preset matching threshold value N, and setting the count value in the counter to be I when the second input voice is determined to be matched with the second reference voicenWherein, InNot less than 0 and In=In-1+1, when InWhen the preset matching threshold value N is equal to the preset matching threshold value N, the second input voice is marked, the voice splicing tool is called to splice the second input voice with the marked mark and the target voice segment in the second reference voice, and third reference voice is obtained, so that the reference voice for voice recognition can be continuously updated, the reference voice can be ensured to change along with the change of the voice of the user, and meanwhile, the phenomenon that the matching rate is gradually reduced due to the updating of the reference voice is avoided.
The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of a computer. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus. The computer-readable storage medium is used for storing the computer program and other programs and data required by the apparatus. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. A speech recognition method, comprising:
if the preset operation for voice recognition is detected, monitoring a comparison result of a first input voice received in the preset operation and a prestored first reference voice;
if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice;
calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice;
when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice;
comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features;
if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice;
if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice;
after determining that the second input speech matches the second reference speech, the method further includes:
setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InWhen the number of the input voices is equal to a preset matching threshold value N, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1;
and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.
2. The speech recognition method of claim 1, wherein the speech splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content;
the calling of the voice splicing tool splices the first input voice with the mark stamp and the first reference voice to obtain a second reference voice, and the method comprises the following steps:
calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice;
generating a new data header protocol according to the first data header protocol and the second data header protocol;
calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content;
and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
3. The speech recognition method of claim 1, wherein after comparing the re-detected second input speech received in the preset operation with the first segmented speech in a voiceprint feature manner, further comprising:
and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.
4. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the following steps when executing the computer program:
if the preset operation for voice recognition is detected, monitoring a comparison result of a first input voice received in the preset operation and a prestored first reference voice;
if the comparison result is that the first input voice is matched with the first reference voice, setting a mark stamp for the first input voice;
calling a voice splicing tool to splice the first input voice with the mark stamp and the first reference voice to obtain second reference voice;
when the preset operation is detected again, dividing the second reference voice segment into a first segmented voice and a second segmented voice, wherein the first segmented voice corresponds to the first reference voice, and the second segmented voice corresponds to the first input voice;
comparing the second input voice received in the redetected preset operation with the first segmented voice by voiceprint features;
if a first matching rate obtained by comparing the second input voice with the first segmented voice is smaller than a first preset matching rate, comparing the voiceprint characteristics of the second input voice with the second segmented voice;
if a second matching rate obtained by comparing the second input voice with the second segmented voice is equal to or greater than a second preset matching rate, determining that the second input voice is matched with the second reference voice;
after determining that the second input speech matches the second reference speech, the method further includes:
setting the count value in the counter to InWherein, InNot less than 0 and In=In-1+1, when InWhen the number of the input voices is equal to a preset matching threshold value N, setting a marking stamp for the second input voice, wherein N is more than or equal to 1, and N is more than 1;
and calling a voice splicing tool to splice the second input voice with the mark stamp and a target voice section in the second reference voice to obtain a third reference voice, wherein the target voice is a voice section corresponding to the first input voice.
5. The terminal device of claim 4, wherein the voice splicing tool comprises a data header protocol tool and a data content splicing tool; the first input voice and the first reference voice both comprise a data header protocol and voice data content;
the calling of the voice splicing tool splices the first input voice with the mark stamp and the first reference voice to obtain a second reference voice, and the method comprises the following steps:
calling the data header protocol tool to respectively split the first input voice and the first reference voice to obtain a first data header protocol and first voice data content corresponding to the first input voice and a second data header protocol and second voice data content corresponding to the second input voice;
generating a new data header protocol according to the first data header protocol and the second data header protocol;
calling the data content splicing tool to splice the first voice data content and the second voice data content to obtain new voice data content;
and encapsulating the new data header protocol and the new voice data content to obtain the second reference voice.
6. The terminal device of claim 4, wherein after comparing the second input voice received in the re-detected preset operation with the first segmented voice for voiceprint features, the method further comprises:
and if a first matching rate obtained by comparing the second input voice with the first segmented voice is equal to or greater than a first preset matching rate, determining that the second input voice is matched with the second reference voice.
7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 3.
CN201711293919.4A 2017-12-08 2017-12-08 Speech recognition method, terminal device and computer-readable storage medium Active CN108257604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711293919.4A CN108257604B (en) 2017-12-08 2017-12-08 Speech recognition method, terminal device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711293919.4A CN108257604B (en) 2017-12-08 2017-12-08 Speech recognition method, terminal device and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN108257604A CN108257604A (en) 2018-07-06
CN108257604B true CN108257604B (en) 2021-01-08

Family

ID=62720987

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711293919.4A Active CN108257604B (en) 2017-12-08 2017-12-08 Speech recognition method, terminal device and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN108257604B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110896352B (en) * 2018-09-12 2022-07-08 阿里巴巴集团控股有限公司 Identity recognition method, device and system
CN109495636A (en) 2018-10-23 2019-03-19 慈中华 Information interacting method and device
CN109584887B (en) * 2018-12-24 2022-12-02 科大讯飞股份有限公司 Method and device for generating voiceprint information extraction model and extracting voiceprint information
CN112786015A (en) * 2019-11-06 2021-05-11 阿里巴巴集团控股有限公司 Data processing method and device
CN111933183B (en) * 2020-08-17 2023-11-24 深圳一块互动网络技术有限公司 Audio identification method of Bluetooth equipment for merchants

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7222072B2 (en) * 2003-02-13 2007-05-22 Sbc Properties, L.P. Bio-phonetic multi-phrase speaker identity verification
CN101311953A (en) * 2007-05-25 2008-11-26 上海电虹软件有限公司 Network payment method and system based on voiceprint authentication
US8417525B2 (en) * 2010-02-09 2013-04-09 International Business Machines Corporation Adaptive voice print for conversational biometric engine
CN102402985A (en) * 2010-09-14 2012-04-04 盛乐信息技术(上海)有限公司 Voiceprint authentication system for improving voiceprint identification safety and method for realizing the same
CN102760434A (en) * 2012-07-09 2012-10-31 华为终端有限公司 Method for updating voiceprint feature model and terminal
CN104239456B (en) * 2014-09-02 2019-05-03 百度在线网络技术(北京)有限公司 The extracting method and device of user characteristic data
CN105575391B (en) * 2014-10-10 2020-04-03 阿里巴巴集团控股有限公司 Voiceprint information management method and device and identity authentication method and system
CN104616655B (en) * 2015-02-05 2018-01-16 北京得意音通技术有限责任公司 The method and apparatus of sound-groove model automatic Reconstruction
CN105991593B (en) * 2015-02-15 2019-08-30 阿里巴巴集团控股有限公司 A kind of method and device identifying consumer's risk
CN105049882B (en) * 2015-08-28 2019-02-22 北京奇艺世纪科技有限公司 A kind of video recommendation method and device
CN106156583A (en) * 2016-06-03 2016-11-23 深圳市金立通信设备有限公司 A kind of method of speech unlocking and terminal
CN106782564B (en) * 2016-11-18 2018-09-11 百度在线网络技术(北京)有限公司 Method and apparatus for handling voice data

Also Published As

Publication number Publication date
CN108257604A (en) 2018-07-06

Similar Documents

Publication Publication Date Title
CN108257604B (en) Speech recognition method, terminal device and computer-readable storage medium
JP6621536B2 (en) Electronic device, identity authentication method, system, and computer-readable storage medium
EP3477519A1 (en) Identity authentication method, terminal device, and computer-readable storage medium
EP3327720A1 (en) User voiceprint model construction method, apparatus, and system
US10270736B2 (en) Account adding method, terminal, server, and computer storage medium
CN109559735B (en) Voice recognition method, terminal equipment and medium based on neural network
WO2022116487A1 (en) Voice processing method and apparatus based on generative adversarial network, device, and medium
JP7123871B2 (en) Identity authentication method, identity authentication device, electronic device and computer-readable storage medium
CN110505504B (en) Video program processing method and device, computer equipment and storage medium
WO2018129869A1 (en) Voiceprint verification method and apparatus
CN113435196B (en) Intention recognition method, device, equipment and storage medium
CN113077821A (en) Audio quality detection method and device, electronic equipment and storage medium
US20190130084A1 (en) Authentication method, electronic device, and computer-readable program medium
CN111081260A (en) Method and system for identifying voiceprint of awakening word
US10910000B2 (en) Method and device for audio recognition using a voting matrix
CN113516994B (en) Real-time voice recognition method, device, equipment and medium
CN108880815A (en) Auth method, device and system
WO2020024415A1 (en) Voiceprint recognition processing method and apparatus, electronic device and storage medium
CN112466287B (en) Voice segmentation method, device and computer readable storage medium
CN109524009B (en) Policy entry method and related device based on voice recognition
CN114822558A (en) Voiceprint recognition method and device, electronic equipment and storage medium
CN115050372A (en) Audio segment clustering method and device, electronic equipment and medium
CN112911334A (en) Emotion recognition method, device and equipment based on audio and video data and storage medium
CN112712793A (en) ASR (error correction) method based on pre-training model under voice interaction and related equipment
CN111785280A (en) Identity authentication method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant