CN115376490A - Voice recognition method and device and electronic equipment - Google Patents

Voice recognition method and device and electronic equipment Download PDF

Info

Publication number
CN115376490A
CN115376490A CN202211000406.0A CN202211000406A CN115376490A CN 115376490 A CN115376490 A CN 115376490A CN 202211000406 A CN202211000406 A CN 202211000406A CN 115376490 A CN115376490 A CN 115376490A
Authority
CN
China
Prior art keywords
language
voice
speech
fragment
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211000406.0A
Other languages
Chinese (zh)
Other versions
CN115376490B (en
Inventor
孙天宇
陈泉
杨晶生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202211000406.0A priority Critical patent/CN115376490B/en
Publication of CN115376490A publication Critical patent/CN115376490A/en
Application granted granted Critical
Publication of CN115376490B publication Critical patent/CN115376490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure relates to a voice recognition method, a voice recognition device, and an electronic device, and in particular, to the field of voice recognition technology. The method comprises the following steps: continuously receiving voice streams, and performing voice recognition by adopting a first voice recognition engine corresponding to a first language; under the condition that a first voice fragment with the duration being greater than or equal to a first preset duration is received, language detection is carried out on the first voice fragment; determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time; and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content. The embodiment of the disclosure is used for solving the problem that partial speech recognition is inaccurate.

Description

Voice recognition method and device and electronic equipment
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method and apparatus, and an electronic device.
Background
Conferencing in multiple languages is now becoming more common. In a conference system with a caption function, when a participant joins a conference, the system continuously performs language detection on the voice of the speaker to generate a caption. When a different language is detected, a corresponding speech recognition (ASR) engine is used for recognition. Generally, in order to ensure the accuracy of language detection, it is usually necessary to accumulate voices for a certain time (usually 3 to 5 seconds) for language detection, and for the case that a speaker speaks in a conference in different languages, because the language detection needs to accumulate a certain time, part of audio may be recognized by using an incorrect ASR engine, resulting in an inaccurate recognition of part of the voices.
Disclosure of Invention
In order to solve the above technical problems, or at least partially solve the above technical problems, the present disclosure provides a speech recognition method, apparatus and electronic device, which can correct a speech recognition error caused by a speech recognition engine using error, and improve accuracy of speech recognition.
In order to achieve the above purpose, the technical solutions provided by the embodiments of the present disclosure are as follows:
in a first aspect, a speech recognition method is provided, including:
continuously receiving voice streams, and performing voice recognition by adopting a first voice recognition engine corresponding to a first language;
under the condition that a first voice fragment with the duration being greater than or equal to a first preset duration is received, language detection is carried out on the first voice fragment;
determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time;
and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
In some embodiments, the determining a transition time between the first language and the second language includes:
and determining the starting time of the first voice segment as the conversion time under the condition that no other voice segment which is continuous in time exists before the first voice segment.
In some embodiments, the determining a transition time of the first language and the second language includes:
under the condition that other voice fragments which are continuous in time exist before the first voice fragment, sequentially selecting at least one historical voice fragment from the detected moment of the second language according to a first sliding step length and a first time window length, and determining a target voice fragment from the at least one historical voice fragment, wherein the target voice fragment is a historical voice fragment with a language detection result which is closest to the current moment and is not the second language;
determining the conversion moment based on a first historical voice segment of the target voice segment, wherein the first historical voice segment is the first historical voice segment of the target voice segment. First historical speech segment in some embodiments, the target speech segment satisfies any one of the following conditions:
the probability that the target voice fragment is in the second language is smaller than or equal to a preset probability threshold;
the probability that the target voice fragment is in the first language is greater than a preset probability threshold;
the probability that the target voice fragment is in any language is smaller than or equal to a preset probability threshold.
In some embodiments, the first time window is less than or equal to the first preset duration, and the first sliding step is less than or equal to the preset time window.
In some embodiments, the determining a starting time of a first historical speech segment selected before the target speech segment as the transition time includes:
determining a first probability that the first historical speech fragment is in a second language;
determining a second probability that a second historical voice fragment is in a second language, wherein the second historical voice fragment is a last voice fragment of the first historical voice fragment;
and when the difference value between the first probability and the second probability is smaller than a preset difference value, determining the starting moment of the first historical voice fragment as the conversion moment.
In some embodiments, the voice stream is a voice stream corresponding to any participant device in the conference, or the voice stream is a voice stream corresponding to any participant in the conference.
In some embodiments, the method further comprises: and presenting the recognition result of the voice recognition on a client interface.
In some embodiments, after performing speech recognition on the target speech after the conversion time by using the second speech recognition engine corresponding to the second language to obtain the target text content, the method further includes:
and replacing a first recognition result based on the target text content, wherein the first recognition result is a voice recognition result obtained by performing voice recognition by using the first voice recognition engine after the conversion moment.
In a second aspect, a speech recognition apparatus is provided, including:
the receiving module is used for continuously receiving the voice stream;
the voice recognition module is used for performing voice recognition by adopting a first voice recognition engine corresponding to a first language;
the language detection module is used for detecting the language of a first voice fragment when the first voice fragment with the length greater than or equal to a first preset time length is received;
the speech recognition module is further configured to:
determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time;
and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
In some embodiments, the speech recognition module is specifically configured to:
and determining the starting moment of the first voice segment as the conversion moment under the condition that other voice segments which are continuous in time do not exist before the first voice segment.
In some embodiments, the speech recognition module is specifically configured to:
under the condition that other voice fragments which are continuous in time exist before the first voice fragment, sequentially selecting at least one historical voice fragment from the detected moment of the second language according to a first sliding step length and a first time window length, and determining a target voice fragment from the at least one historical voice fragment, wherein the target voice fragment is a historical voice fragment with a language detection result which is closest to the current moment and is not in the second language;
determining the conversion moment based on a first historical voice segment of the target voice segment, wherein the first historical voice segment is the first historical voice segment of the target voice segment.
In some embodiments, the target speech segment satisfies any one of the following conditions:
the probability that the target voice fragment is in the second language is smaller than or equal to a preset probability threshold;
the probability that the target voice fragment is in the first language is greater than a preset probability threshold;
the probability that the target voice fragment is in any language is smaller than or equal to a preset probability threshold.
In some embodiments, the first time window is less than or equal to the first preset duration, and the first sliding step is less than or equal to the preset time window.
In some embodiments, the speech recognition module is specifically configured to:
determining a first probability that the first historical speech fragment is in a second language;
determining a second probability that a second historical voice fragment is in a second language, wherein the second historical voice fragment is a last voice fragment of the first historical voice fragment;
and when the difference value between the first probability and the second probability is smaller than a preset difference value, determining the starting moment of the first historical voice fragment as the conversion moment.
In some embodiments, the voice stream is a voice stream corresponding to any participant device in the conference, or the voice stream is a voice stream corresponding to any participant in the conference.
In some embodiments, the apparatus further comprises: a recognition result processing module; and the recognition result processing module is used for presenting the recognition result of the voice recognition on a client interface.
In some embodiments, the recognition result processing module is further configured to, after the speech recognition module performs speech recognition on the target speech after the conversion time by using a second speech recognition engine corresponding to a second language to obtain target text content, replace, based on the target text content, a first recognition result, where the first recognition result is a speech recognition result obtained by performing speech recognition by using the first speech recognition engine after the conversion time.
In a third aspect, an electronic device is provided, including: a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing the speech recognition method according to the first aspect or any one of its alternative embodiments.
In a fourth aspect, a computer-readable storage medium is provided, comprising: the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements a speech recognition method according to the first aspect or any one of its alternative embodiments.
In a fifth aspect, there is provided a computer program product comprising: when the computer program product runs on a computer, the computer is caused to implement the speech recognition method according to the first aspect or any one of its alternative embodiments.
The voice recognition method provided by the embodiment of the disclosure can continuously receive the voice stream, and performs voice recognition by adopting the first voice recognition engine corresponding to the first language; under the condition that a first voice fragment with the duration being greater than or equal to a first preset duration is received, language detection is carried out on the first voice fragment; determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time; and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content. By the scheme, language detection can be carried out on the first voice fragment with the length longer than the first preset time, language conversion can be determined when the first voice fragment is detected to be the second language and the last detection result is the first language, after the conversion time of the first language and the second language is determined, because the target voice received after the conversion time is likely to have the problem of misdetection, the target voice is subjected to voice recognition again based on the second voice recognition engine, the voice recognition error caused by the fact that the voice recognition is carried out through the first voice recognition engine before can be corrected, and the accuracy of the voice recognition is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic diagram of an implementation scenario provided by an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram for determining a transition time according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of another method for determining a transition time according to an embodiment of the disclosure;
fig. 5 is a schematic diagram illustrating comparison of recognition results provided by an embodiment of the present disclosure;
fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present disclosure;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.
It is becoming more common in the related art to conduct conferences using multiple languages. In a conference system with a caption function, when a participant joins a conference, the system can continuously perform language detection on the voice of a speaker to generate a caption. When a different language is detected, a corresponding speech recognition (ASR) engine is used for recognition. In order to ensure the accuracy of language detection, it is usually necessary to accumulate speech of a certain duration (usually 3 to 5 seconds) for language detection, and for the case that a speaker speaks in a conference in different languages, because the language detection needs to accumulate a certain duration, part of audio may be recognized by using an incorrect ASR engine, resulting in inaccurate recognition of part of speech.
In order to solve the foregoing problems, embodiments of the present disclosure provide a speech recognition method, an apparatus, and an electronic device, which may perform language detection on a first speech segment longer than a first preset duration, determine that language conversion has occurred when the first speech segment is detected to be a second language and a last detection result is the first language, determine that a conversion time of the first language and the second language is later, and perform speech recognition again on a target speech based on a second speech recognition engine because a problem of false detection may exist in the target speech received after the conversion time, correct a speech recognition error caused by previous speech recognition performed by the first speech recognition engine, and improve accuracy of the speech recognition.
Fig. 1 is a schematic diagram of an implementation scenario provided by the embodiment of the present disclosure, in which a server 10 and 3 electronic devices are involved in the scenario, where the electronic devices are an electronic device 11, an electronic device 12, and an electronic device 13, and the 3 users are a user a, a user B, and a user C respectively to perform an online conference through the 3 electronic devices, where after the user a uses the electronic device 11 to enter a current conference, the electronic device 11 may continuously perform language detection and perform speech recognition by using a corresponding speech recognition engine based on a speech recognition method provided by the embodiment of the present disclosure, or the server receives a speech stream of the electronic device 11 and may continuously perform language detection and perform speech recognition by using a corresponding speech recognition engine based on a speech recognition method provided by the embodiment of the present disclosure for the speech stream of the electronic device 11.
The speech recognition method provided by the embodiment of the disclosure can be applied to a speech recognition device or an electronic device, and the speech recognition device can be a functional module or a functional entity which can implement the speech recognition method in the electronic device.
It is understood that before or during the application of the technical solutions disclosed in the embodiments of the present disclosure, the user should be informed of the type, usage scope, usage scenario, etc. of the personal information (e.g., user information, user voice) involved in the present disclosure in a proper manner according to the relevant laws and regulations and obtain the authorization of the user.
For example, in the embodiment of the present disclosure, in relation to acquiring the voice stream of the speaker of some participant devices in the conference scene, in practical applications, before the acquisition of the voice stream of the speaker is performed, a user authorization may be applied to allow the voice stream of the speaker to be acquired and used for voice recognition and language recognition.
The electronic device may be a server, a tablet computer, a mobile phone, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), a Personal Computer (PC), and the like, which is not limited in this disclosure.
Fig. 2 is a schematic flowchart of a speech recognition method according to an embodiment of the present disclosure, where the method may include the following steps:
201. the voice stream is continuously received.
The voice recognition method provided by the embodiment of the disclosure can be applied to voice recognition in a conference scene, for example, voice recognition for a voice stream of a participant in the conference scene, or voice recognition for a voice stream of a participant in the conference scene.
In some embodiments, the voice stream is a voice stream corresponding to any participant device in the conference, or the voice stream is a voice stream corresponding to any participant in the conference.
202. And aiming at the voice stream which is continuously received, a first voice recognition engine corresponding to the first language is adopted for voice recognition.
The first language may be a language determined after last language detection. The first language may also be a default language at the beginning of the conference.
Different speech recognition engines are usually used for different languages, and the first speech recognition engine is a speech recognition engine for recognizing the first language.
203. And under the condition that a first voice segment which is longer than or equal to a first preset time length is received, performing language detection on the first voice segment.
The first preset duration is the set shortest accumulation duration of the voice which represents that the language detection is allowed to be carried out.
In the embodiment of the present disclosure, in the process of continuously receiving a voice stream, if a first voice segment that is greater than or equal to a first preset duration is received, it is described that the voice stream with the first preset duration has been accumulated, language detection may be accurately performed, and at this time, language detection may be performed for the first voice segment.
204. And detecting that the second language to which the first voice fragment belongs is different from the first language obtained by last language detection.
The language detection of the first language is carried out to obtain how large the first voice frequency band is as the second language, and the second language is a language different from the first language.
Illustratively, the first language is Chinese and the second language is English; or the first language is English and the second language is Chinese.
205. And determining the conversion time of the first language and the second language.
Wherein, in response to detecting that the second language to which the first speech segment belongs is different from the first language detected last time, the language conversion from the first language to the second language can be obtained, and here, the conversion time between the first language and the second language is triggered and determined.
In some embodiments, after detecting that the language of the first speech segment is the second language, a second speech recognition engine corresponding to the second language may be used to perform speech recognition on a subsequently received speech stream, where the subsequently received speech stream may be a speech stream received after the detection time of the second language.
After the language of the first voice segment is detected to be the second language, the language is converted from the first language to the second language at the moment, so that for the voice stream received after the first voice segment, the second voice recognition engine corresponding to the second language is selected for voice recognition, and thus the correct voice recognition engine is selected to accurately recognize the subsequently received voice stream.
206. And performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
Because the conversion time is the time for converting the first language into the second language, the target voices after the time are all the second language, and the second voice recognition engine is adopted to perform voice recognition on the target voices after the conversion time, so that accurate target text content can be obtained.
It should be noted that, the performing speech recognition on the target speech after the conversion time by using the second speech recognition engine corresponding to the second language to obtain the target text content may include but is not limited to: after the second language is recognized through the first voice fragment and before the switching time of the first language and the second language is determined, after the second language is recognized, subsequently received voice is subjected to voice recognition through the second voice recognition engine, a part of text content is obtained first, and then after the switching time of the first language and the second language is determined, voice recognition is carried out on the voice received before the second language is recognized through the second voice recognition engine after the switching time is determined, and another part of text content is obtained. The two parts of text contents obtained successively are the target text contents.
After the language of the first speech segment is detected to be the second language, which indicates that the language has been converted from the first language to the second language at this time, it is necessary to determine a conversion time when the first language is converted into the second language, and perform speech recognition again based on the second speech recognition engine on the target speech received before the detection time of the second language after the conversion time.
In some embodiments, determining the transition time of the first language and the second language may include, but is not limited to: in case there are no other temporally consecutive speech segments preceding a first speech segment, the start time of the first speech segment is determined as the transition time.
Under the condition that other time continuous speech segments do not exist before the first speech segment, it is indicated that there may be a pause in speaking of the speaker corresponding to the speech, and in combination with the speaking habit of the actual speaker, it is likely that the language of the speaking is converted after the pause, and then the starting time of the first speech segment is determined as the conversion time of the first language into the second language.
That is, in this case, speech recognition may be resumed with the second speech recognition engine from the target speech received after the start time of the first speech segment.
Fig. 3 is a schematic diagram for determining a conversion time according to an embodiment of the present disclosure, in a speech fragment 31 received at the beginning, a detected language is chinese, and a speech recognition engine corresponding to the chinese is correspondingly used for speech recognition. After an interval time T1 after the speech segment 31, the speech segments 32 with the duration of T2 are accumulated, the starting time of T2 is T1, the ending time is T2, the language is detected to be english at the time T2, and the language is detected to be english at the time T2, so that the speech recognition engine corresponding to english is used for speech recognition after the time T2, and the part of the speech segments before the time T2 after the time T1 is actually english, but the speech recognition engine corresponding to chinese is used for speech recognition, so that the part of the speech segments 32 before the time T2 after the time T1 needs to be subjected to speech recognition again based on the speech recognition engine corresponding to english to correctly recognize the part of the speech segments, and the accuracy of speech recognition is improved.
In some embodiments, determining the transition time of the first language and the second language may include, but is not limited to: under the condition that other voice fragments which are continuous in time exist before the first voice fragment, sequentially selecting at least one historical voice fragment from the detected moment of the second language according to a first sliding step length and a first time window length, and determining a target voice fragment from the at least one historical voice fragment, wherein the target voice fragment is a historical voice fragment with a language detection result which is closest to the current moment and is not the second language; and determining the starting moment of the first historical voice fragment selected before the target voice fragment as the conversion moment.
When other temporally continuous speech segments exist before the first speech segment, a transition time from the first language to the second language cannot be directly determined, at this time, at least one historical speech segment may be sequentially selected from a detection time of the second language according to a time sequence from near to far, languages corresponding to the historical speech segments are detected, until a target speech segment that is not the second language is detected from the historical speech segments, it may be determined that the first historical speech segment selected before the target speech segment is a latest speech segment that can be detected as the second language, and then the start time of the first historical speech segment may be determined as the transition time.
That is, in this case, speech recognition may be re-performed based on the second speech recognition engine for the target speech received after the start time of the first history speech segment of the target speech segment.
In some embodiments, the target speech segment satisfies any one of the following conditions:
condition 1: the probability that the target voice fragment is in the second language is smaller than or equal to a preset probability threshold;
condition 2: the probability that the target voice fragment is in the first language is greater than a preset probability threshold;
condition 3: the probability that the target voice fragment is in any language is smaller than or equal to a preset probability threshold.
The language detection is that a classifier is used for receiving a voice fragment with a certain data volume, the probability of the voice fragment corresponding to each language is given for the received voice fragment, then when the probability of the target language is larger than a preset probability threshold, the detected language of the voice fragment is considered as the target language, and if the probability of each language is smaller than the preset probability threshold, the language cannot be detected. In the condition 1, the probability that the target voice fragment is in the second language is less than or equal to a preset probability threshold, which indicates that the target voice fragment is not in the second language; in condition 2, the probability that the target speech segment is in the first language is greater than the preset probability threshold, which indicates that the target speech segment is in the first language, and in condition 3, the probability that the target speech segment is in any language is less than or equal to the preset probability threshold, which indicates that the target speech segment may have a situation that multiple languages are mixed, and the languages cannot be detected.
For example, fig. 4 is another schematic diagram for determining a transition time according to an embodiment of the present disclosure, where a temporally continuous chinese speech segment 41 exists before an english speech segment 42, so that during the whole detection process, when a speech segment performs a speech type detection, three situations may occur, where one situation is that the detected speech type is chinese, another situation is that chinese and english are mixed, and the speech type cannot be detected, and yet another situation is that the detected speech type is english. In this case, before the detection time T3 of chinese and english in fig. 4, the conversion time of chinese and english cannot be directly determined, at this time, at least one historical speech segment needs to be sequentially selected in a time sequence from near to far according to a step size S and a time window Δ T, a target speech segment 43 in which a language cannot be detected due to recent chinese-english mixing is determined from the at least one historical speech segment, then a previous historical speech frequency band selected before the target speech segment 43, that is, a first historical speech segment 44, is determined, and a start time T4 of the first historical speech segment 44 is determined as the conversion time.
In some embodiments, determining the transition time of the first language and the second language may include, but is not limited to: under the condition that other voice fragments which are continuous in time exist before the first voice fragment, at least one historical voice fragment is sequentially selected from the detection moment of the second language according to a first sliding step length and a first time window in the time sequence from near to far, and a target voice fragment with a latest language detection result which is not the second language is determined from the at least one historical voice fragment; determining a first probability that a first historical voice fragment selected before a target voice fragment is in a second language; determining a second probability that the first historical voice fragment selected before the target voice fragment is in the second language; and when the difference value between the first probability and the second probability is smaller than a preset difference value, determining the starting moment of a first historical voice segment selected before the target voice segment as the conversion moment.
Under normal conditions, when at least one historical voice fragment is sequentially selected according to a first sliding step length and a first time window in a time sequence from near to far, if the historical voice fragment before the target voice fragment is a continuous voice fragment of a second language, the probability that the language of the historical voice fragment selected first is the second language is higher because the probability that the voice of the first language is mixed in the second language in the historical voice fragment selected later is higher, and therefore the first probability that the first historical voice fragment selected before the target voice fragment is the second language is smaller than or equal to the second probability that the second historical voice fragment selected before the target voice fragment is the second language. In the above embodiment, when the difference between the first probability and the second probability is smaller than the preset difference, the starting time of the first historical speech segment selected before the target speech segment is determined as the transition time, and the starting time of the first historical speech segment may be determined as the transition time of the first language and the second language under the condition that the first probability that the first historical speech segment is in the second language is not significantly reduced compared with the second probability, so that the accuracy of determining the transition time is improved.
Correspondingly, when the difference between the first probability and the second probability is greater than or equal to the preset difference, it is indicated that the first probability that the first historical speech segment before the target speech segment is in the second language is obviously reduced compared with the first probability, it is indicated that at this time, an audio segment in the first language may exist in the first historical speech segment before the target speech segment, at this time, at least one sub-speech segment may be sequentially selected in a time sequence from a second step length and a second time window from a time end time of the first historical speech segment before the target speech segment, and languages corresponding to the sub-speech segments may be detected until a target sub-speech segment other than the second language is detected from the sub-speech segments, it may be determined that the last sub-speech segment selected before the sub-speech segment, that is, the first sub-speech segment is the latest speech segment that can be detected as the transition time, and then the start time of the first sub-speech segment may be determined as the transition time.
In the foregoing several embodiments, the time for determining the transition from the first language to the second language is provided for different situations, and when the speech recognition method provided in the embodiment of the present disclosure is implemented in practice, the actual situation may be detected and recognized, and the transition time may be determined in a corresponding manner, so that the speech recognition method provided in the embodiment of the present disclosure may be applied to various implementation scenarios.
In some embodiments, before performing speech recognition on the target speech after the conversion time by using the second speech recognition engine corresponding to the second language to obtain the target text content, the first speech recognition engine corresponding to the first language is typically used to perform speech recognition on the target speech to obtain the first text content.
After the target voice is subjected to voice recognition based on the second voice recognition engine, text content of the second language can be obtained, and then the text content of the second language is used to replace the text content of the first language obtained by performing voice recognition on the target voice by using the first voice recognition engine before, and is used as subtitle content corresponding to the target voice.
In the embodiment of the present disclosure, after the voice recognition is performed, a recognition result of the voice recognition may be presented on the client interface.
In some embodiments, after performing speech recognition on the target speech after the conversion time by using the second speech recognition engine corresponding to the second language to obtain the target text content, the first recognition result may be further replaced based on the target text content, where the first recognition result is obtained by performing speech recognition by using the first speech recognition engine after the conversion time.
In the above embodiment, the text content recognized and displayed by the speech recognition engine corresponding to the language before the conversion time is replaced with the text content recognized and displayed by the speech recognition engine corresponding to the language before the conversion time, so that the text content displayed on the client interface is still incorrect due to the misrecognition.
As shown in fig. 5, which is a schematic diagram of comparison of speech recognition results, for a speech segment of chinese of a certain speaker a, (a) in fig. 5 shows that a text result obtained by using an english corresponding speech recognition engine (ASR) for recognition is "She likes doing his own word heading lead text of \8230;," and (b) in fig. 5 shows that a text result obtained by using a chinese corresponding speech recognition engine (ASR) for recognition is "people who won't wait and yield, and people who have blood flesh of us 8230;" 82308230; "who are unwilling to wait and yield". It can be seen that the text result in (b) in fig. 5 is recognized by the speech recognition engine corresponding to the language used by the speaker at present, so that the recognized sentence has no obvious error and has a correct meaning, while the text result in (a) in fig. 5 is recognized by the speech recognition engine not corresponding to the language used by the speaker at present, so that the recognized sentence is not smooth and has no clear meaning. Under the condition that the speech recognition engine corresponding to the speaker is used for recognition, correct text content can be finally recognized.
The speech recognition method provided by the embodiment of the disclosure determines the language conversion time, and performs speech recognition by using the speech recognition engine corresponding to the converted language after the conversion time, so as to obtain correct text content.
As shown in fig. 6, an embodiment of the present disclosure provides a speech recognition apparatus, including:
a receiving module 601, configured to continuously receive a voice stream;
a speech recognition module 602, configured to perform speech recognition by using a first speech recognition engine corresponding to a first language;
a language detection module 603, configured to perform language detection on a first speech segment when the first speech segment with a duration greater than or equal to a first preset duration is received;
the speech recognition module 602 is further configured to:
determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time;
and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
In some embodiments, the speech recognition module 602 is specifically configured to:
and determining the starting moment of the first voice segment as the conversion moment under the condition that other voice segments which are continuous in time do not exist before the first voice segment.
In some embodiments, the speech recognition module 602 is specifically configured to:
under the condition that other voice fragments which are continuous in time exist before the first voice fragment, sequentially selecting at least one historical voice fragment from the detected moment of the second language according to a first sliding step length and a first time window length, and determining a target voice fragment from the at least one historical voice fragment, wherein the target voice fragment is a historical voice fragment with a language detection result which is closest to the current moment and is not the second language;
determining the conversion moment based on a first historical voice segment of the target voice segment, wherein the first historical voice segment is the first historical voice segment of the target voice segment. In some embodiments, the target speech segment satisfies any one of the following conditions:
the probability that the target voice fragment is in the second language is smaller than or equal to a preset probability threshold;
the probability that the target voice fragment is in the first language is greater than a preset probability threshold;
and the probability of the target voice fragment in any language is less than or equal to a preset probability threshold.
In some embodiments, the first time window is less than or equal to the first preset duration, and the first sliding step is less than or equal to the preset time window.
In some embodiments, the speech recognition module 602 is specifically configured to:
determining a first probability that the first historical speech fragment is in a second language;
determining a second probability that a second historical speech fragment is in a second language, wherein the second historical speech fragment is a previous speech fragment of the first historical speech fragment;
and when the difference value between the first probability and the second probability is smaller than a preset difference value, determining the starting moment of the first historical voice segment as the conversion moment.
In some embodiments, the apparatus further comprises: a recognition result processing module 604; and the recognition result processing module is used for presenting the recognition result of the voice recognition on a client interface.
In some embodiments, the recognition result processing module 604 is further configured to, after the speech recognition module 602 performs speech recognition on the target speech after the conversion time by using a second speech recognition engine corresponding to a second language to obtain target text content, replace, based on the target text content, a first recognition result, where the first recognition result is a speech recognition result obtained by performing speech recognition by using the first speech recognition engine after the conversion time.
In some embodiments, the voice stream is a voice stream corresponding to any participant device in the conference, or the voice stream is a voice stream corresponding to any participant in the conference.
As shown in fig. 7, an embodiment of the present disclosure provides an electronic device, including: a processor 701, a memory 702 and a computer program stored on the memory 702 and executable on the processor 702, which computer program, when being executed by the processor 702, implements the respective processes of the speech recognition method in the above-described method embodiments. And the same technical effect can be achieved, and in order to avoid repetition, the description is omitted.
The embodiment of the present disclosure provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the speech recognition method in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
The embodiments of the present disclosure provide a computer program product, where the computer program is stored, and when being executed by a processor, the computer program implements each process of the speech recognition method in the foregoing method embodiments, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.
In the present disclosure, the Processor may be a Central Processing Unit (CPU), and may also be other general purpose processors, digital Signal Processors (DSP), application Specific Integrated Circuits (ASIC), field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In the present disclosure, the memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
In the present disclosure, computer readable media include permanent and non-permanent, removable and non-removable storage media. Storage media may implement information storage by any method or technology, and the information may be computer-readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include transitory computer readable media such as modulated data signals and carrier waves.
It is noted that, in this document, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a component of' 8230; \8230;" does not exclude the presence of additional identical elements in the process, method, article, or apparatus that comprises the element.
The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (12)

1. A speech recognition method, comprising:
continuously receiving voice streams, and performing voice recognition by adopting a first voice recognition engine corresponding to a first language;
under the condition that a first voice fragment with the duration being greater than or equal to a first preset duration is received, language detection is carried out on the first voice fragment;
determining the conversion time of the first language and the second language in response to detecting that the second language to which the first voice fragment belongs is different from the first language obtained by last language detection;
and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
2. The method of claim 1, wherein said determining a transition time of said first language and said second language comprises:
determining a start time of the first speech segment as the transition time in case there is no other speech segment that is consecutive in time before the first speech segment.
3. The method of claim 1, wherein said determining a transition time of said first language and said second language comprises:
under the condition that other voice fragments which are continuous in time exist before the first voice fragment, sequentially selecting at least one historical voice fragment from the detected moment of the second language according to a first sliding step length and a first time window length, and determining a target voice fragment from the at least one historical voice fragment, wherein the target voice fragment is a historical voice fragment with a language detection result which is closest to the current moment and is not in the second language;
determining the conversion moment based on a first historical voice segment of the target voice segment, wherein the first historical voice segment is the first historical voice segment of the target voice segment.
4. The method according to claim 3, wherein the target speech segment satisfies any one of the following conditions:
the probability that the target voice fragment is in the second language is smaller than or equal to a preset probability threshold;
the probability that the target voice fragment is in the first language is greater than a preset probability threshold;
the probability that the target voice fragment is in any language is smaller than or equal to a preset probability threshold.
5. The method of claim 3, wherein the first time window is less than or equal to the first preset duration, and wherein the first sliding step is less than or equal to the preset time window.
6. The method of claim 3, wherein determining the transition time based on a first historical speech segment of the target speech segment comprises:
determining a first probability that the first historical speech segment is in a second language;
determining a second probability that a second historical voice fragment is in a second language, wherein the second historical voice fragment is a last voice fragment of the first historical voice fragment;
and when the difference value between the first probability and the second probability is smaller than a preset difference value, determining the starting moment of the first historical voice segment as the conversion moment.
7. The method of claim 1, further comprising: and presenting the recognition result of the voice recognition on a client interface.
8. The method according to claim 7, wherein after performing speech recognition on the target speech after the conversion time by using the second speech recognition engine corresponding to the second language to obtain target text content, the method further comprises:
and replacing a first recognition result based on the target text content, wherein the first recognition result is a voice recognition result obtained by utilizing the first voice recognition engine to perform voice recognition after the conversion moment.
9. The method according to any one of claims 1 to 8, wherein the voice stream is a voice stream corresponding to any participant device in the conference, or the voice stream is a voice stream corresponding to any participant in the conference.
10. A speech recognition apparatus, comprising:
the receiving module is used for continuously receiving the voice stream;
the voice recognition module is used for performing voice recognition by adopting a first voice recognition engine corresponding to a first language;
the language detection module is used for detecting the language of a first voice fragment when the first voice fragment with the length greater than or equal to a first preset time length is received;
the speech recognition module is further configured to:
determining a conversion moment of a first language and a second language in response to detecting that a second language to which the first voice fragment belongs is different from the first language detected last time;
and performing voice recognition on the target voice after the conversion moment by adopting a second voice recognition engine corresponding to the second language to obtain target text content.
11. An electronic device, comprising: processor, memory and a computer program stored on the memory and executable on the processor, which computer program, when executed by the processor, implements the speech recognition method according to any one of claims 1 to 9.
12. A computer-readable storage medium, comprising: the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 9.
CN202211000406.0A 2022-08-19 2022-08-19 Voice recognition method and device and electronic equipment Active CN115376490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211000406.0A CN115376490B (en) 2022-08-19 2022-08-19 Voice recognition method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211000406.0A CN115376490B (en) 2022-08-19 2022-08-19 Voice recognition method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN115376490A true CN115376490A (en) 2022-11-22
CN115376490B CN115376490B (en) 2024-07-30

Family

ID=84064781

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211000406.0A Active CN115376490B (en) 2022-08-19 2022-08-19 Voice recognition method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115376490B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185375A (en) * 2015-08-10 2015-12-23 联想(北京)有限公司 Information processing method and electronic equipment
US9812130B1 (en) * 2014-03-11 2017-11-07 Nvoq Incorporated Apparatus and methods for dynamically changing a language model based on recognized text
US20180182396A1 (en) * 2016-12-12 2018-06-28 Sorizava Co., Ltd. Multi-speaker speech recognition correction system
US20190096399A1 (en) * 2017-09-27 2019-03-28 Hitachi Information & Telecommunication Engineering, Ltd. Call voice processing system and call voice processing method
US20190096396A1 (en) * 2016-06-16 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Multiple Voice Recognition Model Switching Method And Apparatus, And Storage Medium
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN111128126A (en) * 2019-12-30 2020-05-08 上海浩琨信息科技有限公司 Multi-language intelligent voice conversation method and system
CN111613208A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Language identification method and equipment
CN111798836A (en) * 2020-08-03 2020-10-20 上海茂声智能科技有限公司 Method, device, system, equipment and storage medium for automatically switching languages
CN111986655A (en) * 2020-08-18 2020-11-24 北京字节跳动网络技术有限公司 Audio content identification method, device, equipment and computer readable medium
CN114239613A (en) * 2022-02-23 2022-03-25 阿里巴巴达摩院(杭州)科技有限公司 Real-time voice translation method, device, equipment and storage medium
CN114373446A (en) * 2021-12-28 2022-04-19 北京字跳网络技术有限公司 Conference language determination method and device and electronic equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9812130B1 (en) * 2014-03-11 2017-11-07 Nvoq Incorporated Apparatus and methods for dynamically changing a language model based on recognized text
CN105185375A (en) * 2015-08-10 2015-12-23 联想(北京)有限公司 Information processing method and electronic equipment
US20190096396A1 (en) * 2016-06-16 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd. Multiple Voice Recognition Model Switching Method And Apparatus, And Storage Medium
US20180182396A1 (en) * 2016-12-12 2018-06-28 Sorizava Co., Ltd. Multi-speaker speech recognition correction system
US20190096399A1 (en) * 2017-09-27 2019-03-28 Hitachi Information & Telecommunication Engineering, Ltd. Call voice processing system and call voice processing method
CN110634487A (en) * 2019-10-24 2019-12-31 科大讯飞股份有限公司 Bilingual mixed speech recognition method, device, equipment and storage medium
CN110853618A (en) * 2019-11-19 2020-02-28 腾讯科技(深圳)有限公司 Language identification method, model training method, device and equipment
CN111128126A (en) * 2019-12-30 2020-05-08 上海浩琨信息科技有限公司 Multi-language intelligent voice conversation method and system
CN111613208A (en) * 2020-05-22 2020-09-01 云知声智能科技股份有限公司 Language identification method and equipment
CN111798836A (en) * 2020-08-03 2020-10-20 上海茂声智能科技有限公司 Method, device, system, equipment and storage medium for automatically switching languages
CN111986655A (en) * 2020-08-18 2020-11-24 北京字节跳动网络技术有限公司 Audio content identification method, device, equipment and computer readable medium
CN114373446A (en) * 2021-12-28 2022-04-19 北京字跳网络技术有限公司 Conference language determination method and device and electronic equipment
CN114239613A (en) * 2022-02-23 2022-03-25 阿里巴巴达摩院(杭州)科技有限公司 Real-time voice translation method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115376490B (en) 2024-07-30

Similar Documents

Publication Publication Date Title
US11127416B2 (en) Method and apparatus for voice activity detection
CN107437416B (en) Consultation service processing method and device based on voice recognition
US9922640B2 (en) System and method for multimodal utterance detection
CN108962233B (en) Voice conversation processing method and system for voice conversation platform
KR101255402B1 (en) Redictation 0f misrecognized words using a list of alternatives
CN110210310B (en) Video processing method and device for video processing
CN111145756B (en) Voice recognition method and device for voice recognition
US20160078020A1 (en) Speech translation apparatus and method
CN110610700B (en) Decoding network construction method, voice recognition method, device, equipment and storage medium
US10850745B2 (en) Apparatus and method for recommending function of vehicle
KR20170045709A (en) Speech endpointing
JPWO2005122144A1 (en) Speech recognition apparatus, speech recognition method, and program
CN107274903B (en) Text processing method and device for text processing
CN111369978B (en) Data processing method and device for data processing
US11120839B1 (en) Segmenting and classifying video content using conversation
CN108628819B (en) Processing method and device for processing
CN112487381A (en) Identity authentication method and device, electronic equipment and readable storage medium
CN111031329A (en) Method, apparatus and computer storage medium for managing audio data
CN114842849B (en) Voice dialogue detection method and device
CN111640452A (en) Data processing method and device and data processing device
CN115376490B (en) Voice recognition method and device and electronic equipment
CN111681646A (en) Universal scene Chinese Putonghua speech recognition method of end-to-end architecture
CN114373446A (en) Conference language determination method and device and electronic equipment
US10841411B1 (en) Systems and methods for establishing a communications session
CN111970311B (en) Session segmentation method, electronic device and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant