CN115410578A

CN115410578A - Processing method of voice recognition, processing system thereof, vehicle and readable storage medium

Info

Publication number: CN115410578A
Application number: CN202211327016.4A
Authority: CN
Inventors: 韩森淼; 郭华鹏; 张岩
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2022-11-29

Abstract

The invention discloses a processing method of voice recognition, a processing system of the processing method, a vehicle and a readable storage medium. The processing method of the speech recognition comprises the following steps: acquiring a voice to be recognized; the method comprises the steps that a voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the vehicle end of a vehicle, the voice to be recognized is recognized, and the second recognition result is a final result of the cloud end, the voice to be recognized is recognized; and displaying one of the first recognition result and the second recognition result according to a display strategy formed by the network environment and the confidence coefficient of the voice recognition result. According to the processing method of the voice recognition, the first recognition result or the second recognition result is displayed according to the display strategy formed by the network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of the local cloud and the cloud are fully exerted, the cloud result and the local result are integrated into a streaming type screen, and the smooth and accurate screen display effect of the voice recognition result is provided.

Description

Processing method of voice recognition, processing system thereof, vehicle and readable storage medium

Technical Field

The present invention relates to the field of vehicle voice recognition technology, and in particular, to a processing method and a processing system for voice recognition, a vehicle, and a readable storage medium.

Background

In the related art, a user speaks ASR (Automatic Speech Recognition) Recognition result of Query to be displayed on a screen. The ASR vehicle on-screen effect is very important as a display that the vehicle-mounted dialog system "sees". Generally speaking, cloud ASR has the characteristics of strong computing power and good effect, but the speed of returning streaming results is slower than that of local ASR due to certain network delay overhead of end cloud interaction.

Disclosure of Invention

The invention provides a processing method of voice recognition, a processing system of the voice recognition, a vehicle and a readable storage medium.

The processing method of the voice recognition comprises the following steps: acquiring a voice to be recognized; performing voice recognition on the voice to be recognized to obtain a voice recognition result, wherein the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the vehicle end of the vehicle for recognizing the voice to be recognized, and the second recognition result is a final result of the cloud for recognizing the voice to be recognized; and displaying one of the first recognition result and the second recognition result according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result.

According to the processing method of the voice recognition, one of the first recognition result and the second recognition result is displayed according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of a local terminal and a cloud terminal can be fully exerted, the cloud terminal result and the local result are fused in a streaming mode to be displayed on a screen, and a smooth and accurate voice recognition result screen display effect is provided.

The processing method of the voice recognition comprises the following steps: acquiring original voice, wherein the original voice comprises at least one human voice part; under the condition that a voice part is detected currently, audio interception is carried out on the part, starting from the voice part, in the original voice to obtain a plurality of audio packets, and the plurality of audio packets form the voice to be recognized after being issued; and under the condition that the voice part is not detected within a first preset time, stopping issuing the audio packet. Thus, the processing efficiency of the voice recognition can be improved.

The processing method of the voice recognition comprises the following steps: under the condition that the detected voice part is subjected to audio interception to obtain the plurality of audio packets, the audio packets are stopped to be issued until a second preset time length after the voice part is not detected, wherein the second preset time length is less than the first preset time length. Therefore, the method can be beneficial to improving the screen-on speed of the voice recognition result.

Acquiring the voice to be recognized, comprising: acquiring the plurality of audio packets frame by frame; and recognizing the voice to be recognized to obtain a voice recognition result, wherein the voice recognition result comprises the following steps: and displaying the voice recognition result in a streaming mode according to the continuously acquired audio package. Therefore, the real-time screen-on effect of the voice recognition result can be realized.

The processing method of the voice recognition comprises the following steps: and displaying the first recognition result when the current streaming generates an intermediate result, wherein the intermediate result is a voice recognition result generated for all the currently acquired audio packets. Therefore, the fastest voice recognition result screen-loading speed can be guaranteed.

Displaying one of the first recognition result and the second recognition result according to a display policy composed of a network environment and a confidence of the voice recognition result, including: under the condition that the network environment is in a normal state, displaying the first recognition result or the second recognition result according to the receiving waiting time of the second recognition result and the confidence coefficient between the first recognition result and the second recognition result; preferentially displaying the first recognition result under the condition that the network environment is in a weak network state, and refreshing and displaying the first recognition result or the second recognition result according to the confidence coefficient between the first recognition result and the second recognition result when the second recognition result is received within a preset timeout duration; and displaying the first identification result under the condition that the network environment is in a no-network state. Therefore, the displayed voice recognition result can be ensured to have enough screen-on speed and credibility.

The processing method of the voice recognition comprises the following steps: after a test packet is sent from a vehicle end of the vehicle, determining that the network environment is in the normal state under the condition that a feedback packet is received within the preset timeout duration, wherein the feedback packet is a processing result of the cloud end on the test packet; after the test packet is sent by the vehicle end of the vehicle, determining that the network environment is in the weak network state under the condition that the feedback packet is not received within the preset timeout duration; and under the condition that the long connection between the vehicle end of the vehicle and the cloud end is disconnected, determining that the network environment is in the network-free state. Therefore, the network condition of communication between the vehicle end and the cloud end of the vehicle can be conveniently determined.

A vehicle according to the present invention includes a memory in which a computer program is stored, and a processor that realizes the steps of the speech recognition processing method described in any one of the above when the processor executes the computer program.

According to the vehicle, one of the first recognition result and the second recognition result is displayed according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of a local place and a cloud can be fully exerted, the cloud result and the local result are fused in a streaming mode and are displayed on the screen, and a smooth and accurate voice recognition result screen display effect is provided.

The invention relates to a processing system for voice recognition, which comprises a vehicle and a cloud end, wherein the vehicle is used for: acquiring a voice to be recognized; the voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the vehicle end of the vehicle for recognizing the voice to be recognized, and the second recognition result is a final result of the cloud for recognizing the voice to be recognized; displaying one of the first recognition result and the second recognition result according to a display strategy formed by a network environment and a confidence coefficient of a voice recognition result; the cloud is used for: receiving the voice to be recognized sent by the vehicle; recognizing the voice to be recognized to obtain a second recognition result; and sending the second recognition result to the vehicle.

According to the voice recognition processing system, one of the first recognition result and the second recognition result is displayed according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of a local terminal and a cloud terminal can be fully exerted, the cloud terminal result and the local result are fused in a streaming mode and are displayed on a screen, and a smooth and accurate voice recognition result screen display effect is provided.

A computer-readable storage medium of the present invention stores thereon a computer program that, when executed by a processor, implements the steps of the processing method for speech recognition described in any one of the above.

According to the computer-readable storage medium, one of the first recognition result and the second recognition result is displayed according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of a local terminal and a cloud terminal can be fully exerted, the cloud terminal result and the local result are fused in a streaming mode and are displayed on a screen, and a smooth and accurate voice recognition result screen display effect is provided.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The above and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the invention taken in conjunction with the accompanying drawings in which:

FIG. 1 is a flow chart of a method of processing speech recognition of the present invention;

FIG. 2 is a schematic diagram of a speech recognition processing system of the present invention;

FIG. 3 is a block schematic diagram of the vehicle of the present invention;

fig. 4 is a schematic view of a scenario of a speech recognition processing method of the present invention.

Description of the main element symbols:

a vehicle 10, a vehicle end 11, a memory 12, and a processor 13; a cloud 20; a processing system 30 for speech recognition.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

Referring to fig. 1 and fig. 2, a processing method of speech recognition according to the present invention includes:

01: acquiring a voice to be recognized;

02: the method comprises the steps that a voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the voice to be recognized, which is recognized through a vehicle end 11 of a vehicle 10, and the second recognition result is a final result of the voice to be recognized, which is recognized through a cloud end 20;

03: and displaying one of the first recognition result and the second recognition result according to a display strategy formed by the network environment and the confidence coefficient of the voice recognition result.

The processing method of the voice recognition of the invention can be realized by the vehicle 10 of the invention. Specifically, referring to FIG. 3, the vehicle 10 includes a memory 12 and a processor 13. The memory 12 stores a computer program. The processor 13 is capable of executing a computer program to carry out the steps of the speech recognition processing method of the present invention. In particular, the processor 13 is configured to: acquiring a voice to be recognized; the method comprises the steps that a voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the voice to be recognized, which is recognized by a vehicle end 11 of a vehicle 10, and the second recognition result is a final result of the voice to be recognized, which is recognized by a cloud 20; and displaying one of the first recognition result and the second recognition result according to a display strategy formed by the network environment and the confidence coefficient of the voice recognition result.

According to the voice recognition processing method and the vehicle 10, one of the first recognition result and the second recognition result is displayed according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, the respective voice recognition advantages of the local and cloud terminals 20 can be fully exerted, the cloud terminal 20 result and the local result are fused in a streaming mode and are displayed on the screen, and a smooth and accurate voice recognition result screen display effect is provided.

Referring to fig. 4, in fig. 4, after the user sends out the Query audio, VAD (Voice Activity Detection) processing may be performed by a VAD (Voice Activity Detection) unit at a vehicle end 11 (local to the vehicle) of the vehicle 10, so that a plurality of audio packets may be continuously obtained. The vehicle-mounted device can send the obtained multiple audio packets to an ASR unit of the vehicle-mounted device and an ASR unit of the cloud 20, so that the vehicle-mounted device performs streaming processing on the Query audio of the user subjected to VAD processing and finally obtains a first recognition result (a local ASR final result), and the cloud 20 performs streaming processing on the Query audio of the user subjected to VAD processing and finally obtains a second recognition result (a cloud ASR final result). The voice recognition results respectively obtained by the local vehicle and the cloud 20 can be returned, so that the local final result and the cloud 20 final result can be fused according to a display strategy formed by a network environment and the confidence coefficient of the voice recognition result, and one of the voice recognition results which is more suitable for being displayed to a user is obtained and displayed on a vehicle-mounted large-screen UI.

In addition, the speech to be recognized may be audio characterizing a user's request for speech (Query). The speech recognition of the speech to be recognized may be that a corresponding speech recognition result presented in text form is obtained by the speech to be recognized presented in audio form.

The processing method of the speech recognition comprises the following steps:

acquiring original voice, wherein the original voice comprises at least one voice part;

under the condition that the voice part is detected currently, audio interception is carried out on the part, starting from the voice part, in the original voice to obtain a plurality of audio packets, and the plurality of audio packets form the voice to be recognized after being issued;

and under the condition that the voice part is not detected within the first preset time, stopping sending the audio packet.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: acquiring original voice, wherein the original voice comprises at least one voice part; under the condition that the voice part is detected currently, audio interception is carried out on the part, starting from the voice part, in the original voice to obtain a plurality of audio packets, and the plurality of audio packets form the voice to be recognized after being issued; and under the condition that the voice part is not detected within the first preset time, stopping sending the audio packet.

Thus, the processing efficiency of the voice recognition can be improved.

Specifically, audio may be received through a microphone of the vehicle 10, such that the user's voice is recorded and the original voice is obtained. The original speech is audio obtained in an actual scene, which includes both a vocal part corresponding to speech uttered by a user and a non-vocal part (e.g., noise) in the scene. The human voice portion may characterize a particular word in speech.

Referring to fig. 4, when detecting the original voice, if the voice part in the original voice is detected, the original voice is audio-captured from the currently detected voice part to obtain a plurality of audio packets. Multiple audio packets may be issued to form the speech to be recognized. The issued audio packets can be sequentially arranged to form a voice to be recognized, and can also be synthesized to form the voice to be recognized. The detection of the original speech may be a voice activity detection.

And under the condition that the vocal part in the original voice is not detected within the first preset time length, or the non-vocal part in the original voice is detected and continues for the first preset time length, the voice recognition is not needed for the current audio part in the original voice, so that a voice activity detection termination state (VAD END) can be entered, the interception of the original voice is stopped when the original voice enters the VAD END to obtain a last audio packet (tail packet), and the audio packet obtained by audio interception after the VAD END is not issued to form the voice to be recognized.

It can be understood that in a scene of performing speech recognition, the recognized speech recognition result has a demand of fast screen-up speed. Under the condition that voice recognition is performed on the audio by sequentially confirming the audio, the voice part in the original voice is intercepted into a plurality of audio packets, and the audio packets are stopped being issued when the voice part does not exist, so that the condition that voice recognition is performed on the non-voice part can be avoided, the voice recognition is pointed, the processing efficiency of the voice recognition is improved, and the screen-loading speed of the voice recognition result can be finally improved.

In addition, the processor 13 may include a voice activity detection unit (not shown) and a voice recognition unit (not shown). The voice activity detection unit can be used for carrying out audio interception on the acquired original voice to obtain a plurality of audio packets, and the voice recognition unit can be used for receiving the plurality of audio packets sent by the voice activity detection unit and forming voice to be recognized for voice recognition on the plurality of received audio packets.

In addition, the first preset duration can be determined by calibration, or can be obtained by adjusting according to actual conditions. The first preset duration may range from 1 ms to 100 ms.

The processing method of the speech recognition comprises the following steps:

and under the condition that a plurality of audio packets are obtained by carrying out audio interception on the detected voice part, stopping issuing the audio packets until a second preset time length after the voice part is not detected, wherein the second preset time length is less than the first preset time length.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: and under the condition that a plurality of audio packets are obtained by carrying out audio interception on the detected voice part, stopping issuing the audio packets until a second preset time length after the voice part is not detected, wherein the second preset time length is less than the first preset time length.

Therefore, the method can be beneficial to improving the screen-loading speed of the voice recognition result.

Specifically, in the original speech, after the end of a human voice part, a non-human voice part may follow. And after the detected voice part is detected until the second preset time, under the condition that the voice part is not detected any more, the current part of the original voice is represented as a non-voice part, voice recognition is not needed, and the audio packet is stopped being issued. Because the second preset duration is less than the first preset duration, the tail point waiting time in the whole duration of the transmitted audio packet can be compressed, the time for entering the voice activity detection termination state is advanced (VAD early), the size of the last audio packet (tail packet) is reduced, the tail packet can be transmitted more quickly, the speed of voice recognition is further improved, and the speed of screen display of the voice recognition result is finally improved.

In addition, in fig. 4, the tail packet corresponding to VAD EarlyEnd is used as the local ASR tail packet, so that the processing speed of the local ASR of the vehicle to the speech to be recognized can be increased, and the tail packet corresponding to VAD End is used as the cloud ASR tail packet, so that the processing precision of the cloud ASR to the speech to be recognized can be increased. Certainly, the end packet corresponding to VAD EarlyEnd can also be used as the cloud ASR end packet, so that the cloud ASR can obtain the second recognition result more quickly.

Step 01 (obtaining the speech to be recognized), comprising:

acquiring a plurality of audio packets frame by frame;

step 02 (recognizing the speech to be recognized to obtain a speech recognition result), which includes:

and displaying the voice recognition result in a streaming mode according to the continuously acquired audio packets.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: acquiring a plurality of audio packets frame by frame; and displaying the voice recognition result in a streaming mode according to the continuously acquired audio packet.

Therefore, the real-time screen-on effect of the voice recognition result can be realized.

Specifically, the plurality of audio packets are acquired frame by frame, so that the plurality of acquired audio packets can be arranged in sequence. The acquisition of the audio packet is a continuous process, and in the process, voice recognition can be performed according to the acquired audio packet, and a voice recognition result obtained by the voice recognition is displayed.

The streaming display may be to display partial contents of all previously recognized speech recognition results, and when a partial content of a new speech recognition result is currently acquired, combine the partial contents of all previously recognized speech recognition results with the partial content of the new speech recognition result to display.

Referring to fig. 4, taking the voice request of the user as "navigate to the popdon international airport" as an example, in the process of acquiring a plurality of audio packets frame by frame, a plurality of intermediate results (local ASR intermediate results) such as "navigate", "navigate to", and "navigate to the popdon" are successively obtained through voice recognition, and these voice recognition results correspond to partial contents in the voice request of the user. And after the voice recognition result obtained by performing voice recognition on the continuously acquired first part of audio packets is 'navigation', displaying the voice recognition result as 'navigation'. And after the voice recognition result obtained by performing voice recognition on the continuously acquired second part of audio packets is 'go', displaying the voice recognition result as 'navigate go'. And after the voice recognition result obtained by performing voice recognition on the continuously acquired third part of audio packets is 'Pudong', displaying the voice recognition result as 'navigation to remove Pudong'. By analogy, before the speech recognition result is displayed as "navigate to the Pudong International airport", the displayed speech recognition results are all intermediate results.

By the mode of displaying the voice recognition result in a streaming manner, after voice recognition is carried out on the continuously acquired audio packet, the voice recognition result is additionally displayed after the obtained voice recognition result is obtained, and the final result of navigating to the Pudong international airport can be displayed afterwards, so that a user can quickly determine the voice recognition condition of the sent voice request through the displayed intermediate result and determine the final effect of the voice recognition through the displayed final result, and the real-time screen-on effect of the voice recognition result can be realized.

The processing method of the speech recognition comprises the following steps:

and displaying the first recognition result in the case that the current streaming generates an intermediate result, wherein the intermediate result is a voice recognition result generated for all the currently acquired audio packets.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: and displaying the first recognition result in the case that the current streaming generates an intermediate result, wherein the intermediate result is a voice recognition result generated for all the currently acquired audio packets.

Therefore, the fastest voice recognition result screen-loading speed can be guaranteed.

Referring to fig. 4, specifically, for the vehicle end 11 of the vehicle 10, because there is no problem of network delay, the vehicle end 11 of the vehicle 10 can directly display the first recognition result after recognizing the first recognition result, so that the screen-loading speed of the voice recognition result is faster. In addition, on the premise of matching with the time for entering the voice activity detection termination state in advance through the second preset time length, the acquisition speed of the first recognition result can be further increased.

Step 03 (displaying one of the first recognition result and the second recognition result according to a display policy consisting of a network environment and a confidence of the speech recognition result) includes:

under the condition that the network environment is in a normal state, displaying the first recognition result or the second recognition result according to the receiving waiting time of the second recognition result and the confidence coefficient between the first recognition result and the second recognition result;

preferentially displaying the first recognition result under the condition that the network environment is in a weak network state, and refreshing and displaying the first recognition result or the second recognition result according to the confidence coefficient between the first recognition result and the second recognition result when the second recognition result is received within a preset timeout duration;

and displaying the first recognition result under the condition that the network environment is in a no-network state.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: under the condition that the network environment is in a normal state, displaying the first recognition result or the second recognition result according to the receiving waiting time of the second recognition result and the confidence coefficient between the first recognition result and the second recognition result; preferentially displaying the first recognition result under the condition that the network environment is in a weak network state, and refreshing and displaying the first recognition result or the second recognition result according to the confidence coefficient between the first recognition result and the second recognition result when the second recognition result is received within a preset timeout duration; and displaying the first recognition result under the condition that the network environment is in a no-network state.

Therefore, the displayed voice recognition result can be ensured to have enough screen-on speed and credibility.

The confidence level of the speech recognition result may be a degree of reliability characterizing the speech recognition result. For a speech recognition result, the higher the confidence of the speech recognition result, the higher the reliability of the speech recognition result, and the less likely there is an error, and the lower the confidence of the speech recognition result, the lower the reliability of the speech recognition result, and the more likely there is an error.

Specifically, if it is determined that the network environment is currently in a normal state, since voice recognition through the cloud 20 may have higher processing accuracy, even if the first recognition result is obtained, it is prioritized to wait for receiving the second recognition result. The first recognition result is not displayed while waiting for the second recognition result to be received. And under the condition that the second recognition result is received within a certain time, performing confidence judgment on the first recognition result and the second recognition result, and determining one of the first recognition result and the second recognition result with higher confidence, wherein if the confidence of the first recognition result is higher than that of the second recognition result, the first recognition result is displayed, and if the confidence of the first recognition result is lower than that of the second recognition result, the second recognition result is displayed. And displaying the first recognition result under the condition that the receiving waiting time for waiting to receive the second recognition result is too long. And under the condition that the second identification result is not received within the preset timeout duration, determining that the receiving waiting duration for receiving the second identification result is too long.

If it is determined that the network environment is currently in a weak network state, the reception of the second recognition result may be timed out. In the process of waiting for receiving the second identification result, if the first identification result is obtained first, the first identification result is displayed first. And under the condition that the time length for waiting to receive the second identification result is longer than the preset overtime time length, keeping displaying the first identification result. And under the condition that the second recognition result is received within the preset timeout duration, performing confidence judgment on the first recognition result and the second recognition result. If the confidence coefficient of the first recognition result is greater than that of the second recognition result, the first recognition result is kept displayed, and if the confidence coefficient of the first recognition result is less than that of the second recognition result, the second recognition result is displayed in a refreshing mode.

If it is determined that the network environment is currently in the no-network state, it may be determined that the second recognition result sent by the cloud 20 cannot be received, and thus the first recognition result may be directly displayed under the condition that the first recognition result is obtained.

On the basis, according to the current network environment of communication between the vehicle 10 and the cloud 20 and the confidence degree between different voice recognition results, on the premise of ensuring enough screen-up speed, one first recognition result or second recognition result with enough confidence degree can be comprehensively judged and displayed.

The processing method of the speech recognition comprises the following steps:

after the vehicle end 11 of the vehicle 10 sends the test packet, under the condition that a feedback packet is received within a preset timeout period, determining that the network environment is in a normal state, wherein the feedback packet is a processing result of the cloud 20 on the test packet;

after the vehicle end 11 of the vehicle 10 sends the test packet, determining that the network environment is in a weak network state under the condition that the feedback packet is not received within a preset timeout duration;

in the case where the long connection between the vehicle end 11 of the vehicle 10 and the cloud end 20 is disconnected, it is determined that the network environment is in the no-network state.

The processing method of the speech recognition of the present invention can be realized by the vehicle 10 of the present invention. Specifically, referring to fig. 3, the processor 13 is configured to: after the vehicle end 11 of the vehicle 10 sends the test packet, under the condition that a feedback packet is received within a preset timeout period, determining that the network environment is in a normal state, wherein the feedback packet is a processing result of the cloud 20 on the test packet; after the vehicle end 11 of the vehicle 10 sends the test packet, determining that the network environment is in a weak network state under the condition that the feedback packet is not received within a preset timeout duration; in the case where the long connection between the vehicle end 11 of the vehicle 10 and the cloud end 20 is disconnected, it is determined that the network environment is in the no-network state.

In this way, the network status of the communication between the vehicle end 11 of the vehicle 10 and the cloud 20 can be determined conveniently.

Specifically, since the communication between the vehicle end 11 of the vehicle 10 and the cloud 20 is long-distance communication (long connection), there is a high demand for a network environment for the communication. By judging whether the feedback packet is received within the preset timeout period, the network condition of communication between the vehicle end 11 and the cloud end 20 of the vehicle 10 can be simply determined, and therefore a display strategy for displaying the voice recognition result is facilitated. The test packet may be a Ping packet and the feedback packet may be a Pong packet.

Based on the foregoing, taking the voice request of the user as "please help me to open the comfortable driving sharing mode" as an example, by VAD processing the voice request, a plurality of intermediate results can be obtained in a streaming manner, such as "please help me", "please help me to open comfortable", so that a first recognition result "please help me to open the comfortable driving sharing mode" and a second recognition result "please help me to open the comfortable driving sharing mode" can be obtained, and the confidence of the first recognition result is 0.9, and the confidence of the second recognition result is 1.0. Then, according to the condition of the network environment of the communication between the vehicle end 11 of the vehicle 10 and the cloud 20, in the case that the network environment is in a normal state, if the voice recognition result of the receiving cloud 20 is not overtime, a display of "please help me to open the comfortable driving sharing mode" is displayed, and if the voice recognition result of the receiving cloud 20 is overtime, a display of "please help me to open the comfortable driving thinking mode" is displayed. Under the condition that the network environment is in a weak network state, a 'please help me to open a comfortable driving mode' is displayed firstly, if the voice recognition result of the receiving cloud 20 is not overtime, the 'please help me to open the comfortable driving mode' is displayed in a refreshing mode after the receiving cloud 20 is received, and if the voice recognition result of the receiving cloud 20 is overtime, the 'please help me to open the comfortable driving mode' is displayed. Under the condition that the network environment is in a non-network state, a 'please help me to open a comfortable driving thinking mode' is directly displayed.

Referring to fig. 2, a processing system 30 for speech recognition according to the present invention includes a vehicle 10 and a cloud 20. The vehicle 10 is used for: acquiring a voice to be recognized; the method comprises the steps that a voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the voice to be recognized, which is recognized by a vehicle end 11 of a vehicle 10, and the second recognition result is a final result of the voice to be recognized, which is recognized by a cloud 20; and displaying one of the first recognition result and the second recognition result according to a display strategy formed by the network environment and the confidence coefficient of the voice recognition result. The cloud 20 is used for: receiving the voice to be recognized sent by the vehicle 10; recognizing the speech to be recognized to obtain a second recognition result; the second recognition result is transmitted to the vehicle 10.

The processing system 30 for speech recognition displays one of the first recognition result and the second recognition result according to a display strategy formed by a network environment and a confidence coefficient of the speech recognition result, so that respective speech recognition advantages of the local and cloud terminals 20 can be fully exerted, the cloud terminal 20 result and the local result are fused in a streaming mode for screen-loading, and a smooth and accurate speech recognition result screen-loading display effect is provided.

Specifically, referring to fig. 2, in fig. 2, the vehicle 10 may obtain a voice request sent by a user through the vehicle end 11, and the vehicle end 11 may perform voice recognition on the processed voice request (to-be-recognized voice) to obtain voice recognition results (an intermediate result and a first recognition result), and may send the to-be-recognized voice to the cloud 20. The cloud 20 can receive the voice to be recognized sent by the vehicle end 11, and perform voice recognition to obtain a second recognition result. The second recognition result is transmitted to the processor 13 so that the processor 13 determines to display one of the first recognition result and the second recognition result according to a display policy.

A computer-readable storage medium of the present invention, on which a computer program is stored, is characterized in that the computer program realizes the steps of the processing method of speech recognition of any one of the above items when being executed by the processor 13.

For example, in the case of a computer program being executed, the following steps may be implemented:

01: acquiring a voice to be recognized;

The computer-readable storage medium may be provided in the vehicle 10 or in another terminal, and the vehicle 10 can communicate with the other terminal to obtain the corresponding program.

It is understood that the computer-readable storage medium may include: any entity or device capable of carrying a computer program, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer storage medium, read-Only Memory (ROM), random Access Memory (RAM), software distribution medium, and the like. The computer program includes computer program code. The computer program code may be in the form of source code, object code, an executable file or some intermediate form, and the like. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer storage medium, read-Only Memory (ROM), random Access Memory (RAM), and software distribution medium.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps in the process, and alternate implementations are included within the scope of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processing module-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features. In the description of the present invention, "a plurality" means two or more unless specifically defined otherwise.

While the invention has been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made herein without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for processing speech recognition, comprising:

acquiring a voice to be recognized;

recognizing the voice to be recognized to obtain a voice recognition result, wherein the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result recognized by a vehicle end of the vehicle for the voice to be recognized, and the second recognition result is a final result recognized by a cloud for the voice to be recognized;

and displaying one of the first recognition result and the second recognition result according to a display strategy formed by a network environment and the confidence degree of the voice recognition result.

2. The processing method of speech recognition according to claim 1, wherein the processing method of speech recognition comprises:

acquiring original voice, wherein the original voice comprises at least one human voice part;

under the condition that a voice part is detected currently, audio interception is carried out on the part, starting from the voice part, in the original voice to obtain a plurality of audio packets, and the plurality of audio packets form the voice to be recognized after being issued;

and under the condition that the voice part is not detected within a first preset time, stopping issuing the audio packet.

3. The processing method for speech recognition according to claim 2, wherein the processing method for speech recognition comprises:

under the condition that the detected voice part is subjected to audio interception to obtain the plurality of audio packets, the audio packets are stopped to be issued until a second preset time length after the voice part is not detected, wherein the second preset time length is less than the first preset time length.

4. The processing method of speech recognition according to claim 2, wherein the obtaining of the speech to be recognized comprises:

acquiring the plurality of audio packets frame by frame;

and performing voice recognition on the voice to be recognized to obtain a voice recognition result, wherein the voice recognition result comprises the following steps:

and displaying the voice recognition result in a streaming mode according to the continuously acquired audio packet.

5. The processing method for speech recognition according to claim 4, wherein the processing method for speech recognition comprises:

and displaying the first recognition result when the current streaming generates an intermediate result, wherein the intermediate result is a voice recognition result generated for all the currently acquired audio packets.

6. The method of processing speech recognition according to claim 1, wherein displaying one of the first recognition result and the second recognition result according to a display policy including a network environment and a confidence of the speech recognition result comprises:

and displaying the first identification result under the condition that the network environment is in a no-network state.

7. The method for processing speech recognition according to claim 6, wherein the method for processing speech recognition comprises:

after a test packet is sent from a vehicle end of the vehicle, determining that the network environment is in the normal state under the condition that a feedback packet is received within the preset timeout duration, wherein the feedback packet is a processing result of the cloud end on the test packet;

after the vehicle end of the vehicle sends the test packet, determining that the network environment is in the weak network state under the condition that the feedback packet is not received within the preset timeout duration;

and determining that the network environment is in the network-free state under the condition that the long connection between the vehicle end of the vehicle and the cloud end is disconnected.

8. A vehicle, characterized by comprising a memory storing a computer program and a processor implementing the steps of the processing method of speech recognition according to any one of claims 1 to 7 when the processor executes the computer program.

9. A processing system for speech recognition, comprising a vehicle and a cloud, the vehicle configured to:

acquiring a voice to be recognized;

the voice to be recognized is recognized through voice to obtain a voice recognition result, the voice recognition result comprises a first recognition result and a second recognition result, the first recognition result is a final result of the vehicle end of the vehicle for recognizing the voice to be recognized, and the second recognition result is a final result of the cloud for recognizing the voice to be recognized;

displaying one of the first recognition result and the second recognition result according to a display strategy formed by a network environment and a confidence coefficient of a voice recognition result;

the cloud is used for:

receiving the voice to be recognized sent by the vehicle;

recognizing the voice to be recognized to obtain a second recognition result;

and sending the second recognition result to the vehicle.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of processing of speech recognition according to any one of claims 1 to 7.