CN112863496B

CN112863496B - Voice endpoint detection method and device

Info

Publication number: CN112863496B
Application number: CN201911181820.4A
Authority: CN
Inventors: 袁斌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-11-27
Filing date: 2019-11-27
Publication date: 2024-04-02
Anticipated expiration: 2039-11-27
Also published as: CN112863496A

Abstract

The application discloses a voice endpoint detection method and device, wherein the method comprises the following steps: obtaining target voice data; obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judging result. By using the method, the voice rear end point detection parameters can be adjusted in real time based on the middle voice recognition result output in the voice recognition process, so that the dynamic detection of the voice rear end point is realized, and the limitation of the existing voice end point detection process due to too much dependence on the voice recognition result is avoided.

Description

Voice endpoint detection method and device

Technical Field

The application relates to the technical field of computers, in particular to a voice endpoint detection method. The application also relates to a voice endpoint detection device and electronic equipment.

Background

The voice endpoint refers to a threshold of silence and effective voice signal change, and voice endpoint detection (Voice Activity Detection, VAD) is also called voice activity detection or voice boundary detection, which is to identify and eliminate long silence periods from voice signals, and is used to determine the start and end of voice, and the correctness of voice endpoint detection has a great influence on the performance of voice recognition. Especially in man-machine interaction scene applications, the effect of endpoint detection directly affects the user experience.

For example, in the voice learning software, the user performs endpoint detection during the recording evaluation, and when the voice is detected to be finished, the recording is automatically stopped, so that the tedious operation of clicking a record stopping button by the user is omitted, and the user experience can be improved. For another example, in some recording scenarios, the user is required to read the complete text content and stop recording, if the user reads half of the text content and then stays for a long time, the existing endpoint detection technology may determine the voice end point according to the detected silence too early, stop recording, and fail to meet the reserved recording requirement, and reduce the user experience.

The existing voice endpoint detection method is mainly realized based on a voice recognition decoder, and voice endpoint detection is performed while voice recognition is performed on input voice data, however, in the implementation process of the voice endpoint detection method, the effect of voice endpoint detection is too dependent on the voice recognition result, for example, when a tag (Token) of the voice recognition decoder walks to a tag (Token) for identifying the end of voice, a corresponding recognition result needs to be obtained, that is, voice endpoint judgment stop is performed according to backtracking information carried on a state node corresponding to the tag, if the voice recognition decoder does not have the recognition result, voice endpoint judgment stop cannot be performed, so that the voice endpoint detection process has limitation due to too dependent on the voice recognition result.

Disclosure of Invention

The embodiment of the application provides a voice endpoint detection method to solve the problem that the existing voice endpoint detection process is too dependent on voice recognition results and has limitation. The application further provides a voice endpoint detection device and electronic equipment.

The embodiment of the application provides a voice endpoint detection method, which comprises the following steps:

obtaining target voice data;

obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data;

decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence;

according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters;

and judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

Optionally, the decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence includes:

mapping a context-dependent modeling unit in the speech recognition decoder to a context-independent modeling unit;

Establishing a voice back end point discrimination model based on the modeling unit irrelevant to the context;

and identifying the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence.

Optionally, the modeling unit is a phoneme, and the mapping the context-dependent modeling unit in the speech recognition decoder to the context-independent modeling unit includes:

obtaining left and right correlated phonemes for each target phoneme in the speech recognition decoder;

obtaining a state transition probability value of the target phoneme and a probability value of each state output observation sequence corresponding to the target phoneme, obtaining a state transition probability value of the left correlation phoneme and a probability value of each state output observation sequence corresponding to the left correlation phoneme, and obtaining a state transition probability value of the right correlation phoneme and a probability value of each state output observation sequence corresponding to the right correlation phoneme;

calculating average values of the state transition probability value of the target phoneme, the state transition probability value of the left related phoneme and the state transition probability value of the right related phoneme to obtain a state transition probability average value; the probability value of each state output observation sequence corresponding to the target phoneme, the probability value of each state output observation sequence corresponding to the left correlation phoneme and the probability value of each state output observation sequence corresponding to the right correlation phoneme are averaged to obtain a probability average value of the state output observation sequence;

And determining the state transition probability mean value as a target state transition probability value of the target phoneme, and determining the probability mean value of the state output observation sequence as a probability value of the target state output observation sequence of the target phoneme.

Optionally, the identifying the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence includes: obtaining a target observation sequence of the target voice data; sequentially recursively calculating and outputting probability values of the target observation sequences according to the occurrence sequence of the target observation sequences in the voice rear endpoint discrimination model; adopting a token passing algorithm, and decoding the target voice data by utilizing the voice rear end point discrimination model to obtain a target state path corresponding to the maximum probability value of an output target observation sequence; and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

Optionally, the decoding the target voice data by using the voice back end point discrimination model by using a token passing algorithm includes: preprocessing the target voice data to obtain an audio frame; extracting the characteristics of the audio frames to obtain target audio characteristics; and inputting the target audio features into the voice rear endpoint discrimination model, and decoding by adopting a token passing algorithm.

Optionally, the adjusting the voice back end point detection parameter according to the intermediate voice recognition result to obtain a target detection parameter includes: according to the intermediate voice recognition result, adjusting the mute detection time of the voice rear end point to obtain target mute detection time;

correspondingly, the step of determining the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point determination result comprises the following steps: and judging the rear end point of the target voice unit sequence according to the target silence detection time to obtain a voice rear end point judging result.

Optionally, the adjusting the silence detection time of the voice rear endpoint according to the intermediate voice recognition result to obtain the target silence detection time includes: and if the intermediate voice recognition result is unchanged in the first preset time period, shortening the silence detection time of the voice rear end point, and obtaining the target silence detection time for distinguishing the voice rear end point.

Optionally, the method further comprises: if the intermediate voice recognition result is unchanged in the second preset time period, carrying out semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint;

The decoding of the target voice data based on the voice back endpoint discrimination model to obtain a target voice unit sequence comprises the following steps: and if the target semantic recognition result is not matched with the target semantic information for distinguishing the voice rear end point, decoding the target voice data based on the voice rear end point distinguishing model to obtain a target voice unit sequence.

Optionally, the method further comprises: and if the target semantic recognition result is matched with the target semantic information for judging the voice back end point, determining the current time point as the voice back end point of the target voice data.

Optionally, the obtaining the target voice data includes: and if the voice input end of the target voice data does not detect the voice rear end point for obtaining the target voice data, receiving the target voice data sent by the voice input end.

Optionally, the method further comprises: and if the voice back end point judging result indicates that the voice back end point of the target voice data is detected, outputting identification information corresponding to the voice back end point.

Optionally, the method further comprises: outputting voice back end point approval information, wherein the voice back end point approval information is used for a user to confirm whether the voice back end point is a real voice back end point or not; and obtaining feedback information of the user aiming at the voice post-endpoint approval information.

The embodiment of the application also provides a voice endpoint detection method, which comprises the following steps:

after a client does not detect a voice back end point for obtaining target voice data, receiving the target voice data sent by the client; obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result; and sending the voice back end point discrimination result to the client.

Optionally, the sending the voice back end point discrimination result to the client includes: and if the voice back end point judging result indicates that the voice back end point of the target voice data is detected, sending the voice back end point judging result to the client.

The embodiment of the application also provides a voice endpoint detection system, which comprises: the system comprises a first voice endpoint detection module, a semantic detection module and a second voice endpoint detection module;

The first voice endpoint detection module is used for detecting a voice rear endpoint of target voice data through a voice recognition decoder, and sending the target voice data to the semantic detection module after the voice rear endpoint of the target voice data is not detected;

the semantic detection module is used for carrying out semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint; if the target semantic recognition result is matched with the target semantic information for judging the voice back end point, determining the current time point as the voice back end point of the target voice data; if the target semantic recognition result is not matched with the target semantic information which is preset and used for judging the voice back end point, the target voice data is sent to the second voice end point detection module;

the second voice endpoint detection module is used for obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

Optionally, the first voice endpoint detection module is disposed at the client, and the semantic detection module and the second voice endpoint detection module are disposed at the server.

The embodiment of the application also provides a voice endpoint detection system, which comprises: the voice terminal comprises a target voice data distribution module, a first voice terminal detection module, a semantic detection module, a second voice terminal detection module and a voice rear terminal confirmation module;

the target voice data distribution module is used for distributing target voice data to the first voice endpoint detection module, the semantic detection module and the second voice endpoint detection module;

the first voice endpoint detection module is used for detecting the voice endpoint of the target voice data through a voice recognition decoder to obtain a first voice endpoint judgment result;

the semantic detection module is used for carrying out semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with target semantic information which is preset and used for judging the voice rear end point or not, and obtaining a semantic matching result; obtaining a second voice rear endpoint discrimination result according to the semantic matching result;

The second voice endpoint detection module is used for obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a third voice rear end point judging result;

the voice back end point confirming module is used for confirming a target voice back end point judging result aiming at the target voice data according to at least two judging results among the first voice back end point judging result, the second voice back end point judging result and the third voice back end point judging result.

Optionally, the determining the target voice endpoint determination result for the target voice data according to at least two of the first voice endpoint determination result, the second voice endpoint determination result and the third voice endpoint determination result includes: and determining the first obtained discrimination result as a target voice rear end point discrimination result aiming at the target voice data according to the time sequence of at least two discrimination results among the first voice rear end point discrimination result, the second voice rear end point discrimination result or the third voice rear end point discrimination result.

The embodiment of the application also provides a voice back end point detection device, which comprises:

a target voice data obtaining unit for obtaining target voice data;

an intermediate voice recognition result obtaining unit, configured to obtain an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data;

the target voice unit sequence obtaining unit is used for decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence;

the target silence detection time obtaining unit is used for adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters;

and the voice rear end point judging unit is used for judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

The embodiment of the application also provides electronic equipment, which comprises: a processor and a memory for storing a post-speech endpoint detection program which, when read for execution by the processor, performs the following operations: obtaining target voice data; obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; and judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

Compared with the prior art, the embodiment of the application has the following advantages:

according to the voice endpoint detection method, the target voice data is decoded by using the voice endpoint judgment model to obtain a target voice unit sequence, the voice endpoint detection parameters are adjusted according to the intermediate voice recognition result, the rear endpoint of the target voice unit sequence is judged according to the adjusted detection parameters to obtain the voice endpoint judgment result, the voice endpoint detection process and the voice recognition process are synchronously carried out, the voice endpoint detection and the voice recognition process are decoupled through the endpoint detection link provided by the method, the voice endpoint detection process is combined with the applicable scene of the current voice, the voice endpoint detection parameters are adjusted in real time based on the intermediate voice recognition result output in the voice recognition process, the dynamic detection of the voice endpoint is realized, and the problem that the existing voice endpoint detection process has limitation due to too high dependence on the voice recognition result is avoided.

Drawings

Fig. 1 is a flowchart of a voice endpoint detection method provided in a first embodiment of the present application;

FIG. 2 is a flowchart of a method for detecting a voice endpoint according to a second embodiment of the present application;

FIG. 3 is a schematic diagram of a voice endpoint detection system according to a third embodiment of the present application;

FIG. 4 is a schematic diagram of a voice endpoint detection system according to a fourth embodiment of the present application;

fig. 5 is a block diagram of a unit of a voice endpoint detection apparatus provided in a fifth embodiment of the present application;

fig. 6 is a schematic logic structure of an electronic device according to a sixth embodiment of the present application.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other ways than those herein described and similar generalizations can be made by those skilled in the art without departing from the spirit of the application and the application is therefore not limited to the specific embodiments disclosed below.

Aiming at a voice endpoint detection scene, in order to improve the applicability of voice endpoint detection, the application provides a voice endpoint detection method, a voice endpoint detection device corresponding to the method and electronic equipment. The following provides examples to describe the method, apparatus and electronic device in detail.

A first embodiment of the present application provides a voice endpoint detection method, where an application body of the method may be applied to a computing device for performing voice endpoint detection, fig. 1 is a flowchart of the voice endpoint detection method provided in the first embodiment of the present application, and the method provided in the embodiment is described in detail below with reference to fig. 1. The embodiments referred to in the following description are intended to illustrate the method principles and not to limit the practical use.

As shown in fig. 1, the voice endpoint detection method provided in this embodiment includes the following steps:

s101, obtaining target voice data.

The target voice data refers to voice data to be subjected to voice rear end point detection, such as navigation voice of a driver in a vehicle-mounted environment.

The obtaining the target voice data may refer to receiving the target voice data sent by the voice input end after the voice input end of the target voice data does not detect a voice rear end point for obtaining the target voice data. The voice input terminal may refer to an intelligent mobile terminal used by a user or a terminal provided locally to the sounding user, such as a car navigation device.

S102, obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data.

In the present embodiment, the process of performing voice recognition on the target voice data is performed in synchronization with the process of voice endpoint detection. The implementation of the method is realized based on a voice recognition decoder, namely, after the intermediate voice recognition result of the voice recognition decoder is obtained, the rear end point detection of the target voice data can be performed. In this embodiment, the intermediate speech recognition result is preferably the number of silence frames of the intermediate speech.

S103, decoding the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence.

The implementation of this step is different from the decoding process of the existing speech recognition decoder, which is a simplified operation based on the decoding process of the existing speech recognition decoder. The present step decodes the target speech data based on a speech back end point discrimination model whose modeling unit is a decoding unit obtained by mapping the modeling unit of the speech recognition decoder described above, i.e., the context-dependent modeling unit is mapped to a context-independent modeling unit, the number of state nodes corresponding to the context-independent modeling unit is substantially smaller than the number of state nodes corresponding to the context-dependent modeling unit, and therefore the purpose of the mapping is to reduce the number of state nodes of the model, thereby reducing the computational load of the decoding process. The reason for this is that: the process of performing voice endpoint detection on target voice data has lower accuracy requirements on the recognition result than the process of voice recognition, i.e., the voice endpoint detection process only needs to distinguish between sound and silence, but has no requirements on the specific content of sound, so the path searching process can be relatively simple.

In this embodiment, the process of decoding the target voice data based on the voice back end point discrimination model to obtain the target voice unit sequence may include the following:

firstly, mapping a modeling unit related to the context in the voice recognition decoder into a modeling unit not related to the context;

secondly, based on the modeling unit irrelevant to the context, obtaining the voice rear endpoint discrimination model;

and finally, identifying the target voice data based on the voice rear end point discrimination model to obtain a target voice unit sequence.

The above mapping of the context-dependent modeling unit in the speech recognition decoder to the context-independent modeling unit refers to simplified mapping of the modeling unit in the acoustic model of the speech recognition decoder, thereby reducing the number of state nodes in the model, which is preferably a hidden markov network (HMM) model, in this embodiment, the context-dependent modeling unit in the speech recognition decoder may be a phoneme, and the corresponding simplified mapped context-independent modeling unit is also a phoneme, and the process specifically includes:

First, obtaining left and right correlated phonemes for each target phoneme in the speech recognition decoder; the left-related phoneme and the right-related phoneme are both context-related phonemes of the target phoneme. For example, if 26 phones (modeling units) are included in the acoustic model of the speech recognition decoder, then all of the 26 phones are target phones.

Secondly, obtaining a state transition probability value of the target phoneme and a probability value of each state output observation sequence corresponding to the target phoneme, obtaining a state transition probability value of the left correlation phoneme and a probability value of each state output observation sequence corresponding to the left correlation phoneme, and obtaining a state transition probability value of the right correlation phoneme and a probability value of each state output observation sequence corresponding to the right correlation phoneme;

then, averaging the state transition probability value of the target phoneme, the state transition probability value of the left correlation phoneme and the state transition probability value of the right correlation phoneme to obtain a state transition probability average value, and averaging the probability value of each state output observation sequence corresponding to the target phoneme, the probability value of each state output observation sequence corresponding to the left correlation phoneme and the probability value of each state output observation sequence corresponding to the right correlation phoneme to obtain a state output observation sequence probability average value;

And finally, determining the state transition probability mean value as a target state transition probability value of the target phoneme, and determining the probability mean value of the state output observation sequence as a probability value of the target state output observation sequence of the target phoneme.

After the mapping, each modeling unit only corresponds to one state node, namely, a node of an acoustic model taking a phoneme as a speech back endpoint discrimination model, and the target state transition probability value of the target phoneme and the probability value of the target state output observation sequence are the probability values corresponding to the network node. For example, after mapping for the above 26 phonemes (modeling unit), the obtained speech rear end point discrimination model is a hidden markov network (HMM) model including 26 state nodes.

In this embodiment, the identifying the target voice data based on the voice back end point discrimination model may specifically be: obtaining a target observation sequence corresponding to the target voice data; sequentially recursively calculating and outputting probability values of the target observation sequences according to the occurrence sequence of the target observation sequences in the voice rear endpoint discrimination model; adopting a token passing algorithm, and decoding the target voice data by utilizing the voice rear end point discrimination model to obtain a target state path corresponding to the maximum probability value of an output target observation sequence; and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

The process of decoding the target voice data by using the voice back end point discrimination model by adopting the token passing algorithm specifically comprises the following steps:

firstly, preprocessing the target voice data to obtain an audio frame; the process comprises the following steps: the analog voice signal is converted into a digital signal through A/D conversion, pre-emphasis processing is carried out on the digital signal, the high-frequency part of the digital signal is improved, and then framing and windowing processing is carried out, so that non-stationary voice signal data points are divided into short-time signals in audio frames.

Secondly, extracting the characteristics of the audio frame to obtain target audio characteristics; in this embodiment, the audio features are MFCC speech features and the process of extracting the features is to convert the sound signal to Mel frequency.

And finally, inputting the target audio features into the voice rear endpoint discrimination model, and decoding by adopting a token passing algorithm. The maximum probability path of each state node in the speech back end point discrimination model is recorded in the corresponding model node, and the maximum probability path arc-in node is recorded in the model node variable. The token is transferred with the input of the target audio feature, and when the transfer is terminated, whether the path is prioritized is judged by calculating the magnitude of the maximum likelihood value (probability value) of the observation sequence for generating the target voice data for each state path based on the backtracking information stored in the token, and the greater the log likelihood value is, the more likely the path is the prioritized path. The log-likelihood value of a path is equal to the sum of the log-likelihood values (hop probability values) of all the hops traversed by the path plus the log-probability density (observation sequence output probability values) of the observation sequence associated with the node states of all the HMMs traversed by it. The token passing algorithm is adopted for decoding, which reduces redundancy in a search network of the voice rear end point discrimination model, greatly reduces space complexity and cost of calculation resources, and improves decoding efficiency of the voice rear end point discrimination model.

The implementation order of steps S102 and S103 is not limited, that is, the intermediate speech recognition result generated by the speech recognition decoder after performing speech recognition on the target speech data may be obtained after decoding the target speech data based on the speech rear end point discrimination model and obtaining the target speech unit sequence.

And S104, adjusting the voice rear end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters.

After the step of obtaining the intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data and the target voice unit sequence obtained after decoding the target voice data based on the voice rear end point discrimination model, the step of adjusting the voice rear end point detection parameter corresponding to the target voice unit sequence according to the intermediate voice recognition result is used for obtaining the target detection parameter.

In this embodiment, the adjusting the voice back end point detection parameter according to the above intermediate voice recognition result to obtain the target detection parameter may specifically refer to: and adjusting the silence detection time of the voice rear end point according to the intermediate voice recognition result to obtain a target silence detection time, for example, if the intermediate voice recognition result is unchanged within a first preset time period, shortening the silence detection time of the voice rear end point and obtaining the target silence detection time for distinguishing the voice rear end point. For example, if the preset silence detection time of the voice back end point is 800ms and the intermediate voice recognition result is not changed within 300ms, the silence detection time of the voice back end point is shortened to 600ms. According to the middle voice recognition result, the mute detection time of the voice rear end point is adjusted, and the voice rear end point detection process can be combined with the applicable scene of the current voice in real time, so that the dynamic detection of the voice rear end point is realized. For example, the speaking mode of the user a is to stop for a period of time in a habitual way, so when the voice back end point is detected for the voice information, the mute detection time can be adjusted in real time according to the length of the pause time in the middle of the speaking of the user a, and if the pause time is longer, the mute detection time is shortened to rapidly determine the voice back end point.

It should be noted that, if the intermediate speech recognition result changes in the time interval from 300ms to 600ms, the silence detection time of the speech rear end point may be restored to a preset 800ms, which is because: if the middle voice recognition result changes in the time interval of 300ms to 600ms, the voice input is not finished, the mute detection time of the voice rear end point is restored to the preset mute detection time, and the condition that the voice rear end point is determined prematurely can be avoided. In this embodiment, after obtaining the intermediate speech recognition result generated after the speech recognition decoder performs speech recognition on the target speech data, the method further includes the following:

A. and if the intermediate voice recognition result is unchanged in the second preset time period, carrying out semantic recognition on the target voice data to obtain a target semantic recognition result. Specifically, the result recognized by the speech recognition decoder is subjected to semantic analysis, and the semantic analysis can be specifically performed through a natural language understanding technology (Natural Language Understanding, abbreviated as NLU). The second predetermined period is used herein to distinguish from the first predetermined period described above to indicate that the intermediate speech recognition result is functionally different in the two scenarios.

In this embodiment, the objective is to perform semantic analysis on the target voice data instead of performing semantic analysis on a frame-by-frame basis after the intermediate voice recognition result is unchanged within the second predetermined period of time, so as to save the calculation amount, and when the intermediate voice recognition result of the voice recognition decoder is unchanged within a predetermined time period (the result can be considered to be a stable result at this time), the recognition result of the voice recognition decoder is subjected to semantic analysis by a natural language understanding technique (Natural Language Understanding, abbreviated as NLU).

B. Judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint; for example, the target semantic information for discriminating the voice rear end point may be a navigation command word.

C. If the target semantic recognition result does not match with the predetermined target semantic information for discriminating the voice rear end point, the operation of step S103 is performed, that is, the target voice data is decoded based on the voice rear end point discriminating model, to obtain a target voice unit sequence.

D. And if the target semantic recognition result is matched with the target semantic information for judging the voice back end point, determining the current time point as the voice back end point of the target voice data. For example, for a preset navigation command word, if the semantics of the navigation command word are detected, the operation of decoding the target voice data based on the voice back end point discrimination model to obtain a target voice unit sequence may not be performed, and the voice back end point is directly determined, so as to reduce the delay in voice interaction. For the vehicle-mounted scene, after the user speaks the offline command word, the user only needs to wait for whether the semantic analysis module hits the predefined command word, and if so, the user immediately returns to the voice ending point event.

S105, judging the voice rear end point of the target voice unit sequence according to the target detection parameter, and obtaining a voice rear end point judging result.

After the target detection parameters are obtained in the step, the step is used for judging the voice rear end point of the target voice unit sequence according to the target detection parameters to obtain a voice rear end point judging result. For example, the voice rear end point of the target voice unit sequence is judged according to the target silence detection time, and when the time corresponding to the number of the silence frames of the target voice unit sequence reaches the target silence detection time, the current time point is determined as the voice rear end point of the target voice unit sequence.

In this embodiment, if the above-mentioned voice back end point discrimination result indicates that the voice back end point of the target voice data has been detected, the identification information corresponding to the voice back end point is output, for example, by displaying punctuation marks at the corresponding positions of the text output interface corresponding to the target voice data, and performing sentence breaking processing on the target voice data through the punctuation marks, so as to perform visual display on the voice back end point. At the same time, voice back end point approval information can be output, and the approval information is used for the user to confirm whether the voice back end point visually displayed is a real voice back end point, and obtain feedback information of the user for the voice back end point approval information, for example, display "whether the current position is a voice tail point? The prompt information of the voice back end point confirmation control and the corresponding denial control are displayed, and after touch operation of the user on the voice back end point confirmation control or the denial control is detected, feedback information of the user on the voice back end point approval information is obtained. The feedback information can be used as one of indexes for adjusting the voice back end point detection parameters for learning so as to optimize the voice back end point judgment result.

It should be noted that, the application body of the voice endpoint detection method provided in this embodiment may also be a client, or may be a server, where the client may be a smart phone or a vehicle navigation device used by a user. When the application body is a server, the obtaining the target voice data in the above step S101 may refer to: and after the voice input end (client) of the target voice data does not detect the voice rear end point for obtaining the target voice data, receiving the target voice data sent by the voice input end. For example, the voice detection process of the mobile phone end or the vehicle navigation device is limited by hardware, so that the effect of voice endpoint detection is not good, when the mobile phone end or the vehicle navigation device does not detect the voice rear endpoint for obtaining the target voice data, the target voice data is sent to the application main body end (may be a server) of the embodiment, and the application main body end of the embodiment performs voice endpoint detection by using the acoustic model (the voice recognition decoder and the voice rear endpoint discrimination model) maintained by the application main body end and good hardware resources, so as to achieve a good voice endpoint detection effect.

According to the voice endpoint detection method, the voice endpoint detection process and the voice recognition process are synchronously carried out by using the voice endpoint discrimination model to decode target voice data, a target voice unit sequence is obtained, voice endpoint detection parameters are adjusted according to the middle voice recognition result, the rear endpoint of the target voice unit sequence is discriminated according to the adjusted detection parameters, and the voice endpoint discrimination result is obtained.

In addition, since the process of performing voice endpoint detection on the target voice data has lower accuracy requirements on the recognition result than the process of voice recognition, that is, the voice endpoint detection process only needs to distinguish between sound and silence, but no requirement is required on the specific content of voice, so that the path searching link in the voice endpoint detection process can be relatively simple, based on which the modeling unit of the voice rear endpoint distinguishing model in the embodiment is a decoding unit obtained after mapping the modeling unit of the voice recognition decoder, that is, the modeling unit with context correlation is mapped to the modeling unit with context independence, and the number of state nodes corresponding to the modeling unit with context independence is greatly smaller than the number of state nodes corresponding to the modeling unit with context correlation, therefore, the number of state nodes of the voice rear endpoint distinguishing model can be reduced through the mapping process, thereby reducing the calculation amount of the decoding process.

The second embodiment of the present application further provides a voice endpoint detection method, an application body of the method is a server constructed with a voice recognition decoder and a voice post endpoint discrimination model, as shown in fig. 2, and the method specifically includes the following steps:

S201, receiving the target voice data sent by the client, for example, after the client does not detect a voice back end point for obtaining the target voice data, receiving the target voice data sent by the client;

s202, obtaining an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data;

s203, decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence;

s204, adjusting the voice back end point detection parameters according to the intermediate voice recognition result to obtain target detection parameters;

s205, judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result;

s206, sending the voice rear end point discrimination result to the client. For example, if the voice back end point discrimination result indicates that the voice back end point of the target voice data has been detected, the voice back end point discrimination result is transmitted to the client.

According to the voice endpoint detection method provided by the embodiment, voice endpoint detection for target voice data is achieved through cooperation between the client and the server. The method carries out the voice back end point detection process and the voice recognition process synchronously, and the independently operated end point detection link provided by the method carries out decoupling processing on the voice end point detection and the voice recognition process, combines the voice back end point detection process with the applicable scene of the current voice, adjusts the voice back end point detection parameters in real time based on the intermediate voice recognition result output in the voice recognition process, realizes the dynamic detection of the voice back end point, and avoids the problem that the existing voice end point detection process has limitation because of too much dependence on the voice recognition result.

A third embodiment of the present application provides a voice endpoint detection system, as shown in fig. 3, including: a first speech endpoint detection module 301, a semantic detection module 302, and a second speech endpoint detection module 303;

the first voice endpoint detection module 301 is configured to perform voice post endpoint detection on target voice data through a voice recognition decoder, and send the target voice data to the semantic detection module 302 after the voice post endpoint of the target voice data is not detected; for example, voice end point detection is performed while voice recognition is performed on input target voice data by a voice recognition decoder, a corresponding recognition result is acquired when a tag (Token) of the voice recognition decoder goes to a tag for identifying the end of voice, that is, voice end point detection is performed according to trace-back information carried on a state node corresponding to the tag, and when the recognition result is not acquired, the target voice data is sent to the semantic detection module 302.

The semantic detection module 302 is configured to perform semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint; if the target semantic recognition result is matched with the target semantic information for judging the voice back end point, determining the current time point as the voice back end point of the target voice data; and if the target semantic recognition result does not match with the target semantic information which is preset and used for judging the voice rear end point, the target voice data is sent to the second voice end point detection module 303.

The second voice endpoint detection module 303 is configured to obtain an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on a voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result. The process refers to the related content of the first embodiment of the present application, and will not be described herein.

In this embodiment, the first voice endpoint detection module may be disposed at a client, and the semantic detection module and the second voice endpoint detection module may be disposed at a server.

According to the voice endpoint detection system provided by the embodiment, the voice endpoint detection can be sequentially carried out on the target voice data by using a plurality of endpoint detection methods, so that the reliability of the voice endpoint detection process is improved.

The fourth embodiment of the present application further provides a voice endpoint detection system, as shown in fig. 4, which includes: a target voice data distribution module 401, a first voice endpoint detection module 402, a semantic detection module 403, a second voice endpoint detection module 404, and a voice post endpoint confirmation module 405;

The target voice data distribution module 401 is configured to distribute target voice data to the first voice endpoint detection module 402, the semantic detection module 403, and the second voice endpoint detection module 404;

the first voice endpoint detection module 402 is configured to perform voice post endpoint detection on the target voice data through the voice recognition decoder, so as to obtain a first voice post endpoint discrimination result.

The semantic detection module 403 is configured to perform semantic recognition on the target voice data to obtain a target semantic recognition result; judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint or not, and obtaining a semantic matching result; and obtaining a second voice rear endpoint discrimination result according to the semantic matching result.

The second voice endpoint detection module 404 is configured to obtain an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data; decoding the target voice data based on the voice rear endpoint discrimination model to obtain a target voice unit sequence; according to the intermediate voice recognition result, adjusting voice rear end point detection parameters to obtain target detection parameters; and judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a third voice rear end point judging result.

The voice back end point confirmation module 405 is configured to confirm a target voice back end point judgment result for the target voice data according to at least two of the first voice back end point judgment result, the second voice back end point judgment result and the third voice back end point judgment result. For example, according to the time sequence of obtaining at least two of the first voice rear end point discrimination result, the second voice rear end point discrimination result or the third voice rear end point discrimination result, determining the discrimination result obtained first as the target voice rear end point discrimination result for the target voice data, or according to the preset priority sequence, selecting the discrimination result with the highest reliability from the at least two discrimination results obtained at the same time as the target voice rear end point discrimination result for the target voice data.

According to the voice endpoint detection system provided by the embodiment, voice endpoint detection can be performed on target voice data by using multiple endpoint detection methods at the same time, multiple voice rear endpoint discrimination results are obtained, and the target voice rear endpoint discrimination result with the highest final reliability is determined based on the preset voice rear endpoint confirmation rule, so that the voice rear endpoint detection efficiency is improved, and the reliability of the voice endpoint detection process is improved.

The first embodiment provides a voice endpoint detection method, and correspondingly, the fifth embodiment of the present application further provides a voice endpoint detection device, and since the device embodiments are substantially similar to the method embodiments, the description is relatively simple, and the details of the relevant technical features should be referred to the corresponding descriptions of the method embodiments provided above, and the following descriptions of the device embodiments are merely illustrative.

Referring to fig. 5 for understanding the embodiment, fig. 5 is a block diagram of a unit of an apparatus provided in the embodiment, and as shown in fig. 5, the apparatus provided in the embodiment includes:

a target voice data obtaining unit 501 for obtaining target voice data;

an intermediate voice recognition result obtaining unit 502, configured to obtain an intermediate voice recognition result generated after the voice recognition decoder performs voice recognition on the target voice data;

a target voice unit sequence obtaining unit 503, configured to decode the target voice data based on a voice rear endpoint discrimination model, to obtain a target voice unit sequence;

a target detection parameter obtaining unit 504, configured to adjust a voice back end point detection parameter according to the intermediate voice recognition result, to obtain a target detection parameter;

And the voice back end point discriminating unit 505 is configured to discriminate the voice back end point of the target voice unit sequence according to the target detection parameter, so as to obtain a voice back end point discriminating result.

In the foregoing embodiments, a voice endpoint detection method and a voice endpoint detection apparatus are provided, and in addition, the sixth embodiment of the present application further provides an electronic device, and since the electronic device embodiments are substantially similar to the method embodiments, the description is relatively simple, and details of relevant technical features should be referred to the corresponding descriptions of the method embodiments provided above, and the following descriptions of the electronic device embodiments are merely illustrative. The electronic device embodiment is as follows:

fig. 6 is a schematic diagram of an electronic device according to the present embodiment.

As shown in fig. 6, the electronic device includes: a processor 601; a memory 602;

The memory 602 is configured to store a program for detecting a voice endpoint, where the program, when read and executed by the processor, performs the following operations:

obtaining target voice data;

and judging the voice rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

1. Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmission media), such as modulated data signals and carrier waves.

2. It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While the preferred embodiment has been described, it is not intended to limit the invention thereto, and any person skilled in the art may make variations and modifications without departing from the spirit and scope of the present invention, so that the scope of the present invention shall be defined by the claims of the present application.

Claims

1. A method for detecting a voice endpoint, comprising:

obtaining target voice data;

2. The method of claim 1, wherein decoding the target speech data based on the speech back end point discrimination model to obtain a sequence of target speech units comprises:

3. The method of claim 2, wherein the modeling unit is a phoneme, and wherein the mapping the context-dependent modeling unit in the speech recognition decoder to the context-independent modeling unit comprises:

4. The method of claim 2, wherein the identifying the target speech data based on the speech back end point discrimination model to obtain a sequence of target speech units comprises:

obtaining a target observation sequence of the target voice data;

sequentially recursively calculating and outputting probability values of the target observation sequences according to the occurrence sequence of the target observation sequences in the voice rear endpoint discrimination model;

adopting a token passing algorithm, and decoding the target voice data by utilizing the voice rear end point discrimination model to obtain a target state path corresponding to the maximum probability value of an output target observation sequence;

and determining the voice unit sequence corresponding to the target state path as the target voice unit sequence.

5. The method of claim 4, wherein said decoding the target speech data using the speech back end point discrimination model using a token passing algorithm comprises:

preprocessing the target voice data to obtain an audio frame;

extracting the characteristics of the audio frames to obtain target audio characteristics;

and inputting the target audio features into the voice rear endpoint discrimination model, and decoding by adopting a token passing algorithm.

6. The method according to claim 1, wherein adjusting the voice post-endpoint detection parameters according to the intermediate voice recognition result to obtain the target detection parameters comprises:

according to the intermediate voice recognition result, adjusting the mute detection time of the voice rear end point to obtain target mute detection time;

correspondingly, the step of determining the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point determination result comprises the following steps:

and judging the rear end point of the target voice unit sequence according to the target silence detection time to obtain a voice rear end point judging result.

7. The method of claim 6, wherein adjusting the silence detection time of the voice rear end point according to the intermediate voice recognition result to obtain the target silence detection time comprises:

and if the intermediate voice recognition result is unchanged in the first preset time period, shortening the silence detection time of the voice rear end point, and obtaining the target silence detection time for distinguishing the voice rear end point.

8. The method as recited in claim 1, further comprising: if the intermediate voice recognition result is unchanged in the second preset time period, carrying out semantic recognition on the target voice data to obtain a target semantic recognition result;

Judging whether the target semantic recognition result is matched with the target semantic information which is preset and used for judging the voice rear endpoint;

the decoding of the target voice data based on the voice back endpoint discrimination model to obtain a target voice unit sequence comprises the following steps:

and if the target semantic recognition result is not matched with the target semantic information for distinguishing the voice rear end point, decoding the target voice data based on the voice rear end point distinguishing model to obtain a target voice unit sequence.

9. The method as recited in claim 8, further comprising: and if the target semantic recognition result is matched with the target semantic information for judging the voice back end point, determining the current time point as the voice back end point of the target voice data.

10. The method of claim 1, wherein the obtaining the target voice data comprises:

and if the voice input end of the target voice data does not detect the voice rear end point for obtaining the target voice data, receiving the target voice data sent by the voice input end.

11. The method as recited in claim 1, further comprising:

And if the voice back end point judging result indicates that the voice back end point of the target voice data is detected, outputting identification information corresponding to the voice back end point.

12. The method as recited in claim 11, further comprising:

outputting voice back end point approval information, wherein the voice back end point approval information is used for a user to confirm whether the voice back end point is a real voice back end point or not;

and obtaining feedback information of the user aiming at the voice post-endpoint approval information.

13. A method for detecting a voice endpoint, comprising:

receiving target voice data sent by a client;

judging the rear end point of the target voice unit sequence according to the target detection parameter to obtain a voice rear end point judging result;

and sending the voice back end point discrimination result to the client.

14. The method of claim 13, wherein the sending the post-speech endpoint discrimination result to the client comprises:

and if the voice back end point judging result indicates that the voice back end point of the target voice data is detected, sending the voice back end point judging result to the client.

15. A voice endpoint detection system, comprising: the system comprises a first voice endpoint detection module, a semantic detection module and a second voice endpoint detection module;

16. The system of claim 15, wherein the first voice endpoint detection module is disposed at a client and the semantic detection module and the second voice endpoint detection module are disposed at a server.

17. A voice endpoint detection system, comprising: the voice terminal comprises a target voice data distribution module, a first voice terminal detection module, a semantic detection module, a second voice terminal detection module and a voice rear terminal confirmation module;

the voice back end point confirming module is used for confirming a target voice back end point judging result aiming at the target voice data according to the first voice back end point judging result, the second voice back end point judging result and the third voice back end point judging result.

18. The system of claim 17, wherein the validating the target voice endpoint determination result for the target voice data based on the first voice endpoint determination result, the second voice endpoint determination result, and the third voice endpoint determination result comprises: and determining the first obtained discrimination result as a target voice rear end point discrimination result aiming at the target voice data according to the time sequence of at least two discrimination results among the first voice rear end point discrimination result, the second voice rear end point discrimination result or the third voice rear end point discrimination result.

19. The system of claim 17, wherein the validating the target voice endpoint determination result for the target voice data based on the first voice endpoint determination result, the second voice endpoint determination result, and the third voice endpoint determination result comprises:

and selecting a discrimination result with highest reliability from at least two discrimination results obtained at the same time according to a preset priority order as a target voice rear endpoint discrimination result aiming at the target voice data.

20. A voice back end point detection apparatus, comprising:

a target voice data obtaining unit for obtaining target voice data;

the target detection parameter obtaining unit is used for adjusting the voice rear end point detection parameter according to the intermediate voice recognition result to obtain a target detection parameter;

21. An electronic device, comprising:

a processor;

a memory for storing a post-speech endpoint detection program that, when read and executed by the processor, performs the operations of:

obtaining target voice data;