CN115331656A

CN115331656A - Non-instruction voice rejection method, vehicle-mounted voice recognition system and automobile

Info

Publication number: CN115331656A
Application number: CN202210917096.2A
Authority: CN
Inventors: 徐高鹏
Original assignee: Weilai Automobile Technology Anhui Co Ltd
Current assignee: Weilai Automobile Technology Anhui Co Ltd
Priority date: 2022-08-01
Filing date: 2022-08-01
Publication date: 2022-11-11

Abstract

The invention relates to a rejection method of non-instruction voice, a vehicle-mounted voice recognition system and an automobile, wherein the rejection method of the non-instruction voice comprises the following steps: carrying out feature extraction on input audio data to obtain a voice feature vector; inputting the voice feature vector into a voice enhancement system to obtain an identification rejection feature vector, confidence coefficient, intention information and text information; obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence level, the intention information and the text information; and performing rejection judgment on the multi-mode fusion feature vector to obtain a recognition result. By means of the multi-mode fusion method, various information is fused, so that the accuracy of instruction voice judgment is improved by comprehensively considering various dimensions in voice judgment and combining the rejection characteristic vector, the confidence coefficient, the intention information and the text information.

Description

Non-instruction voice rejection method, vehicle-mounted voice recognition system and automobile

Technical Field

The invention relates to the field of voice recognition, and particularly provides a non-instruction voice recognition rejection method, a vehicle-mounted voice recognition system and an automobile.

Background

The man-machine interaction is usually accompanied with a voice recognition process, and after the machine receives a section of audio information, whether a target audio is an instruction voice needs to be judged. The instruction voice is a voice with a definite intention given by a user to a machine.

The judgment of the instruction voice requires multidimensional information such as the driving state of the vehicle, the intonation of the user, the speed of speech and the like. However, in the prior art, the judgment of the instruction voice is carried out by only simple text recognition, the consideration of low dimensionality and poor recognition accuracy result in that a lot of non-instruction audios are mistakenly recognized as the instruction voice by a machine.

Accordingly, there is a need in the art for a new method of non-instructional voice content rejection to address the above-mentioned problems.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks, the present invention has been made to provide a solution or at least a partial solution to the problem that non-instruction speech in speech recognition is easily misrecognized as instruction speech.

In a first aspect, the present invention provides a method for rejecting non-instruction speech, the method comprising:

carrying out feature extraction on input audio data to obtain a voice feature vector;

inputting the voice feature vector into a voice enhancement system to obtain an identification rejection feature vector, confidence coefficient, intention information and text information;

obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence level, the intention information and the text information;

and performing rejection judgment on the multi-mode fusion feature vector to obtain a recognition result.

In one technical solution of the rejection method for non-instruction speech,

the speech enhancement system comprises a speech enhancement model, an intention understanding model and a text coding model;

inputting the voice feature vector into a voice enhancement system to obtain a rejection feature vector, a confidence level, intention information and text information, wherein the steps comprise:

inputting the voice feature vector into a trained voice enhancement model to obtain an identification rejection feature vector, a confidence coefficient and an identification result text;

inputting the recognition result text into a trained intention understanding model to obtain intention information;

and inputting the recognition result text into the trained text coding model to obtain text information.

In one technical solution of the above-mentioned rejection method for non-instruction speech, the obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence level, the intention information, and the text information includes:

and obtaining a multi-modal fusion feature vector based on the recognition rejection feature vector, the confidence coefficient, the intention information, the text information and the in-vehicle information.

In one embodiment of the method for rejecting non-instruction speech, the speech enhancement model includes a speech encoder and a speech decoder;

inputting the speech feature vector into a trained speech enhancement model to obtain a rejection feature vector, a confidence coefficient and a recognition result text, wherein the steps of:

inputting the voice feature vector into a voice coder to obtain an identification rejection feature vector;

and inputting the rejection characteristic vector into a voice decoder to obtain a confidence coefficient and a recognition result text.

In one technical scheme of the rejection method for the non-instruction speech, the speech coder is composed of M layers of conv1d networks, the text coding model is composed of N layers of conv1d networks and Y layers of LSTM networks, and M, N, Y are natural numbers.

In one technical solution of the above-mentioned rejection method for non-instruction speech, the obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence level, the intention information, the text information, and in-vehicle information includes:

fusing the rejection characteristic vector, the confidence coefficient, the intention information, the text information and the in-vehicle information to obtain a model set characteristic vector;

and (4) passing the model set feature vector through a trained attention mechanism model to obtain a multi-mode fusion feature vector.

In one technical solution of the above non-instruction speech rejection method, the in-vehicle information includes a vehicle state and/or a sound zone direction of the audio data source.

In one technical solution of the rejection method for non-instruction speech, the performing rejection judgment on the multi-modal fusion feature vector to obtain a recognition result includes:

inputting the multi-modal fusion feature vector into the trained rejection judging model to obtain a recognition result,

wherein the rejection judging model is as follows:

y = sigmoid (J1), J1= W1attn + b1, where sigmoid is an activation function, attn is a multi-modal fusion feature vector, W1 is a weight in the rejection network, and b1 is a bias vector in the rejection network.

In one technical solution of the rejection method for the non-instruction speech, the fusion is a splicing operation;

the attention mechanism model is as follows:

wherein attn is a multi-modal fusion feature vector, softmax is an activation function, d is a scaling coefficient, and A is a model set feature vector.

In a second aspect, an electronic device is provided, which includes a processor and a storage device, where the storage device is adapted to store a plurality of program codes, and the program codes are adapted to be loaded and run by the processor to execute the method for rejecting non-instruction speech according to any one of the above-mentioned technical solutions.

In a third aspect, a computer-readable storage medium is provided, in which a plurality of program codes are stored, and the program codes are adapted to be loaded and run by a processor to execute the method for rejecting non-instruction speech according to any one of the above-mentioned technical solutions of the method for rejecting non-instruction speech.

In a fourth aspect, there is provided a vehicle-mounted speech recognition system, the system comprising: the voice acquisition device is used for acquiring audio data; and the electronic equipment is operated to execute the rejection method of the non-instruction voice.

In a fifth aspect, an automobile is provided, which comprises the vehicle-mounted voice recognition system of the fourth aspect.

One or more technical schemes of the invention at least have one or more of the following beneficial effects:

in the technical scheme of the invention, various information is fused by a multi-mode fusion method, so that the accuracy of instruction voice judgment is improved by comprehensively considering various dimensions and combining the rejection characteristic vector, the confidence coefficient, the intention information and the text information during voice judgment. By adding the in-vehicle information as one of the judgment information in the multi-mode fusion, the considered dimensionality is increased, so that the instruction pre-judgment is more reasonable and more accurate.

Drawings

The disclosure of the present invention will become more readily understood with reference to the accompanying drawings. As is readily understood by those skilled in the art: these drawings are for illustrative purposes only and are not intended to constitute a limitation on the scope of the present invention. Moreover, in the drawings, like numerals are used to indicate like parts, and in which:

FIG. 1 is a flow chart illustrating the main steps of a rejection method for non-instruction speech according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a speech enhancement system according to one embodiment of the present invention;

FIG. 3 is a sub-flow diagram of a method for rejection of non-instructional speech according to one embodiment of the present invention;

FIG. 4 is a sub-flow diagram of a method of rejection of non-instructional speech according to one embodiment of the present invention;

FIG. 5 is a sub-flow diagram of a method for rejection of non-instructional speech according to one embodiment of the present invention;

FIG. 6 is a sub-flow diagram of a method of rejection of non-instructional speech according to one embodiment of the present invention;

FIG. 7 is a sub-flow diagram of a method of rejection of non-instructional speech according to one embodiment of the present invention;

FIG. 8 is a sub-flow diagram of a method of rejection of non-instructional speech according to one embodiment of the present invention;

fig. 9 is a sub-flow diagram of a rejection method of non-instruction speech according to an embodiment of the present invention.

Detailed Description

Some embodiments of the invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

In the description of the present invention, a "module" or "processor" may include hardware, software, or a combination of both. A module may comprise hardware circuitry, various suitable sensors, communication ports, memory, may comprise software components such as program code, or may be a combination of software and hardware. The processor may be a central processing unit, microprocessor, image processor, digital signal processor, or any other suitable processor. The processor has data and/or signal processing functionality. The processor may be implemented in software, hardware, or a combination thereof. Non-transitory computer readable storage media include any suitable medium that can store program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random-access memory, and the like. The term "a and/or B" denotes all possible combinations of a and B, such as a alone, B alone or a and B. The term "at least one A or B" or "at least one of A and B" means similar to "A and/or B" and may include only A, only B, or both A and B. The singular forms "a", "an" and "the" may include the plural forms as well.

Some terms to which the present invention relates are explained first.

Instruction voice: in performing voice operations, the original audio corresponding to the executable voice commands received by the system is called command voice. In other words, the instructional voice is the voice that the system receives and must be executed or responded to. It should be noted that the instruction voice may be a command statement, and if one instruction voice is "volume is increased by fifteen percent", the system controls the volume of the sound box to be correspondingly increased. It may be a meaning that the system judges the answer, for example, if a command voice is "how well to go to xu state before three pm tomorrow", the system should calculate the time required from the departure point to the destination in the common transportation mode and give an explanation.

Attention mechanism (Attention model): the attention mechanism is that the information which is more critical to the current task target is selected from the received information, namely the key information is selected by screening the information, so that the efficiency and the accuracy of information processing are improved.

In the traditional voice recognition, firstly, voice is recognized to obtain text information corresponding to the voice, then, the text information corresponds to an instruction in a database to obtain a corresponding voice instruction, and finally, a system executes the voice instruction. This approach has many disadvantages, for example, only judging the data source is single through text information, making the final recognition inaccurate when the voice command is judged, and seriously reducing the judgment accuracy under certain interference.

For the convenience of understanding the present embodiment, a specific application scenario is provided herein by taking the prior art as an example, such as a scenario of human-vehicle voice interaction. Human-vehicle voice interaction often faces the complexity of the scene and the complexity of the personnel.

First, the complexity of the scene is explained here, and unlike the home environment, the interaction scene where people and vehicles may interact is various, and may be a quiet underground parking space or a noisy intersection. The system should consider special situations under different scenes when judging whether the target voice is the instruction voice, thereby increasing the accuracy of judgment. For example, during a day with good weather, on the streets of the city center, the system receives a voice command, and the text content of the voice command is "turn on high beam", and the prior art system turns on the high beam of the vehicle through the text content. However, the information of good weather, daytime and city center is collected to judge that the voice command cannot be executed. However, since the conventional technology cannot comprehensively determine the specific scene information, the voice command to be invalidated is determined as the command voice.

The complexity of the personnel is explained next. Because the human on the vehicle has great randomness, the voice collected by the system can be highly interfered in human-vehicle interaction. For example, when the primary driver is interfered by the secondary driver when sending a voice instruction to the system, the instruction voice of the primary driver is "good heat and window, the voice sent by the secondary driver at the same time is" go to open the air conditioner ", and if the" window opening "and the" go back "are sent at the same time, the text content given by the prior art system is" good heat &, & "(&" & & indicates that the corresponding text corresponding to the moment cannot be recognized), and then the air conditioner is turned on by the system. However, it is obvious that the voice information sent by the assistant driver is regarded as the instruction voice by the system, and the real instruction voice from the main driver is not recognized and executed. By this example, it can be illustrated that, for human-vehicle voice interaction, it is not sufficient to determine only by text content.

The embodiment provides a rejection method of non-instruction voice.

Referring to fig. 1-9, fig. 1 is a flow chart illustrating the main steps of a non-instruction speech rejection method according to an embodiment of the present invention. As shown in fig. 1, the method for rejecting non-instruction speech in the embodiment of the present invention mainly includes: step S10-step S40, which is specifically as follows:

step S10: and performing feature extraction on the input audio data to obtain a voice feature vector.

In this embodiment, FBank is used to perform feature extraction on input audio data, so as to obtain a speech feature vector. The input audio data can be preprocessed before the FBank feature extraction, the input audio is subjected to frame division in the preprocessing, a minimum unit audio which divides the input audio into preset frame lengths is obtained after the frequency division and is called as a frame, and the FBank feature and other processing are performed on the input audio frame by frame. It should be noted that the frames in the preprocessing are different from the frames in the time-domain waveform, in the preprocessing, the divided frames are samples for analyzing and extracting the Fbank features, and the frames in the time-domain waveform are samples taken by sampling the audio on a time-domain scale. FBank feature extraction is performed after preprocessing.

In one embodiment, FBank is used to perform feature extraction on the input audio data to obtain a speech feature vector. The input audio data can be preprocessed before the FBank feature extraction, the input audio is subjected to frame division in the preprocessing, a minimum unit audio which is obtained by dividing the input audio by a preset frame length is obtained after frequency division and is called as a frame, and the subsequent FBank feature and other processing are respectively used for carrying out frame-by-frame processing on the input audio. It should be noted that the frames in the preprocessing are different from the frames in the time-domain waveform, in the preprocessing, the divided frames are samples for analyzing and extracting the Fbank features, and the frames in the time-domain waveform are samples taken by sampling the audio on a time-domain scale. FBank feature extraction is performed after preprocessing.

The step of extracting FBank features in this embodiment includes: S101-S102, specifically as follows:

s101: input audio data is converted from a time domain signal to a frequency domain signal.

In the present embodiment, the input audio data at this time is audio data subjected to pre-processing framing. The independent variable of a time domain signal is time and the dependent variable is the amplitude of the signal. The independent variable of a frequency domain signal is the frequency and the dependent variable is the amplitude of the frequency signal. In practical use, the general time domain representation is more visual and intuitive, the frequency domain analysis is more concise, and the analysis problem is more convenient. At this time, the input audio data after frequency division is still a time domain signal, and in order to identify the difference between the signals, the time domain signal needs to be converted into a frequency domain signal.

In one embodiment, the input audio data is converted from a time domain signal to a frequency domain signal by fourier transform, and since the input audio data in this embodiment is digital audio, discrete Fourier Transform (DFT) is used. Preferably, in this embodiment, a Fast Fourier Transform (Fast Fourier Transform) is used, so as to reduce the computational complexity of the Fourier Transform.

Step S102: and (3) passing the input audio data which are frequency domain signals through a Mel filter to obtain a voice feature vector.

In the present embodiment, a Mel filter (Mel Filterbank) is a prior art. To understand the role of the mel filter, we first explain that the sensitivity of human beings to audio is related to the frequency of the audio, and that human beings perceive low frequency signals more strongly than high frequency signals. Specifically, the perception degree is linear with frequency for 1kHz or less, and the perception degree is logarithmic with frequency for 1kHz or more. The mel filter is represented by a scale rule which can simulate the human perception rule of audio signals with different frequencies. It is understood that the input audio data through the mel filter is more consistent with information that a human being can receive in reality. Then converting the input audio data passing through the Mel filter into a voice feature vector to obtain a voice feature vector X = (X) of each frame ₁ ,X ₂ ……X _T ) Where T corresponds to the number of frames after a split, i.e. the sequence length.

Step S20: and inputting the voice feature vector into a voice enhancement system to obtain the rejection feature vector, the confidence coefficient, the intention information and the text information.

In the present embodiment, the speech enhancement system includes a speech enhancement model, an intention understanding model, and a text coding model. The rejected feature vector is a dense vector converted from the speech feature vector. The text information is text information obtained by the speech feature vector after feature extraction. Confidence is the confidence for the text information, i.e. the estimate of the probability distribution for the text information. The confidence in this embodiment includes a confidence interval. The intention information is an intention understanding for the text information.

In one embodiment, as shown in fig. 6, the rejection feature vector, the confidence level, the intention information, and the text information are obtained by using steps S201 to S203, which are as follows:

step S201: and inputting the voice feature vector into a trained voice enhancement model to obtain an identification rejection feature vector, a confidence coefficient and an identification result text.

In the embodiment, the speech enhancement model includes a speech coder (speech coder) and a speech decoder (speech decoder) model architecture, wherein the recognition result text is obtained by a speech feature vector, and the recognition result text is presented in the form of natural language text, i.e. text readable by human beings. The confidence is the confidence of the text of the recognition result.

In one embodiment, steps of a method for obtaining the recognition rejection feature vector, the confidence level, and the recognition result text are provided, as shown in fig. 8, including steps S2011-S2012, specifically as follows,

step S2011: and inputting the voice feature vector into a voice coder to obtain the rejection feature vector.

In the present embodiment, the speech encoder (SpeechEncoder) is configured by using an M-layer conv1d network (one-dimensional convolutional neural network). In the embodiment, the frequency domain audio is processed by using the one-dimensional convolutional neural network, so that the method is more convenient and accurate. The rejected feature vector is a dense vector converted from the speech feature vector by the speech encoder, and contains various information of the original input audio, such as noise.

In one embodiment, the speech coder uses an 18-layer conv1d network, where the expression of the rejection feature vector is: s = spechencoder (X), where s is the rejection feature vector and X is the speech feature vector.

Step S2012: and inputting the rejection characteristic vector into a voice decoder to obtain a confidence coefficient and a recognition result text.

In this embodiment, the meaning of the confidence level is briefly described by way of example, if the confidence level of the recognition result text "departure" is 0.95, the probability that the recognition result text is a true text is 0.95. The higher the confidence level is seen, the more accurate and more reliable the recognition result text is.

In one embodiment, the expression of the recognition result text is: text = spechdecoder(s), where text is the recognition result text and s is the rejection feature vector.

Step S202: and inputting the recognition result text into the trained intention understanding model to obtain intention information.

In this embodiment, the expression of the intention information is int = NLU (text), where int is the intention information, text is the recognition result text, and NLU is a general term of a method model or task that supports a machine to understand the text content.

In one embodiment, the intent information is exemplified herein for ease of understanding, such as: "i want to listen to music", "put a song at will", "music sounds" and "music goes up", the expressions are the same meaning, but the users adopt different expressions. The system will know through the intention information that the sentences are all the same intention, i.e. playing music. Of course, there are other categories of intention information, such as fuzzy and ambiguous sentence comprehension to the user, and accurate matching of the user's needs, for example, the user says "order an alarm clock at five am" at 0 am in the morning, and then the user needs an alarm clock at five am today with a high probability, and if the user does not correctly understand the user's intention according to the comprehension of the computer.

It should be noted that this embodiment is for easy understanding, and the input information forming the intention information may also use the recognition result text and the recognition rejection feature vector as input variables in addition to the recognition result text.

Step S203: and inputting the recognition result text into the trained text coding model to obtain text information.

In this embodiment, the conversion of the text of the recognition result into the text information is performed from a natural language into a computer language, which facilitates subsequent fusion interaction. In this embodiment, the text coding model is composed of N-layer conv1d networks and Y-layer LSTM networks, where N, Y are both natural numbers.

In one embodiment, a 4-layer conv1d network and a 2-layer bidirectional lstm network are used. The expression of the text information is T = TexthEncoder (text), where T is the text information and text is the recognition result text.

Step S30: and obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence degree, the intention information and the text information.

In this embodiment, as shown in fig. 3, a multi-modal fusion feature vector is obtained based on the recognition rejection feature vector, the confidence level, the intention information, and the text information. As shown in fig. 4 and 7, obtaining a multi-modal fused feature vector based on the rejection feature vector, the confidence level, the intention information, and the text information includes: step S300: and obtaining a multi-modal fusion feature vector based on the recognition rejection feature vector, the confidence coefficient, the intention information, the text information and the in-vehicle information. Further, as shown in fig. 5, the rejection feature vector, the confidence level, the intention information, the text information, and the in-vehicle information are input into the trained multi-modal fusion model to obtain a multi-modal fusion feature vector. Wherein the in-vehicle information includes a vehicle state and/or a range bearing of the audio data source.

In the present embodiment, a method for obtaining a multi-modal fusion feature vector is provided, as shown in fig. 9, including steps S3001-S3002, specifically as follows:

step S3001: and fusing the rejection characteristic vector, the confidence coefficient, the intention information, the text information and the in-vehicle information to obtain a model set characteristic vector.

In this embodiment, the in-vehicle information includes all other information that can be collected by the in-vehicle system, in addition to the audio information. Specifically, the in-vehicle information may be, for example, vehicle GPS positioning information, vehicle driving state information, vehicle door and window state information, real time, weather information, array microphone sound source information, and the like. The in-vehicle information is presented in the form of a feature vector, wherein the in-vehicle information is obtained through an in-vehicle information model.

In this embodiment, the speech feature vector, the recognition rejection feature vector, the confidence, the intention information, the text information, and the in-vehicle information are all represented in a matrix manner, and the recognition rejection feature vector X = (X) ₁ ,X ₂ ……X _T ) For example, if the audio to be recognized is divided into 200 parts at intervals of 35ms during framing, T is 200. Further, assuming that the embedding dimension is 100 dimensions, a matrix of 200 × 100 dimensions is finally constructed.

Furthermore, during fusion, intention information, confidence coefficient and in-vehicle information are fused, meanwhile, the recognition rejection characteristic vector is fused with the text information, and finally the model set characteristic vector is obtained.

In one embodiment, the rejection feature vector, confidence, intention information, text information, and in-vehicle information line are concatenated using a concat function. Before the splicing process, five matrixes, namely rejection characteristic vector, confidence coefficient, intention information, text information and in-vehicle information, need to be transformed into matrixes with the same dimension, and the dimension is adjusted by using matelabe reshape. It should be noted that splicing using the concat function is only one of the splicing methods in the present embodiment, and other functions such as add splicing may also be used. And finally obtaining the model set characteristic vector.

Step S3002: and (4) passing the model set feature vector through a trained attention mechanism model to obtain a multi-mode fusion feature vector.

In this embodiment, the Attention mechanism model (Attention model) is:

wherein attn is a multi-mode fusion feature vector, softmax is an activation function, d is a scaling coefficient, and A is a model set feature vector. A. The ^T Is a transposed matrix, AA, of the eigenvectors A of the model set ^T A matrix of m x m, i.e. the similarity, is obtained and divided by

And obtaining a weight matrix after softmax normalization, wherein each value in the weight matrix is a weight coefficient which is greater than 0 and smaller than 1.

In one embodiment, to facilitate understanding of the technical content, the beneficial effect of fusing the in-vehicle information is illustrated with reference to a specific scene. Take the case where a primary driver in the complexity of the personnel is disturbed by a secondary driver when sending a voice command to the system as an example. After the voice recognition is started, the recognition result text is 'good heat & &, air conditioner is started', and at the moment, the confidence coefficients of the voice in the frequency domain are multiplied by each other to obtain the confidence coefficient of the whole recognition result text. For example, the confidence of the final entire recognition result text is 50%, and the confidence correspondingly lowers the weight of the entire recognition result text and the weight of the intention information. Meanwhile, the array directional microphone in the in-vehicle information records the sound zone direction of the audio data source when recording the audio data, for example, the in-vehicle information shows that 'good heat' comes from the main driving position, and 'air conditioner on' appears at the auxiliary driving position. Since the voice audio is from the primary driver and the secondary driver at the same time, and the command of the secondary driver is usually lower in weight, the credibility of the voice audio is further reduced, so that the whole voice judgment is non-instruction voice.

In the above example, another post-fusion judgment logic for in-vehicle information is given. After the confidence degree of the recognition result text is judged, the audio information in the rejection characteristic vector can be split through the information of the directional array microphone in the in-vehicle information and the rejection characteristic vector. Therefore, the voice of the main driver is 'good heat and window opening', the voice sent by the auxiliary driver is 'go back to open the air conditioner', the voice is judged according to the information, the 'good heat and window opening' of the main driver is instruction voice, and the window opening voice is executed.

Taking the complexity of the scene as an example, when the system receives a voice command that the recognition result text is 'turn on high beam', the intention information is 'turn on high beam', and it is assumed that the confidence of the recognition result text at this time is high, and the in-vehicle information indicates that the day is at this time and the weather is clear. If the weights of the parts are fixed and are constant at this time, the high beam of the vehicle is still turned on. However, when multi-modal fusion is adopted, since the information in the vehicle is displayed in the daytime and the weather is clear, the voice command is directly judged to be non-command voice and is not executed.

By the above example, multi-modal fusion is a mutual weighting process, and multi-modal fusion can use as many kinds of information as possible to comprehensively judge at the time of judgment, and comprehensively consider each judgment, rather than giving fixed weights to respective parts. The accuracy of the judgment is increased through multi-modal fusion.

Step S40: and performing rejection judgment on the multi-mode fusion feature vector to obtain a recognition result.

In this embodiment, the recognition result is a value based on 0-1. And when the recognition result is greater than or equal to the preset threshold value, the input audio is the instruction voice, and the vehicle executes the voice instruction. If the recognition result is smaller than the preset threshold value, the input audio is the non-instruction voice, and the vehicle refuses to execute the voice instruction.

In one embodiment, the multi-mode fusion feature vector is input into a trained recognition rejection judging model through sigmoid function output to obtain a recognition result,

wherein the rejection judging model is as follows:

It should be noted that, although the foregoing embodiments describe each step in a specific sequence, those skilled in the art will understand that, in order to achieve the effect of the present invention, different steps do not necessarily need to be executed in such a sequence, and they may be executed simultaneously (in parallel) or in other sequences, and these changes are all within the protection scope of the present invention.

It will be understood by those skilled in the art that all or part of the flow of the method of the above-described embodiment may be implemented by a computer program, which may be stored in a computer-readable storage medium, and the steps of the method embodiments may be implemented when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying said computer program code, media, usb disk, removable hard disk, magnetic diskette, optical disk, computer memory, read-only memory, random access memory, electrical carrier wave signals, telecommunication signals, software distribution media, etc. It should be noted that the computer readable storage medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable storage media that does not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

Furthermore, the invention also provides a control device. In an embodiment of the control device according to the present invention, the control device comprises a processor and a storage device, the storage device may be configured to store a program for executing the rejection method of the non-instruction speech of the above-mentioned method embodiment, and the processor may be configured to execute a program in the storage device, the program including but not limited to a program for executing the rejection method of the non-instruction speech of the above-mentioned method embodiment. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The control device may be a control device apparatus formed including various electronic apparatuses.

Further, the invention also provides a computer readable storage medium. In one computer-readable storage medium embodiment according to the present invention, the computer-readable storage medium may be configured to store a program for executing the rejection method of the non-instruction speech of the above-described method embodiment, and the program may be loaded and executed by a processor to implement the rejection method of the non-instruction speech. For convenience of explanation, only the parts related to the embodiments of the present invention are shown, and details of the specific techniques are not disclosed. The computer readable storage medium may be a storage device formed by including various electronic devices, and optionally, the computer readable storage medium is a non-transitory computer readable storage medium in the embodiment of the present invention.

Further, the present invention also provides a vehicle-mounted voice recognition system, comprising:

the electronic device described above; and the voice acquisition device is used for acquiring the audio data with noise.

For example, the voice acquisition device is an onboard microphone.

Further, the invention also provides an automobile comprising the vehicle-mounted voice recognition system.

Further, it should be understood that, since the configuration of each module is only for explaining the functional units of the apparatus of the present invention, the corresponding physical devices of the modules may be the processor itself, or a part of software, a part of hardware, or a part of a combination of software and hardware in the processor. Thus, the number of individual modules in the figures is merely illustrative.

Those skilled in the art will appreciate that the various modules in the apparatus may be adaptively split or combined. Such splitting or combining of specific modules does not cause the technical solutions to deviate from the principle of the present invention, and therefore, the technical solutions after splitting or combining will fall within the protection scope of the present invention.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A rejection method of non-instruction voice is characterized by comprising the following steps:

inputting the voice characteristic vector into a voice enhancement system to obtain an identification rejection characteristic vector, a confidence coefficient, intention information and text information;

2. The rejection method of the non-instruction speech according to claim 1, characterized in that:

3. The rejection method of the non-instruction speech according to claim 1, characterized in that: the obtaining a multi-modal fused feature vector based on the rejection feature vector, the confidence level, the intention information, and the text information comprises:

4. The rejection method of the non-instruction speech according to claim 2, characterized in that:

the voice enhancement model comprises a voice coder and a voice decoder;

5. The rejection method of non-instructed speech according to claim 4, wherein said speech coder is composed of M layers of conv1d networks, and said text coding model is composed of N layers of conv1d networks and Y layers of LSTM networks, wherein M, N, Y are all natural numbers.

6. The method for rejecting non-instruction speech according to claim 1, wherein obtaining a multi-modal fusion feature vector based on the rejection feature vector, the confidence level, the intention information, the text information, and in-vehicle information includes:

7. The rejection method of the non-instruction speech according to any one of claims 1 to 6,

the in-vehicle information includes a vehicle state and/or a range position of the audio data source.

8. An electronic device comprising a processor and a storage means adapted to store a plurality of program codes, characterized in that said program codes are adapted to be loaded and run by said processor to perform the method of rejecting non-instructional speech according to any one of claims 1 to 7.

9. An in-vehicle speech recognition system, comprising:

the electronic device of claim 8;

and the voice acquisition device is used for acquiring audio data.

10. An automobile, characterized in that it comprises a vehicle-mounted speech recognition system according to claim 9.