CN112669840A - Voice processing method, device, equipment and storage medium - Google Patents

Voice processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN112669840A
CN112669840A CN202011503332.3A CN202011503332A CN112669840A CN 112669840 A CN112669840 A CN 112669840A CN 202011503332 A CN202011503332 A CN 202011503332A CN 112669840 A CN112669840 A CN 112669840A
Authority
CN
China
Prior art keywords
semantic
voice signal
determining
semantic resource
loading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011503332.3A
Other languages
Chinese (zh)
Inventor
任伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wutong Chelian Technology Co Ltd
Original Assignee
Beijing Wutong Chelian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wutong Chelian Technology Co Ltd filed Critical Beijing Wutong Chelian Technology Co Ltd
Priority to CN202011503332.3A priority Critical patent/CN112669840A/en
Publication of CN112669840A publication Critical patent/CN112669840A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a voice processing method, a voice processing device, voice processing equipment and a storage medium, wherein the method comprises the following steps: acquiring a voice signal; determining semantic information of the voice signal according to the loaded first semantic resource; when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal; loading the second semantic resource; and determining semantic information of the voice signal according to the second semantic resource.

Description

Voice processing method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the field of voice recognition, and relates to but is not limited to a voice processing method, a voice processing device, voice processing equipment and a storage medium.
Background
With the development of artificial intelligence technology, the voice recognition technology is also widely applied to various industries, the intention of a user is confirmed by analyzing the voice of the user, and corresponding response is carried out, so that the user operation is greatly simplified, and the experience of the user on products is improved.
In the vehicle-mounted equipment, the allowed memory used by the application is different according to different vehicle-mounted hardware configurations. Generally, for different memory constraints, the semantic resource configuration adopts high, medium and low differentiation configuration. These differentiated configuration resources based on different performance requirements all require module pre-definition.
Disclosure of Invention
The embodiment of the application provides a voice processing method, a voice processing device, voice processing equipment and a storage medium.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a speech processing method, where the method includes:
acquiring a voice signal;
determining semantic information of the voice signal according to the loaded first semantic resource;
when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal;
loading the second semantic resource;
and determining semantic information of the voice signal according to the second semantic resource.
Based on the above scheme, the determining, according to the voice signal, the second semantic resource corresponding to the voice signal includes:
inputting the voice signal into a neural network to obtain the intention category of the voice signal;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
Based on the above scheme, the determining, according to the voice signal, the second semantic resource corresponding to the voice signal includes:
determining text information corresponding to the voice signal according to the voice signal;
extracting keywords from the text information, and determining the intention category of the text information;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
Based on the above scheme, the method further comprises:
and loading the first semantic resource when the voice recognition function is started.
Based on the above scheme, the loading the first semantic resource includes at least one of:
loading the first semantic resource according to the loading record of the historical semantic resource;
and loading the first semantic resource according to a preset function requirement.
Based on the above scheme, the method further comprises:
and unloading at least one loaded first semantic resource corresponding to the first domain according to the second domain corresponding to the second semantic resource.
In a second aspect, an embodiment of the present application provides a speech processing apparatus, including: an acquisition unit configured to acquire a voice signal;
the determining unit is used for determining semantic information of the voice signal according to the loaded first semantic resource; when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal; determining semantic information of the voice signal according to the second semantic resource;
and the loading unit is used for loading the second semantic resource.
Based on the above scheme, the determining unit is specifically configured to input the speech signal into a neural network, and obtain an intention category of the speech signal;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
Based on the above scheme, the determining unit is specifically configured to determine, according to the voice signal, text information corresponding to the voice signal;
extracting keywords from the text information, and determining the intention category of the text information;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
Based on the above scheme, the loading unit is further configured to load the first semantic resource when the voice recognition function is started.
Based on the above scheme, the loading unit is specifically configured to at least one of:
loading the first semantic resource according to the loading record of the historical semantic resource;
and loading the first semantic resource according to a preset function requirement.
Based on the above scheme, the apparatus further comprises: and the unloading unit is used for unloading the loaded at least one first semantic resource according to the intention category corresponding to the second semantic resource.
In a third aspect, an embodiment of the present application provides a speech processing apparatus, where the apparatus at least includes: a processor and a storage medium configured to store executable instructions, wherein: the processor is configured to execute stored executable instructions configured to perform the speech processing methods provided by the above-described embodiments.
In a fourth aspect, the present application provides a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are configured to execute the voice processing method provided by the foregoing embodiment.
In the embodiment of the application, when the semantic information of a voice signal is determined according to a loaded first semantic resource and the semantic information of the voice signal is determined to be abnormal, a second semantic resource corresponding to the voice signal is determined according to the voice signal; loading the second semantic resource; and determining semantic information of the voice signal according to the second semantic resource. According to the embodiment of the application, when the speech signal understanding is abnormal, the semantic resources are dynamically loaded to determine the semantic information of the speech, on one hand, part of memory resources of the speech processing equipment are released, the configuration requirement of the speech processing equipment is reduced, and the cost of speech recognition is reduced, and on the other hand, the problem of abnormal semantic analysis due to limited semantic understanding caused by preloading of fixed semantic resources and incapability of supporting multi-scene semantic analysis is solved by dynamically loading the semantic information.
Drawings
Fig. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech processing method based on a deep learning network according to an embodiment of the present application;
FIG. 3 is a diagram illustrating an RNN-based audio-frequency identification word vector according to an embodiment of the present disclosure;
FIG. 4 is an architecture diagram of a speech processing system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a deep learning network model according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings, the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by persons skilled in the art without inventive work shall fall within the scope of protection of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
In the following description, references to the terms "first \ second \ third" are only to distinguish similar objects and do not denote a particular order, but rather the terms "first \ second \ third" may, where permissible, be interchanged with a particular order or sequence so that embodiments of the invention described herein may be practiced in other than the order shown or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
The technical solution of the present invention is further described in detail below with reference to the drawings and specific embodiments.
The embodiment of the application provides a voice processing method, which is applied to a voice processing device, wherein the device can be an electronic device such as a mobile phone, a desktop computer, a tablet computer, a server cluster and the like, the functions realized by the method can be realized by calling a program code through a processor in the device, and the program code can be stored in a computer storage medium.
Fig. 1 is a schematic flowchart of a speech processing method according to an embodiment of the present application, and as shown in fig. 1, the method includes:
step S110, acquiring a voice signal;
step S120, determining semantic information of the voice signal according to the loaded first semantic resource;
step S130, when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal;
step S140, loading the second semantic resource;
step S150, determining semantic information of the voice signal according to the second semantic resource.
In an embodiment, in step S110, the acquiring the voice signal specifically includes: and collecting the audio signal of the user through an audio collecting device, and preprocessing the collected audio signal. Wherein the pretreatment comprises: noise reduction, echo cancellation, etc. Audio capture devices include, but are not limited to: a microphone.
The first semantic resource and the second semantic resource can be both offline semantic resources, are stored in the storage device, and are used for performing semantic analysis on the voice signal. The storage device includes but is not limited to a magnetic disk, a U disk, and a hard disk. In one embodiment, the semantic information determining abnormality of the speech signal may include, but is not limited to: semantic analysis is carried out on the voice signal, the semantic feedback of the voice signal is null, namely, the semantic analysis of the voice signal fails, and the semantic information of the voice information cannot be obtained; or when the determined semantic information does not accord with the preset response condition, judging that the semantic information is determined to be abnormal. Wherein the preset response condition is a preset response condition of the functional response. For example: in the running process of the vehicle, the semantic of the voice signal is determined to be 'open the door' through the first semantic resource, the preset response condition for opening the door is that the vehicle stops moving, and at the moment, the 'open the door' does not accord with the response condition for opening the door, and then the semantic information is determined to be abnormal.
In the embodiment of the application, when the loaded first semantic resource cannot determine semantic information of the voice signal, the second semantic resource for analyzing the voice signal is dynamically loaded according to the voice signal, so that the semantic information of the voice signal is determined according to the second semantic resource. In one embodiment, loading refers to importing data from a storage device other than the memory of the speech processing device into the memory of the speech processing device. The loaded first semantic resource is a first semantic resource stored in a memory of the voice processing device.
On one hand, the embodiment reduces the occupation of the storage resource of the voice processing equipment, reduces the configuration requirement of the voice processing equipment and the cost of voice processing by dynamically loading the semantic resource, and is beneficial to the wide application of the voice processing method; on the other hand, by dynamically loading semantic resources, the problems that the semantic understanding is limited due to the fact that fixed semantic resources are preloaded, multi-scene semantic analysis cannot be supported, and therefore the semantic analysis is abnormal are solved; on the other hand, by using the off-line semantic resources, the dependence of the acquisition of the semantic resources on network signals and the influence of the network signals in the voice processing process are reduced, and the stability and the efficiency of the voice processing are improved.
In some embodiments, the determining, according to the speech signal, the second semantic resource corresponding to the speech signal includes:
inputting the voice signal into a neural network to obtain the intention category of the voice signal;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
In one embodiment, the neural network is a deep learning network model constructed based on a recurrent neural network and a self-attention mechanism for determining intent classes of speech signals. Before the voice signal is obtained, a network structure of the neural network to be trained is determined based on a cyclic neural network and an attention mechanism, and the constructed neural network is trained according to the corpus voice and the text and semantic intentions corresponding to the corpus voice to obtain the trained neural network.
In one embodiment, the intent class of the speech signal is a domain to which the intent of the speech signal belongs. The fields include navigation, audio-visual playback, vehicle control, and the like. Different domain dictionaries such as book names, song names, trade names, etc. may be used for different intentions. It is possible to determine which domain the intention belongs to based on the degree of matching between the intention and the dictionary. For example: when the speech signal is: i want to xxx, then after inputting the speech signal into the trained neural network, the neural network determines the intention of the speech signal as: a destination is searched.
Based on the semantic resources, the second semantic resource corresponding to the voice signal can be determined to be the navigation-class semantic resource. Another example is: when the speech signal is: playing popular songs, and then inputting the voice signal into the trained neural network, the neural network determines the intention of the voice signal as follows: a song of zhou jilun is played. Based on the semantic resources, the second semantic resource corresponding to the voice signal can be determined to be the semantic resource of the music class. For another example: when the speech signal is: opening the car window, inputting the voice signal into the trained neural network, and determining the intention of the voice signal by the neural network as follows: and controlling the vehicle, wherein the second semantic resource corresponding to the voice signal is the semantic resource of the vehicle control class.
In some embodiments, the determining, according to the speech signal, the second semantic resource corresponding to the speech signal includes:
determining text information corresponding to the voice signal according to the voice signal;
extracting keywords from the text information, and determining the intention category of the text information;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
In an embodiment, the voice information is converted into text information corresponding to the voice signal, a TF-IDF (term frequency-inverse text frequency index) may be used to obtain a keyword in the text information corresponding to the voice signal, and an intention type of the text information is determined according to a type of the keyword. The keywords typically include: verbs and nouns. For example: the character corresponding to the voice signal is 'i want to go xxx', the 'go' and 'xxx' can be obtained through keyword extraction, and then the corresponding intention category can be known as navigation.
In some embodiments, the method further comprises:
and loading the first semantic resource when the voice recognition function is started.
When the voice recognition function is started, a first semantic resource for voice signal semantic analysis is extracted from the storage device, and the first semantic resource is loaded into a memory of the voice processing device for performing semantic analysis on voice input by a user.
In some embodiments, the loading the first semantic resource includes at least one of:
loading the first semantic resource according to the loading record of the historical semantic resource;
and loading the first semantic resource according to a preset function requirement.
In an embodiment, loading the first semantic resource according to a loading record of a historical semantic resource includes: and according to the semantic resource loading record, historical loading one or more semantic resources with the highest frequency.
In an embodiment, loading the first semantic resource according to a loading record of a history semantic resource, further includes: and loading one or more semantic resources in the loading record into the memory of the semantic processing equipment according to the loading sequence of the semantic resources in the previous voice processing process.
In another embodiment, loading the first semantic resource according to a loading record of a historical semantic resource further comprises: and loading the semantic resources loaded last in the previous voice processing process. According to the loading record of the historical semantic resources, the loaded first semantic resources are determined, on the one hand, the efficiency of user voice processing is improved through preloading operation, and on the other hand, the loading of useless semantic resources is reduced through loading the semantic resources according to the loading record of the historical semantic resources.
In some embodiments, the preset functional requirements include: the user sets a common voice control function according to the requirement of the user. Specifically, semantic resources of the domain can be loaded according to the domain corresponding to the common function set by the user. For example, if the user sets the navigation function as a common function, the semantic resource of the navigation field is loaded. Another example is: and (4) the user sets the music playing to be a common function, and then the semantic resources of the music are loaded.
In this embodiment, the efficiency of user speech processing is improved through the preloading operation in the first aspect. In the second aspect, the first semantic resource is loaded according to the preset function requirement, so that the voice processing is more in line with the user requirement, and the user experience is improved.
In some embodiments, the speech processing method is applied to a vehicle-mounted system, and the method further comprises: and loading the first semantic resource according to the driving condition information. Wherein, the driving condition information includes but is not limited to: vehicle status information, status information of on-board personnel, weather information, and time information.
The vehicle state information includes, but is not limited to: the remaining fuel consumption information of the vehicle, the temperature information in the vehicle, and the state of the vehicle device. Specifically, in an embodiment, when the remaining fuel consumption of the vehicle is lower than a preset fuel consumption threshold, the semantic resource corresponding to the gas station navigation is loaded. Wherein the predetermined threshold fuel consumption is typically set at 1/4 of the tank capacity. In an embodiment, when the temperature in the vehicle exceeds a preset temperature interval, the offline semantic resource corresponding to the air conditioner control is loaded, wherein the preset temperature interval is generally set to be between 15 ℃ and 30 ℃. In one embodiment, when a vehicle device monitors for an anomaly, for example: and when the tire pressure is abnormal, loading semantic resources corresponding to the automobile repair field.
The state information of the vehicle-mounted personnel includes, but is not limited to, the mental state of the vehicle-mounted personnel and the health condition of the vehicle-mounted personnel. Specifically, when the vehicle-mounted personnel are in an exhausted state, corresponding semantic resources for playing music are loaded; and when the body of the vehicle-mounted personnel is painful, loading semantic resources corresponding to medical treatment.
In an embodiment, loading the first semantic resource according to weather information includes: when the current weather information is in haze weather, loading semantic resources corresponding to the car light control; and when the current weather is the rain and snow weather, loading the corresponding semantic resources of the wiper control and the like.
In one embodiment, loading the first semantic resource according to the time information includes: and when the current moment is within a preset time interval, loading semantic resources corresponding to the names of the restaurants. The preset time interval is generally set to be between 11:00 and 13:00 and between 17:00 and 19: 00.
In an embodiment, loading the first semantic resource according to the driving condition information further includes: and when the vehicle is monitored to be started, the navigation semantic resources are loaded in advance.
According to the embodiment, the first semantic resource is loaded according to the driving condition, and the semantic resource is loaded in advance through the prediction control intention, so that the response efficiency of voice processing is improved.
In some embodiments, the method further comprises:
and unloading at least one loaded first semantic resource corresponding to the first domain according to the second domain corresponding to the second semantic resource.
The first domain and the second domain are domains corresponding to a dictionary in a semantic resource, for example: and if the words in the semantic resources are navigation words, the semantic resources correspond to the navigation field and belong to the navigation semantic resources.
According to the embodiment, the loaded semantic resources which do not belong to the first field are unloaded according to the first field corresponding to the second semantic resource, so that the occupation of memory resources in the voice processing system is reduced, and the situation that the semantic analysis of the voice signal cannot be realized due to the fact that the semantic resources cannot be loaded because new semantic resources need to be utilized in the semantic analysis is reduced. For example: and if the second semantic resource corresponds to the navigation field and is the navigation semantic resource, unloading the semantic resource which is loaded in the memory and is not the navigation semantic resource. In an embodiment, the offloading the first semantic resource corresponding to the first domain specifically includes: and unloading one or more first semantic resources which are loaded earliest and are different from the domain to which the second semantic resource belongs according to the loading time sequence of the semantic resources.
One specific example is provided below in connection with the above embodiments:
in a vehicle-mounted system, aiming at different memory limitations, the offline semantic resource configuration adopts high-medium low-difference configuration. These differentiated configuration resources based on different performance requirements all need to count the product function list in advance, and the modules are predefined. Through analysis, the voice processing mainly has the following problems:
1. predefined high, medium and low configuration: the module is predefined in advance, the corpus which can support semantic understanding is limited, and the corpus generalization capability is poor.
2. Loading semantic resources based on user high-frequency corpus classification: the semantics of part of user preferences can be loaded in a differentiated mode, the system performance consumption is reduced, but the system performance consumption is also preloaded, the semantic understanding range is narrowed, and the multi-scenario capability is not provided.
Based on this, the present example provides a speech processing method based on a deep learning network, which is applied to a vehicle-mounted system. The scheme is characterized in that user linguistic data and intention labels with successfully understood semantemes are collected, a deep learning network structure is built, and a linguistic data audio intention classifier is trained. Aiming at the abnormal problem that the user corpus offline resources are not loaded in the memory, so that the user corpus offline resources cannot be understood, the intention classifier is used, the semantic resources can be effectively and dynamically loaded, multiple rounds of conversations are triggered, and the user is guided. The scheme is more intelligent, and the core of the scheme is that no semantic return is realized for solving the problem of voice recognition, the offline semantic capability is enhanced, and the human-computer interaction is completed.
As shown in fig. 2, the present example proposes a speech processing method based on a deep learning network, which includes the following steps:
step S201, the audio collector collects a voice signal of a user.
Specifically, the audio corpus of the user is input through an in-vehicle microphone, and the original data is preprocessed, for example: and noise reduction and echo cancellation are carried out.
Step S202: and the speech recognition engine determines the semantic information of the user speech signal according to the loaded offline semantic resources.
Specifically, the text information of the user voice is determined through ASR (Automatic Speech Recognition), the semantic information of the text information of the voice signal is determined according to the offline semantic resources in the memory of the voice processing device, namely, each corpus spoken by the user is recognized by using a voice Recognition engine, if the corpus is not recognized, the semantic information of the voice signal cannot be determined, and empty feedback information is given.
Step S203: and analyzing semantic information. And judging whether the semantic information of the user voice signal is successfully analyzed in the step S202, and if the semantic information is analyzed, jumping to the step S200 and finishing the processing of the voice signal. If the analysis is not successful, step S204 is executed.
Step S204: triggering the audio intent decoder.
Step S205: a category of control intentions of the user speech signal is predicted.
Specifically, a voice signal of the user is input into the audio control intention classifier, and the intention category of the voice signal of the user is determined. If the intention category of the user voice signal is the search destination, executing step S206; if the intention category of the user voice signal is to play a song, executing step S207; if the intention type of the user voice signal is vehicle control, step S208 is executed.
Wherein, the audio control intention classifier is to construct a deep learning Network model (as shown in fig. 7) based on RNN (Recurrent Neural Network) and self-attention (self-attention mechanism) before collecting the speech signal.
Specifically, the audio control intention classifier is obtained by training a deep learning network model through a corpus audio and a text and semantic intention corresponding to an ASR (Automatic Speech Recognition). The following corpora and notations:
audio samples Recognizing characters Annotation-semantic intent
Audio
1 How much the weather is today Weather inquiry
Audio 2 I want to listen to Liudebua song Playing songs
Audio
3 Dexxx Searching for a destination
Audio
4 Opening vehicle window Vehicle control
Define X as a segment of audio Fbank feature, i.e., X ═ X1,x2,…,xn}。
Defining X ' as a recognition character corresponding to a segment of audio, i.e. X ' ═ { X '1,x′2,…,x′k} e.g., for the identification text "open Window", x'1X'2X'3X'4A window.
Defining Y as a semantic meaning, i.e. Y ═ Y1,y2,…,ym},
E.g. for the intention "vehicle control", y1Vehicle, y2I mean one, y3Control, y4Made as finished product.
emb (x) represents the word vector for word x.
The detailed algorithm is as follows:
(1) the intention text is usually short, so the semantic intention vector y is calculated based on a word vector average modelemb_avgThe following are:
Figure BDA0002844261800000111
wherein, emb (y)i) Is yiThe word vector of (2).
(2) An identification word vector C of known audio is constructed based on RNN (as shown in fig. 3).
Based on the Attention mechanism, the similarity a of the character vector C and the semantic intention vector is calculatediThe following are:
Figure BDA0002844261800000121
Figure BDA0002844261800000122
wherein h isiAnd hjRespectively corresponding to the audio frequency, and identifying the characteristics of the RNN hidden layer corresponding to the ith and jth words in the characters.
(3) Construction of an audio intent control prediction model
The training model is constructed as shown in fig. 7, and includes a CNN (Convolutional Neural Networks) layer, an RNN (Recurrent Neural Networks) layer, an RNN + self attention (self attention) layer, and a fully connected layer, wherein the RNN + self attention layer may be repeated in multiple layers.
hiThe calculation is as follows:
Figure BDA0002844261800000123
hi=aih×hi
wherein h isiAnd hjThe characteristics of the RNN hidden layer corresponding to the ith and jth characteristics in the audio Fbank characteristics respectively. a isijThe semantic intent vector similarity is used.
siThe calculation is as follows:
Figure BDA0002844261800000124
wherein s isiAnd an output vector corresponding to the ith feature in the audio Fbank features, wherein A is a coefficient matrix, b is a constant, and tanh is an activation function. And C is an identification word vector C for constructing the known audio based on the RNN. The situation that the training fails due to gradient disappearance in the deep learning training process is reduced by introducing C into the activation function.
Finally, the intention word vector y is calculatediAnd performing gradient descent training with a loss function of the label-semantic intention, determining network parameters of the neural network, and obtaining the audio control intention classifier.
Step S206: and the semantic resource loader loads the semantic resources of the navigation intention into the memory.
Step S207: and the semantic resource loader loads semantic resources of the song intention into the memory.
Step S208: and the semantic resource loader loads the semantic resources of the vehicle control intention to the memory.
Step S209: and triggering navigation multi-turn conversation. For example: ask "where you go". The offline speech engine then further determines semantic information for the user based on the user's responses, and thereby responds to the user's speech controls. For example: if the user answers the going xxx, after the offline speech engine determines that the user wants to go xxx, the vehicle-mounted system searches the route of the going xxx and starts the navigation route.
Step S210: triggering a song multi-turn conversation. For example: the recommendation query "do you want to hear. The offline speech engine then further determines semantic information of the user based on the user's responses, and thereby responds to the user's speech control. For example: if the user answers the song that the user wants to listen to the Zhou Jieren, the off-line voice engine determines that the user wants to listen to the song of the Zhou Jieren, and the vehicle-mounted system plays the Zhou Jieren song.
Step S211: and triggering the vehicle control multi-round session. For example: ask "do you open window". The off-line speech engine then further determines semantic information of the user based on the user's responses, and thereby responds to the user's speech controls. For example: if the user answers to open the window, the off-line voice engine determines that the window is opened by the user, and then the vehicle-mounted system controls the window to be opened.
Fig. 4 is an architecture diagram of a speech processing system provided by this example. The audio collector 401: the audio corpus of the user is input through the microphone in the vehicle, and the original data needs to be preprocessed, such as noise reduction and echo cancellation. 402 speech recognition engine: and identifying the audio corpus of the user to be processed by the audio collector, if not, triggering an audio intention decoder 403, and predicting the control intention of the user through an audio intention classification model 404 according to the audio corpus of the user by the audio intention decoder. The semantic resource loader 405 dynamically loads the offline semantic resources to which the control intents belong to the memory according to the control intents of the user, and triggers the scene multi-turn session trigger 406. Trigger scenario multiple turn session trigger 406: and triggering multiple rounds of conversations in different scenes according to the control intention, and guiding the user to perform secondary confirmation. If the audio corpus is predicted to be the meaning of 'searching for the destination', a navigation two-round conversation is triggered, the user is asked where to go, and then the destination is input by voice.
In the example scheme, in the first aspect, the offline semantic resource is adopted to determine the semantic information of the user, so that the influence of the network signal on the voice processing is reduced, and the offline semantic capability of the voice processing is enhanced. In the second aspect, when the loaded offline semantic resources cannot determine the semantic information of the voice signal, the intent category of the user is determined according to the audio intent classifier, and the offline semantic resources are dynamically loaded, so that the situations that semantic parsing fails due to poor corpus generalization capability, narrow semantic understanding range and no multi-scene semantic analysis capability are reduced when semantic parsing is performed only through the preloaded semantic resources. In the third aspect, by triggering multiple rounds of conversations and by secondary confirmation of the user, the accuracy of user intention analysis is improved, and meanwhile, the experience of the user is improved.
Based on the foregoing embodiments, an embodiment of the present application provides a speech processing apparatus, where each unit included in the apparatus can be implemented by a processor in a speech processing device; of course, may be implemented by logic circuits; in implementation, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
Fig. 5 is a schematic diagram of a component structure of a speech processing apparatus according to an embodiment of the present application, where the apparatus 500 includes:
an obtaining unit 510, configured to obtain a voice signal;
a determining unit 520, configured to determine semantic information of the speech signal according to the loaded first semantic resource; when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal; determining semantic information of the voice signal according to the second semantic resource;
a loading unit 530, configured to load the second semantic resource.
In some embodiments, the determining unit is specifically configured to input the speech signal into a neural network, and obtain an intention category of the speech signal; and determining a second semantic resource corresponding to the voice signal according to the intention category.
In some embodiments, the determining unit is further specifically configured to determine, according to the voice signal, text information corresponding to the voice signal; extracting keywords from the text information, and determining the intention category of the text information; and determining a second semantic resource corresponding to the voice signal according to the intention category.
In some embodiments, the loading unit is further configured to load the first semantic resource when the speech recognition function is started.
In some embodiments, the loading unit is specifically configured to at least one of:
loading the first semantic resource according to the loading record of the historical semantic resource;
and loading the first semantic resource according to a preset function requirement.
In some embodiments, the apparatus further comprises: and the unloading unit is used for unloading at least one loaded first semantic resource corresponding to the first domain according to the second domain corresponding to the second semantic resource.
It should be noted that, in the embodiment of the present application, if the speech processing method is implemented in the form of a software functional module and sold or used as a standalone product, it may also be stored in a computer-readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially or partially implemented in the form of a software product, which is stored in a storage medium and includes several instructions for causing a server to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the voice processing method provided by the above embodiment.
Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that fig. 6 is a schematic structural diagram of a speech processing apparatus provided in an embodiment of the present application, and as shown in fig. 6, the apparatus 600 at least includes: a processor 610, a communication interface 620, and a memory 630, wherein:
the processor 610 generally controls the overall operation of the device 600.
Communication interface 620 may enable a device to communicate with other devices over a network.
The Memory 630 is configured to store instructions and applications executable by the processor 610, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 610 and modules in the device 600, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).
Of course, the apparatus in the embodiment of the present application may have other similar protocol interaction implementation cases, and those skilled in the art can make various corresponding changes and modifications according to the embodiment of the present application without departing from the spirit and the spirit of the present application, but these corresponding changes and modifications should fall within the scope of the claims appended to the method of the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the embodiments of the present application, the size of the sequence numbers of the above-mentioned processes does not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the inherent logic thereof, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the modules is only one logical functional division, and other division manners may be available in actual implementation, for example: multiple modules or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, indirect coupling or communication connection between devices or modules, and may be electrical, mechanical or other.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules; the network module can be located in one place or distributed on a plurality of network modules; some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the following claims.

Claims (10)

1. A method of speech processing, comprising:
acquiring a voice signal;
determining semantic information of the voice signal according to the loaded first semantic resource;
when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal;
loading the second semantic resource;
and determining semantic information of the voice signal according to the second semantic resource.
2. The method of claim 1, wherein the determining, according to the speech signal, the second semantic resource corresponding to the speech signal comprises:
inputting the voice signal into a neural network to obtain the intention category of the voice signal;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
3. The method of claim 1, wherein the determining, according to the speech signal, the second semantic resource corresponding to the speech signal comprises:
determining text information corresponding to the voice signal according to the voice signal;
extracting keywords from the text information, and determining the intention category of the text information;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
4. The method of claim 1, further comprising:
and loading the first semantic resource when the voice recognition function is started.
5. The method of claim 4, wherein the loading the first semantic resource comprises at least one of:
loading the first semantic resource according to the loading record of the historical semantic resource;
and loading the first semantic resource according to a preset function requirement.
6. The method of claim 1, further comprising:
and unloading at least one loaded first semantic resource corresponding to the first domain according to the second domain corresponding to the second semantic resource.
7. A speech processing apparatus, characterized in that the apparatus comprises:
an acquisition unit configured to acquire a voice signal;
the determining unit is used for determining semantic information of the voice signal according to the loaded first semantic resource; when the semantic information of the voice signal is determined to be abnormal, determining a second semantic resource corresponding to the voice signal according to the voice signal; determining semantic information of the voice signal according to the second semantic resource;
and the loading unit is used for loading the second semantic resource.
8. The apparatus of claim 7,
the determining unit is specifically configured to input the speech signal into a neural network, and obtain an intention category of the speech signal;
and determining a second semantic resource corresponding to the voice signal according to the intention category.
9. A speech processing device, characterized in that it comprises at least: a processor and a storage medium configured to store executable instructions, wherein:
the processor is configured to execute stored executable instructions configured to perform the speech processing method provided by any of the preceding claims 1 to 6.
10. A computer-readable storage medium having computer-executable instructions stored therein, the computer-executable instructions being configured to perform the speech processing method provided by any one of claims 1 to 6.
CN202011503332.3A 2020-12-17 2020-12-17 Voice processing method, device, equipment and storage medium Pending CN112669840A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011503332.3A CN112669840A (en) 2020-12-17 2020-12-17 Voice processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011503332.3A CN112669840A (en) 2020-12-17 2020-12-17 Voice processing method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112669840A true CN112669840A (en) 2021-04-16

Family

ID=75406338

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011503332.3A Pending CN112669840A (en) 2020-12-17 2020-12-17 Voice processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112669840A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999204A (en) * 2022-07-29 2022-09-02 北京百度网讯科技有限公司 Navigation information processing method, device, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095176A (en) * 2014-04-29 2015-11-25 华为技术有限公司 Method for extracting feature information of text information by user equipment and user equipment
CN107515857A (en) * 2017-08-31 2017-12-26 科大讯飞股份有限公司 Semantic understanding method and system based on customization technical ability
US20180342233A1 (en) * 2017-05-23 2018-11-29 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for correcting speech recognition error based on artificial intelligence, and storage medium
CN109961780A (en) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 A kind of man-machine interaction method, device, server and storage medium
CN110136700A (en) * 2019-03-15 2019-08-16 湖北亿咖通科技有限公司 A kind of voice information processing method and device
CN110705267A (en) * 2019-09-29 2020-01-17 百度在线网络技术(北京)有限公司 Semantic parsing method, semantic parsing device and storage medium
CN111399629A (en) * 2018-12-29 2020-07-10 Tcl集团股份有限公司 Operation guiding method of terminal equipment, terminal equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095176A (en) * 2014-04-29 2015-11-25 华为技术有限公司 Method for extracting feature information of text information by user equipment and user equipment
US20180342233A1 (en) * 2017-05-23 2018-11-29 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for correcting speech recognition error based on artificial intelligence, and storage medium
CN107515857A (en) * 2017-08-31 2017-12-26 科大讯飞股份有限公司 Semantic understanding method and system based on customization technical ability
CN109961780A (en) * 2017-12-22 2019-07-02 深圳市优必选科技有限公司 A kind of man-machine interaction method, device, server and storage medium
CN111399629A (en) * 2018-12-29 2020-07-10 Tcl集团股份有限公司 Operation guiding method of terminal equipment, terminal equipment and storage medium
CN110136700A (en) * 2019-03-15 2019-08-16 湖北亿咖通科技有限公司 A kind of voice information processing method and device
CN110705267A (en) * 2019-09-29 2020-01-17 百度在线网络技术(北京)有限公司 Semantic parsing method, semantic parsing device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999204A (en) * 2022-07-29 2022-09-02 北京百度网讯科技有限公司 Navigation information processing method, device, equipment and storage medium
CN114999204B (en) * 2022-07-29 2022-11-08 北京百度网讯科技有限公司 Navigation information processing method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
DE112020004504T5 (en) Account connection with device
CN103280216B (en) Improve the speech recognition device the relying on context robustness to environmental change
US20170162191A1 (en) Prioritized content loading for vehicle automatic speech recognition systems
KR20190004495A (en) Method, Apparatus and System for processing task using chatbot
CN106875936B (en) Voice recognition method and device
CN109920410B (en) Apparatus and method for determining reliability of recommendation based on environment of vehicle
JP2007114621A (en) Conversation controller
Akhtiamov et al. Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction.
CN108595406B (en) User state reminding method and device, electronic equipment and storage medium
CN113486170B (en) Natural language processing method, device, equipment and medium based on man-machine interaction
US11990122B2 (en) User-system dialog expansion
JP7332024B2 (en) Recognition device, learning device, method thereof, and program
CN111739506B (en) Response method, terminal and storage medium
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN116153311A (en) Audio processing method, device, vehicle and computer readable storage medium
CN115132196A (en) Voice instruction recognition method and device, electronic equipment and storage medium
CN112669840A (en) Voice processing method, device, equipment and storage medium
CN111400463A (en) Dialog response method, apparatus, device and medium
US11508372B1 (en) Natural language input routing
CN117688145A (en) Method and device for question-answer interaction and intelligent equipment
WO2023168713A1 (en) Interactive speech signal processing method, related device and system
CN114625878A (en) Intention identification method, interactive system and equipment
CN115132195A (en) Voice wake-up method, apparatus, device, storage medium and program product
CN115168558A (en) Method for realizing multi-round man-machine conversation
KR102174148B1 (en) Speech Recognition Method Determining the Subject of Response in Natural Language Sentences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210416