CN116583820A - Voice interaction method and device - Google Patents

Voice interaction method and device Download PDF

Info

Publication number
CN116583820A
CN116583820A CN202180036192.XA CN202180036192A CN116583820A CN 116583820 A CN116583820 A CN 116583820A CN 202180036192 A CN202180036192 A CN 202180036192A CN 116583820 A CN116583820 A CN 116583820A
Authority
CN
China
Prior art keywords
model
voice
text
user
instruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180036192.XA
Other languages
Chinese (zh)
Inventor
李宏广
高益
聂为然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN116583820A publication Critical patent/CN116583820A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A method (300) and apparatus (1000) for training a speech interactive response language model, the method (300) comprising: acquiring a first voice instruction of a user (S301); feature extraction is performed on the text of the first voice instruction to obtain a first instruction text (S302); training a first model to be trained according to the text of the first voice instruction and the first instruction text to obtain a voice interaction response language model, wherein the text output by the voice interaction response language model has the expression characteristics of a user, the voice interaction response language model is used for responding according to the voice instruction of the user, the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label (S303). The model is trained according to the voice command of the user in the application, so that the trained model can output personalized response language which accords with the expression habit of the user according to the voice command of the user.

Description

Voice interaction method and device Technical Field
The embodiment of the application relates to the technical field of man-machine interaction, in particular to a method and a device for voice interaction.
Background
The development of technology brings great revolution to human-computer interaction, users have increased demands for intelligent and personalized voice interaction, and how to maximize the use of voice interaction to improve user interaction experience becomes a current research hotspot. Taking man-machine interaction in an intelligent automobile scene as an example, users often react untimely to the traditional touch screen type interaction mode due to complex traffic road conditions in actual driving, resulting in extremely high driving risks. And the man-machine interaction technology based on natural language understanding (natural language understanding, NLU) can completely liberate the hands of a driver, and realize voice control of all controls in the vehicle, including supporting random control of navigation, music, radio stations and the like, so that driving safety is improved, and user experience is improved. However, the voice response in the prior man-machine interaction is too mechanical and hard, lacks natural language expression, has the problem of high homogeneity of speech surgery, and cannot realize natural smooth and personalized response.
Disclosure of Invention
The embodiment of the application provides a voice interaction method and a voice interaction device, which can train a model according to the voice expression habit of a user, so that the trained model can output a personalized response language which accords with the user expression habit according to the voice instruction of the user.
In a first aspect, a method for voice interaction is provided, including: acquiring a first voice instruction of a user; extracting features of the text of the first voice instruction to obtain a first instruction text; training a first model to be trained according to the text of the first voice instruction and the first instruction text to obtain a voice interaction response language model, wherein the text output by the voice interaction response language model has the expression characteristics of a user, the voice interaction response language model is used for responding according to the voice instruction of the user, the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label.
The voice interaction method of the embodiment of the application can train the model according to the voice expression habit of the user, and the training data sources of the model are direct, can acquire the voice instruction of the user through daily voice interaction in the process of using the voice interaction system by the user, acquire the input of the model to be trained according to the characteristic extraction mode, and train the voice instruction of the user as the training label of the model to be trained to obtain the voice interaction response language model without manually writing or collecting training data. In addition, the training model is directly trained by using the voice command of the user, and the text output by the trained voice interaction answering language model has the expression characteristics of the user, namely the text conforming to the expression habit of the user, so that the answering language voice conforming to the expression habit of the user can be output by the voice interaction system in the process of interacting with the user, and the user experience is improved.
In some implementations, feature extraction is performed on text of a first voice instruction to obtain first instruction text, including: extracting features of the text of the first voice instruction to obtain intention information and slot position information of the first voice instruction; and acquiring a first instruction text according to the intention information, the slot position information and the preset template.
It should be noted that in some implementations, the preset template combines the intention information and the slot information into a sentence text, instead of generating a corresponding answer text for the intention information and the slot information of the first voice instruction, the method is equivalent to removing the personalized features with the language expression habit of the user from the personalized first voice instruction text with the language expression habit of the user sent by the original user, and only leaves the most basic features capable of expressing the intention information and the slot information of the user.
In some implementations, the user includes a plurality of users.
In some implementations, the user is a first user, and a first mapping is provided between the first user and a first voice interaction answer model, where the first mapping is used to indicate that the first voice interaction answer model corresponds to the first user, and the first voice interaction answer model is obtained through training according to a voice instruction of the first user.
In actual practice, a user herein may represent one or more users. Specifically, the voice interaction system in the embodiment of the application can train out the voice interaction response language models respectively corresponding to the users according to different users, and the text output by each voice interaction response language model accords with the language expression habit of each user. Thus, the method can realize that the answer words conforming to the expression habit of the user can be output for different users, for example, the answer words with more mature styles can be output for parents, and the answer words with natural styles can be output for children.
In some implementations, the first model to be trained includes three sub-models, which are a marker model, a pointer model, and an insert model.
In some implementations, training the first model to be trained from the text of the first voice instruction and the first instruction text includes: inputting the first instruction text into a marking model to obtain a characteristic marking sequence of the first instruction text, wherein the characteristic marking sequence is obtained by carrying out characteristic marking on the first instruction text; inputting the feature tag sequence into a pointer model to obtain a feature ordering sequence, wherein the feature ordering sequence is obtained by reordering features in the feature tag sequence; inputting the feature ordering sequence into an insertion model to obtain an output sequence, wherein the output sequence is obtained by inserting a first feature into the feature ordering sequence; and updating parameters of the marking model, the pointer model and the insertion model by taking the text of the first voice instruction as a training label.
In some implementations, updating parameters of the markup model, the pointer model, and the insertion model with text of the first voice instruction as the training tag includes: calculating a first loss function of the marking model, a second loss function of the pointer model and a third loss function of the insertion model by taking the text of the first voice instruction as a training label; and updating parameters of the marking model, the pointer model and the insertion model according to the first loss function, the second loss function and the third loss function.
In some implementations, the first model to be trained is trained according to a preset training sentence and a preset label of the preset training sentence.
The model to be trained is subjected to preliminary training before the model to be trained, and the model to be trained after preliminary training can output a relatively natural answer text according to the voice instruction of the user in the using process of the user. For example, the model to be trained may be initially trained before leaving the factory, or may be initially trained in a previous upgrade process, or may be trained by other methods in a previous use process.
In a second aspect, a voice interaction method is provided, including: acquiring a second voice instruction of a user; acquiring a first answer text according to the second voice instruction; the method comprises the steps of inputting a first answer text into a voice interaction answer model to output a second answer text, wherein the voice interaction answer model is obtained by training according to a text of a first voice instruction and the first instruction text, the first instruction text is obtained by extracting features of the text of the first voice instruction, and the first voice instruction is a voice instruction of a user.
The voice interaction method of the embodiment of the application uses the voice interaction answer language model obtained by training according to the voice instructions sent in the daily voice interaction of the user to generate the answer language, so that the generated answer language accords with the language expression habit of the user. And different voice interaction answer language models are matched for different users, so that personalized and thousand-face answer language expression can be realized, and the use experience of the users is greatly improved.
In some implementations, obtaining the first answer text from the second speech instruction includes: acquiring intention information and slot position information of a second voice instruction according to the second voice instruction; and acquiring a first answer text according to the intention information, the slot position information and the preset template.
It should be noted that, unlike the first aspect described above, the first answer text is an answer to the second voice command, and is more mechanized, except that the first answer text does not conform to the language expression habit of the user.
In some implementations, the user includes a plurality of users.
In some implementations, the user is a first user, the first answer input a voice interaction answer model comprising: acquiring a first voice interaction response language model according to a first mapping, wherein the first voice interaction response language model is obtained by training according to a voice instruction of a first user, and the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user; the text of the first answer is input into a first voice interaction answer model.
The voice interaction system in the embodiment of the application can train out the voice interaction response language models respectively corresponding to the users according to different users, and the text output by each voice interaction response language model accords with the language expression habit of each user.
In certain implementations, the method further comprises: and filtering out preset language information in the second answer language text.
In the actual training process, if the term of the user is not too civilized, the voice interaction answer language model obtained by training according to the voice instruction of the user may output an answer language text which is not civilized, so before the answer language text is output to the user, the second answer language text output by the voice interaction answer language model needs to be filtered, and the non-civilized language information in the second answer language text is filtered.
In certain implementations, the method further comprises: the second answer language text is input to a speech synthesis engine to generate a second answer language speech.
In some implementations, before obtaining the second voice instruction of the user, the method further includes: acquiring a third voice instruction of a user; and inputting a third voice instruction into a first model to be trained so as to output a third answer text, wherein the first model to be trained is obtained by training according to a preset training sentence and a preset label of the preset training sentence.
The third answer text is a natural answer text, but does not conform to the language expression habit of the user.
In some implementations, the speech interaction answer model and the first model to be trained are non-autoregressive models.
In a third aspect, there is provided a voice interaction apparatus, comprising: the acquisition unit is used for acquiring a first voice instruction of a user; the processing unit is used for extracting the characteristics of the text of the first voice instruction so as to obtain a first instruction text; the processing unit is further used for training the first model to be trained according to the text of the first voice instruction and the first instruction text to obtain a voice interaction response language model, the text output by the voice interaction response language model has the expression characteristics of the user, the voice interaction response language model is used for responding according to the voice instruction of the user, the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label.
In some implementations, the processing unit is specifically configured to: extracting features of the text of the first voice instruction to obtain intention information and slot position information of the first voice instruction; and acquiring a first instruction text according to the intention information, the slot position information and the preset template.
In some implementations, the user includes a plurality of users.
In some implementations, the user is a first user, and a first mapping is provided between the first user and a first voice interaction answer model, where the first mapping is used to indicate that the first voice interaction answer model corresponds to the first user, and the first voice interaction answer model is obtained through training according to a voice instruction of the first user.
In some implementations, the first model to be trained includes three sub-models, which are a marker model, a pointer model, and an insert model.
In some implementations, the processing unit is specifically configured to: inputting the first instruction text into a marking model to obtain a characteristic marking sequence of the first instruction text, wherein the characteristic marking sequence is obtained by carrying out characteristic marking on the first instruction text; inputting the feature tag sequence into a pointer model to obtain a feature ordering sequence, wherein the feature ordering sequence is obtained by reordering features in the feature tag sequence; inputting the feature ordering sequence into an insertion model to obtain an output sequence, wherein the output sequence is obtained by inserting a first feature into the feature ordering sequence; and updating parameters of the marking model, the pointer model and the insertion model by taking the text of the first voice instruction as a training label.
In some implementations, the processing unit is specifically configured to: calculating a first loss function of the marking model, a second loss function of the pointer model and a third loss function of the insertion model by taking the text of the first voice instruction as a training label; and updating parameters of the marking model, the pointer model and the insertion model according to the first loss function, the second loss function and the third loss function.
In some implementations, the first model to be trained is trained according to a preset training sentence and a preset label of the preset training sentence.
In a fourth aspect, a voice interaction device is provided, including: the acquisition unit is used for acquiring a second voice instruction of the user; the processing unit is used for acquiring a first answer text according to the second voice instruction; the processing unit is further configured to input a first answer text into a voice interaction answer model to output a second answer text, where the voice interaction answer model is obtained by training according to a text of a first voice instruction and the first instruction text, the first instruction text is obtained by extracting features from the text of the first voice instruction, and the first voice instruction is a voice instruction of a user.
In some implementations, the processing unit is specifically configured to: acquiring intention information and slot position information of a second voice instruction according to the second voice instruction; and acquiring a first answer text according to the intention information, the slot position information and the preset template.
In some implementations, the user includes a plurality of users.
In some implementations, the processing unit is specifically configured to: acquiring a first voice interaction response language model according to a first mapping, wherein the first voice interaction response language model is obtained by training according to a voice instruction of a first user, and the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user; the first answer text is input into a first voice interaction answer model.
In some implementations, the processing unit is further to: and filtering out the first language information in the second answer language text, wherein the first language information is preset.
In some implementations, the processing unit is further to: the second answer language text is input to a speech synthesis engine to generate a second answer language speech.
In some implementations, the processing unit is further to: acquiring a third voice instruction of a user; and inputting a third voice instruction into a first model to be trained so as to output a third answer text, wherein the first model to be trained is obtained by training according to a preset training sentence and a preset label of the preset training sentence.
In some implementations, the speech interaction answer model and the first model to be trained are non-autoregressive models.
In a fifth aspect, there is provided a computer readable medium having stored thereon a program code which, when run on a computer, causes the computer to perform the method of any of the first and second aspects above.
In a sixth aspect, a chip is provided, including: at least one processor and a memory, the at least one processor being coupled to the memory for reading and executing instructions in the memory to perform the method of any of the first and second aspects above.
The voice interaction method of the embodiment of the application can train the model according to the voice expression habit of the user, the training data sources of the model are direct, the voice instruction of the user can be collected through daily voice interaction in the process of using the voice interaction system by the user, the input of the model to be trained is obtained according to the characteristic extraction mode, and the voice instruction of the user is used as the training label of the model to be trained to train the model to obtain the voice interaction response language model without manually writing or collecting training data. In addition, the training model is directly trained by using the voice command of the user, and the text output by the trained voice interaction answering language model has the expression characteristics of the user, namely the text conforming to the expression habit of the user, so that the answering language voice conforming to the expression habit of the user can be output by the voice interaction system in the process of interacting with the user, and the user experience is improved.
Drawings
FIG. 1 is a schematic diagram of a voice interaction system 100 according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a system architecture 200 according to an embodiment of the application;
FIG. 3 is a schematic flow chart of a method of voice interaction of an embodiment of the application;
FIG. 4 is a schematic diagram of a system architecture for voice interaction in accordance with an embodiment of the present application;
FIG. 5 is a schematic flow chart of a voice interaction method of an embodiment of the application;
FIG. 6 is a schematic block diagram of another voice interactive system of an embodiment of the present application;
FIG. 7 is a schematic flow chart diagram of generating generic natural answer text in accordance with an embodiment of the application;
FIG. 8 is a schematic flow chart of training a speech interactive answer model in accordance with an embodiment of the application;
FIG. 9 is a schematic flow chart of generating personalized natural answer text in accordance with an embodiment of the application;
FIG. 10 is a schematic block diagram of an apparatus 1000 for voice traffic in accordance with an embodiment of the present application;
FIG. 11 is a schematic block diagram of a voice interaction apparatus 1100 in accordance with an embodiment of the present application;
fig. 12 is a schematic diagram of an apparatus 1200 according to an embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made with reference to the accompanying drawings, in which it is evident that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Compared with the traditional touch screen type interaction, the voice interaction brings greater convenience to the life of the user, and the man-machine interaction technology based on natural language understanding can completely liberate the hands of the user, so that the user can control corresponding equipment through voice. The scheme of the application can be applied to human-computer interaction scenes, such as human-computer interaction of electronic equipment and human-computer interaction scenes of vehicle-mounted systems. The electronic device may include a smart phone, a personal digital assistant (personal digital assistant, PDA), a tablet computer, and the like, among others. The in-vehicle system may be one or more of in-vehicle chips, in-vehicle devices (e.g., car sets, in-vehicle computers, sensors having a voice recognition function, etc.), and the like. The voice interaction method comprises the steps that in the training process of a model, an electronic device or a vehicle-mounted system can upload an acquired voice instruction of a user to a cloud end, the cloud end processes the voice instruction of the user, trains the model according to a processed result, and then sends a trained voice interaction response language model to the electronic device or the vehicle-mounted system; or the electronic device or the vehicle-mounted system can perform certain preprocessing on the acquired voice instruction of the user, for example, the voice instruction is converted into a text, then the text is subjected to feature extraction to obtain an instruction text, the instruction text is uploaded to a cloud end, the cloud end trains the model according to the instruction text, and then the trained voice interaction response language model is sent to the electronic device or the vehicle-mounted system; or the electronic equipment or the vehicle-mounted system can upload the acquired voice instruction of the user to the cloud end, the cloud end carries out certain preprocessing on the voice instruction of the user, for example, the voice instruction is converted into a text, then the text is subjected to feature extraction to obtain an instruction text, the instruction text is sent to the electronic equipment or the vehicle-mounted system, and the electronic equipment or the vehicle-mounted system trains the model according to the received instruction text to obtain a trained voice interaction response language model. The trained voice interaction answer language model can be applied to human-computer interaction of electronic equipment and human-computer interaction scenes of a vehicle-mounted system, and comprises the steps of outputting corresponding answer language texts according to voice instructions of users, and generating corresponding answer language voices according to the answer language texts by the electronic equipment or a voice synthesis engine in the vehicle-mounted system and outputting the corresponding answer language texts to the users.
Two more commonly used application scenarios are briefly described below.
Application scenario 1: application scene of intelligent driving
In an application scenario of intelligent driving, a user may control the intelligent driving device through voice. For example, the user may issue voice instructions to a voice assistant onboard the vehicle to control the intelligent driving device. In some possible examples, the user may adjust the inclination of the seat back, adjust the temperature of the air conditioner in the vehicle, turn on or off the seat heater, turn on or off the vehicle lights, turn on or off the vehicle windows, turn on or off the trunk, plan the navigation route, play the personalized song list, etc. by voice. In an application scenario of intelligent driving, voice interaction is beneficial to providing a convenient driving environment for a user.
Application scenario 2: application scene of intelligent home
In the application scene of the intelligent home, a user can control the intelligent home equipment through voice. For example, a user may issue a voice command to an internet of things device (e.g., a smart home device) or an internet of things control device (e.g., a cell phone, etc.) to control the internet of things device. In some possible examples, the user may control the temperature of the intelligent air conditioner, control the intelligent television to play a user-specified television show, control the intelligent cooking device to start at a user-specified time, control the intelligent window shade to open or close, control the intelligent light fixture to adjust the color temperature, etc. through voice. In the application scene of intelligent home, voice interaction is beneficial to providing a comfortable home environment for users.
FIG. 1 is a schematic diagram of a voice interaction system 100, which is shown in FIG. 1, and which may be used to perform a voice interaction method according to an embodiment of the present application.
The execution device 110 may be a device having voice recognition capabilities, natural language understanding capabilities, etc. The execution device 110 may be, for example, a server. Optionally, the execution device 110 may also cooperate with other computing devices, such as: data storage, routers, load balancers, etc. The execution device 110 may be disposed on one physical site or distributed across multiple physical sites. The execution device 110 may use data in the data storage system 150 or invoke program code in the data storage system 150 to perform at least one of the functions of speech recognition, machine learning, deep learning, model training, etc. The data storage system 150 of fig. 1 may be integrated on the execution device 110, or may be disposed on a cloud or other network server.
The user may operate respective local devices (e.g., local device 101 and local device 102) to interact with execution device 110. The local device shown in fig. 1 may represent, for example, various types of voice interaction terminals, such as the electronic device and the vehicle-mounted system described above. The user sends out a voice command to the local device, the local device sends the voice command of the user to the execution device 110, and the execution device processes the voice command of the user and executes the corresponding command according to the processing result.
The local device of the user may interact with the execution device 110 through a wired or wireless communication network, and a communication method, a system or a standard of the communication network are not limited, and may be a wide area network, a local area network, a point-to-point connection, or any combination thereof.
In one implementation, the local device 101 may provide local data or feedback calculations to the executing device 110.
In another implementation, all or part of the functionality of the execution device 110 may be implemented by a local device. For example, the local device 101 implements the functionality of the device 110 and provides services to its own user or to the user of the local device 102.
In the voice interaction method provided by the embodiment of the application, the processing result of the voice instruction by the execution device is sent to the local device, so that the local device can respond to the voice instruction of the user correspondingly.
FIG. 2 is a schematic diagram of a system architecture 200 that may be used by the system of FIG. 2 to perform a method of training a speech interactive response language model according to an embodiment of the application.
The data collection device 260 may be used to collect training data, where the training data collected may be manually designed training sentences and labels thereof, or may be voice instructions sent by the user during use. The data acquisition device 260 may also be used to store training data in the database 230. The training device 220 may train to obtain the target model/rule 201 based on training data maintained in the database 230, where the trained target model/rule 201 may be a voice interactive answer language model according to an embodiment of the present application. The training device 220 is not necessarily completely based on the training data maintained by the database 230 to train the target model/rule 201, and may acquire the training data from the cloud or other places to train the model, which should not be construed as limiting the embodiments of the present application.
The training data maintained in the database 230 need not all be from the collection of the data collection device 260, but may be received from other devices. In one example, training data in database 230 may be obtained by client device 240 or may be obtained by execution device 210. Client device 240 may include, for example, various types of voice interactive terminals. The execution device 210 may be a device having voice recognition capabilities, natural language understanding capabilities, etc. For example, training data such as text features of an input text, phonetic symbol features of a target voice, etc. may be obtained by obtaining voice information through the data collection device 260 and performing related processing; text features of the input text and phonetic symbol features of the target speech may also be acquired by the data acquisition device 260. As another example, the voice information may be directly used as training data. In another example, the same account may be logged onto multiple client devices 240, and data collected by the multiple client devices 240 may be maintained in the database 230.
Alternatively, the training data may include, for example, one or more of speech, corpus, hotword, and the like. Speech may refer to sounds that are loaded with a linguistic meaning. Corpus, i.e., language material, may refer to a language in the real world and its context, e.g., text and text context. Hotwords can be understood as hot words. The hotword may be a lexical phenomenon, which may reflect problems, topics, things, etc. that some people are relatively interested in during a period of time. The hotwords for different time phases may be different.
In one possible example, the training data may include, for example, input speech (the input speech may be, for example, from a user or may be speech acquired by another device).
In another possible example, the training data may include, for example, feature vectors of the input speech (e.g., phonetic symbol features, which may, for example, reflect phonetic symbols of the input speech). The feature vector of the input speech may be obtained by feature extraction of the input speech.
In another possible example, the training data may include, for example, target text corresponding to an input voice, or the like.
In yet another possible example, the training data may include text features of target text corresponding to the input speech, for example. The target text can be obtained by performing characteristic preprocessing on input voice. The text features of the target text can be obtained by extracting features of the target text.
It should be appreciated that the input voice may be sent by the client device 240 to the data acquisition device 260, may be read from a storage device by the data acquisition device 260, or may be acquired by real-time acquisition.
The target model/rule 201 is obtained according to training of the training device 220, and may be a model built based on a neural network, where the neural network may be a convolutional neural network (convolutional neuron network, CNN), a cyclic neural network (recurrent neural network, RNN), a time recurrent neural network (long-short term memory, LSTM), a bidirectional time recurrent neural network (bidirectional long-short term memory, BLSTM), a deep convolutional neural network (deep convolutional neural networks, DCNN), and the like.
The target model/rule 201 obtained by the training device 220 may be applied to different systems or devices. In the systematic architecture 200 shown in fig. 2, the execution device 210 may be configured with an input/output (I/O) interface 212. Through the I/O interface 212, the execution device 210 is capable of data interaction with external devices of the execution device 210. As shown in FIG. 2, a "user" may input data to I/O interface 212 through client device 240. For example, the user may input the intermediate prediction result to the I/O interface 212 through the client device 240, and the client device 240 may send the intermediate prediction result obtained after a certain processing to the execution device 210 through the I/O interface 212. The intermediate prediction result may be, for example, a target text corresponding to the input voice.
Alternatively, the training device 220 may generate, for different targets or users, a corresponding target model/rule 201 based on different training data, where the corresponding target model/rule 201 may be used to achieve the above targets or to perform the above tasks, thereby providing the user with the desired results.
Alternatively, the target model/rule 201 may be obtained by training on the basis of a basic speech model. During the training process, one portion of the target model/rule 201 may be updated and another portion of the target model/rule 201 may not be updated. The updated portion of the target model/rule 201 may correspond to a personalized speech sub-model. The non-updated portion of the target model/rule 201 may correspond to a generic speech sub-model. The basic speech model may be a pre-trained speech model by the training device 220 using a plurality of voices, corpus, or the like, or may be an existing speech model.
The client device 240 and the computing module 211 may work in concert. The client device 240 and the computing module 211 may process data input to the client device 240 and/or data input to the execution device 210 (e.g., intermediate predictions from the client device 240) according to the personalized speech submodel and the generic speech submodel described above. In one example, the client device 240 may process the input user speech to obtain phonetic or text features corresponding to the user speech; the client device 240 may then input the phonetic or text features to the computing module 211. In other examples, the preprocessing module 213 of the execution device 210 may receive input speech from the input speech according to the I/O interface 112 and perform feature preprocessing and feature extraction on the input speech to obtain text features of the target text. The preprocessing module 213 may input text features of the target text to the calculation module 211 may input the phonetic symbol features or the text features into the target model/rule 201, thereby obtaining an output result of the speech recognition (e.g., a semantic recognition result, an operation corresponding to a speech instruction, etc.). The computing module 211 may input the output result to the client device 240 so that the client device 240 may perform a corresponding operation in response to a voice instruction of the user.
The I/O interface 212 may send the input data to the corresponding module of the execution device 210, or may return the output result to the client device 240 for the user. For example, the I/O interface 212 may send the intermediate prediction result corresponding to the input voice to the calculation module 211, or may return the result obtained after recognizing the voice to the client device 240.
In the system architecture 200 shown in fig. 2, a user may input data such as voice and corpus into the client device 240, and may view the result output by the execution device 210 at the client device 240, and a specific presentation form may be a specific manner such as sound or a combination of sound and display. The client device 240 may also be used as a data collection terminal to store collected data such as voice and corpus into the database 230. Of course, the data such as voice and corpus of the user and the output result of the I/O interface 212 may be stored in the database 230 as new sample data by other devices instead of being collected by the client device 240.
In the system architecture 200 shown in fig. 2, the execution device 210 and the data storage system 250 may be integrated in different devices depending on the data processing capabilities of the client device 240. For example, when the data processing capabilities of the client device 240 are strong, the execution device 210 and the data storage system 250 may be integrated in the client device 240; whereas the execution device 210 and the data storage system 250 may be integrated in a dedicated data processing device when the data processing capabilities of the client device 240 are not very strong. The database 230, training device 220, and data collection device 260 of fig. 2 may be integrated into a dedicated data processing device, may be located on a cloud or other server on the network, or may be located in a client device 240 and data processing device, respectively.
It should be noted that fig. 2 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, apparatuses, modules, etc. shown in fig. 2 is not limited in any way. For example, in FIG. 2, data storage system 250 is external memory to execution device 210, in other cases, data storage system 250 may be located within execution device 210. As another example, in some possible examples, the execution device 210 may be disposed in the client device 240. The generic speech sub-model of the target model/rule 201 may be an outbound speech model of the client device 240. After the client device 240 leaves the factory, the personalized speech submodel of the target model/rule 201 may be updated based on the data collected by the client device 240.
In the voice interaction, the intelligent device may respond according to the voice command of the user and the corresponding operation in addition to executing the corresponding operation according to the voice command of the user, for example, in the system shown in fig. 1, the local device 101 obtains the voice command of the user and sends the voice command to the execution device 110, the execution device 110 processes the voice command to obtain the corresponding execution command, and simultaneously generates a response word corresponding to the voice command and sends the execution command and the response word to the local device 101, and the local device outputs the response word to the user in addition to executing the corresponding execution command. The existing voice response is mainly realized based on a response language template, different voice interaction scenes have different response language templates, generally, the response language generated based on the response language template is too mechanical and hard, in practical application, the expression habits of different users are different, and the method for generating the response language based on the response language template is difficult to meet the requirements of nature, individuation and compliance of the user expression habits.
Therefore, the embodiment of the present application provides a method for training a voice interactive response language model, in which the model is trained according to the voice instruction of the user in the application, so that the response language output by the trained model accords with the personalized expression of the user, where the method for training the voice interactive response language model of the embodiment of the present application may be implemented by the system in fig. 2, the data acquisition device 260 in fig. 2 acquires the voice instruction of the user and stores the voice instruction in the database 230, the training device 220 trains the model according to the language instruction in the database 230, so as to obtain a trained voice interactive response language model, and the trained voice interactive response language model is stored in the data storage system 150 in fig. 1. In addition, the embodiment of the application further provides a voice interaction method, and the voice interaction response language model trained by the method for training the voice interaction response language model provided by the embodiment of the application is used for processing the voice instruction of the user, specifically, in the system in fig. 1, the executing device 110 acquires the user instruction through the local device 101, and then processes the voice instruction according to the voice interaction response language model in the data storage system 150, so as to obtain a corresponding natural, personalized and habit-conforming response language, and the response language is output to the user through the local device, thereby improving the use experience of the user.
Fig. 3 is a schematic flowchart of a method for training a voice interaction answer language model according to an embodiment of the present application, where the method shown in fig. 3 may be applied to a process of using a voice interaction system by a user, for example, in a man-machine interaction scene of an electronic device and a man-machine interaction scene of a vehicle-mounted system, and when the voice interaction system is the vehicle-mounted voice interaction system, and in a process of performing voice interaction with the vehicle-mounted voice interaction system in a daily manner, the vehicle-mounted voice interaction system obtains a voice instruction of the user, and then training the voice interaction answer language model according to the voice instruction sent by the user, and the trained voice interaction answer language model may be applied to the man-machine interaction scene of the vehicle-mounted voice interaction system, including outputting a corresponding answer language text according to the voice instruction of the user, and the voice synthesis engine generates a corresponding answer language according to the answer language text and outputs the corresponding answer language to the user, so that the answer language output by the vehicle-mounted voice interaction system is more personalized and accords with the expression habit of the user. The scheme can be realized by vehicle-mounted equipment, such as a vehicle-mounted system, a vehicle-mounted device, a vehicle-mounted processor and the like, or the vehicle-mounted equipment can upload the acquired voice instruction of the user to a cloud end, the cloud end processes the voice instruction of the user, trains the model according to the processed result and then sends the trained voice interaction response language model to the vehicle-mounted equipment; or, the vehicle-mounted device can perform certain preprocessing on the acquired voice instruction of the user, for example, the voice instruction is converted into a text, then the text is subjected to feature extraction to obtain an instruction text, the instruction text is uploaded to a cloud, the cloud trains the model according to the instruction text, and then the trained voice interaction answer model is sent to the vehicle-mounted device; or, the vehicle-mounted voice interaction system can upload the acquired voice instruction of the user to the cloud end, the cloud end performs certain preprocessing on the voice instruction of the user, for example, the voice instruction is converted into a text, then feature extraction is performed on the text to obtain an instruction text, the instruction text is sent to the vehicle-mounted device, and the vehicle-mounted device trains the model according to the received instruction text to obtain a trained voice interaction response language model. The method shown in fig. 3 includes steps 301 to 303, which are described below.
S301, a first voice instruction of a user is acquired.
As described above, the first voice command is a voice command sent by the user to the voice interaction system in the process of interacting with the voice interaction system. For example, the user sends a first voice instruction to the voice interaction system to talk about the song ABC melody bar according to the expression habit of the user, and the voice interaction system can train the model to be trained according to the first voice instruction of the user. It should be understood that the method of fig. 3 is a process of retraining the model to be trained, which means that the model to be trained may be preliminarily trained before that, for example, the model to be trained may be preliminarily trained before leaving the factory, or may be preliminarily trained in a previous upgrading process, or may also be trained in a previous use process by other methods. Therefore, in the use process of the user, the user sends a first voice command to the voice interaction system, the voice interaction system will make a response word corresponding to the first voice command according to the initially trained model to be trained, the voice interaction of this time is completed, then the first voice command of this time of the user is saved (for example, saved in the database 230 of the system shown in fig. 2), the model to be trained is trained according to the first voice command of the user when necessary, for example, the model to be trained is trained after a preset number of voice commands of the user are collected, or the model to be trained is trained after the user uses the voice interaction system for a preset time, specifically, the voice interaction system can be preset by human beings.
S302, extracting features of the text of the first voice instruction to obtain a first instruction text.
And converting the first voice instruction into a text, and then extracting features of the text of the first voice instruction to obtain a first instruction text. Specifically, firstly, extracting features of a text of a first voice instruction to obtain intention information and slot information of the first voice instruction, and then obtaining the text of the first instruction according to the intention information and the slot information of the first voice instruction and a preset template. The first instruction text is a concise sentence text which only retains the intention information and the slot position information of the first voice instruction, and does not contain any personalized expression compared with the text of the first voice instruction. For example, the text of the first voice command is "the song ABC melody bar for the first talking" and the text of the first command is "the song ABC for playing".
An exemplary specific process is described below, in which a first voice command is first converted into a voice audio signal, and noise reduction, amplification, etc. may be performed on the voice audio signal, so as to facilitate subsequent voice recognition; then converting the voice audio signal into a text signal, extracting intention information from the text signal by using an intention decoder, and extracting slot information from the text signal by using a semantic slot decoder; finally, a first instruction text is obtained according to the intention information, the slot information and the preset template, and note that the preset template simply combines the intention information and the slot information into a sentence text, rather than generating a corresponding answer text aiming at the intention information and the slot information of the first voice instruction, the method is equivalent to removing the personalized features with the language expression habit in the first voice instruction text which are sent by the original user and have the language expression habit of the original user, and only leaves the most basic features capable of expressing the intention information and the slot information of the user.
S303, training a first model to be trained according to the text of the first voice instruction and the first instruction text to obtain a voice interaction response language model, wherein the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label.
And taking the obtained first instruction text as the input of the first model to be trained, taking the text of the first voice instruction as a training label of the first model to be trained, and training the first model to be trained. The first model to be trained comprises three sub-models, namely a mark model, a pointer model and an insertion model, or the initial model to be trained is a model and simultaneously comprises the functions of the three sub-models. In the following, the training of the initial model to be trained will be described by taking three sub-models as an example, and it should be understood that when the initial model to be trained is a model and includes the functions of the three sub-models, the following training process may be referred to.
The specific training process of the first model to be trained comprises the steps of firstly inputting a first instruction text into a marking model, and carrying out feature marking on the first instruction text by the marking model so as to obtain a feature marking sequence; the method comprises the steps of marking a first instruction text by features, judging which features need to be deleted, which features need to be maintained, which positions need to be inserted with new features, how many new features are inserted, and the like in the first instruction text, and marking corresponding features or corresponding positions; for example, for the first instruction text "play song is ABC", the features "play", "yes" need to be deleted, the features "song", "ABC" need to be reserved, 4 new features need to be inserted before the features "song", 5 new features need to be inserted after the features "ABC", and corresponding marks are made at corresponding positions. Then inputting the feature tag sequence into a pointer model, and sequencing the feature tag sequence by the pointer model so as to obtain a feature sequencing sequence; specifically, the pointer model deletes the feature marked for deletion, reorders the feature marked for maintenance, for example, for the first instruction text "play song is ABC", deletes the feature "play" and "yes", arranges the feature "song" before the feature "ABC", reserves 4 positions to be inserted before the feature "song", and reserves 5 positions to be inserted after the feature "ABC". Finally, inputting the sequencing sequence into an insertion model, inserting a first feature into the feature sequencing sequence by the insertion model to obtain an output sequence, for example, for a first instruction text of 'playing song is ABC', in the obtained sequencing sequence, reserving 4 positions to be inserted before the feature of 'song', respectively inserting the feature of 'coming', 'speaking', 'singing', and reserving 5 positions to be inserted after the feature of 'ABC', respectively inserting the feature of 'spinning', 'regular', 'thief', 'stick', so as to obtain the output sequence of the 'coming speaking song ABC melody and the thief stick', wherein the output sequence is the output sequence in the training process of the first model to be trained.
The total loss function of the model to be trained consists of the loss function of the marker model, the loss function of the pointer model and the loss function of the insertion model, and the output sequence is compared with the training label, so that the loss functions of the three sub-models are calculated and fed back to the three sub-models, and the parameters of the three sub-models are adjusted, thereby achieving the purpose of training the voice interaction response language model. The loss function is used for describing the matching degree of the trained model and the target model, and is used for updating the trained model parameters according to the gradient descent algorithm.
As described in S301, the method of fig. 3 is a process of retraining the model to be trained, where the model to be trained has been subjected to preliminary training before the model to be trained is delivered, or is subjected to preliminary training during the previous upgrade process, or may be subjected to training by other methods during the previous use process. The preliminary training means that the model to be trained is trained according to preset training sentences and preset labels of the preset training sentences before leaving the factory, the preset training sentences and the preset labels of the preset training sentences can be written manually or obtained from a history record, and the model to be trained after preliminary training can output a relatively natural answer text according to voice instructions of a user in the use process of the user. For example, for a first voice command of "coming to talk song ABC melody bar", the model to be trained after preliminary training outputs a relatively natural answer text "ABC plays the hah for you".
The voice interaction answer language model obtained through training in the steps can output personalized text conforming to the user voice expression habit. In actual practice, a user herein may represent one or more users. Specifically, the voice interaction system in the embodiment of the application can train out the voice interaction response language models respectively corresponding to the users according to different users, and the text output by each voice interaction response language model accords with the language expression habit of each user. For example, the voice interaction system can judge whether the voice command is from different users by identifying the tone of the different users, or can also combine other sensors such as a camera sensor to perform facial recognition on the users so as to judge which user the current voice command is from, so that the acquired voice command of each user is stored in different sets of a database, and then different voice interaction response language models are trained respectively according to the different sets. Meanwhile, a mapping relation is established between each user and a corresponding voice interaction answer model, for example, for a first user, a first mapping is arranged between the first user and the first voice interaction answer model, the first mapping is used for indicating that the first voice interaction answer model corresponds to the first user, and the first voice interaction answer model is obtained through training according to voice instructions of the first user. Thus, the method can realize that the answer words conforming to the expression habit of the user can be output for different users, for example, the answer words with more mature styles can be output for parents, and the answer words with natural styles can be output for children.
Optionally, the embodiment of the application can train the same voice interaction response language model according to voice instructions of a plurality of different users, and the plurality of different users have similar language expression habits.
The training data source of the method for training the voice interaction answering language model is direct, in the process that a user uses a voice interaction system, voice instructions of the user are collected through daily voice interaction, input of the model to be trained is obtained according to a depersonalized characteristic extraction mode, the voice instructions of the user are used as training labels of the model to be trained, the voice interaction answering language model is obtained through training, and training data do not need to be written or collected manually. In addition, the training model is trained by directly using the voice command of the user, and the text output by the trained voice interaction answer language model is the text conforming to the user expression habit, so that the answer language output by the voice interaction system is answer language voice conforming to the user expression habit, and the user experience is improved.
The voice interaction answer model trained according to the method for training a voice interaction answer model shown in fig. 3 may be applied to the system for voice interaction shown in fig. 4. FIG. 4 shows a schematic diagram of a system architecture for voice interaction of an embodiment of the present application, as shown in FIG. 4, including a voice recognition subsystem, a semantic understanding subsystem, a semantic response subsystem, and a voice synthesis subsystem. The voice recognition subsystem is used for converting voice signals acquired by the audio equipment into text signals, the semantic understanding subsystem is used for understanding the meaning of the text signals, the semantic response subsystem is used for determining response language texts based on the output of the semantic understanding subsystem, and the voice synthesis subsystem is used for combining the response language texts into corresponding voices. In addition, the system can also comprise a preprocessing system, wherein the preprocessing system is used for preprocessing the voice signal such as noise reduction and amplification before the voice recognition subsystem converts the voice signal acquired by the audio device into a text signal.
Fig. 5 is a schematic flow chart of a voice interaction method according to an embodiment of the present application, where the method shown in fig. 5 may be implemented using the voice interaction system shown in fig. 4, and by using the method shown in fig. 5, it may be implemented that the voice interaction system outputs a personalized response language conforming to the language expression habit of the user during the voice interaction with the user. The method shown in fig. 5 includes steps 501 to 503, which are described below.
S501, a second voice instruction of a user is acquired.
The process of performing one-time voice interaction between the user and the voice interaction system is taken as an example for illustration, and the second voice instruction refers to a voice instruction sent by the user to the voice interaction system in the process of performing voice interaction between the user and the voice interaction system, and still takes the second voice instruction as an example for illustration.
S502, acquiring a first answer text according to the second voice instruction.
The method comprises the steps of firstly obtaining intention information and slot position information of a second voice instruction according to the second voice instruction, and then obtaining a first answer text according to the intention information, the slot position information and a preset answer template. Specifically, the second voice command is firstly converted into a voice audio signal, and the voice audio signal can be subjected to noise reduction, amplification and other treatments so as to facilitate the subsequent voice recognition; then converting the voice audio signal into a text signal, extracting intention information from the text signal by using an intention decoder, and extracting slot information from the text signal by using a semantic slot decoder; finally, a first answer text is obtained according to the intention information, the slot information and the preset answer template, and it is noted that, unlike in the above step S302, a first instruction text is obtained according to the intention information, the slot information and the preset template in step S302, the first instruction text is obtained by depersonalizing a first voice instruction, the first instruction text is basically an instruction, the first answer text is an answer to a second voice instruction, and only the first answer text does not conform to the language expression habit of the user yet, and is more mechanized. For example, when the second voice command is "talk song ABC melody bar", the corresponding first answer text obtained according to the preset answer template is "play song ABC for you".
S503, inputting the first answer text into a voice interaction answer model to output a second answer text, wherein the voice interaction answer model is obtained by training according to a text of a first voice instruction and a first instruction text, the first instruction text is obtained by extracting features of the text of the first voice instruction, and the first voice instruction is a voice instruction of the user.
Since the voice interactive answer language model is trained according to the voice instruction of the user through the training method shown in fig. 3, a text conforming to the language expression habit of the user can be output, so that the first answer language text is input into the trained voice interactive answer language model, and the output second answer language text is the answer language text conforming to the language expression habit.
As described above, the voice interaction system in the embodiment of the application can train the voice interaction response language models respectively corresponding to the users according to different users, and the text output by each voice interaction response language model accords with the language expression habit of each user. When the user is a first user, the first voice interaction answer model is obtained according to a first mapping, the first voice interaction answer model is obtained according to voice instructions of the first user, the first mapping is used for indicating that the first voice interaction answer model corresponds to the first user, and then the text of the first answer obtained according to the second voice instructions of the user is input into the first voice interaction answer model, so that answer conforming to the language expression habits of the first user can be output. In the practical application process, the voice interaction system can judge the identity of the user by identifying the tone of different users, or can also combine other sensors such as a camera sensor to carry out facial recognition on the user so as to judge the identity of the current user, thereby acquiring a voice interaction response language model corresponding to the user according to the mapping relation.
Optionally, after the second answer text is obtained, the method of the embodiment of the present application further includes filtering out the preset language information in the second answer text. In the actual training process, if the term of the user is not too civilized, the voice interaction answer language model obtained by training according to the voice instruction of the user may output an answer language text which is not civilized, so before the answer language text is output to the user, the second answer language text output by the voice interaction answer language model needs to be filtered, and the non-civilized language information in the second answer language text is filtered. Specific language information to be filtered can be preset by a developer before the delivery of the voice interaction system, or can be freely set by a user in the use process, and the embodiment of the application is not limited herein.
After the second answer text is obtained according to the steps, the second answer text is input into a voice synthesis engine to generate second answer voice, and the second answer voice is played for the user, so that one-time voice interaction between the user and the voice interaction system is realized.
It should be understood that, because the voice interactive answer model is obtained by retraining in the use process of the user, and the model to be trained is subjected to preliminary training before leaving the factory, the method of the embodiment of the application further comprises the steps of obtaining a third voice instruction of the user after leaving the factory and before retraining, inputting the third voice instruction into the first model to be trained to output a third answer text, wherein the first model to be trained is obtained by training according to preset training sentences and preset labels of the preset training sentences. And inputting the third answer language text into a voice synthesis engine to generate third answer language voice which is a natural answer language but does not accord with the language expression habit of the user.
The speech interactive answer language model and the model to be trained in the embodiment of the application are non-autoregressive models (non-autoregressive translation, NART), the autoregressive models (autoregressive translation, ART) can predict a word in the future by using the generated sequence as known information each time, and finally, the word generated in each time step is spliced into a complete sequence to be output. The time delay is large; in contrast to the autoregressive model, there is no dependency between each word in the non-autoregressive model, and each word of the entire output sequence is synchronously predicted in parallel.
The voice interaction method of the embodiment of the application uses the voice interaction answer language model obtained by training according to the voice instructions sent in the daily voice interaction of the user to generate the answer language, so that the generated answer language accords with the language expression habit of the user. And different voice interaction answer language models are matched for different users, so that personalized and thousand-face answer language expression can be realized, and the use experience of the users is greatly improved.
The method for training the voice interaction answering language model in the embodiment of the application is mainly aimed at a vehicle-mounted voice interaction answering scene, and the oriented product is mainly a voice interaction product in the field of intelligent automobiles, and the specific form of the voice interaction answering language model can be a software code, a functional interface or hardware with a voice interaction function or a voice interaction processing function, including but not limited to a vehicle machine, a voice interaction system, a vehicle-mounted computer, a processor and the like. The method for training the voice interaction answer language model can be further expanded into intelligent home related products such as intelligent sound equipment and intelligent televisions, and related products include but are not limited to processors, computing equipment, sound equipment, televisions, voice interaction systems and the like.
FIG. 6 shows a schematic block diagram of a more detailed system of voice interaction of an embodiment of the present application, as shown in FIG. 6, including a preprocessing subsystem, a voice recognition subsystem, a semantic understanding subsystem, a semantic response subsystem, and a voice synthesis subsystem.
The preprocessing subsystem is used for converting the acquired voice instruction of the user into a voice audio signal and then transmitting the voice audio signal to the voice recognition subsystem. The speech recognition subsystem is used for converting the speech audio signal into a text signal and then transmitting the text signal to the semantic understanding subsystem. The semantic understanding subsystem obtains corresponding intention and slot information according to the text signal, and then transmits the intention and slot information to the semantic response subsystem. The semantic response subsystem is used for generating a response language text corresponding to a voice instruction of a user according to intention and slot position information, the semantic response subsystem is loaded with the voice interaction response language model of the embodiment of the application, the voice interaction response language model is trained offline according to training data designed manually before, and in later application, the voice interaction response language model comprises three different stages, wherein the first stage is a universal natural response stage, and the voice interaction response language model is trained according to the universal training data before, so that the universal natural response language text can be output according to the voice instruction of the user, and the stage can obtain a natural voice response language, so that the user experience is more natural; the second stage is a personalized learning stage, which takes the voice instruction of the user as training data, so that the voice interaction answering language model continuously learns the language habit of the user in daily man-machine interaction, and the learning capacity of personalized expression of the machine is enhanced; the third stage is a personalized natural response stage, namely personalized learning for a period of time, the voice interaction response language model can output response language texts similar to the language expression habits of users, and better use experience is brought to the users. The semantic response subsystem transmits the generated response language text to the voice synthesis subsystem, and the voice synthesis subsystem converts the response language text into voice and then outputs the voice to the user.
According to the above description of the voice interaction system of fig. 6, three stages of application of the voice interaction answer language model of the embodiment of the present application are described below with reference to fig. 7 to 9, where an application scenario is exemplified by a man-machine interaction scenario of a vehicle-mounted system, but it should be understood that the application scenario of the voice interaction answer language model of the embodiment of the present application further includes man-machine interaction scenarios of other electronic devices, including man-machine interaction scenarios of an intelligent terminal, an intelligent home, and the like.
When the vehicle is started, the voice interaction answer model is loaded in the voice interaction system shown in fig. 6, wherein the voice interaction answer model is trained according to the manually designed training data, for example, the model to be trained can be primarily trained before leaving the factory, or primarily trained in the previous upgrading process, or can be trained by other methods in the previous using process. The universal natural response text can be output.
Fig. 7 shows a schematic flow chart of generating generic natural answer text from a speech interactive answer model. First, the vehicle starts or runs During driving (mainly during the use of an automobile, but not limited to parking or driving state), a user sends a voice instruction to talk about song ABC melody bar, and an audio acquisition device (such as a microphone in the automobile) inputs the acquired voice instruction of the user into a preprocessing module, and the preprocessing module converts the voice instruction into a voice audio signal t=t 1 t 2 ...t n Wherein T represents a byte, n represents the length of the voice command, the preprocessing module can perform noise reduction, amplification and other processing on the voice audio signal so as to facilitate the subsequent module to perform operations such as voice recognition, understanding, response and the like, and the preprocessing module transmits the voice audio signal T to the voice recognition module. After receiving the voice audio signal T, the voice recognition module converts the voice audio signal T into a text signal x=x 1 x 2 …x n Where X represents one byte and n represents the length of the text signal X, where the text signal refers to text recognizable by the machine from which the speech is converted, for example, the text of "how you know a how you go" is converted to "how you know a how you go" or the text of "how you speak a song ABC melody bar" is converted to "how you speak a song ABC melody bar", and the speech recognition module then passes the text signal X to the semantic understanding module. After the semantic understanding module receives the text signal X, it first converts the text signal X into a new sequence z=z 1 z 2 …z n Where Z represents one byte and n represents the length of the text sequence Z, and then the semantic intent decoder in the speech recognition module processes the text sequence Z to obtain intent y 1 The text sequence Z is processed by a semantic slot decoder in the voice recognition module to obtain slot information Y=y 2 y 3 …Y n+1 For example, the text signal "Olympic-! Place B starts, the semantic understanding module is input for the song ABC melody bar of the first place song, semantic intention of navigation destination place B and playing song ABC can be output, the semantic slot can be navigation destination place B and playing song name ABC, and the voice recognition module is used for inputting the intention y 1 Sum grooveThe bit information Y is passed to the semantic response module.
As shown in FIG. 7, the semantic response module is based on existing answer templates, intent y 1 And slot information Y to obtain a fixed template answer r=r 1 r 2 …r m Where R represents a byte, m represents the length of the template answer R, where the answer template is a pre-trained voice answer template, and one is based on a general-purpose expected trained answer model, for example, the set semantic answer for "play song is ABC" is "play song ABC for you", and the set semantic answer for "navigation destination is place B" is "have navigated to place B for you". As shown in fig. 7, the obtained template answer is "play song ABC for you", and it can be seen that the template answer R obtained from the existing answer template is mechanically hard. The template answer R is input into an offline trained voice interaction answer model, which includes 3 sub-models in fig. 7, namely a labeling model, a pointer model and a text insertion model, and the labeling model specifically performs feature labeling for the template answer R according to the following formula:
Wherein,representing the ith element in the feature tag sequence output by the tag model, argmax () represents the maximum pooling function, BERT () represents the feature extraction function, r i Is the ith element in the template answer R. For the template answer of "play song for you ABC", the output of the tag model is shown in FIG. 7, where tag "D" indicates delete, tag "K" indicates maintain, tag "I2" indicates insert, tag model tags each word indicating whether this word is deleted, reserved or inserted. Then the model is markedThe output signature sequence serves as an input to a pointer model that marks which words are adjusted to which positions, as shown in fig. 7, the pointer model deletes the feature labeled "D" in the signature sequence and reorders the plurality of features labeled "K" in the signature sequence according to the following formula:
wherein,represents the i-th element in the feature ordered sequence output by the pointer model, p () represents the insertion function, and pi () represents the permutation function. The output of the pointer model is shown in fig. 7, where the feature "song" is deleted, the features "play for you" and "ABC" reorder, and the feature "play for you" has two insert positions determined after. And then taking the feature sequence output by the pointer model as the input of an insertion model, and inserting proper features into the insertion position of the feature sequence by the insertion model according to the following formula:
Wherein the method comprises the steps ofThe i-th element of the feature insertion sequence representing the insertion model output, BERT () represents the addition of a mask to the signal in brackets. As shown in fig. 7, the pointer model inserts the feature "ha" after the feature "play for you", so as to obtain the generic natural answer text generated by the voice interaction answer model "ABC has ha for you play". Finally, inputting the universal natural answer text into a voice synthesis module for voice synthesisThe forming module converts the universal natural response language into universal natural response language voice and outputs the universal natural response language voice to a user, and the universal natural response language voice is more popular and natural in expression compared with a template response language obtained according to a response language template.
However, in daily use, the above-mentioned universal natural answer language voice may not meet the needs of the user yet, and for some users, it may be desirable for the answer language output by the voice interaction system to be more personalized and more conform to the expression habits of the user. Therefore, the method of the embodiment of the application further comprises training the voice interaction response language model according to the voice instruction of the user in daily use of the voice interaction system.
FIG. 8 is a schematic flow chart of training a voice interactive response language model according to a user's voice command, first of all, the intent y corresponding to the voice command is obtained still according to the user's voice command 1 And slot information Y, still take the voice command "talk song ABC melody bar" as an example, the specific process can refer to the above-mentioned figure 7 to obtain the intention Y according to the voice command and preprocessing module, voice recognition module and voice understanding module of the user 1 And slot information Y, and embodiments of the present application are not described herein. As shown in FIG. 8, the intent y corresponding to the voice command is then parsed according to the existing semantic parsing template 1 And the slot information Y generates a fixed voice interaction instruction, as shown in fig. 8, the fixed voice interaction instruction is "playing song is ABC", and it can be seen that, compared with the voice instruction of the user, the fixed voice interaction instruction can only express the intention of the user and include the corresponding slot information, but does not include the habit expression of the user. According to the method, a fixed voice interaction instruction of 'playing song is ABC' is used as a training sentence of a voice interaction answer language model, a voice instruction of a user of 'coming to talk of the song ABC melody and a bar' is used as a training label, and the voice interaction answer language model is trained. Specifically, the voice interaction response language model comprises a marking model, a pointer model and an insertion model, wherein the marking model is used for carrying out characteristic marking on fixed voice interaction instructions so as to obtain And (3) the characteristic mark sequence is reordered by the pointer model to obtain a characteristic ordering sequence, and the insertion model inserts proper characteristics at the insertion position of the characteristic ordering sequence to finally obtain an output sequence. The total loss function of the voice interaction response language model consists of a loss function of a marking model, a loss function of a pointer model and a loss function of an inserting model, parameters of the three models are fed back and updated to the marking model, the pointer model and the inserting model, and the value of the total loss function is minimum by adjusting the parameters of the 3 sub-models, so that the trained voice interaction response language model capable of realizing personalized natural response is obtained. It is noted that in the process of training the voice interaction answer model according to the voice command of the user, the input of the voice interaction answer model is a fixed voice interaction command, and the output is also a command instead of an answer, but the output command is the same as the voice command of the user or accords with the language habit expression of the user.
After the voice interaction answer language model is trained, the trained voice interaction answer language model can be used for realizing personalized natural answer. As shown in fig. 9, taking the voice command "talk song ABC melody bar" as an example, first, the intention y corresponding to the voice command is obtained according to the voice command of the user 1 And slot information Y, the specific process can refer to the above-mentioned voice command and preprocessing module, voice recognition module and voice understanding module according to the user in FIG. 7 to obtain the intention Y 1 And slot information Y, and embodiments of the present application are not described herein. Then according to the existing answer language template and the intention y corresponding to the voice instruction 1 Generating a fixed template answer r=r with the slot information Y 1 r 2 …r m Where R represents one byte and m represents the length of the template answer R, which is "play song ABC for you," as shown in fig. 9. Then inputting the template response language R into a trained voice interaction response language model, specifically inputting the template response language R into a marking model, wherein the marking model is used for marking the characteristics of the template response language R to obtain a characteristic marking sequence,as shown in FIG. 9, where the label "D" indicates delete, the label "K" indicates sustain, and the label "I≡6" indicates insert. Then, the feature tag sequence output by the tag model is used as the input of the pointer model, the pointer model deletes the feature of the tag "D" in the feature tag sequence, and reorders the plurality of features of the tag "K" in the feature tag sequence, so as to obtain a feature ordering sequence, as shown in FIG. 9, wherein the feature "song" is deleted, the features "play you" and "ABC" are reordered, and the feature "has 6 insertion positions after play you". And then taking the feature sequence output by the pointer model as the input of an insertion model, and inserting proper features into the insertion position of the feature sequence by the insertion model, wherein the pointer model inserts the features of melody-squid first-class after the features of play for you, as shown in figure 9, so that a voice interaction answer language model is obtained to generate personalized natural answer language text of ABC for play melody-squid first-class for you. Finally, the personalized natural response language text is input into a voice synthesis module, the voice synthesis module converts the personalized natural response language text into personalized natural response language voice and outputs the personalized natural response language voice to a user, and the personalized natural response language is more personalized in expression and accords with the expression habit of the user better than the universal natural response language according to fig. 7, so that the user has a feeling of relativity.
It should be understood that the formulas according to which the marker model, the pointer model, and the insertion model in fig. 8 and fig. 9 are based in the data processing process may refer to the formulas in fig. 7, and for brevity, the embodiments of the present application are not described herein again.
The foregoing describes in detail the training and application of the voice interaction model according to the embodiments of the present application with reference to fig. 7 to 9, where the method in fig. 7 trains the voice response subsystem to obtain a generic natural response language model, and the training phase may be performed before the vehicle is started, in the cloud, or locally, and may use a large amount of user data of the same type, or may perform training with the vehicle data. Fig. 8 shows that after the vehicle is started, the voice response subsystem is trained according to the personalized voice habit of the user to obtain a personalized natural response language model, and the training stage can be trained at the cloud or locally, can be trained according to the voice of a certain user using the vehicle, and can also be trained according to the voices of a plurality of users using the vehicle, such as the questions and daily communication of the users. FIG. 9 is a diagram of a user's voice response according to a trained personalized natural answer model to obtain a personalized answer that meets the user's expression habits, enhancing the user's use experience.
The following briefly describes the processing of the voice command before, during and after model retraining, taking the voice command "to speak the song ABC melody bar" as an example.
Before model retraining, a user sends a voice command of 'coming to talk song ABC melody rod', the audio acquisition device inputs the acquired voice command of the user into a preprocessing module, the preprocessing module converts the voice command into a voice audio signal, and meanwhile, the preprocessing module can also conduct noise reduction, amplification and other processing on the voice audio signal so as to facilitate the following module to conduct operations such as voice recognition, understanding and response, and then the preprocessing module transmits the voice audio signal T to a voice recognition module. After receiving the voice audio signal, the voice recognition module converts the voice of the ' coming talking song ABC melody rod into a text signal of the ' coming talking song ABC melody rod ', wherein the text signal refers to words which can be recognized by a machine converted from the voice, and the voice recognition module transmits the text signal to the semantic understanding module. After the text signal is received by the semantic understanding module, the text signal is firstly converted into a new sequence, then a semantic intention decoder in the voice recognition module processes the text sequence to obtain intention information of 'play song', a semantic slot decoder in the voice recognition module processes the text sequence to obtain slot information of 'ABC' which is the played song name, and the voice recognition module transmits the intention and the slot information to the semantic response module. The semantic response module obtains a fixed template response word 'play song ABC for you' according to the existing response word template, intention and slot information. And inputting the obtained fixed template answer into a model which is preliminarily trained before, and outputting an answer text 'ABC playing the Ha for you' by the model. And finally, inputting the answer text into a voice synthesis engine, converting the answer text into answer voice by the voice synthesis engine and outputting the answer voice to a user, wherein compared with a fixed template answer, the answer voice is more popular and natural in expression.
In the model retraining process, intention and slot information corresponding to the voice command are obtained according to the process, then a fixed voice interaction command 'playing song is ABC' is generated according to the existing semantic analysis template, the intention and slot information corresponding to the voice command, the fixed voice interaction command is used as input in the model retraining process, a voice command of a user's voice command' coming to rap the ABC melody rod 'serving as the voice command of the user's voice command 'coming to rap the ABC melody rod' serving as a training label, and the model is retrained.
After the model is trained, the intention and the slot position information corresponding to the voice instruction are still acquired according to the process for the voice instruction of the user, and then a fixed template answer is generated according to the existing answer template, the intention and the slot position information, so that song ABC is played for you. And then inputting the template answer into a retrained model, and outputting an answer text' ABC for playing melody for you. And finally, inputting the answer text into a voice synthesis engine, converting the answer text into answer voice by the voice synthesis engine and outputting the answer voice to a user, wherein compared with the answer voice output by a model before retraining, the answer voice output by the model after retraining is more personalized, and the answer voice output by the model after retraining is more in line with the expression habit of the user, so that the user has a feeling of relativity.
The method of the embodiment of the present application is described in detail above with reference to the accompanying drawings, and the apparatus of the embodiment of the present application is described below, and it should be understood that the apparatus of the embodiment of the present application can perform each step of the method of the embodiment of the present application, and duplicate descriptions are omitted appropriately when describing the apparatus of the embodiment of the present application.
Fig. 10 is a schematic block diagram of a voice interaction device according to the present application, which may be a terminal, for example, an electronic device or an in-vehicle system as described above, or may be a chip inside the terminal, for example, an in-vehicle chip, etc., as shown in fig. 10, the device for training a voice interaction answer model includes an obtaining unit 1001, and a processing unit 1002, which will be briefly described below.
An obtaining unit 1001 is configured to obtain a first voice instruction of a user.
The processing unit 1002 is configured to perform feature extraction on the text of the first voice instruction to obtain a first instruction text.
The processing unit 1002 is further configured to train the first model to be trained according to the text of the first voice command and the first command text, so as to obtain a voice interaction answer model, where the text output by the voice interaction answer model has the expression characteristics of the user, the voice interaction answer model is used for answering according to the voice command of the user, the first command text is an input of the first model to be trained, and the text of the first voice command is a training tag.
In some implementations, the processing unit 1002 is specifically configured to: extracting features of the text of the first voice instruction to obtain intention information and slot position information of the first voice instruction; and acquiring a first instruction text according to the intention information, the slot position information and the preset template.
In some implementations, the user includes a plurality of users.
In some implementations, the user is a first user, and a first mapping is provided between the first user and a first voice interaction answer model, where the first mapping is used to indicate that the first voice interaction answer model corresponds to the first user, and the first voice interaction answer model is obtained through training according to a voice instruction of the first user.
In some implementations, the first model to be trained includes three sub-models, which are a marker model, a pointer model, and an insert model.
In some implementations, the processing unit 1002 is specifically configured to: inputting the first instruction text into a marking model to obtain a characteristic marking sequence of the first instruction text, wherein the characteristic marking sequence is obtained by carrying out characteristic marking on the first instruction text; inputting the feature tag sequence into a pointer model to obtain a feature ordering sequence, wherein the feature ordering sequence is obtained by reordering features in the feature tag sequence; inputting the feature ordering sequence into an insertion model to obtain an output sequence, wherein the output sequence is obtained by inserting a first feature into the feature ordering sequence; and updating parameters of the marking model, the pointer model and the insertion model by taking the text of the first voice instruction as a training label.
In some implementations, the processing unit 1002 is specifically configured to: calculating a first loss function of the marking model, a second loss function of the pointer model and a third loss function of the insertion model by taking the text of the first voice instruction as a training label; and updating parameters of the marking model, the pointer model and the insertion model according to the first loss function, the second loss function and the third loss function.
In some implementations, the first model to be trained is trained according to a preset training sentence and a preset label of the preset training sentence.
It should be understood that the apparatus for voice interaction shown in fig. 10 may be used to implement the method 300 for voice interaction described above, where the obtaining unit 1001 is used to implement the step 301, the processing unit 1002 is used to implement the step 302 and the step 303, and the apparatus for voice interaction shown in fig. 10 may also be used to implement the method for training a voice interaction answer described in fig. 8, and specific steps and details may refer to the description of fig. 8 above, so that the disclosure is not repeated herein for brevity.
Fig. 11 is a schematic block diagram of a voice interaction device according to the present application, which may be a terminal, such as the electronic device or the vehicle-mounted system described above, or a chip inside the terminal, such as a vehicle-mounted chip, or the like. As shown in fig. 11, the voice interaction device includes an acquisition unit 1101, a processing unit 1102, and a brief description will be given below.
An obtaining unit 1101 is configured to obtain a second voice instruction of the user.
The processing unit 1102 is configured to obtain a first answer text according to the second voice instruction.
The processing unit 1102 is further configured to input the first answer text into a voice interaction answer model to output a second answer text, where the voice interaction answer model is obtained by training according to the text of the first voice instruction and the first instruction text, the first instruction text is obtained by extracting features from the text of the first voice instruction, and the first voice instruction is a voice instruction of the user.
In some implementations, the processing unit 1102 is specifically configured to: acquiring intention information and slot position information of a second voice instruction according to the second voice instruction; and acquiring a first answer text according to the intention information, the slot position information and the preset template.
In some implementations, the user includes a plurality of users.
In some implementations, the processing unit 1102 is specifically configured to: acquiring a first voice interaction response language model according to a first mapping, wherein the first voice interaction response language model is obtained by training according to a voice instruction of a first user, and the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user; the first answer text is input into a first voice interaction answer model.
In some implementations, the processing unit 1102 is further configured to: and filtering out the first language information in the second answer language text, wherein the first language information is preset.
In some implementations, the processing unit 1102 is further configured to: the second answer language text is input to a speech synthesis engine to generate a second answer language speech.
In some implementations, the processing unit 1102 is further configured to: acquiring a third voice instruction of a user; and inputting a third voice instruction into a first model to be trained so as to output a third answer text, wherein the first model to be trained is obtained by training according to a preset training sentence and a preset label of the preset training sentence.
In some implementations, the speech interaction answer model and the first model to be trained are non-autoregressive models.
It should be understood that the voice interaction device shown in fig. 11 may be used to implement the voice interaction method 500 described above, where the obtaining unit 1101 is used to implement the step 501, the processing unit 1102 is used to implement the step 502 and the step 503, the voice interaction device shown in fig. 11 may also be used to implement the voice interaction method described in fig. 9, and specific steps may refer to the description of fig. 9 above, so that the disclosure is not repeated here for brevity.
It should be understood that the apparatus 1000 and the apparatus 1100 in the embodiments of the present application may be implemented by software, for example, a computer program or instructions having the functions described above may be stored in a memory inside the terminal, and the functions described above may be implemented by a processor reading the corresponding computer program or instructions inside the memory. Alternatively, the apparatus 1000 and the apparatus 1100 in the embodiment of the present application may be implemented by hardware. Wherein the processing units 1002 and 1102 are processors (e.g., neural network processing units (neural network processing unit, NPUs), processors in system-on-a-chip, etc.), the acquisition units 1001 and 1101 are data interfaces. Alternatively, the apparatus 1000 and the apparatus 1100 in the embodiments of the present application may be implemented by a combination of a processor and a software unit. In particular, the acquisition unit 1001 and the acquisition unit 1101 may be interface circuits of a processor, or a microphone of a terminal, or the like. For example, the microphone of the terminal sends the acquired user voice command to the processor interface circuit.
Fig. 12 is a schematic diagram of an apparatus 1200 according to an embodiment of the application. The apparatus 1200 shown in fig. 12 comprises a memory 1201, a processor 1202, a communication interface 1203, and a bus 1204. Wherein the memory 1201, the processor 1202 and the communication interface 1203 are communicatively coupled to each other via a bus 1204.
It should be appreciated that the acquisition unit 1001 and the acquisition unit 1101 in fig. 10 and 11 may correspond to sensors in the apparatus 1200 (the sensors are not shown in fig. 12), and the processing unit 1002 and the processing unit 1102 may correspond to the processor 1202 in the apparatus 1200. The individual elements and units of apparatus 1200 are described in detail below.
The memory 1201 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1201 may store a program, and the processor 1202 is configured to perform the steps of the method of the embodiment of the present application when the program stored in the memory 1201 is executed by the processor 1202.
In particular, the processor 1202 may be configured to perform steps 302, 303 of the method shown in fig. 3 and steps 502, 503 of the method shown in fig. 5. In addition, the processor 1202 may also perform the processes shown in fig. 7-9.
When the processor 1202 performs steps 302, 303, 502, 503, the processor 1202 may obtain the voice command of the user from the sensor of the apparatus 1200 through the communication interface 1203, and train the model according to the voice command of the multiple users or obtain the corresponding answer text by using the model.
The processor 1202 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement methods of embodiments of the present application.
The processor 1202 may also be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method of the present application may be performed by integrated logic circuitry in hardware or by instructions in software in the processor 1202.
The processor 1202 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software elements in a decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 1201, and the processor 1202 reads information in the memory 1201 and performs functions necessary for the unit included in the apparatus in combination with its hardware, or performs a method according to an embodiment of the method of the present application.
The communication interface 1203 uses a transceiver device, such as, but not limited to, a transceiver, to enable communication between the device 1200 and other devices or communication networks. For example, user voice instructions may be obtained through the communication interface 1203.
The bus 1204 may include a path to transfer information between various components of the device 1200 (e.g., the memory 1201, the processor 1202, the communication interface 1203).
Embodiments of the present application also provide a computer readable medium storing program code which, when run on a computer, causes the computer to perform the methods described above with reference to fig. 3, 5, and 7 to 9.
The embodiment of the application also provides a chip, which comprises: at least one processor and a memory, the at least one processor being coupled to the memory for reading and executing instructions in the memory to perform the methods described above with respect to fig. 3, 5, 7-9.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk, etc.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (34)

  1. A method of voice interaction, comprising:
    acquiring a first voice instruction of a user;
    extracting features of the text of the first voice instruction to obtain a first instruction text;
    training a first model to be trained according to the text of the first voice instruction and the first instruction text to obtain a voice interaction response language model, wherein the text output by the voice interaction response language model has the expression characteristics of the user, the voice interaction response language model is used for responding according to the voice instruction of the user, the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label.
  2. The method of claim 1, wherein the feature extracting the text of the first voice command to obtain a first command text comprises:
    extracting characteristics of the text of the first voice instruction to obtain intention information and slot position information of the first voice instruction;
    and acquiring the first instruction text according to the intention information, the slot position information and a preset template.
  3. The method of claim 1 or 2, wherein the user comprises a plurality of users.
  4. The method according to claim 1 or 2, wherein the user is a first user, and a first mapping is provided between the first user and a first voice interaction answer model, the first mapping is used for indicating that the first voice interaction answer model corresponds to the first user, and the first voice interaction answer model is obtained through training according to voice instructions of the first user.
  5. The method according to any one of claims 1 to 4, wherein the first model to be trained comprises three sub-models, namely a marker model, a pointer model and an insert model.
  6. The method of claim 5, wherein training the first model to be trained based on the text of the first voice instruction and the first instruction text comprises:
    inputting the first instruction text into the marking model to obtain a characteristic marking sequence of the first instruction text, wherein the characteristic marking sequence is obtained by characteristic marking the first instruction text;
    inputting the feature tag sequence into the pointer model to obtain a feature ordering sequence, wherein the feature ordering sequence is obtained by reordering features in the feature tag sequence;
    Inputting the feature ordering sequence into the insertion model to obtain an output sequence, wherein the output sequence is obtained by inserting a first feature into the feature ordering sequence;
    and updating parameters of the marking model, the pointer model and the insertion model by taking the text of the first voice instruction as a training label.
  7. The method of claim 6, wherein updating parameters of the markup model, the pointer model, and the insertion model using text of the first voice instruction as a training tag comprises:
    calculating a first loss function of the marking model, a second loss function of the pointer model and a third loss function of the insertion model by taking the text of the first voice instruction as a training label;
    updating parameters of the marker model, the pointer model and the insertion model according to the first loss function, the second loss function and the third loss function.
  8. The method according to any one of claims 1 to 7, wherein the first model to be trained is trained from a preset training sentence and a preset label of the preset training sentence.
  9. A method of voice interaction, comprising:
    acquiring a second voice instruction of a user;
    acquiring a first answer text according to the second voice instruction;
    the first answer text is input into a voice interaction answer model to output a second answer text, the voice interaction answer model is obtained through training according to a text of a first voice instruction and a first instruction text, the first instruction text is obtained through extracting features of the text of the first voice instruction, and the first voice instruction is a voice instruction of the user.
  10. The method of claim 9, wherein the obtaining the first answer text from the second voice instruction comprises:
    acquiring intention information and slot position information of the second voice instruction according to the second voice instruction;
    and acquiring the first answer text according to the intention information, the slot position information and a preset template.
  11. The method of claim 9 or 10, wherein the user comprises a plurality of users.
  12. The method of claim 9 or 10, wherein the user is a first user, the first answer input voice interactive answer model comprising:
    Acquiring a first voice interaction response language model according to a first mapping, wherein the first voice interaction response language model is trained according to voice instructions of the first user, and the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user;
    and inputting the text of the first answer language into a first voice interaction answer language model.
  13. The method according to any one of claims 9 to 12, further comprising: and filtering preset language information in the second answer language text.
  14. The method according to any one of claims 9 to 13, further comprising:
    and inputting the second answer language text into a voice synthesis engine to generate second answer language voice.
  15. The method of any of claims 9 to 14, wherein prior to the obtaining the second voice command of the user, the method further comprises:
    acquiring a third voice instruction of the user;
    and inputting the third voice instruction into a first model to be trained so as to output a third answer text, wherein the first model to be trained is obtained by training according to a preset training sentence and a preset label of the preset training sentence.
  16. The method according to any one of claims 9 to 15, wherein the speech interactive answer model and the first model to be trained are non-autoregressive models.
  17. A device for voice interaction, comprising:
    the acquisition unit is used for acquiring a first voice instruction of a user;
    the processing unit is used for extracting the characteristics of the text of the first voice instruction so as to obtain a first instruction text;
    the processing unit is further configured to train a first model to be trained according to the text of the first voice instruction and the first instruction text, so as to obtain a voice interaction response language model, the text output by the voice interaction response language model has the expression characteristics of the user, the voice interaction response language model is used for responding according to the voice instruction of the user, the first instruction text is input of the first model to be trained, and the text of the first voice instruction is a training label.
  18. The apparatus according to claim 17, wherein the processing unit is specifically configured to:
    extracting characteristics of the text of the first voice instruction to obtain intention information and slot position information of the first voice instruction;
    And acquiring the first instruction text according to the intention information, the slot position information and a preset template.
  19. The apparatus of claim 17 or 18, wherein the user comprises a plurality of users.
  20. The apparatus of claim 17 or 18, wherein the user is a first user,
    the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user, and the first voice interaction response language model is obtained through training according to voice instructions of the first user.
  21. The apparatus according to any one of claims 17 to 20, wherein the first model to be trained comprises three sub-models, namely a marker model, a pointer model and an insert model.
  22. The apparatus according to claim 21, wherein the processing unit is specifically configured to:
    inputting the first instruction text into the marking model to obtain a characteristic marking sequence of the first instruction text, wherein the characteristic marking sequence is obtained by characteristic marking the first instruction text;
    Inputting the feature tag sequence into the pointer model to obtain a feature ordering sequence, wherein the feature ordering sequence is obtained by reordering features in the feature tag sequence;
    inputting the feature ordering sequence into the insertion model to obtain an output sequence, wherein the output sequence is obtained by inserting a first feature into the feature ordering sequence;
    and updating parameters of the marking model, the pointer model and the insertion model by taking the text of the first voice instruction as a training label.
  23. The apparatus according to claim 22, wherein the processing unit is specifically configured to:
    calculating a first loss function of the marking model, a second loss function of the pointer model and a third loss function of the insertion model by taking the text of the first voice instruction as a training label;
    updating parameters of the marker model, the pointer model and the insertion model according to the first loss function, the second loss function and the third loss function.
  24. The apparatus according to any one of claims 17 to 23, wherein the first model to be trained is trained from a preset training sentence and a preset label of the preset training sentence.
  25. A voice interaction device, comprising:
    the acquisition unit is used for acquiring a second voice instruction of the user;
    the processing unit is used for acquiring a first answer text according to the second voice instruction;
    the processing unit is further configured to input the first answer text into a voice interaction answer model to output a second answer text, where the voice interaction answer model is obtained by training according to a text of a first voice instruction and a first instruction text, the first instruction text is obtained by extracting features from the text of the first voice instruction, and the first voice instruction is a voice instruction of the user.
  26. The apparatus according to claim 25, wherein the processing unit is specifically configured to:
    acquiring intention information and slot position information of the second voice instruction according to the second voice instruction;
    and acquiring the first answer text according to the intention information, the slot position information and a preset template.
  27. The apparatus of claim 25 or 26, wherein the user comprises a plurality of users.
  28. The apparatus according to claim 25 or 26, wherein the processing unit is specifically configured to:
    Acquiring a first voice interaction response language model according to a first mapping, wherein the first voice interaction response language model is trained according to voice instructions of the first user, and the first mapping is used for indicating that the first voice interaction response language model corresponds to the first user;
    and inputting the first answer text into a first voice interaction answer model.
  29. The apparatus of any one of claims 25 to 28, wherein the processing unit is further configured to: and filtering out the first language information in the second answer language text, wherein the first language information is preset.
  30. The apparatus of any one of claims 25 to 29, wherein the processing unit is further configured to:
    and inputting the second answer language text into a voice synthesis engine to generate second answer language voice.
  31. The apparatus of any one of claims 25 to 30, wherein the processing unit is further configured to:
    acquiring a third voice instruction of the user;
    and inputting the third voice instruction into a first model to be trained so as to output a third answer text, wherein the first model to be trained is obtained by training according to a preset training sentence and a preset label of the preset training sentence.
  32. The apparatus of any one of claims 25 to 31, wherein the speech interactive answer model and the first model to be trained are non-autoregressive models.
  33. A computer readable medium, characterized in that the computer readable medium stores a program code which, when run on a computer, causes the computer to perform the method of any of claims 1 to 8 or 9 to 16.
  34. A chip, comprising: at least one processor and a memory, the at least one processor coupled with the memory to read and execute instructions in the memory to perform the method of any of claims 1-8 or 9-16.
CN202180036192.XA 2021-12-10 2021-12-10 Voice interaction method and device Pending CN116583820A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137038 WO2023102889A1 (en) 2021-12-10 2021-12-10 Voice interaction method and device

Publications (1)

Publication Number Publication Date
CN116583820A true CN116583820A (en) 2023-08-11

Family

ID=86729468

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180036192.XA Pending CN116583820A (en) 2021-12-10 2021-12-10 Voice interaction method and device

Country Status (2)

Country Link
CN (1) CN116583820A (en)
WO (1) WO2023102889A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116841523B (en) * 2023-07-19 2023-12-22 上海海启科技有限公司 Online programming method and system based on artificial intelligence

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109284386A (en) * 2018-10-15 2019-01-29 四川长虹电器股份有限公司 Customized intension recognizing method and device
CN109522556B (en) * 2018-11-16 2024-03-12 北京九狐时代智能科技有限公司 Intention recognition method and device
US20200175107A1 (en) * 2018-11-30 2020-06-04 MeVero Inc. method and system for passion identification of a user
CN111193834B (en) * 2019-12-16 2022-04-15 北京淇瑀信息科技有限公司 Man-machine interaction method and device based on user sound characteristic analysis and electronic equipment
CN111611382A (en) * 2020-05-22 2020-09-01 贝壳技术有限公司 Dialect model training method, dialog information generation method, device and system
EP3940693A4 (en) * 2020-05-22 2022-03-23 Baidu Online Network Technology (Beijing) Co., Ltd. Voice interaction-based information verification method and apparatus, and device and computer storage medium

Also Published As

Publication number Publication date
WO2023102889A1 (en) 2023-06-15

Similar Documents

Publication Publication Date Title
CN112100349B (en) Multi-round dialogue method and device, electronic equipment and storage medium
CN110211563B (en) Chinese speech synthesis method, device and storage medium for scenes and emotion
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
CN109410927A (en) Offline order word parses the audio recognition method combined, device and system with cloud
JP6677419B2 (en) Voice interaction method and apparatus
CN109408833A (en) A kind of interpretation method, device, equipment and readable storage medium storing program for executing
CN108227565A (en) A kind of information processing method, terminal and computer-readable medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112216267B (en) Prosody prediction method, device, equipment and storage medium
CN114596844A (en) Acoustic model training method, voice recognition method and related equipment
CN116583820A (en) Voice interaction method and device
CN112735379B (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN116913278B (en) Voice processing method, device, equipment and storage medium
CN118052907A (en) Text map generation method and related device
CN116362265A (en) Text translation method, device, equipment and storage medium
CN111680514A (en) Information processing and model training method, device, equipment and storage medium
EP4318464A1 (en) Speech interaction method and apparatus
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN114333832A (en) Data processing method and device and readable storage medium
CN113763920A (en) Air conditioner, voice generation method thereof, voice generation device and readable storage medium
CN117496972B (en) Audio identification method, audio identification device, vehicle and computer equipment
CN111783892B (en) Robot instruction identification method and device, electronic equipment and storage medium
CN110162801B (en) Text processing method, device, equipment and readable storage medium
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination