CN112053696A - Voice interaction method and device and terminal equipment - Google Patents

Voice interaction method and device and terminal equipment Download PDF

Info

Publication number
CN112053696A
CN112053696A CN201910485079.4A CN201910485079A CN112053696A CN 112053696 A CN112053696 A CN 112053696A CN 201910485079 A CN201910485079 A CN 201910485079A CN 112053696 A CN112053696 A CN 112053696A
Authority
CN
China
Prior art keywords
voice
instruction
voice instruction
sent
recording
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910485079.4A
Other languages
Chinese (zh)
Inventor
穆培婷
陈晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TCL Corp
TCL Research America Inc
Original Assignee
TCL Research America Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TCL Research America Inc filed Critical TCL Research America Inc
Priority to CN201910485079.4A priority Critical patent/CN112053696A/en
Publication of CN112053696A publication Critical patent/CN112053696A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention is suitable for the technical field of voice interaction, and provides a voice interaction method, a voice interaction device and terminal equipment, wherein the method comprises the following steps: after receiving a first voice instruction, responding to the first voice instruction and entering a recording state; in a recording state, receiving a second voice instruction, and judging whether the first voice instruction and the second voice instruction are sent by the same user; and if the first voice instruction and the second voice instruction are not sent by the same user, performing response processing according to the second voice instruction. The invention solves the problem that the voice instruction of the next user is wrongly responded and processed when the voice instruction is sent by different users due to the relevance before and after the voice interaction.

Description

Voice interaction method and device and terminal equipment
Technical Field
The invention belongs to the technical field of voice interaction, and particularly relates to a voice interaction method, a voice interaction device and terminal equipment.
Background
With the rapid development of artificial intelligence, voice interaction between a voice assistant and an intelligent terminal is more and more common, and the voice assistant also becomes essential embedded software of various intelligent terminal devices.
At present, in the process of voice interaction, after a voice instruction sent by a user is received, the voice instruction is responded and enters a state to be recorded, audio information input by the user is continuously acquired, the audio information is analyzed, and response processing corresponding to the audio information is executed according to the audio information; due to the relevance of the preceding and following statements in the voice interaction process, when the preceding and following two pieces of audio information are sent by different users, the voice assistant is easy to send wrong response processing to the audio information of the next user.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, and a terminal device for voice interaction, so as to solve the problem in the prior art that due to the relevance of preceding and following statements in the voice interaction process, two pieces of preceding and following audio information issue an incorrect response to a voice instruction of a following user when the two pieces of audio information are issued by different users.
A first aspect of an embodiment of the present invention provides a method for voice interaction, including:
after receiving a first voice instruction, responding to the first voice instruction and entering a recording state;
in a recording state, receiving a second voice instruction, and judging whether the first voice instruction and the second voice instruction are sent by the same user;
and if the first voice instruction and the second voice instruction are not sent by the same user, performing response processing according to the second voice instruction.
In one embodiment, after receiving a first voice command, responding to the first voice command and entering a recording state includes:
after receiving the first voice instruction, initializing voice data of a voice storage unit, and storing the first voice instruction;
responding to the first voice instruction and entering a state to be recorded;
judging whether recording behavior data are detected within preset time, wherein the recording behavior data comprise voice awakening instructions or touch instructions;
and if the recording behavior data is detected, entering a recording state.
In one embodiment, after entering the state to be recorded, the method further comprises:
and if the recording behavior data is not detected within the preset time, entering a dormant state.
In one embodiment, determining whether the first voice command and the second voice command are issued by the same user includes:
performing voiceprint recognition on the first voice instruction to acquire a first voiceprint feature of the first voice instruction;
performing voiceprint recognition on the second voice instruction to acquire a second voiceprint feature of the second voice instruction;
calculating the similarity of the first voiceprint feature and the second voiceprint feature;
and judging whether the first voice instruction and the second voice instruction are sent by the same user or not according to the similarity.
In one embodiment, if the first voice command and the second voice command are not issued by the same user, performing response processing according to the second voice command includes:
if the similarity is not within the preset threshold range, the first voice instruction and the second voice instruction are not sent by the same user;
if the first voice instruction and the second voice instruction are sent by different users, initializing voice data of a voice storage unit and storing the second voice instruction;
and performing first response processing according to the second voice instruction, wherein the first response processing is a function response executed by the second voice instruction stored after the voice storage unit deletes historical voice data.
In one embodiment, after determining whether the first voice command and the second voice command are issued by the same user, the method includes:
if the similarity is within a preset threshold range, the first voice instruction and the second voice instruction are sent by the same user;
if the first voice command and the second voice command are sent by the same user, storing the second voice command to a voice storage unit;
and performing second response processing according to the current voice data of the voice storage unit, wherein the current voice data comprises historical voice data of the current user and the second voice instruction, and the second response processing is a function response executed according to a plurality of voice instructions of the current user in the voice storage unit.
A second aspect of an embodiment of the present invention provides a device for voice interaction, including:
the data receiving module is used for responding to a first voice instruction and entering a recording state after receiving the first voice instruction;
the data processing module is used for receiving a second voice instruction in a recording state and judging whether the first voice instruction and the second voice instruction are sent by the same user or not;
and the data response module is used for performing response processing according to the second voice instruction if the first voice instruction and the second voice instruction are not sent by the same user.
In one embodiment, the data receiving module comprises:
the voice storage unit is used for initializing voice data of the voice storage unit and storing the first voice instruction after receiving the first voice instruction; executing first response processing according to the first voice command, and entering a state to be recorded;
the recording behavior data detection unit is used for judging whether recording behavior data are detected or not within preset time, and the recording behavior data comprise a voice awakening instruction or a touch instruction;
and the recording control unit is used for entering a recording state if the recording behavior data is detected.
A third aspect of the embodiments of the present invention provides an end device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method as described above when executing the computer program.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.
Compared with the prior art, the embodiment of the invention has the following beneficial effects: after receiving the first voice command, the embodiment of the invention responds to the first voice command and enters a recording state; in the recording state, receiving a second voice instruction, and judging whether the first voice instruction and the second voice instruction are sent by the same user; if the first voice instruction and the second voice instruction are not sent by the same user, response processing is carried out according to the second voice instruction; in the voice interaction process, the problem that due to the fact that the front and back users are different due to the relevance of the voice is solved, the voice assistant sends out wrong response to the voice instruction of the next user; the method and the device realize planning of corresponding function direction according to the difference of the users who send the voice commands from front to back, make accurate response processing, enable the voice interaction process to be more flexible and smooth, and have strong usability and practicability.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flow chart illustrating an implementation of a method for voice interaction according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating an implementation of a voiceprint recognition process in a voice interaction process according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating an overall implementation process of a voice interaction method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a voice interaction device provided by an embodiment of the present invention;
fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
Example one
Referring to fig. 1, which is a schematic diagram illustrating an implementation flow of a voice interaction method according to an embodiment of the present invention, the method is applied to an intelligent terminal, and voice interaction with a user is implemented by a voice assistant; the intelligent terminal comprises but is not limited to terminal equipment such as a mobile phone, a tablet, a computer and the like. The method aims at a voice assistant of the intelligent terminal, after responding to a previous voice instruction, the intelligent terminal enters a recording state, acquires the current voice instruction, identifies the current voice instruction and the previous voice instruction, judges whether two adjacent voice instructions are sent by the same user, and further plans the function direction of the response user instruction, for example, different response processing is carried out aiming at the instructions sent by different users, the problem that wrong response is carried out on the voice instruction sent by the next different user due to the relevance of the previous voice instruction and the next voice instruction is avoided, and the accuracy and the flexibility of the voice response are improved. As shown in the figure, the voice interaction method comprises the following steps:
step S101, after receiving a first voice command, responding to the first voice command and entering a recording state.
In this embodiment, after a voice assistant of a terminal device is turned on, a first voice instruction input by a user is received in a recording state, where the first voice instruction is an instruction sent by a first user in the turning-on state of the voice assistant; after receiving the first voice instruction, initializing the voice instruction stored by the voice assistant before receiving the first voice instruction, deleting the previous voice instruction, storing the current first voice instruction, analyzing the current first voice instruction, and making a response corresponding to the first voice instruction, for example, if the first voice instruction is "call", the voice assistant makes an inquiry response of "call to whom", and then further waits for the next voice instruction. After responding to the first voice command, the voice assistant will enter the recording state again, and wait for receiving the next voice command, which may be the voice command sent by the first user or the voice command sent by another user.
Optionally, after receiving the first voice instruction, responding to the first voice instruction and entering a recording state, including:
a1, after receiving the first voice command, initializing voice data of a voice storage unit and storing the first voice command;
and A2, responding to the first voice command, and entering a state to be recorded.
In this embodiment, the voice assistant needs to plan the function direction according to the content stored in the memory storage unit, so as to respond to the voice command; the content stored in the memory storage unit is generated according to a voice instruction sent by a user. After the voice assistant receives a first voice instruction sent by a user, the memory storage unit initializes the memory storage content according to the content of the received first voice instruction, performs comprehensive analysis processing on the content stored in the memory storage unit, and makes a response to the first voice instruction. After responding to the first voice instruction, the voice assistant is in a state to be recorded within a fixed time, namely, the recording switch is in a non-opened state.
A3, judging whether recording behavior data are detected or not within preset time, wherein the recording behavior data comprise a voice awakening instruction or a touch instruction;
and A4, if the recording behavior data is detected, entering a recording state.
In this embodiment, after responding to a first voice instruction sent by a first user, the voice assistant enters a state to be recorded, where the state to be recorded is a state in which the recording switch is not turned on; the preset time is set according to the longest time for the voice assistant to enter the sleep state and the shortest time for detecting the voice, and the difference between the longest time for the voice assistant to enter the sleep state and the shortest time for detecting the voice is used as the size of the preset time. In addition, the preset time can also be used for comprehensively analyzing the time interval of the voice assistant entering the recording state from the state to be recorded in the voice interaction process according to the recorded time information of the historical voice interaction, and setting the preset time according to the historical time interval; historical data statistics can be carried out according to voice interaction habits of users, and preset time can be set in an adjustable mode.
The recording behavior data includes voice data, for example, by setting a fixed wake-up word as a voice wake-up instruction; for the terminal equipment with the touch control function, the recording behavior data can also comprise a trigger signal input by a recording control displayed by a touch screen; for the remote control type terminal equipment, the recording behavior data can also comprise a trigger signal input through a remote control key. The voice assistant can detect whether the voice of enough decibels exists in the surrounding environment or whether a trigger signal input by a user is received in the state of waiting for recording. When the recording behavior data is detected, the detected voice can be the voice content which is sent by the first user and is associated with the first voice instruction, or the voice content which is sent by other users different from the first user; the voice content sent by other users may be fixed voice content of voice data for turning on the recording switch, or may be any content, and is specifically set according to an application scenario, for example, for a terminal device for home or personal application, the voice data of the corresponding fixed content may be set for turning on the recording switch, and for a public application place, the specific content of the voice data for turning on the recording switch may not be limited.
It should be noted that the recording behavior data includes a voice wake-up instruction or a touch instruction, and the voice assistant detects whether there is voice input within a preset time through the audio acquisition unit; or the voice assistant detects whether a touch instruction input through a touch screen or a remote control exists within a preset time through the touch signal acquisition unit. At voice assistant's the state of waiting to record, through voice trigger voice assistant's recording switch, make voice assistant get into the recording state, removed the operation of opening the recording switch through manual touch-control from, make voice interaction more convenient, nimble.
Optionally, after entering the state to be recorded, the method further includes:
and if the recording behavior data is not detected within the preset time, entering a dormant state.
In this embodiment, the voice assistant automatically enters the sleep state when the to-be-recorded state of the voice assistant exceeds a preset time, so as to save power and hardware consumption. The sleep state may be a state in which the voice assistant is hidden in the background of the terminal device and is invisible in the display interface of the terminal device.
Step S102, in the recording state, receiving a second voice command, and judging whether the first voice command and the second voice command are sent by the same user.
In this embodiment, the recording behavior data includes a voice wake-up instruction or a touch instruction, and when the audio acquisition unit of the voice assistant detects the voice data, the audio acquisition unit directly triggers the recording switch of the voice assistant, so that the voice assistant is in a recording state.
The obtained second voice instruction sent by the user is in a recording state, and the voice instruction input by the user is directly received through a microphone or other audio acquisition equipment, wherein the second voice instruction may be a voice instruction sent by the first user continuously or a voice instruction sent by other users.
Comparing the stored first voice command with the received second voice command, and judging whether the first voice command and the second voice command are sent by the same user; specifically, voiceprint recognition can be performed on two adjacent voice data before and after, and whether two adjacent voice instructions are sent by the same user or not is judged according to the voiceprint similarity.
Optionally, as shown in the schematic flow chart of implementation of the voiceprint recognition process in the voice interaction process shown in fig. 2, after receiving the first voice instruction, the voice data of the voice storage unit is initialized, the first voice instruction is stored in the voice storage unit, and the first voice instruction is responded; after the response to the first voice command is finished, the voice assistant enters a state to be recorded; and triggering a recording switch by detecting voice, entering a recording state, acquiring a second voice instruction currently input by the user, carrying out voiceprint recognition on the second voice instruction and the previous voice instruction, and judging whether the two voice instructions are sent by the same user. Specifically, the determining whether the first voice command and the second voice command are sent by the same user includes:
step S201, performing voiceprint recognition on the first voice command, and acquiring a first voiceprint feature of the first voice command.
Step S202, performing voiceprint recognition on the second voice command to acquire a second voiceprint feature of the second voice command.
In this embodiment, voiceprint recognition is performed on the first voice instruction and the second voice instruction, and audio preprocessing is performed on the first voice instruction and the second voice instruction respectively.
Acquiring a first voice instruction and a second voice instruction, and performing pre-emphasis processing on the first voice instruction and the second voice instruction respectively, so that the high-frequency part of the voice instruction is improved, the high-frequency resolution of voice is increased, and the frequency spectrum characteristic of an audio signal tends to be flat; the local part of the audio signal presents approximately flat spectral characteristics, and the local audio signal is subjected to framing processing to obtain a corresponding stable audio signal; after framing processing, windowing processing is carried out on each frame of audio signal, so that the framed audio signal is more continuous, and each frame of audio signal has the characteristic of a periodic function; performing fast Fourier transform on the audio signal subjected to windowing processing to extract the more obvious characteristics of the audio signal in a time domain; adding the extracted features into a filter bank, and smoothing the frequency spectrum of the audio signal, wherein the filter bank can select an MEL filter bank based on MATLAB and Python; performing logarithmic operation and Discrete Cosine Transform (DCT) on the audio signal subjected to the smoothing processing, extracting a Mel Frequency Cepstrum Coefficient (MFCC), and acquiring a channel normalized Mel spectrogram (PCEN) to extract a Mel Frequency Cepstrum Coefficient (MFCC) voice feature of a voice instruction; other speech features are also possible, such as vocal tract based linear Prediction cepstral coefficients lpcc (linear Prediction Cepstrum coefficient) speech features and mel-frequency cepstral coefficients MFCC speech features based on auditory properties; the vocal tract-based Linear Prediction Cepstrum Coefficient (LPCC) speech feature can completely eliminate the influence of excitation information on voiceprint recognition, and the Mel Frequency Cepstrum Coefficient (MFCC) speech feature is extracted based on the relation between human auditory sense and frequency.
In this embodiment, the Per-channel normalized mel-spectrum (PCEN) features of the first voice instruction and the second voice instruction are input into the Neural network model, and the Neural network model may select a recurrent Neural network RNN (recurrent Neural network), but is not limited to the recurrent Neural network RNN, and may also include a convolutional Neural network CNN; and superposing the context information of the audio signal through a Recurrent Neural Network (RNN) model to obtain high-precision voice characteristics, and further adding the audio signal into an attention mechanism and a pooling layer to respectively extract the voiceprint characteristics of the first voice instruction and the second voice instruction.
Step S203, calculating a similarity between the first voiceprint feature and the second voiceprint feature.
And step S204, judging whether the first voice command and the second voice command are sent by the same user or not according to the similarity.
In this embodiment, the extracted first voiceprint feature and the second voiceprint feature are calculated by using a cosine similarity function, so as to obtain a similarity score between the first voiceprint feature and the second voiceprint feature. And setting a similarity threshold, and comparing the obtained similarity score with the similarity threshold to judge whether the previous voice command and the current voice command are sent by the same user.
By preprocessing the voice instruction, adopting a high-efficiency machine learning model based on a Recurrent Neural Network (RNN), adding an attention mechanism and a pooling layer, and calculating through a similar function, realizing high-precision recognition of the voice instruction, and accurately judging whether the voice instructors sent twice before and after are the same user.
Optionally, after determining whether the first voice instruction and the second voice instruction are sent by the same user, the method includes:
b1, if the similarity is within a preset threshold range, the first voice instruction and the second voice instruction are sent by the same user;
b2, if the first voice command and the second voice command are sent by the same user, storing the second voice command in a voice storage unit;
and B3, performing second response processing according to the current voice data of the voice storage unit, wherein the current voice data comprises the historical voice data of the current user and the second voice instruction, and the second response processing is a function response executed according to a plurality of voice instructions of the current user in the voice storage unit.
In this embodiment, if the first voice command and the second voice command are issued by different users, the previous content of the voice storage unit is cleared, and the second voice command is added as the current memory storage content. Responding to the second voice command according to the content memorized and stored by the voice memory unit at this time; in order to avoid that wrong responses are made to voice instructions sent by different users before and after the voice instructions are sent by different users, emptying the content corresponding to the voice instruction of the last user cached in the voice storage unit, reinitializing the content of the memory storage unit by the voice instruction sent by the current user, and responding to the voice instruction of the current user according to the memory storage content corresponding to the voice instruction of the current user.
Step S103, if the first voice command and the second voice command are not sent by the same user, response processing is carried out according to the second voice command.
In this embodiment, if the second voice instruction and the first voice instruction are issued by different users, the voice data of the voice storage unit is initialized according to the second voice instruction, and the content stored in the voice storage unit is updated; deleting the first voice command, storing the second voice command, and executing corresponding function processing according to the second voice command; for example, the first voice instruction is "call", the voice assistant responds to the inquiry, "call to whom", the second voice instruction replies "zhang san", if the second voice instruction and the first voice instruction come from the same user, the call function is called according to the normal call flow, the contact way of zhang san is read, and zhang san is called; if the user is not the same, for the second voice instruction command, the voice assistant deletes the original storage content of the voice storage unit, stores the second voice instruction, performs function planning only on the name of the second voice instruction, makes response processing, displays the information about Zhang III, and waits for the next instruction.
It should be noted that the updated content of the voice storage unit is not associated with the content of the voice instruction sent by the previous user only for the current voice instruction sent by the user, so as to avoid making a function plan of wrong direction; the user starts the recording function of the voice assistant through voice, so that the voice interaction process is smoother, simpler and more convenient.
In addition, according to the new storage content of the voice storage unit, the function direction is planned, the response to the voice instruction is completed, and the corresponding operation is executed. For each instruction, the voice assistant synthesizes the contents in the voice storage unit, responds to the current voice instruction, and responds according to the front and back instruction messages when the adjacent front and back voice instructions are sent by the same user; the voice interaction method has the advantages that comprehensive response is carried out on a plurality of instructions of the same user, extension of a voice response function is achieved, the function range is re-planned for the voice instruction sent by the last user aiming at the voice instruction sent by different users, accurate response is made, and the voice interaction process is more flexible.
In addition, the voice assistant responds to the instruction and then enters the state to be recorded again, and the above operation is circulated.
The voice assistant can integrate the content of the voice storage unit for responding to each instruction, and respond to the current voice instruction according to the stored content; replanning the range pointed by the function aiming at the voice commands sent by different users; therefore, voiceprint recognition is carried out on each voice instruction and the adjacent previous voice instruction, whether the current voice instruction and the adjacent previous voice instruction are sent by the same user or not is judged, and therefore wrong response or function direction to the current voice instruction is avoided.
Fig. 3 is a diagram illustrating an overall implementation flow of a voice interaction method according to an embodiment of the present invention; the steps in the overall implementation flowchart of the voice interaction method are the same as the implementation principles of the corresponding steps in fig. 1 or fig. 2, and are not described herein again, as shown in the figure, the overall implementation flowchart of the voice interaction method includes:
step S301, receiving a first voice command sent by a user.
In this embodiment, the first voice command sent by the user may be a first voice command.
Step S302, initializing the content stored in the voice storage unit according to the first voice instruction.
In this embodiment, after receiving a first voice instruction sent by a user, the voice storage unit is initialized according to the voice instruction, so as to prevent the memory storage unit from storing previous contents and affecting the response of the current voice instruction.
Step S303, the content of the voice storage unit is integrated, the first voice instruction is responded, and the state of waiting for recording is entered.
In this embodiment, the contents of the voice storage unit are comprehensively analyzed, a voice command is responded, and after the response of the voice command is finished, the voice assistant enters a state to be recorded.
Step S304, judging whether recording behavior data is detected or not in a preset time when the recording state is ready.
Step S305, if no recording behavior data is detected within the preset time, entering a sleep state.
In this embodiment, if the time that the voice assistant is in the state to be recorded exceeds the set time, the voice assistant enters the sleep state and is hidden to the back end of the terminal device or other invisible states.
Step S306, if the recording behavior data is detected, the recording switch is triggered to acquire a second voice instruction sent by the user.
Step S307, performing voiceprint recognition on the first voice command and the second voice command to judge whether the first voice command and the second semantic command are sent by the same user.
In this embodiment, for each voice instruction after the first voice instruction is issued, the voice assistant recognizes and judges the currently acquired voice instruction and the previous voice instruction through voiceprint recognition, so as to determine whether the voice instructions are issued by the same user.
Step S308, if the second voice command and the adjacent first voice command are sent by the same user, adding the second voice command to a voice storage unit.
In this embodiment, if two adjacent voice instructions before and after are sent by the same user, the current voice instruction is directly added to the voice storage unit of the voice assistant, the content of the voice storage unit is integrated, and the current voice instruction is responded by combining the context aiming at the voice instruction sent by the same user before and after, so as to realize extension of function execution of the same user; replanning the function range of the voice instruction sent by the last user for the voice instructions sent by different users; and the voice assistant responds to the instruction and then enters the state to be recorded again, and the above operation is circulated.
If the current voice instruction and the previous adjacent voice instruction are sent by different users, executing step S302, initializing the content of a voice storage unit according to the voice instruction, emptying the content in front of the voice storage unit, and responding according to the content corresponding to the current voice instruction; and the voice assistant responds to the instruction and then enters the state to be recorded again, and the above operation is circulated.
It should be noted that, within the technical scope of the present disclosure, other sequencing schemes that can be easily conceived by those skilled in the art should also be within the protection scope of the present disclosure, and detailed description is omitted here.
According to the embodiment of the invention, when the voice assistant is in the state of waiting for recording, whether recording behavior data are detected or not is judged within the preset time; if the recording behavior data are detected, triggering a recording switch to acquire a current voice instruction sent by a user; if the current voice instruction and the adjacent previous voice instruction are sent by different users, the current voice instruction is taken as the voice content of the current time and stored in a voice storage unit; and planning corresponding function direction according to the content of the voice storage unit.
Generally, before a user inputs a voice instruction each time in the process of using a voice assistant to perform voice interaction, the user needs to trigger the voice assistant to enter a recording state from a state to be recorded through a hand touch screen or a remote control, and needs to perform touch operation for many times when performing voice interaction or inputting a plurality of voice instructions for many times; due to the relevance between the front and the back of the voice, when the front and the back users are different, the problem that the voice assistant sends wrong response to the voice instruction of the next user is avoided; the voice assistant realizes triggering of the recording switch of the voice assistant according to voice and planning of function direction according to the difference of the users who send voice instructions from front to back, so that the voice interaction process is more convenient, flexible and smooth.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
Example two
Referring to fig. 4, which is a schematic diagram of a voice interaction apparatus according to an embodiment of the present invention, for convenience of description, only the portions related to the embodiment of the present invention are shown.
The voice interaction device comprises:
the data receiving module 41 is configured to respond to a first voice instruction and enter a recording state after receiving the first voice instruction;
the data processing module 42 is configured to receive a second voice instruction in a recording state, and determine whether the first voice instruction and the second voice instruction are sent by the same user;
and a data response module 43, configured to perform response processing according to the second voice instruction if the first voice instruction and the second voice instruction are not issued by the same user.
According to the embodiment of the invention, when the voice assistant is in the state of waiting for recording, whether recording behavior data are detected or not is judged within the preset time; if the recording behavior data are detected, triggering a recording switch to acquire a current voice instruction sent by a user; if the current voice instruction and the adjacent previous voice instruction are sent by different users, the current voice instruction is taken as the voice content of the current time and stored in a voice storage unit; and planning corresponding function direction according to the content of the voice storage unit. Generally, before a user inputs a voice instruction each time in the process of using a voice assistant to perform voice interaction, the user needs to trigger the voice assistant to enter a recording state from a state to be recorded through a hand touch screen or a remote control, and needs to perform touch operation for many times when performing voice interaction or inputting a plurality of voice instructions for many times; due to the relevance between the front and the back of the voice, when the front and the back users are different, the problem that the voice assistant sends wrong response to the voice instruction of the next user is avoided; the voice assistant realizes triggering of the recording switch of the voice assistant according to voice and planning of function direction according to the difference of the users who send voice instructions from front to back, so that the voice interaction process is more convenient, flexible and smooth.
It will be apparent to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely illustrated, and in practical applications, the above function distribution may be performed by different functional units and modules as needed, that is, the internal structure of the mobile terminal is divided into different functional units or modules to perform all or part of the above described functions. Each functional module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional modules are only used for distinguishing one functional module from another, and are not used for limiting the protection scope of the application. The specific working process of the module in the mobile terminal may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
EXAMPLE III
Fig. 5 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 5, the terminal device 5 of this embodiment includes: a processor 50, a memory 51 and a computer program 52 stored in said memory 51 and executable on said processor 50. The processor 50, when executing the computer program 52, implements the steps in the above-mentioned respective embodiments of the voice interaction method, such as the steps 101 to 103 shown in fig. 1. Alternatively, the processor 50, when executing the computer program 52, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 41 to 43 shown in fig. 4.
Illustratively, the computer program 52 may be partitioned into one or more modules/units that are stored in the memory 51 and executed by the processor 50 to implement the present invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 52 in the terminal device 5. For example, the computer program 52 may be divided into a data receiving module, a data processing module and a data responding module, and the specific functions of each module are as follows:
the data receiving module is used for responding to a first voice instruction and entering a recording state after receiving the first voice instruction;
the data processing module is used for receiving a second voice instruction in a recording state and judging whether the first voice instruction and the second voice instruction are sent by the same user or not;
and the data response module is used for performing response processing according to the second voice instruction if the first voice instruction and the second voice instruction are not sent by the same user.
The terminal device 5 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 50, a memory 51. Those skilled in the art will appreciate that fig. 5 is merely an example of a terminal device 5 and does not constitute a limitation of terminal device 5 and may include more or fewer components than shown, or some components may be combined, or different components, e.g., the terminal device may also include input-output devices, network access devices, buses, etc.
The Processor 50 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 51 may be an internal storage unit of the terminal device 5, such as a hard disk or a memory of the terminal device 5. The memory 51 may also be an external storage device of the terminal device 5, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 5. Further, the memory 51 may also include both an internal storage unit and an external storage device of the terminal device 5. The memory 51 is used for storing the computer program and other programs and data required by the terminal device. The memory 51 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method according to the embodiments of the present invention may also be implemented by a computer program, which may be stored in a computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments may be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain other components which may be suitably increased or decreased as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media which may not include electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A method of voice interaction, the method comprising:
after receiving a first voice instruction, responding to the first voice instruction and entering a recording state;
in a recording state, receiving a second voice instruction, and judging whether the first voice instruction and the second voice instruction are sent by the same user;
and if the first voice instruction and the second voice instruction are not sent by the same user, performing response processing according to the second voice instruction.
2. The method of voice interaction of claim 1, wherein after receiving a first voice command, responding to the first voice command and entering a recording state, comprises:
after receiving the first voice instruction, initializing voice data of a voice storage unit, and storing the first voice instruction;
responding to the first voice instruction and entering a state to be recorded;
judging whether recording behavior data are detected within preset time, wherein the recording behavior data comprise voice awakening instructions or touch instructions;
and if the recording behavior data is detected, entering a recording state.
3. The method of voice interaction of claim 2, after entering the to-be-recorded state, further comprising:
and if the recording behavior data is not detected within the preset time, entering a dormant state.
4. The method of voice interaction according to claim 1, wherein determining whether the first voice command and the second voice command are issued by the same user comprises:
performing voiceprint recognition on the first voice instruction to acquire a first voiceprint feature of the first voice instruction;
performing voiceprint recognition on the second voice instruction to acquire a second voiceprint feature of the second voice instruction;
calculating the similarity of the first voiceprint feature and the second voiceprint feature;
and judging whether the first voice instruction and the second voice instruction are sent by the same user or not according to the similarity.
5. The method of claim 4, wherein if the first voice command and the second voice command are not issued by the same user, performing response processing according to the second voice command comprises:
if the similarity is not within the preset threshold range, the first voice instruction and the second voice instruction are not sent by the same user;
if the first voice instruction and the second voice instruction are sent by different users, initializing voice data of a voice storage unit and storing the second voice instruction;
and performing first response processing according to the second voice instruction, wherein the first response processing is a function response executed by the second voice instruction stored after the voice storage unit deletes historical voice data.
6. The method of voice interaction according to claim 4, wherein determining whether the first voice command and the second voice command are issued by the same user comprises:
if the similarity is within a preset threshold range, the first voice instruction and the second voice instruction are sent by the same user;
if the first voice command and the second voice command are sent by the same user, storing the second voice command to a voice storage unit;
and performing second response processing according to the current voice data of the voice storage unit, wherein the current voice data comprises historical voice data of the current user and the second voice instruction, and the second response processing is a function response executed according to a plurality of voice instructions of the current user in the voice storage unit.
7. An apparatus for voice interaction, comprising:
the data receiving module is used for responding to a first voice instruction and entering a recording state after receiving the first voice instruction;
the data processing module is used for receiving a second voice instruction in a recording state and judging whether the first voice instruction and the second voice instruction are sent by the same user or not;
and the data response module is used for performing response processing according to the second voice instruction if the first voice instruction and the second voice instruction are not sent by the same user.
8. The apparatus of voice interaction of claim 7, wherein the data receiving module comprises:
the voice storage unit is used for initializing voice data of the voice storage unit and storing the first voice instruction after receiving the first voice instruction; executing first response processing according to the first voice command, and entering a state to be recorded;
the recording behavior data detection unit is used for judging whether recording behavior data are detected or not within preset time, and the recording behavior data comprise a voice awakening instruction or a touch instruction;
and the recording control unit is used for entering a recording state if the recording behavior data is detected.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
CN201910485079.4A 2019-06-05 2019-06-05 Voice interaction method and device and terminal equipment Pending CN112053696A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910485079.4A CN112053696A (en) 2019-06-05 2019-06-05 Voice interaction method and device and terminal equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910485079.4A CN112053696A (en) 2019-06-05 2019-06-05 Voice interaction method and device and terminal equipment

Publications (1)

Publication Number Publication Date
CN112053696A true CN112053696A (en) 2020-12-08

Family

ID=73609312

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910485079.4A Pending CN112053696A (en) 2019-06-05 2019-06-05 Voice interaction method and device and terminal equipment

Country Status (1)

Country Link
CN (1) CN112053696A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847520A (en) * 2016-05-19 2016-08-10 北京小米移动软件有限公司 Recording realization method and device in conversation process
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal
CN107066229A (en) * 2017-01-24 2017-08-18 广东欧珀移动通信有限公司 The method and terminal of recording
CN107680591A (en) * 2017-09-21 2018-02-09 百度在线网络技术(北京)有限公司 Voice interactive method, device and its equipment based on car-mounted terminal
CN107958668A (en) * 2017-12-15 2018-04-24 中广热点云科技有限公司 The acoustic control of smart television selects broadcasting method, acoustic control to select broadcast system
CN108694947A (en) * 2018-06-27 2018-10-23 Oppo广东移动通信有限公司 Sound control method, device, storage medium and electronic equipment
CN108766438A (en) * 2018-06-21 2018-11-06 Oppo广东移动通信有限公司 Man-machine interaction method, device, storage medium and intelligent terminal
CN109040444A (en) * 2018-07-27 2018-12-18 维沃移动通信有限公司 A kind of call recording method, terminal and computer readable storage medium
CN109559741A (en) * 2017-09-27 2019-04-02 浙江苏泊尔家电制造有限公司 Cooking methods and device, cooking system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105847520A (en) * 2016-05-19 2016-08-10 北京小米移动软件有限公司 Recording realization method and device in conversation process
CN106653021A (en) * 2016-12-27 2017-05-10 上海智臻智能网络科技股份有限公司 Voice wake-up control method and device and terminal
CN107066229A (en) * 2017-01-24 2017-08-18 广东欧珀移动通信有限公司 The method and terminal of recording
CN107680591A (en) * 2017-09-21 2018-02-09 百度在线网络技术(北京)有限公司 Voice interactive method, device and its equipment based on car-mounted terminal
CN109559741A (en) * 2017-09-27 2019-04-02 浙江苏泊尔家电制造有限公司 Cooking methods and device, cooking system
CN107958668A (en) * 2017-12-15 2018-04-24 中广热点云科技有限公司 The acoustic control of smart television selects broadcasting method, acoustic control to select broadcast system
CN108766438A (en) * 2018-06-21 2018-11-06 Oppo广东移动通信有限公司 Man-machine interaction method, device, storage medium and intelligent terminal
CN108694947A (en) * 2018-06-27 2018-10-23 Oppo广东移动通信有限公司 Sound control method, device, storage medium and electronic equipment
CN109040444A (en) * 2018-07-27 2018-12-18 维沃移动通信有限公司 A kind of call recording method, terminal and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110310623B (en) Sample generation method, model training method, device, medium, and electronic apparatus
CN102568478B (en) Video play control method and system based on voice recognition
CN108899037B (en) Animal voiceprint feature extraction method and device and electronic equipment
CN107702706B (en) Path determining method and device, storage medium and mobile terminal
CN109522419B (en) Session information completion method and device
CN110020009B (en) Online question and answer method, device and system
CN108346427A (en) Voice recognition method, device, equipment and storage medium
CN110544473B (en) Voice interaction method and device
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN109947971B (en) Image retrieval method, image retrieval device, electronic equipment and storage medium
CN113488024B (en) Telephone interrupt recognition method and system based on semantic recognition
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN114127849A (en) Speech emotion recognition method and device
CN112309365A (en) Training method and device of speech synthesis model, storage medium and electronic equipment
CN109360551B (en) Voice recognition method and device
CN110491373A (en) Model training method, device, storage medium and electronic equipment
CN109977426A (en) A kind of training method of translation model, device and machine readable media
CN108628819A (en) Treating method and apparatus, the device for processing
CN110502648A (en) Recommended models acquisition methods and device for multimedia messages
CN112017670B (en) Target account audio identification method, device, equipment and medium
CN111400463A (en) Dialog response method, apparatus, device and medium
CN108989551B (en) Position prompting method and device, storage medium and electronic equipment
CN114595692A (en) Emotion recognition method, system and terminal equipment
CN109064720B (en) Position prompting method and device, storage medium and electronic equipment
CN112053696A (en) Voice interaction method and device and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201208

RJ01 Rejection of invention patent application after publication