CN114391165A - Voice information processing method, device, equipment and storage medium - Google Patents

Voice information processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN114391165A
CN114391165A CN201980099978.9A CN201980099978A CN114391165A CN 114391165 A CN114391165 A CN 114391165A CN 201980099978 A CN201980099978 A CN 201980099978A CN 114391165 A CN114391165 A CN 114391165A
Authority
CN
China
Prior art keywords
voice
target
role
skill
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980099978.9A
Other languages
Chinese (zh)
Inventor
郝杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Shenzhen Huantai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd, Shenzhen Huantai Technology Co Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Publication of CN114391165A publication Critical patent/CN114391165A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice information processing method, device, equipment and storage medium, the method includes: acquiring voice information acquired by a voice acquisition unit; wherein the voice information comprises first voice information, and the first voice information is used for indicating the called target skill (101); identifying first voice information based on a preset skill identification strategy, and determining a target skill (102) which is called and indicated by the first voice information; then determining a first target role for realizing the target skill; and controlling the first target role to execute voice broadcasting (104) aiming at the target skill, determining the user intention expressed in the voice information through intention judgment, and determining the target skill which the user wants to call according to the user intention so as to awaken the target role corresponding to the target skill. So, the realization role that can be more smooth and easy awakens up, improves voice control's intellectuality, and carries out the voice broadcast of different skills through configuration multiple role, has improved voice control's interest.

Description

Voice information processing method, device, equipment and storage medium Technical Field
The present application relates to voice technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing voice information.
Background
The intelligent voice assistant is widely applied to products such as mobile phones, vehicle-mounted terminals and intelligent homes, hands of users are liberated, and the users can control and operate functions of the products only through voice interaction with the intelligent voice assistant.
At present, a voice synthesis system (Text to Speech, TTS) in an intelligent voice scheme can only provide a role with a single tone, and the role has a single tone when broadcasting voice information to a user, so that an interesting and anthropomorphic interaction process is lacked.
Disclosure of Invention
In order to solve the related technical problem, embodiments of the present application desirably provide a method, an apparatus, a device, and a storage medium for processing voice information.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, a method for processing voice information is provided, including:
acquiring voice information acquired by a voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
recognizing the first voice information based on a preset skill recognition strategy, and determining a target skill which is called by the first voice information indication;
determining a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
and controlling the first target role to execute voice broadcast aiming at the target skill.
In the foregoing solution, before the controlling the first target character to perform the voice broadcast for the target skill, the method further includes: determining a second target role for currently executing voice broadcast; and when the second target role is different from the first target role, switching the second target role currently executing voice broadcast into the first target role.
In the above scheme, the voice message further includes a second voice message, where the second voice message is used to instruct to wake up the second target role; before determining the second target role currently performing the voice broadcast, the method further includes: identifying the second voice message from the voice messages, and determining a second target role which is awakened and indicated by the second voice message; and controlling the second target role to execute voice broadcast.
In the foregoing solution, the determining that the second voice message indicates the woken second target role includes: determining a wakeup identification in the second voice message; determining the second target role corresponding to the awakening identifier from a preset second mapping relation; and the second mapping relation comprises the mapping relations between at least three roles and the awakening identifier.
In the foregoing solution, the controlling the first target role to perform the voice broadcast for the target skill includes: acquiring tone information and voice text information of the first target role; wherein, different roles correspond to different tone information; synthesizing voice audio information based on the tone information of the first target role and the voice text information; and controlling a voice output unit to output the voice audio information.
In the foregoing solution, after the controlling the first target role to perform the voice broadcast for the target skill, the method further includes: acquiring third voice information; the third voice information is used for indicating to quit the first target role currently executing the voice broadcast; and controlling to exit the first target role based on the third voice information.
In the foregoing solution, the controlling to exit from the first target role based on the third voice information includes: determining an exit identifier in the third voice message; determining a first target role corresponding to the exit identifier from a preset third mapping relation table; wherein, the third mapping relation comprises the mapping relation between at least three roles and the exit identifier; and controlling to exit the first target role.
In a second aspect, there is provided a speech information processing apparatus comprising:
an acquisition section configured to acquire voice information acquired by the voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
the processing part is configured to recognize the first voice information based on a preset skill recognition strategy, and determine a target skill which is called and indicated by the first voice information;
the processing part is configured to determine a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
and the control part is configured to control the first target role to execute voice broadcast aiming at the target skill.
In a third aspect, there is provided a speech information processing apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor,
wherein the processor is configured to perform the steps of any of the preceding methods when executing the computer program.
In a fourth aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any of the preceding claims.
According to the voice information processing method, the voice information processing device, the voice information processing equipment and the voice information processing storage medium, the voice information acquired by the voice acquisition unit is acquired; the voice information comprises first voice information, and the first voice information is used for indicating the called target skill; identifying the target skill indicated by the first voice information based on a preset skill identification strategy; then determining a first target role for realizing the target skill; and controlling the first target role to execute voice broadcast aiming at the target skill, determining the user intention expressed in the voice information through intention judgment, and determining the target skill which the user wants to call according to the user intention, thereby awakening the target role corresponding to the target skill. So, can add smooth and easy realization role and awaken up, improve speech control's intellectuality, and carry out the voice broadcast of different skills through configuration multiple role, improve speech control's interest.
Drawings
FIG. 1 is a first flowchart of a method for processing a voice message according to an embodiment of the present application;
FIG. 2 is a second flowchart of a speech information processing method according to an embodiment of the present application;
FIG. 3 is a third flowchart of a method for processing speech information according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a component structure of a speech processing system according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a skill processing system in an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a structure of a speech information processing apparatus according to an embodiment of the present application;
fig. 7 is a schematic diagram of a configuration of a speech information processing apparatus in an embodiment of the present application.
Detailed Description
So that the manner in which the features and elements of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.
An embodiment of the present application provides a method for processing voice information, where fig. 1 is a first flowchart of the method for processing voice information in the embodiment of the present application, and as shown in fig. 1, the method may specifically include:
step 101: acquiring voice information acquired by a voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
step 102: recognizing the first voice information based on a preset skill recognition strategy, and determining a target skill which is called by the first voice information indication;
step 103: determining a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
step 104: and controlling the first target role to execute voice broadcast aiming at the target skill.
Here, the execution subject of steps 101 to 104 may be a processor of the voice information processing apparatus. Here, the voice information processing apparatus may be located on the server side or the terminal side. The terminal can be a mobile terminal or a fixed terminal with a voice control function. Such as smart phones, personal computers (e.g., tablet, desktop, notebook, netbook, palmtop), mobile phones, electronic book readers, portable multimedia players, audio/video players, cameras, virtual reality devices, wearable devices, and the like.
Here, the voice collecting means may be a microphone. For example, a microphone on the terminal collects voice information, and the voice information processing method is executed locally on the terminal; or uploading the voice information to a server, executing the steps of the voice information processing method by the server, issuing the processing result to the terminal by the server, and executing the corresponding voice output control operation by the terminal according to the processing result.
In practical application, a preset skill recognition strategy is used for skill recognition, and a target skill to be realized by the terminal or other electronic equipment to be controlled through voice information by a user is determined, wherein the target skill is determined according to user intention expressed by the first voice information, and the target skill to be realized by the terminal to be controlled next by the user is determined through recognizing the user intention. For example, the first voice message is "how is the weather today? "recognizing the user intention as inquiring weather according to the first voice information, and then determining the corresponding target skill as" weather inquiry ", and inquiring date as" today ". Therefore, the user does not need to speak a control keyword such as 'weather query', but uses a more spoken language, so that the voice control function can be realized, the control process is more intelligent, and the daily communication habit of the user is met.
Specifically, the recognizing the first voice information based on the preset skill recognition strategy includes: performing text recognition on the first voice information by utilizing a voice recognition technology to obtain first text information; and performing semantic recognition on the first text information by utilizing a semantic recognition technology to obtain a target skill instructed to be called by the first voice information.
That is, text information included in the voice information is recognized by a voice recognition technique, and a target skill indicated by the text information is recognized by a semantic recognition technique.
In some embodiments, the method further comprises: acquiring at least one skill which can be realized by at least three roles; and establishing a first mapping relation by utilizing the mapping relation between the at least three roles and the at least one skill.
Or acquiring at least one skill which can be realized by at least three roles; establishing a skill set by utilizing at least one skill which can be realized by each role; and establishing a first mapping relation by utilizing the mapping relations of at least three roles and the skill set. One role in the first mapping relation corresponds to at least one skill, and all skills corresponding to one role form a skill set.
That is, the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set. The first target role corresponding to the target skill can be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set is determined from the first mapping relationship.
In practical applications, the skills corresponding to different roles are the same or different, that is, the same or different skills that can be realized by different roles are different. Here, the voice role configured on the terminal may be a role developed by a terminal manufacturer, or may also be a third party role developed by a third party manufacturer, and the third party role is invoked by downloading a third party application program, or may also be invoked by an online access manner without downloading the third party application program.
For example, the role a corresponds to a skill set a, and the skills included in the set a include "weather query, broadcast play, audio e-book play, and the like";
the role B corresponds to a skill set B, and skills included in the set B comprise 'music playing, music video playing, music main broadcasting live broadcasting and the like';
the role C corresponds to a skill set C, and skills included in the set C comprise 'information query, information recommendation, information download and the like'.
The role a may be a role developed by a terminal manufacturer itself and used to implement the voice control operation of the application program a itself, and the role B and the role C may be roles developed by other terminals themselves and used to implement the voice control operation of the application program B and the application program C.
In the embodiment of the application, the voice broadcasting with different skills is executed by configuring multiple roles, so that the interestingness of voice control is improved.
Further, a first target role corresponding to the target skill is determined from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills. Specifically, skills in a first mapping relation of the target skills are matched, and a corresponding first target role is determined when matching is successful; or matching the target skill with a skill set corresponding to the role, and determining the skill set containing the target skill so as to determine a first target role corresponding to the skill set.
Further, controlling the first target role to execute voice broadcast aiming at the target skill comprises: acquiring tone information and voice text information of the first target role; wherein, different roles correspond to different tone information; synthesizing voice audio information based on the tone information of the first target role and the voice text information; and controlling a voice output unit to output the voice audio information.
That is, if the voice information needs to be output when the target skill is executed, and the voice interactive operation with the user is realized, the voice text to be output is synthesized with the tone of the first target character, the voice audio information with the tone of the first target character is synthesized, and the voice audio information is output through a voice output unit such as a loudspeaker or an earphone.
According to the voice information processing method, the voice information processing device, the voice information processing equipment and the voice information processing storage medium, the voice information acquired by the voice acquisition unit is acquired; the voice information comprises first voice information, and the first voice information is used for indicating the called target skill; identifying the target skill indicated by the first voice information based on a preset skill identification strategy; then determining a first target role for realizing the target skill; and controlling the first target role to execute voice broadcast aiming at the target skill, determining the user intention expressed in the voice information through intention judgment, and determining the target skill which the user wants to call according to the user intention, thereby awakening the target role corresponding to the target skill. So, the realization role that can be more smooth and easy awakens up, improves voice control's intellectuality, and carries out the voice broadcast of different skills through configuration multiple role, has improved voice control's interest.
On the basis of the foregoing embodiment, a more detailed speech information processing method is further provided, and fig. 2 is a second flow chart of the speech information processing method in the embodiment of the present application, as shown in fig. 2, the method includes:
step 201: acquiring voice information acquired by a voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
here, the voice collecting means may be a microphone. For example, a microphone on the terminal collects voice information, and the voice information processing method is executed locally on the terminal; or uploading the voice information to a server, executing the steps of the voice information processing method by the server, issuing the processing result to the terminal by the server, and executing the corresponding voice output control operation by the terminal according to the processing result.
Step 202: identifying first voice information based on a preset skill identification strategy, and determining a target skill which is called by the first voice information instruction;
in practical application, a preset skill recognition strategy is used for skill recognition, and a target skill to be realized by the terminal or other electronic equipment to be controlled through voice information by a user is determined, wherein the target skill is determined according to user intention expressed by the first voice information, and the target skill to be realized by the terminal to be controlled next by the user is determined through recognizing the user intention. For example, the first voice message is "how is the weather today? "recognizing the user intention as inquiring weather according to the first voice information, and then determining the corresponding target skill as" weather inquiry ", and inquiring date as" today ". Therefore, the user does not need to speak a control keyword such as 'weather query', but uses a more spoken language, so that the voice control function can be realized, the control process is more intelligent, and the daily communication habit of the user is met.
Specifically, the recognizing the first voice information based on the preset skill recognition strategy includes: performing text recognition on the first voice information by utilizing a voice recognition technology to obtain first text information; and performing semantic recognition on the first text information by utilizing a semantic recognition technology to obtain a target skill instructed to be called by the first voice information.
Step 203: determining a first target role corresponding to the target skill from a preset first mapping relation; the first mapping relation comprises mapping relations between at least three roles and skills;
here, the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set. The first target role corresponding to the target skill can be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set is determined from the first mapping relationship.
Correspondingly, step 203 may specifically include: matching skills in the first mapping relation of the target skills, and determining a corresponding first target role when the matching is successful; or matching the target skill with a skill set corresponding to the role, and determining the skill set containing the target skill so as to determine a first target role corresponding to the skill set.
In practical applications, the skills corresponding to different roles are the same or different, that is, the same or different skills that can be realized by different roles are different. Here, the voice role configured on the terminal may be a role developed by a terminal manufacturer, or may also be a third party role developed by a third party manufacturer, and the third party role is invoked by downloading a third party application program, or may also be invoked by an online access manner without downloading the third party application program.
Step 204: determining a second target role for currently executing voice broadcast;
here, the second target character is a character in which voice broadcasting is being performed before the first target character is woken up. For example, when the terminal device is using the second target character to perform voice communication with the user, it is determined that the user needs the first target character to execute the target skill according to the speaking intention of the user, and the first target character can be summoned through the second target character.
In some embodiments, determining the second target character currently performing the voice broadcasting includes: and detecting the currently executed voice broadcasting role and determining a second target role. For example, a role identification bit of a currently executed voice broadcast role is detected, and a second target role which is executing voice broadcast is determined.
In some embodiments, before determining a first target role for executing a target skill, a role needs to be woken up to perform initial communication with a user, for example, the user directly wakens up to invoke the first target role corresponding to the target skill; or awakening a default role of the system; or wake up a role that is frequently used by the user.
In some embodiments, the wake-up may be by voice, such as: the voice information further comprises second voice information, and the second voice information is used for indicating to awaken the second target role; correspondingly, before determining the second target role currently performing the voice broadcast, the method further includes: identifying the second voice message from the voice messages, and determining a second target role which is awakened and indicated by the second voice message; and controlling the second target role to execute voice broadcast.
Further, the determining that the second voice message indicates a second target role for waking up includes: determining a wakeup identification in the second voice message; determining the second target role corresponding to the awakening identifier from a preset second mapping relation; and the second mapping relation comprises the mapping relations between at least three roles and the awakening identifier.
In practical applications, the roles and the wake-up identifiers in the second mapping relationship are mapped one-to-one or mapped one-to-many. That is, one role can only be woken up by one wake-up identity, or one role can be woken up by multiple wake-up identities. The wake-up identifiers of different roles can be uniformly specified by the manufacturer or can be set by the user according to habits or preferences.
For example, the awakening identifier corresponding to the role a is "a classmate, a little a hello, a little a in stock";
the awakening identifier corresponding to the role B is fat B classmate;
the wake up flag for role C is "old C hello, hi old C".
Here, different wake-up identifiers are associated with different roles, and the flexibility of role control is improved.
In the real-time application, obtain the speech information that the speech acquisition unit gathered, include: simultaneously acquiring first voice information and second voice information, or acquiring the first voice information first and determining a second target role awakened by the second voice information indication; controlling the second target role to execute voice broadcast; and then acquiring second voice information, and determining the target skill which is called by the second voice information indication.
Step 205: judging whether the second target role is the same as the first target role, if not, executing the step 206; if yes, go to step 207;
step 206: when the second target role is different from the first target role, switching the second target role currently executing voice broadcast into the first target role;
in some embodiments, the second target role and the first target role are different, the method further comprising: generating switching prompt information; and controlling the first target role or the second target role to play the switching prompt message. Here, the switching prompt information is used to prompt the user that the role switching operation is to be performed immediately or that the role switching operation has been performed.
The controlling the first target role or the second target role to play the switching prompt message includes: before switching, controlling a second target role to play the switching prompt message; and after switching, controlling the first target role to play the switching prompt message.
That is, the switching prompt message may be played before the switching operation to remind the user that the first target character is to be called by the second target character to perform the voice broadcast. For example, a voice assistant that wakes up character a with a wake word for character a, actually judges the target skill as that of character B using a skill classifier, can add a smooth sentence while reporting "a question about XX can ask character B with the tone color of character a", and then feeds back the actual result of character B.
Or, the switching prompt message may be played after the switching operation to remind the user that the first target character performs the voice broadcast now. For example, a voice assistant of a role a is awakened by using an awakening word of the role a, a skill classifier is used for actually judging a target skill as a skill of a role B, a smooth sentence can be added, meanwhile, the tone of the role B is used for reporting that ' the problem role a about XX is unknown, the role B answers to the your bar ' is existed ', and then the actual result of the role B is fed back.
And closed-loop conversation is allowed among different roles, so that role switching becomes smoother, daily communication habits of users are better met, and the intelligent level of voice control is improved.
Here, step 207 is also executed after the handover is completed.
Step 207: and controlling the first target role to execute voice broadcasting aiming at the target skills.
It can be understood that, when the second target role is the same as the first target role, the first target role is the second target role, and the first target role is controlled to perform the voice broadcast, that is, the second target role is controlled to perform the voice broadcast.
The step may specifically include: acquiring tone information and voice text information of the first target role; wherein, different roles correspond to different tone information; synthesizing voice audio information based on the tone information of the first target role and the voice text information; and controlling a voice output unit to output the voice audio information.
That is, if the voice information needs to be output when the target skill is executed, and the voice interactive operation with the user is realized, the voice text to be output is synthesized with the tone of the first target character, the voice audio information with the tone of the first target character is synthesized, and the voice audio information is output through a voice output unit such as a loudspeaker or an earphone.
On the basis of the foregoing embodiment, a more detailed speech information processing method is further provided, fig. 3 is a third schematic flow chart of the speech information processing method in the embodiment of the present application, and as shown in fig. 3, the method includes:
step 301: acquiring voice information acquired by a voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
here, the voice collecting means may be a microphone. For example, a microphone on the terminal collects voice information, and the voice information processing method is executed locally on the terminal; or uploading the voice information to a server, executing the steps of the voice information processing method by the server, issuing the processing result to the terminal by the server, and executing the corresponding voice output control operation by the terminal according to the processing result.
Step 302: recognizing the first voice information based on a preset skill recognition strategy, and determining a target skill which is called by the first voice information indication;
in practical application, a preset skill recognition strategy is used for skill recognition, and a target skill to be realized by the terminal or other electronic equipment to be controlled through voice information by a user is determined, wherein the target skill is determined according to user intention expressed by the first voice information, and the target skill to be realized by the terminal to be controlled next by the user is determined through recognizing the user intention. For example, the first voice message is "how is the weather today? "recognizing the user intention as inquiring weather according to the first voice information, and then determining the corresponding target skill as" weather inquiry ", and inquiring date as" today ". Therefore, the user does not need to speak a control keyword such as 'weather query', but uses a more spoken language, so that the voice control function can be realized, the control process is more intelligent, and the daily communication habit of the user is met.
Specifically, the recognizing the first voice information based on the preset skill recognition strategy includes: performing text recognition on the first voice information by utilizing a voice recognition technology to obtain first text information; and performing semantic recognition on the first text information by using a semantic recognition technology to obtain a target skill which is called by the first voice information indication.
Step 303: determining a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
here, the first mapping relationship may include a mapping relationship between a role and a skill, or a mapping relationship between a role and a skill set. The first target role corresponding to the target skill can be directly determined from the first mapping relationship, or the skill set where the target skill is located is determined first, and then the first target role corresponding to the skill set is determined from the first mapping relationship.
Correspondingly, step 303 may specifically include: matching skills in the first mapping relation of the target skills, and determining a corresponding first target role when the matching is successful; or matching the target skill with a skill set corresponding to the role, and determining the skill set containing the target skill so as to determine a first target role corresponding to the skill set.
In practical applications, the skills corresponding to different roles are the same or different, that is, the same or different skills that can be realized by different roles are different. The voice role configured on the terminal can be a role developed by a terminal manufacturer or a third party role developed by a third party manufacturer, the third party role is called by downloading a third party application program, the third party role can also be called in an online access mode without downloading the third party application program, and the voice skill range is expanded by calling the third party role, so that the processing effect on the voice information of the user is improved.
Step 304: controlling the first target role to execute voice broadcast aiming at the target skill;
the step may specifically include: acquiring tone information and voice text information of the first target role; wherein, different roles correspond to different tone information; synthesizing voice audio information based on the tone information of the first target role and the voice text information; and controlling a voice output unit to output the voice audio information.
That is, if the voice information needs to be output when the target skill is executed, and the voice interactive operation with the user is realized, the voice text to be output is synthesized with the tone of the first target character, the voice audio information with the tone of the first target character is synthesized, and the voice audio information is output through a voice output unit such as a loudspeaker or an earphone.
Step 305: acquiring third voice information; the third voice information is used for indicating to quit the first target role currently executing the voice broadcast;
step 306: and controlling to exit the first target role based on the third voice information.
In some embodiments, the step specifically includes: determining an exit identifier in the third voice message; determining a first target role corresponding to the exit identifier from a preset third mapping relation table; wherein, the third mapping relation comprises the mapping relation between at least three roles and the exit identifier; and controlling to exit the first target role.
In practical applications, the role and exit identifier in the third mapping relationship are mapped one-to-one or mapped one-to-many. That is, one role can be exited by only one exit identifier, or one role can be exited by a plurality of exit identifiers. The exit identifiers of different roles can be uniformly specified by the manufacturer or can be set by the user according to habits or preferences.
For example, the quit mark corresponding to the role a is "quit a classmate, small a quit, small a away from stock";
the exit identifier corresponding to role B is 'walk away fat B classmate';
the exit identifier corresponding to role C is "exit old C, old C bye".
Here, different exit identifiers are associated with different roles, and the flexibility of role control is improved.
On the basis of the foregoing embodiment, a speech information processing scenario is further provided, fig. 4 is a schematic diagram of a composition structure of a speech processing system in the embodiment of the present application, and as shown in fig. 4, the speech processing system includes: a voice assistant client 401, a voice assistant central server 402, a recognition server 403, a skill recognizer 404, a role a server 405, a role B server 406, and a role C server 407.
Here, the voice assistant client 401, the user, implements collection, uploading, receiving of voice information (including audio and character wake-up words), voice output result, and outputting of voice information of different characters. Here, the audio is the first voice message, and the character wake-up word is the second voice message.
The voice assistant center control server 402 to the character C server 407, which are used to implement the processing of voice data.
The voice assistant central control server 402 is configured to receive voice information uploaded by the voice assistant client 401, and perform text recognition on the voice information by using a voice recognition technology to obtain a recognition text; sending the recognition text to the skill classifier 404;
a skill classifier 404, which adopts semantic recognition technology to carry out semantic understanding on the text information, determines a target skill, and executes a skill A service by using a role A server 405 when the target skill is a skill A; performing a skill B service using role B server 406 when the target skill is skill B; when the target skill is skill C, a skill C service is performed using the character C server 407.
The role A server 405 processes the skill A, and sends the obtained skill A intention result, skill A resource service result, response text and role A response audio to the voice assistant central control server 402;
the role B server 406 processes the skill B, and sends the obtained skill B intention result, skill B resource service result, response text and role B response audio to the voice assistant central control server 402;
the role C server 407 processes the skill C, and sends the obtained skill C intention result, skill C resource service result, response text and role C response audio to the voice assistant central control server 402;
the voice assistant central control server 402 performs voice synthesis according to the received processing result to generate a voice output result, and sends the voice output result to the voice assistant client 401; the voice assistant client 401 controls the output of the voice output result.
Fig. 5 is a schematic structural diagram of a skill processing system in an embodiment of the present application, and as shown in fig. 5, the system includes: a skills server 501, a semantic understanding server 502, a resource recall server 503, and a TTS server 504.
The skill server 501 sends the received recognition text to the semantic understanding server 502, the semantic understanding server 502 performs semantic understanding on the recognition text to obtain an intention result of the user, and the intention result is returned to the skill server 501;
the skill server 501 sends the intention result to the resource recall server 503, and the resource recall server 503 determines a resource service result and a response text according to the intention result and sends the result and the response text to the skill server 501;
the skill server 501 sends the response text to the TTS server 504, and the TTS server performs speech synthesis to generate a response audio according to the role tone and the response text, and returns the response audio to the skill server 501;
the skill server 501 sends the voice processing results of the intention result, the resource service result, the response text and the response audio to the voice assistant central control server.
In an implementation scenario, the voice assistant client is located at a terminal side, and the terminal side further includes a voice acquisition unit for acquiring voice data; other servers for realizing voice information processing are positioned at the server side.
In other implementation scenarios, part or all of the server implementing the voice information processing may also be located at the terminal side.
An embodiment of the present application further provides a speech information processing apparatus, and as shown in fig. 6, the apparatus includes:
an acquisition section 601 configured to acquire voice information acquired by the voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
the processing part 602 is configured to recognize the first voice information based on a preset skill recognition strategy, and determine that the first voice information indicates a called target skill;
the processing part 602 is configured to determine a first target role corresponding to the target skill from a preset first mapping relationship; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
a control section 603 configured to control the first target character to perform voice broadcast for the target skill.
In some embodiments, before the controlling the first target character to perform the voice broadcasting for the target skill, the processing part is configured to determine a second target character currently performing the voice broadcasting;
correspondingly, the device also comprises: and a switching part configured to switch the second target character currently performing the voice broadcast to the first target character when the second target character is different from the first target character.
In some embodiments, the voice information further comprises second voice information indicating to wake the second target role; the processing part is configured to recognize the second voice message from the voice messages and determine that the second voice message indicates a woken second target role; and controlling the second target role to execute voice broadcast.
In some embodiments, the processing portion is configured to determine a wake-up flag in the second voice message; determining the second target role corresponding to the awakening identifier from a preset second mapping relation; and the second mapping relation comprises the mapping relations between at least three roles and the awakening identifier.
In some embodiments, the control section is configured to acquire timbre information of the first target character, and speech text information; wherein, different roles correspond to different tone information; synthesizing voice audio information based on the tone information of the first target role and the voice text information; and controlling a voice output unit to output the voice audio information.
In some embodiments, the acquisition section is configured to acquire third voice information; the third voice information is used for indicating to quit the first target role currently executing the voice broadcast; the control part is configured to control to exit the first target role based on the third voice information.
In some embodiments, the processing portion is configured to determine an exit flag in the third speech information; determining a first target role corresponding to the exit identifier from a preset third mapping relation table; wherein, the third mapping relation comprises the mapping relation between at least three roles and the exit identifier; the control section is configured to control to exit the first target role.
An embodiment of the present application further provides a speech information processing apparatus, as shown in fig. 7, the apparatus includes: a processor 701 and a memory 702 configured to store a computer program capable of running on the processor; the steps of the method of the previous embodiment are implemented by the processor 701 when executing the computer program in the memory 702.
Of course, in actual practice, the various components in the device are coupled together by a bus system 703, as shown in FIG. 7. It is understood that the bus system 703 is used to enable communications among the components. The bus system 703 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as bus system 703.
The embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of the method according to any of the embodiments.
In practical applications, the processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.
The Memory may be a volatile Memory (volatile Memory), such as a Random-Access Memory (RAM); or a non-volatile Memory (non-volatile Memory), such as a Read-Only Memory (ROM), a flash Memory (flash Memory), a Hard Disk (HDD), or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor.
It should be noted that: "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

  1. A method of processing speech information, comprising:
    acquiring voice information acquired by a voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
    recognizing the first voice information based on a preset skill recognition strategy, and determining a target skill which is called by the first voice information indication;
    determining a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
    and controlling the first target role to execute voice broadcast aiming at the target skill.
  2. The method of claim 1, wherein prior to the controlling the first target character to perform voice broadcast for the target skills, the method further comprises:
    determining a second target role for currently executing voice broadcast;
    and when the second target role is different from the first target role, switching the second target role currently executing voice broadcast into the first target role.
  3. The method of claim 2, wherein the voice message further comprises a second voice message indicating to wake up the second target role;
    before determining the second target role currently performing the voice broadcast, the method further includes:
    identifying the second voice message from the voice messages, and determining a second target role which is awakened and indicated by the second voice message;
    and controlling the second target role to execute voice broadcast.
  4. The method of claim 3, wherein the determining that the second voice message indicates a second target role for waking comprises:
    determining a wakeup identification in the second voice message;
    determining the second target role corresponding to the awakening identifier from a preset second mapping relation; and the second mapping relation comprises the mapping relations between at least three roles and the awakening identifier.
  5. The method of claim 1, wherein the controlling the first target role to perform voice broadcast for the target skill comprises:
    acquiring tone information and voice text information of the first target role; wherein, different roles correspond to different tone information;
    synthesizing voice audio information based on the tone information of the first target role and the voice text information;
    and controlling a voice output unit to output the voice audio information.
  6. The method according to any one of claims 1 to 5, wherein after controlling the first target character to perform the voice broadcast for the target skill, the method further comprises:
    acquiring third voice information; the third voice information is used for indicating to quit the first target role currently executing the voice broadcast;
    and controlling to exit the first target role based on the third voice information.
  7. The method of claim 6, wherein the controlling exiting the first target role based on the third voice information comprises:
    determining an exit identifier in the third voice message;
    determining a first target role corresponding to the exit identifier from a preset third mapping relation table; wherein, the third mapping relation comprises the mapping relation between at least three roles and the exit identifier;
    and controlling to exit the first target role.
  8. A speech information processing apparatus comprising:
    an acquisition section configured to acquire voice information acquired by the voice acquisition unit; the voice information comprises first voice information, and the first voice information is used for indicating calling target skills;
    the processing part is configured to recognize the first voice information based on a preset skill recognition strategy, and determine a target skill which is called and indicated by the first voice information;
    the processing part is configured to determine a first target role corresponding to the target skill from a preset first mapping relation; wherein the first mapping relation comprises mapping relations between at least three roles and skills;
    and the control part is configured to control the first target role to execute voice broadcast aiming at the target skill.
  9. A speech information processing apparatus comprising: a processor and a memory configured to store a computer program capable of running on the processor,
    wherein the processor is configured to perform the steps of the method of any one of claims 1 to 7 when running the computer program.
  10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN201980099978.9A 2019-10-29 2019-10-29 Voice information processing method, device, equipment and storage medium Pending CN114391165A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/113943 WO2021081744A1 (en) 2019-10-29 2019-10-29 Voice information processing method, apparatus, and device, and storage medium

Publications (1)

Publication Number Publication Date
CN114391165A true CN114391165A (en) 2022-04-22

Family

ID=75714767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980099978.9A Pending CN114391165A (en) 2019-10-29 2019-10-29 Voice information processing method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114391165A (en)
WO (1) WO2021081744A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113364665B (en) * 2021-05-24 2023-10-24 维沃移动通信有限公司 Information broadcasting method and electronic equipment
CN115001890B (en) * 2022-05-31 2023-10-31 四川虹美智能科技有限公司 Intelligent household appliance control method and device based on response-free

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104464716B (en) * 2014-11-20 2018-01-12 北京云知声信息技术有限公司 A kind of voice broadcasting system and method
US9875081B2 (en) * 2015-09-21 2018-01-23 Amazon Technologies, Inc. Device selection for providing a response
CN107122179A (en) * 2017-03-31 2017-09-01 阿里巴巴集团控股有限公司 The function control method and device of voice
CN108735211A (en) * 2018-05-16 2018-11-02 智车优行科技(北京)有限公司 Method of speech processing, device, vehicle, electronic equipment, program and medium
CN109524010A (en) * 2018-12-24 2019-03-26 出门问问信息科技有限公司 A kind of sound control method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2021081744A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN107340991B (en) Voice role switching method, device, equipment and storage medium
CN107704275B (en) Intelligent device awakening method and device, server and intelligent device
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN107370649B (en) Household appliance control method, system, control terminal and storage medium
CN109637548A (en) Voice interactive method and device based on Application on Voiceprint Recognition
CN111161714B (en) Voice information processing method, electronic equipment and storage medium
CN111261151B (en) Voice processing method and device, electronic equipment and storage medium
EP3611724A1 (en) Voice response method and device, and smart device
CN110060678B (en) Virtual role control method based on intelligent device and intelligent device
WO2016078214A1 (en) Terminal processing method, device and computer storage medium
CN111312235A (en) Voice interaction method, device and system
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN112912955B (en) Electronic device and system for providing speech recognition based services
CN109377979B (en) Method and system for updating welcome language
CN108899028A (en) Voice awakening method, searching method, device and terminal
CN114391165A (en) Voice information processing method, device, equipment and storage medium
CN109065050A (en) A kind of sound control method, device, equipment and storage medium
CN105529025B (en) Voice operation input method and electronic equipment
CN111429917A (en) Equipment awakening method and terminal equipment
CN108492826B (en) Audio processing method and device, intelligent equipment and medium
CN115497470A (en) Cross-device conversation service continuing method, system, electronic device and storage medium
CN115150501A (en) Voice interaction method and electronic equipment
CN110910100A (en) Event reminding method, device, terminal, storage medium and system
CN109725798B (en) Intelligent role switching method and related device
CN111816168A (en) Model training method, voice playing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination