CN111354350A

CN111354350A - Voice processing method and device, voice processing equipment and electronic equipment

Info

Publication number: CN111354350A
Application number: CN201911371210.0A
Authority: CN
Inventors: 袁全
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-06-30
Anticipated expiration: 2039-12-26
Also published as: CN111354350B

Abstract

The embodiment of the application provides a voice processing method and device, voice processing equipment and electronic equipment, wherein the method comprises the following steps: determining keywords in the acquired voice information; searching a target server corresponding to the keyword from a plurality of candidate servers; and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user. According to the embodiment of the application, the application range of the electronic equipment is widened by integrating the multiple servers, and the utilization efficiency of the electronic equipment is improved.

Description

Voice processing method and device, voice processing equipment and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a voice processing method and apparatus, a voice processing device, and an electronic device.

Background

The artificial intelligence technology is a technical science of theories, method technologies and application systems for simulating, extending and expanding human intelligence, and electronic devices such as smart phones, smart home appliances and smart sound boxes are increasingly widely applied, can interact with users, and realize corresponding intelligent processing according to instructions sent by the users.

In the prior art, voice clients of electronic devices such as smart speakers and smart phones can collect voice information of users. The voice client can send the collected voice information to the corresponding server, the server can identify the voice information of the user and obtain the recommended content corresponding to the voice information, then the recommended content can be sent to the voice client of the electronic equipment, and the voice client can output the recommended content.

However, the server corresponding to each voice client is fixed, and the electronic device can only obtain the recommended content corresponding to the installed voice client, but cannot obtain the recommended content corresponding to the servers corresponding to other voice clients, which results in a low utilization rate of the electronic device.

Disclosure of Invention

The embodiment of the application provides a voice processing method and device, a voice processing device and an electronic device, and aims to solve the technical problem that in the prior art, the utilization rate of the electronic device is low because the recommended content of a server corresponding to a voice client can only be obtained through the voice client in the electronic device.

Thus, in one embodiment of the present application, there is provided a speech processing method, the method comprising:

determining keywords in the acquired voice information;

searching a target server corresponding to the keyword from a plurality of candidate servers;

and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user.

In another embodiment of the present application, there is provided a speech processing method including:

determining keywords in the acquired voice information;

searching a target client corresponding to the keyword from the candidate clients;

obtaining the recommended content determined by the target client based on the voice information to output the recommended content for the user

In another embodiment of the present application, a speech processing method applied to an electronic device is provided, including:

collecting voice information of a user;

sending the voice information to a central server, wherein a keyword in the voice information is determined by the central server, and the keyword is used for searching a target server corresponding to the keyword from a plurality of candidate servers; the target server is used for feeding back recommended content corresponding to the voice information to the central server; the recommended content is fed back to the electronic equipment by the central server side;

and obtaining the recommended content fed back by the central server to output the recommended content for the user.

In still another embodiment of the present application, there is provided a speech processing apparatus including:

the first determining module is used for determining keywords in the acquired voice information;

the first searching module is used for searching a target server corresponding to the keyword from a plurality of candidate servers;

and the first processing module is used for acquiring recommended content corresponding to the voice information fed back by the target server side and outputting the recommended content for a user.

the second determining module is used for determining keywords in the acquired voice information;

the second searching module is used for searching a target client corresponding to the keyword from the candidate clients;

and the second processing module is used for acquiring recommended content determined by the target client based on the voice information so as to output the recommended content for the user.

In another embodiment of the present application, a speech processing apparatus configured to an electronic device includes:

the voice acquisition module is used for acquiring voice information of a user;

the voice sending module is used for sending the voice information to a central server, wherein a keyword in the voice information is determined by the central server, and the keyword is used for searching a target server corresponding to the keyword from a plurality of candidate servers; the target server is used for feeding back recommended content corresponding to the voice information to the central server; the recommended content is fed back to the electronic equipment by the central server side;

and the third processing module is used for acquiring the recommended content fed back by the central server and outputting the recommended content to the user.

In yet another embodiment of the present application, there is provided a voice processing apparatus including: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing component is to:

determining keywords in the acquired voice information; searching a target server corresponding to the keyword from a plurality of candidate servers; and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user.

the processing component is to:

determining keywords in the acquired voice information; searching a target client corresponding to the keyword from the candidate clients; and acquiring recommended content determined by the target client based on the voice information, and outputting the recommended content to the user.

In yet another embodiment of the present application, there is provided an electronic device including: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing component is to:

collecting voice information of a user; sending the voice information to a central server, wherein a keyword in the voice information is determined by the central server, and the keyword is used for searching a target server corresponding to the keyword from a plurality of candidate servers; the target server is used for feeding back recommended content corresponding to the voice information to the central server; the recommended content is fed back to the electronic equipment by the central server side; and obtaining the recommended content fed back by the central server to output the recommended content for the user.

According to the technical scheme provided by the embodiment of the application, the keywords in the acquired voice information can be determined, and the target server corresponding to the keywords can be searched from a plurality of candidate servers, so that the recommended content corresponding to the voice information fed back by the target server can be obtained, and the recommended content can be output by the user. By providing the selection determination operation of the multiple candidate service terminals, the multiple candidate service terminals can simultaneously serve the related recommendation work of the voice information of the user, the feedback range of the voice service is expanded, and the utilization efficiency of the electronic equipment is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow diagram illustrating one embodiment of a method of speech processing provided herein;

FIG. 2 is a flow diagram illustrating yet another embodiment of a speech processing method provided herein;

FIG. 3 is a flow chart illustrating a further embodiment of a speech processing method provided herein;

FIG. 4 is a flow chart illustrating a further embodiment of a speech processing method provided herein;

FIG. 5 is a flow chart illustrating a further embodiment of a speech processing method provided herein;

FIG. 6 is a diagram illustrating an example of a speech processing method provided herein;

FIG. 7 is a flow chart illustrating a further embodiment of a speech processing method provided herein;

FIG. 8 is a flow chart illustrating a further embodiment of a speech processing method provided herein;

FIG. 9 is a schematic diagram illustrating an embodiment of a speech processing apparatus provided by the present application;

FIG. 10 is a schematic block diagram illustrating one embodiment of a speech processing device provided herein;

FIG. 11 is a schematic diagram illustrating an architecture of another embodiment of a speech processing apparatus provided by the present application;

FIG. 12 is a schematic diagram illustrating an architecture of yet another embodiment of a speech processing device provided by the present application;

FIG. 13 is a schematic diagram illustrating an architecture of yet another embodiment of a speech processing apparatus provided by the present application;

fig. 14 shows a schematic structural diagram of another embodiment of an electronic device provided in the present application.

Detailed Description

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

In some of the flows described in the specification and claims of this application and in the above-described figures, a number of operations are included that occur in a particular order, but it should be clearly understood that these operations may be performed out of order or in parallel as they occur herein, the number of operations, e.g., 101, 102, etc., merely being used to distinguish between various operations, and the number itself does not represent any order of performance. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor limit the types of "first" and "second" to be different.

The embodiment of the application can be applied to an intelligent voice interaction scene of a user, and the multiple candidate service ends are aggregated for the electronic equipment, so that the non-inductive switching of the candidate service ends is provided for the user, the user can use differentiated services of different chat systems, and the utilization efficiency of the electronic equipment is improved.

As introduced in the background section, electronic devices such as smartphones, etc. may typically have candidate clients installed. The candidate client can usually communicate with the corresponding candidate server, the candidate client can collect voice information sent by a user, the collected voice information can be sent to the candidate server, and the candidate server can recognize the voice information and obtain corresponding recommended content. The server side feeds the recommended content back to the candidate client side, and the candidate client side can output the recommended content. However, the candidate server corresponding to the candidate client is usually fixed, for example, the tianmao sprite usually only accesses the background system of the tianmao sprite, resulting in a low interest rate efficiency of the electronic device.

In the embodiment of the application, a plurality of candidate service ends are integrated, and after the keywords of the acquired voice information are determined, the target service end corresponding to the keywords can be searched from the plurality of candidate service ends, so that the target service end can acquire the voice information, search the recommended content corresponding to the voice information, and feed back the acquired recommended content to the electronic equipment, and the electronic equipment can acquire the recommended content corresponding to the voice information fed back by the target service end, so that the recommended content can be output to a user. By integrating the candidate service terminals, after the electronic equipment collects voice information, the electronic equipment can access and judge the candidate client terminals to obtain a target service terminal, and the electronic equipment can access any one of the candidate service terminals based on the voice information of a user so as to realize non-inductive switching among the candidate client terminals, realize multi-system application of the electronic equipment and improve the utilization rate of the electronic equipment.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of an embodiment of a speech processing method provided in an embodiment of the present application may include:

101: and determining keywords in the acquired voice information.

The voice processing method provided by the embodiment of the application can be applied to electronic equipment capable of directly interacting with a user, the electronic equipment can correspond to the electronic equipment, and when the user needs to perform intelligent voice interaction through the electronic equipment, recommended content can be obtained through the electronic equipment. The electronic device may include a mobile phone, a smart speaker, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, and the like, and the specific type of the electronic device is not limited in this embodiment of the application.

The voice processing method can be applied to a center server interacting with electronic equipment, a user uses the electronic equipment to interact with the corresponding center server, and the center server can comprise a high-performance computer, cloud computing equipment and the like. The embodiment of the present application does not set any limit to the specific type of the central server. The central server may correspond to a plurality of candidate servers, and may interact with each candidate server respectively.

The keywords in the voice information may refer to words related to a service system, such as a system, a device, a developer, a client, and a server, included in the voice information.

Taking the electronic device as an example of an intelligent sound box, in the embodiment of the application, the collected voice information may be, for example, "tianmaoling" and i want to listen to a song of a certain sound, and the keywords in the collected voice information may be determined as "tianmaoling" and "a certain sound".

102: and searching a target server corresponding to the keyword from a plurality of candidate servers.

Optionally, a system keyword for each candidate server may be set. Searching for a target server corresponding to the keyword from a plurality of candidate servers may include: and searching a target server containing the keywords from the system keywords corresponding to the candidate servers. The keyword of each candidate server can include a plurality of keywords, and each keyword can be used for distinguishing the candidate server.

Optionally, the target server corresponding to the keyword may be directly searched from a plurality of candidate servers. At this time, the electronic device or the central server may directly communicate with the plurality of candidate servers to obtain the recommended content corresponding to the voice information from the corresponding candidate servers.

The electronic equipment or the central server can send the voice information to the target server, so that the target server can obtain the voice information and determine the recommended content corresponding to the voice information. After obtaining the recommended content, the target server may send the recommended content to the electronic device or the central server.

The candidate server may include a voice server, and the candidate client may include a voice client. A plurality of voice service terminals can be integrated, and a target service terminal corresponding to the keyword can be searched from the plurality of voice service terminals. In a possible design, when the electronic device is an intelligent sound box, the intelligent sound box can provide voice interaction service for a user, in order to expand the application range of the intelligent sound box, the intelligent sound box can perform function integration on a plurality of voice service terminals, and the intelligent sound box can select a proper target service terminal from the plurality of voice service terminals according to keywords so as to obtain recommended content from the target service terminal. The plurality of candidate servers are taken as a plurality of intelligent voice servers, for example, when the plurality of candidate servers are a "tianmao eidolon server" and a "minim server", and when the keywords of the voice information are "tianmao eidolon" and "a certain", the "tianmao eidolon server" can be selected from the "tianmao eidolon server" and the "minim server" as a target server, so as to obtain corresponding recommended contents from the "tianmao eidolon server".

The intelligent sound box can be switched to different voice service terminals at any time to obtain corresponding services, the application range of the intelligent sound box is expanded, and the utilization efficiency of the intelligent sound box is improved.

103: and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user.

Optionally, the recommended content sent by the target server may be received. When the recommended content is output for the user, the recommended content can be output through a display screen, and can also be played in voice. Taking the smart sound box as an example, when the smart sound box plays the recommended content for the user, the recommended content can be played in a voice form.

The target server can receive voice information sent by the electronic equipment. In some embodiments, after searching for the target server corresponding to the keyword from the plurality of candidate servers, the method further includes:

and sending the voice information to the target server.

And the voice information is used for the target server to determine the corresponding recommended content.

After receiving the voice information sent by the electronic device, the target server may determine recommended content corresponding to the voice information. The target server can feed back the recommended content to the electronic equipment.

As shown in fig. 2, a flowchart of another embodiment of a speech processing method provided in this embodiment of the present application may include the following steps:

201: and collecting voice information of the user.

The voice processing method provided by the embodiment of the application can be applied to electronic equipment.

The electronic device may be configured with a voice collecting component, such as a microphone, through which voice information of a user may be collected.

202: and determining keywords in the acquired voice information.

The keywords in the voice information can be obtained through text conversion and semantic recognition processing.

203: and searching a target server corresponding to the keyword from a plurality of candidate servers.

204: and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user.

Some steps of the embodiments of the present application are the same as those of the embodiments described above, and are not described herein again.

As shown in fig. 3, a flowchart of another embodiment of a speech processing method provided in this embodiment of the present application may include the following steps:

301: and acquiring voice information sent by the electronic equipment.

The voice processing method provided by the embodiment of the application can be applied to electronic equipment, wherein the voice information is acquired by the electronic equipment.

The electronic equipment is an application program configured in the electronic equipment used by a user. The electronic device may collect voice information of the user.

The voice processing method in the embodiment of the present application may be applied to a central server, and the central server may be formed by a device with a high processing effect, such as a high-performance computer, a cloud computing device, and the like. The central server can be communicated with the electronic equipment, and the central server can also be communicated with a plurality of candidate servers to realize information transmission between the central server and the candidate servers. The electronic device can collect voice information of the user and output recommended content to the user after acquiring the recommended content.

302: and determining the keywords in the received voice information.

303: and searching a target server corresponding to the keyword from a plurality of candidate servers.

304: and obtaining recommended content corresponding to the voice information fed back by the target server, and outputting the recommended content for the user.

The obtaining of the recommended content corresponding to the voice information fed back by the target server to output the recommended content to the user may include: acquiring recommended content corresponding to the voice information fed back by the target server; and sending the recommended content to the electronic equipment so that the electronic equipment can output the recommended content for the user.

In the embodiment of the application, the electronic equipment can collect the voice information and send the voice information to the central server, and the central server can acquire the voice information sent by the electronic equipment. The central server can determine the keywords in the voice information and find the target server corresponding to the keywords from the candidate servers, the central server can communicate with the candidate servers and can respectively apply different candidate servers, so that after the corresponding target server is found based on the keywords in the voice information, the targeted application of the server can be realized, a proper target server can be selected from the different candidate servers, the application range of the central server is expanded, a user can use the services provided by the candidate servers through the electronic equipment, the use range of the electronic equipment used by the user is expanded, and the application efficiency of the electronic equipment is improved.

Generally, when processing voice information, the voice information may be converted into text information, and the text information obtained by the conversion is recognized through semantics, so as to obtain keywords in the voice information. Therefore, as an embodiment, the determining the keywords in the collected voice information may include:

converting the voice information into character information;

and carrying out semantic recognition processing on the character information to obtain key words in the character information.

Optionally, the converting the voice message into text message may include: and converting the voice information into character information through a voice recognition algorithm. The speech recognition algorithm may include a deep neural network model, a CTC (connection temporal classification) algorithm, and the like, which are not described herein again.

In some of the above, the searching for the target server corresponding to the keyword from the candidate servers includes:

determining respective system keywords of the candidate servers;

searching a target system keyword matched with the keyword from the system keywords of the candidate servers;

and determining the candidate server corresponding to the target system keyword as the target server.

The system keyword of each candidate server may include a plurality of system keywords, and in this embodiment, the candidate server includes a plurality of candidate servers, in order to distinguish different candidate servers, the system keyword of the candidate server may be used to identify the candidate server, and the system keyword may be used to distinguish corresponding candidate servers.

As a possible implementation manner, the client names of the candidate clients may be used as system keywords, and the determining the system keywords of each of the candidate servers may include:

determining client names of the candidate clients corresponding to the candidate servers respectively;

and taking the client names corresponding to the candidate service ends as respective system keywords of the candidate service ends to obtain a plurality of voice keywords.

In addition, in some embodiments, the server name of each candidate server may be used as a system keyword, developer information, system version information, and the like may be used as a system keyword, and any word that can distinguish the candidate servers may be used as a system keyword.

As another possible implementation manner, the system keyword of each of the candidate servers may be determined by:

determining at least one candidate word;

sequentially determining candidate service ends corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the candidate service ends respectively;

and determining a target candidate word corresponding to any candidate server as the system keyword of the candidate server, and obtaining the system keyword of each candidate server.

Candidate words may refer to a single word, a short sentence, or a sentence model containing both word and sentence structure, and may be used to identify corresponding candidate servers. In some embodiments, the number of candidate words is very large, and therefore, the candidate words may be divided into candidate servers to distinguish different candidate servers by the candidate words. Each candidate server can correspond to a plurality of candidate words, and the plurality of candidate words corresponding to the candidate server form the system keywords of the server. The system keyword of each candidate server can be multiple, and the system keyword can be used for distinguishing different candidate words.

In some embodiments, the sequentially determining candidate service ends corresponding to the at least one candidate word, and obtaining target candidate words corresponding to the candidate service ends respectively includes:

and sequentially determining the candidate service ends corresponding to the at least one candidate word according to the functional attributes corresponding to the candidate service ends respectively, and obtaining target candidate words corresponding to the candidate service ends respectively.

The function attribute corresponding to the candidate server may include a network service provided by the candidate server for the user. For example, the music software, that is, the client may generally provide functional services such as song playing, song library query, song recommendation for the user, and the functional attributes of the service corresponding to the music software, that is, the candidate service may include audio search, audio recommendation, user feedback, and the like. The at least one candidate word can be respectively divided into different candidate servers according to the word meaning of the at least one candidate word, and target candidate words respectively corresponding to the multiple candidate servers are obtained.

Further, optionally, the sequentially determining, according to the function attributes respectively corresponding to the candidate servers, the candidate servers respectively corresponding to the at least one candidate word, and obtaining the target candidate words respectively corresponding to the candidate servers may include:

and determining a target candidate word matched with the functional attribute from the at least one candidate word aiming at the functional attribute of any candidate server so as to obtain the target candidate words respectively corresponding to the candidate servers.

Optionally, for the functional attribute of any candidate server, a target candidate word whose word meaning matches the functional attribute of the candidate server may be determined from the at least one candidate word, so as to obtain target candidate words corresponding to the multiple candidate servers respectively.

In one possible design, the at least one candidate word may be determined by:

acquiring historical keywords corresponding to historical voice information;

determining target historical keywords meeting selection conditions from the historical keywords;

and determining the target historical keyword as the at least one candidate word.

The candidate words can be selected and obtained based on historical keyword selection, specifically, historical keywords corresponding to historical voice information can be obtained, and at least one candidate word meeting the selection condition is selected from the historical keywords. Because the historical keywords correspond to the historical server accessed by the user, the corresponding candidate server can be directly determined through the historical keywords, and the word distribution efficiency can be improved.

Therefore, the sequentially determining the candidate service ends corresponding to the at least one candidate word, and obtaining the target candidate words corresponding to the candidate service ends may include: and sequentially determining the history service ends corresponding to the at least one candidate word, wherein the history service ends are candidate service ends, and the at least one candidate word can be respectively corresponding to the associated candidate service ends to obtain target candidate words corresponding to the candidate service ends.

After the target server corresponding to the keyword is searched from the plurality of candidate servers, the method may further include: and associating the target server and the keyword, and taking the keyword as a system keyword of the target server.

In some embodiments, the determining, from the history keywords, a target history keyword that satisfies a selection condition may include:

determining the occurrence frequency of each historical keyword;

and determining the history keywords with the occurrence times larger than the time threshold value as target history keywords.

In yet another possible design, the at least one candidate word may be determined by:

extracting voice attribute information in the voice information;

determining user attribute information of the user based on the voice attribute information;

and determining at least one candidate word having an association relation with the user attribute information.

Optionally, the voice attribute information of the user may be extracted based on the voice information of the user, and the pronunciation characteristics and the speaking content of the user may be digitalized to identify the identity of the user through the voice characteristics, analyze the preference and dialect of the user, so that the word related to the user may be determined by using the identity characteristics of the user to obtain at least one candidate word.

The voice attribute information of the user may include information such as a tone, audio, a tone color, a language used for speaking, a dialect used for speaking, and grammar. The voice attribute information of the user can be used for determining the user attribute information such as the identity, the age, the location of the area, the interested content, the historically accessed server and the like of the user, so that at least one candidate word which has an association relation with the attribute information of the user can be determined.

Optionally, at least one candidate word having an association relationship with the attribute information may be searched from a word library based on the attribute information of the user.

and detecting system keywords set by the user aiming at any candidate server to obtain the respective system keywords of the multiple candidate servers.

The user can set keywords for the candidate server or the candidate client. The system keywords of the server side are the same as the system keywords of the client side.

Because the users in different regions may use different tones, languages families, and grammars, depending on the region, the method may further include, as another embodiment:

and determining voice attribute information of the voice information.

The obtaining of the recommended content corresponding to the voice information fed back by the target server to output the recommended content to the user includes:

acquiring recommended content corresponding to the voice information fed back by the target server;

and outputting the recommended content based on the voice attribute information.

Optionally, the voice attribute information of the voice information may include various information such as a language, a dialect, a language family, a grammar, and/or a tone used by the voice information. The language may refer to a language type used when the user utters voice, for example, different language types such as chinese, english, spanish, and the like. The dialect may specifically refer to a specific language of the speech uttered by the user, for example, a language used in areas such as cantonese and mandarin. The language family refers to a part of history comparison linguistics divided according to relatives of different languages. Grammar refers to the class of words, inflexion of words or other means of interrelationships that are used in a determined usage, and the function and relationship of words in sentences. The intonation can refer to information such as rising and falling tone, rising tone, falling tone, level tone and the like in voice information, and the emotion of the user can be judged through the intonation, for example, the user sends out voice in a specific thought emotion.

In some embodiments, the outputting the recommended content based on the voice attribute information may include:

converting the recommended content into a first recommended voice corresponding to the voice attribute information;

and outputting the first recommended voice.

Optionally, the converting the recommended content into the first recommended voice corresponding to the voice attribute information may include: and generating a first recommended voice corresponding to the recommended content by using the voice attribute information as a sound generation parameter.

When the recommended content is converted into the first recommended voice corresponding to the target language, the recommended content may be converted into the first recommended voice according to information such as a language, accent, language family, grammar, and/or homologous words in the voice attribute information. For example, when the speech attribute information of the speech information is chinese or cantonese, the recommended content may be converted into a first recommended speech composed of chinese as a base meaning, cantonese grammar, and words. For another example, a plurality of rising tones are included in the tones in the voice attribute information in the voice information, and the emotion of the user may not be stable enough, and therefore, the first recommended voice may be set to a flat tone or a falling tone.

In one possible design, the converting the recommended content into a first recommended voice corresponding to the target language includes:

and if the user is located in a self-service place, converting the recommended content into a first recommended voice corresponding to the target language.

In the self-service place, the electronic equipment can be a self-service terminal, and a user can interact with the self-service terminal by using voice to realize self-defined processing of related products and services in the self-service place.

In certain embodiments, the method may further comprise:

determining user attribute information of the user based on voice attribute information of the voice information;

the candidate servers are determined by:

and determining a plurality of candidate servers having association relation with the user attribute information.

Besides determining at least one candidate word in association with the attribute information of the user according to the attribute information of the user, a plurality of candidate servers in association with the user can be determined according to the attribute information of the user, so that candidate servers with higher association with the user can be obtained, targeted server screening is realized, and the selection efficiency of the servers is improved.

In the embodiment of the application, the recommended content is converted into the voice to be output by the user, so that the user can quickly acquire the recommended content without executing other operations again, the non-inductive interaction is realized, and the application efficiency of the electronic equipment is improved.

In order to improve the experience of voice interaction and improve the application range of the electronic device, as another embodiment, the method may further include:

determining gender attribute information of the voice information;

and outputting the recommended content for the user based on the gender attribute information.

The gender attribute information included in the voice information may specifically refer to a gender attribute of the user who uttered the voice, for example, when the voice uttered by the user is determined to be female voice, it may be determined that the gender attribute of the voice is female. When the voice information of the user includes a male name, the gender attribute of the voice can be determined to be male.

As a possible implementation manner, the determining the gender attribute information of the voice information may include:

and determining gender attribute information of the voice information based on the characteristic information of the voice information.

As another possible implementation manner, the determining the gender attribute information of the voice information includes:

converting the voice information into character information;

performing semantic recognition processing on the text information to obtain name keywords in the text information;

and determining gender attribute information of the voice information by using the name keywords.

In some embodiments, the determining gender attribute information of the voice message using the name keyword comprises:

if the name key words are matched with the first class of names, determining the gender attribute information of the voice information as first attribute information;

and if the name key words are matched with the second type of names, determining that the gender attribute information of the voice information is second attribute information.

The first type of name may be a set of female names and the second type of name may be a set of male names. Matching the name keyword with the first category of names may include the first category of names including names having a similarity to the name keyword that exceeds a similarity threshold. Matching the name keyword with the second category of names may include including names in the second category of names that have a similarity to the name keyword that exceeds a similarity threshold. The similarity threshold may be set according to the similarity requirement for two names, with the higher the similarity, the more similar the two names are. The first type of name and the second type of name may be a thesaurus of names obtained based on statistics or the like. The names with different attributes are classified to output and control the voice, so that personalized output of recommended contents can be realized, multi-level application of the electronic equipment is provided, and the application efficiency of the electronic equipment is improved.

In one possible design, the outputting the recommended content for the user based on the gender attribute information includes:

generating a second recommended voice corresponding to the recommended content by taking gender attribute information as a sound generation parameter;

and outputting the second recommended voice.

And when the gender attribute information is used as a sound generation parameter and a second recommended voice corresponding to the recommended content is generated, controlling the voice output tone of the second recommended voice to be the same as the gender attribute information. For example, when the gender attribute information is male attribute information, the gender attribute information is also male attribute information when the second recommended voice is output.

As shown in fig. 4, a flowchart of another embodiment of a speech processing method provided in this embodiment of the present application may include the following steps:

401: and collecting voice information of the user.

402: and identifying key words in the voice information.

403: and searching a target server corresponding to the keyword from a plurality of candidate servers.

404: and sending the voice information to the target server side so that the target server side can determine recommended content corresponding to the voice information and feed back the recommended content.

405: and obtaining the recommended content fed back by the target server.

406: and determining voice attribute information and gender attribute information of the voice information.

407: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as sound generation parameters.

408: and outputting the third recommended voice.

In the embodiment of the application, the electronic equipment can collect the voice information of the user and can identify the keywords in the voice information, so that the target server can be inquired from a plurality of candidate servers by using the keywords to send the voice information to the target server. The target server side can obtain the recommended content corresponding to the voice information and feed the recommended content back to the electronic equipment, and the electronic equipment can obtain the recommended content fed back by the target server side. The electronic equipment can establish a connection relation with the multiple candidate servers, system application of the multiple candidate servers of the electronic equipment can be achieved, and application efficiency of the electronic equipment is improved. In addition, the voice attribute information and the gender attribute information of the voice information can be determined, the voice attribute information and the gender attribute information can be used as sound generation parameters, third recommended voices corresponding to the recommended contents are generated and output, personalized output of the recommended contents is achieved, and the range of the electronic equipment is widened.

As shown in fig. 5, a flowchart of another embodiment of a speech processing method provided in this embodiment is applied to an electronic device, and the method may include the following steps:

501: collecting voice information of a user;

502: and sending the voice information to a central server.

The keyword in the voice information is determined by the central server, and the keyword is used for searching a target server corresponding to the keyword from a plurality of candidate servers; the target server is used for feeding back recommended content corresponding to the voice information to the central server; and the recommended content is fed back to the electronic equipment by the central server.

503: and obtaining the recommended content fed back by the central server to output the recommended content for the user.

The embodiment of the application can be applied to electronic equipment, the electronic equipment can comprise an intelligent sound box, an intelligent mobile phone and the like, and the electronic equipment can provide a voice interaction function for a user.

The central server in the embodiment of the present application may execute the voice processing method shown in fig. 3, and obtain the voice information sent by the electronic device, and after determining the keyword in the voice information, search for a target server corresponding to the keyword from a plurality of candidate servers, so as to obtain the recommended content corresponding to the voice information from the target server. The electronic equipment acquires the voice information of the user and transmits the voice information with the data of the central server, so that the electronic equipment can obtain the network services provided by a plurality of candidate servers, and the processing and switching process of the network services is not sensitive, the application range of the electronic equipment is expanded under the condition that the use of the user is not influenced, and the utilization rate of the electronic equipment is improved.

For convenience of understanding, the technical solution of the embodiment of the present application is described in detail by taking the electronic device as an example. As shown in fig. 6, the smart sound box M1 may establish wireless communication with a plurality of candidate servers M2, and the communication relationship between the two is represented by a wireless transmission symbol.

The candidate server can be a server formed by a computer or a server formed by a cloud server.

The smart speaker M1 may collect the voice information 601 of the user U, and recognize the keyword 602 in the voice information, and then the smart speaker M1 may search the target server 603 corresponding to the keyword from the candidate servers M2. Assuming that the found target server is M2S in M2, the smart speaker M1 may send the voice message to the target server M2S through 604. The wireless transmission symbol between the smart sound box M1 and the target server M2S is not shown.

Thereafter, the target server M2S may obtain 605 recommended content corresponding to the voice information, and feed back the recommended content.

Then, the smart speaker M1 may obtain the recommended content 606 fed back by the target server M2S, so that the smart speaker M1 outputs the recommended content 607.

The intelligent sound box M1 is connected with the candidate service terminals M2, and the collected voice information is systematically judged to access the corresponding target service terminals, so that access switching of different candidate service terminals is realized, the application range of the intelligent sound box is expanded, and the utilization efficiency of the intelligent sound box is improved.

As shown in fig. 7, a flowchart of another embodiment of a speech processing method provided in this embodiment of the present application may include:

701: and determining keywords in the acquired voice information.

Optionally, the technical solution of the embodiment of the present application may be applied to an electronic device, the electronic device may integrate a plurality of candidate clients, the candidate clients may include a voice client, the electronic device may simultaneously correspond to the plurality of candidate clients, and may obtain customer services provided by the plurality of candidate clients. For example, a plurality of music software may be installed in the electronic device to simultaneously use a network service provided by the plurality of music software.

702: and searching a target client corresponding to the keyword from the candidate clients.

703: and acquiring recommended content determined by the target client based on the voice information, and outputting the recommended content to the user.

The obtaining of the recommended content determined by the target client based on the voice information may include: and sending the voice information to a corresponding target server through the target client so that the target server can determine the recommended content corresponding to the voice information and feed back the recommended content to the target client, so that the electronic equipment can obtain the recommended content through the target client.

In the embodiment of the present invention, by integrating a plurality of candidate clients, after acquiring voice information of a user, a keyword in the voice information may be identified, so that a target client corresponding to the keyword may be searched from the plurality of candidate clients, and recommended content determined by the voice information may be obtained by the target client, so that the recommended content may be output to the user. The electronic equipment integrates a plurality of candidate clients to realize the comprehensive application of the candidate clients so as to obtain the voice services respectively provided by the candidate clients. Each candidate client can communicate with the corresponding candidate server, and the electronic equipment obtains background services of the multiple candidate servers by using the multiple candidate clients, so that the application efficiency of the electronic equipment is improved.

Generally, when processing voice information, the voice information may be converted into text information, and the text information obtained by the conversion is recognized through semantics, so as to obtain keywords in the voice information. As an embodiment, the determining the keywords in the collected voice information includes:

converting the voice information into character information;

In some embodiments, the searching for the target client corresponding to the keyword from the plurality of candidate clients may include:

determining a system keyword of each candidate client;

searching a target system keyword matched with the keyword from the system keywords of the candidate clients;

and determining the candidate client corresponding to the target system keyword as the target client.

The system keyword of each candidate client may include a plurality of system keywords, and in this embodiment, the system keywords include a plurality of candidate clients, in order to distinguish different candidate clients, the system keywords of different candidate clients may be different, and each system keyword may be used to distinguish a corresponding candidate client.

As a possible implementation manner, the client names of the candidate clients may be used as system keywords, and the determining the system keywords of each of the candidate clients may include:

and determining the client names of the candidate clients as the system keywords of the candidate clients.

In addition, in some embodiments, the client name of each candidate client may also be used as a system keyword, developer information, system version information, and the like may also be used as a system keyword, and any word that can distinguish the candidate client may be used as a system keyword.

As another possible implementation manner, the system keyword of each of the candidate clients may be determined by:

determining at least one candidate word;

sequentially determining candidate clients corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the candidate clients respectively;

and confirming that the target candidate word corresponding to any candidate client is the system keyword of the candidate client, and obtaining the respective system keywords of the candidate clients.

The candidate clients in the embodiment of the present application and the candidate servers in the above embodiments have a corresponding relationship, and any one of the candidate clients may have a candidate server corresponding thereto, and both of the candidate clients and the candidate server form a system for providing an interactive service for a user. And the candidate server provides services such as data interaction, data support, data query and the like for the corresponding candidate client. The system keywords of the candidate client may be the same as the system keywords of the corresponding candidate server, that is, the server and the client in a system providing interactive services for the user may use the same system keywords.

It should be noted that, in the embodiment of the present application, the matching process and the step for at least one candidate word and at least one candidate word of the candidate client and the corresponding system keyword are the same as the matching process and the step for at least one candidate word and at least one candidate word of the candidate server and the corresponding system keyword in the above embodiment, and are not described herein again.

In some embodiments, the sequentially determining candidate clients corresponding to the at least one candidate word, and obtaining target candidate words corresponding to the candidate clients may include:

and according to the functional attributes respectively corresponding to the candidate clients, sequentially determining the candidate clients respectively corresponding to the at least one candidate word, and obtaining target candidate words respectively corresponding to the candidate clients.

Further, optionally, the sequentially determining, according to the function attributes respectively corresponding to the multiple candidate clients, the candidate clients respectively corresponding to the at least one candidate word, and obtaining the target candidate words respectively corresponding to the multiple candidate clients may include:

and determining a target candidate word matched with the functional attribute from the at least one candidate word aiming at the functional attribute of any candidate client so as to obtain the target candidate words respectively corresponding to the plurality of candidate clients.

As another possible implementation manner, the system keyword of each of the candidate clients is determined by:

and detecting the system keywords set by the user aiming at any candidate client to obtain the respective system keywords of the candidate clients.

In some embodiments, after determining the keywords in the collected voice information, the method may further include:

determining voice attribute information of the voice information;

the obtaining of the recommended content determined by the target client based on the voice information to output the recommended content to the user includes:

acquiring recommended content determined by the target client based on the voice information;

As an embodiment, the method may further include:

the plurality of candidate clients may be determined by:

and determining a plurality of candidate clients having association relation with the user attribute information.

It should be noted that, some steps in the embodiment of the present application are the same as those in the foregoing embodiment, and are not described herein again.

and determining voice attribute information of the voice information.

The outputting the recommended content for the user includes:

and outputting the first recommended voice.

determining gender attribute information of the voice information;

the obtaining of the recommended content determined by the target client based on the voice information to output the recommended content to the user may include:

and acquiring recommended content determined by the target client based on the voice information.

converting the voice information into character information;

The first type of name may be a set of female names and the second type of name may be a set of male names. The names with different attributes are classified to output and control the voice, so that personalized output of recommended contents can be realized, multi-level application of the electronic equipment is provided, and the application efficiency of the electronic equipment is improved.

and outputting the second recommended voice.

It should be noted that, some steps of the embodiments of the present application have been described in detail in the foregoing embodiments, and are not described herein again.

As shown in fig. 8, which is a flowchart of another embodiment of a speech processing method provided in this embodiment of the present application, the method may include the following steps:

801: and collecting voice information of the user.

802: and identifying key words in the voice information.

803: and searching a target client corresponding to the keyword from a plurality of candidate clients.

804: and acquiring recommended content determined by the target client based on the voice information.

805: and determining voice attribute information and gender attribute information of the voice information.

806: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as sound generation parameters.

807: and outputting the third recommended voice.

In the embodiment of the application, the electronic equipment can collect the voice information of a user and can identify the keywords in the voice information, so that the target client can be inquired from a plurality of candidate clients by using the keywords, and the recommended content corresponding to the voice information can be obtained through the target client. Each candidate client can communicate with the corresponding candidate server, and the electronic equipment obtains background services of the multiple candidate servers by using the multiple candidate clients, so that the application efficiency of the electronic equipment is improved. In addition, the voice attribute information and the gender attribute information of the voice information can be determined, the voice attribute information and the gender attribute information can be used as sound generation parameters, third recommended voices corresponding to the recommended contents are generated and output, personalized output of the recommended contents is achieved, and the range of the electronic equipment is widened.

As shown in fig. 9, a schematic structural diagram of an embodiment of a speech processing apparatus provided in this embodiment of the present application, the apparatus may include:

the first determining module 901: the method is used for determining keywords in the collected voice information.

The first lookup module 902: and the server is used for searching a target server corresponding to the keyword from a plurality of candidate servers.

The first processing module 903: and the recommendation server is used for acquiring the recommendation content corresponding to the voice information fed back by the target server so as to output the recommendation content for the user.

In the embodiment of the application, after the acquired voice information is acquired, the keywords in the voice information are identified, the keywords can be used for searching the target server from a plurality of candidate servers, the selection of the candidate servers is realized, the recommended content provided by the target server is directly acquired, the switching process does not need user operation, and the application efficiency is improved. The electronic equipment can simultaneously correspond to a plurality of candidate servers, so that the comprehensive application of the multi-voice system is realized, and the application efficiency of the electronic equipment is improved.

In some embodiments, the second determining module may include:

the text conversion unit is used for converting the voice information into text information;

and the word processing unit is used for carrying out semantic recognition processing on the character information to obtain key words in the character information.

As an embodiment, the first lookup module may include:

the first determining unit is used for determining the system keywords of each candidate server;

and the word matching unit is used for searching a target system keyword matched with the keyword from the system keywords of the candidate service terminals.

And the target determining unit is used for determining the candidate server corresponding to the target system keyword as the target server.

In some embodiments, the first determining unit may include:

the first determining subunit is configured to determine client names of the candidate clients corresponding to the multiple candidate servers respectively;

and the first obtaining subunit is configured to use the client names corresponding to the multiple candidate servers as system keywords of the multiple candidate servers.

In some embodiments, the first determining unit may include:

the second determining subunit is used for determining at least one candidate word;

the second obtaining subunit is configured to sequentially determine candidate servers corresponding to the at least one candidate word, and obtain target candidate words corresponding to the candidate servers;

and the third obtaining subunit is configured to determine that a target candidate word corresponding to any one of the candidate servers is a system keyword of the candidate server, and obtain the system keyword of each of the candidate servers.

As a possible implementation manner, the second obtaining subunit may include:

and the word matching module is used for sequentially determining the candidate service ends corresponding to the at least one candidate word according to the functional attributes corresponding to the candidate service ends respectively, and obtaining the target candidate words corresponding to the candidate service ends respectively.

Further, optionally, the term matching module may include:

and the word matching unit is used for determining a target candidate word matched with the functional attribute from the at least one candidate word aiming at the functional attribute of any candidate server so as to obtain the target candidate words respectively corresponding to the candidate servers.

As another possible implementation manner, the second determining subunit may include:

the first acquisition module is used for acquiring historical keywords corresponding to the historical voice information;

the first selection module is used for determining target historical keywords meeting selection conditions from the historical keywords;

a first target module, configured to determine that the target history keyword is the at least one candidate word.

In some embodiments, the target selection module may include:

a first statistical unit for determining the occurrence number of each history keyword;

and the word selection unit is used for determining the historical keywords with the occurrence times larger than the frequency threshold as the target historical keywords.

and the first extraction module is used for extracting the voice attribute information in the voice information.

And the third determining module is used for determining the user attribute information of the user based on the voice attribute information.

And the candidate determining module is used for determining at least one candidate word which has an incidence relation with the user attribute information.

In some embodiments, the first determining unit may include:

the first detection subunit is configured to detect a system keyword set by the user for any candidate server, so as to obtain the system keyword of each of the multiple candidate servers.

As an embodiment, the apparatus may further include:

the fourth determining module is used for determining the voice attribute information of the voice information;

the first processing module comprises:

a content obtaining unit, configured to obtain recommended content corresponding to the voice information fed back by the target server;

a first output unit configured to output the recommended content based on the voice attribute information.

As a possible implementation manner, the first output unit may include:

a first conversion subunit, configured to convert the recommended content into a first recommended voice corresponding to the voice attribute information;

and the first output subunit is used for outputting the first recommended voice.

Further, optionally, the first converting subunit may be specifically configured to:

and if the user is located in a self-service place, converting the recommended content into a first recommended voice corresponding to the voice attribute information.

In some embodiments, the apparatus may further comprise:

and the fifth determining module is used for determining the user attribute information of the user based on the voice attribute information of the voice information.

And the sixth determining module is used for determining a plurality of candidate servers which have an association relation with the user attribute information.

As an embodiment, the apparatus may further include:

a gender determination module, configured to determine gender attribute information of the voice message;

the first processing module comprises:

a third content obtaining unit, configured to obtain recommended content corresponding to the voice information fed back by the target server;

and the second output unit is used for outputting the recommended content for the user based on the gender attribute information.

As a possible implementation manner, the module for determining gender may include:

and the second determining unit is used for determining the gender attribute information of the voice information based on the characteristic information of the voice information.

As another possible implementation manner, the module for determining gender may include:

a text conversion unit, configured to convert the voice information into text information;

a second name extraction unit, configured to perform semantic recognition processing on the text information to obtain a name keyword in the text information;

and the third determining unit is used for determining the gender attribute information of the voice information by using the name key words.

Further, optionally, the third determining unit may include:

the first matching subunit is used for determining the gender attribute information of the voice information as first attribute information if the name keyword is matched with a first class of names;

and the second matching subunit is used for determining the gender attribute information of the voice information as second attribute information if the name keyword is matched with a second type of name.

As another possible implementation manner, the second output unit includes:

the first generation subunit is used for generating a second recommended voice corresponding to the recommended content by taking the gender attribute information as a sound generation parameter;

and the second output subunit is used for outputting the second recommended voice.

As an embodiment, the first determining module may include:

the first acquisition unit is used for acquiring the voice information of the user;

and the fourth determining unit is used for determining the keywords in the acquired voice information.

As yet another embodiment, the first determining module may include:

the first acquisition unit is used for acquiring voice information sent by the electronic equipment; the voice information is acquired by the electronic equipment;

and the fifth determining unit is used for determining the keywords in the received and obtained voice information.

In some possible designs, the apparatus may further include:

the first sending module is used for sending the voice information to the target server; and the voice information is used for the target server to determine the corresponding recommended content.

The second output unit may include:

a second generation subunit: and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as sound generation parameters.

A voice output subunit: and outputting the third recommended voice.

The speech processing apparatus shown in fig. 5 can execute the speech processing method described in the embodiments shown in fig. 1 to fig. 4, and the implementation principle and the technical effect are not described again. The specific manner of operations performed by each module, unit, and sub-unit in the speech processing apparatus in the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 9 may be configured as a speech processing device, and as shown in fig. 10, a schematic structural diagram of an embodiment of a speech processing device provided in the embodiment of the present application includes: a storage component 1001 and a processing component 1002; the storage component stores one or more computer instructions that are invoked by the processing component 1001;

the processing component 1002 is configured to:

As an embodiment, the determining, by the processing component, the keywords in the collected voice information may specifically be: converting the voice information into character information; and carrying out semantic recognition processing on the character information to obtain key words in the character information.

As a possible implementation manner, the searching, by the processing component, a target server corresponding to the keyword from the multiple candidate servers may specifically be:

determining respective system keywords of the candidate servers;

searching a target system keyword corresponding to the keyword from the system keywords of the candidate servers;

In some embodiments, the determining, by the processing component, the system keyword of each of the candidate servers may specifically be:

and taking the client names corresponding to the candidate servers as the system keywords of the candidate servers.

As an example, the processing component may determine the system keyword for each of the plurality of candidate servers by:

determining at least one candidate word;

As a possible implementation manner, the sequentially determining, by the processing component, candidate service ends corresponding to the at least one candidate word, and obtaining target candidate words corresponding to the candidate service ends may specifically be:

Further, optionally, the determining, by the processing component, the candidate service ends respectively corresponding to the at least one candidate word in sequence according to the functional attributes respectively corresponding to the candidate service ends, and obtaining the target candidate words respectively corresponding to the candidate service ends may specifically be:

As yet another possible implementation, the processing component may determine the at least one candidate word by:

acquiring historical keywords corresponding to historical voice information;

Further, optionally, the step of determining, by the processing component, the target history keyword satisfying the selection condition from the history keywords may specifically be:

determining the occurrence frequency of each historical keyword;

In some embodiments, the processing component determines the at least one candidate word by:

extracting voice attribute information in the voice information; determining user attribute information of the user based on the voice attribute information; and determining at least one candidate word having an association relation with the user attribute information.

As another possible implementation manner, the processing component may determine the system keyword of each of the candidate servers by:

As an embodiment, the processing component may be further to:

determining voice attribute information of the voice information;

As a possible implementation manner, the outputting, by the processing component, the recommended content based on the voice attribute information may specifically be:

and outputting the first recommended voice.

In some embodiments, the converting, by the processing component, the recommended content into the first recommended voice corresponding to the voice attribute information may specifically be:

As another possible implementation, the processing component may be further configured to:

the processing component may determine a plurality of candidate servers by:

As yet another embodiment, the processing component may be further to:

determining gender attribute information of the voice information;

As a possible implementation manner, the determining, by the processing component, the gender attribute information of the voice information may specifically be:

As another possible implementation manner, the determining, by the processing component, the gender attribute information of the voice information may specifically be:

converting the voice information into character information;

In some embodiments, the determining, by the processing component, the gender attribute information of the voice information by using the name keyword may specifically be:

As a possible implementation manner, the outputting, by the processing component, the recommended content for the user based on the gender attribute information may specifically be:

generating a second recommended voice corresponding to the recommended content by taking the gender attribute information as a sound generation parameter;

and outputting the second recommended voice.

As an embodiment, the processing component may be further to:

and determining voice attribute information and gender attribute information of the voice information.

The step of outputting, by the processing component, the recommended content to the user may specifically be:

and generating a third recommended voice corresponding to the recommended content by taking the voice attribute information and the gender attribute information as sound generation parameters. And outputting the third recommended voice.

As an embodiment, the determining, by the processing component, the keywords in the collected voice information may specifically be:

collecting voice information of the user;

and determining keywords in the acquired voice information.

As another embodiment, the determining, by the processing component, the keywords in the collected voice information may specifically be:

acquiring voice information sent by electronic equipment; the voice information is acquired by the electronic equipment;

and determining the keywords in the received voice information.

In some embodiments, the processing component may be further operative to:

and sending the voice information to the target server.

The speech processing device shown in fig. 10 may execute the speech processing method described in the embodiments shown in fig. 1 to fig. 4, and the implementation principle and the technical effect are not described again. The specific manner in which the processing components of the speech processing device in the above-described embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

As shown in fig. 11, a schematic structural diagram of another embodiment of a speech processing apparatus provided in the embodiment of the present application, the apparatus may include:

the second determination module 1101: the method is used for determining keywords in the collected voice information.

The second lookup module 1102: the method is used for searching the target client corresponding to the keyword from the candidate clients.

The second processing module 1103: the voice information processing device is used for obtaining the recommended content determined by the target client based on the voice information so as to output the recommended content for the user.

As one embodiment, the second determining module includes:

the first conversion unit is used for converting the voice information into character information;

and the first processing unit is used for carrying out semantic recognition processing on the character information to obtain key words in the character information.

As one embodiment, the second lookup module includes:

a sixth determining unit, configured to determine a system keyword of each of the multiple candidate clients;

the first matching unit is used for searching a target system keyword matched with the keyword from the system keywords of the candidate clients;

and the word determining unit is used for determining the candidate client corresponding to the target system keyword as the target client.

In some embodiments, the sixth determination unit includes:

and the third determining subunit is used for determining the client names of the candidate clients as the system keywords of the candidate clients.

In some embodiments, the sixth determining unit may include:

the fourth determining subunit is used for determining at least one candidate word;

a fourth obtaining subunit, configured to sequentially determine candidate clients corresponding to the at least one candidate word, and obtain target candidate words corresponding to the multiple candidate clients;

and the fifth obtaining subunit is configured to confirm that the target candidate word corresponding to any one candidate client is the system keyword of the candidate client, and obtain the system keyword of each of the multiple candidate clients.

As a possible implementation manner, the fourth obtaining subunit may include:

and the first matching module is used for sequentially determining the candidate clients corresponding to the at least one candidate word according to the functional attributes corresponding to the candidate clients respectively, and obtaining the target candidate words corresponding to the candidate clients respectively.

As another possible implementation manner, the fourth determining subunit may include:

the second acquisition module is used for acquiring historical keywords corresponding to the historical voice information;

the second selection module is used for determining target historical keywords meeting selection conditions from the historical keywords;

and the second target module is used for determining the target historical keyword as the at least one candidate word.

In some embodiments, the second selection module may include:

the second statistical unit is used for determining the occurrence frequency of each history keyword;

and the word selection unit is used for determining the history keywords with the occurrence times larger than the frequency threshold value as target history keywords.

the second extraction module is used for extracting the voice attribute information in the voice information;

a seventh determining module, configured to determine user attribute information of the user based on the voice attribute information;

Further, optionally, the first matching module may include:

and the second matching unit is used for determining a target candidate word matched with the functional attribute from the at least one candidate word aiming at the functional attribute of any candidate client so as to obtain the target candidate words respectively corresponding to the multiple candidate clients.

and the second detection subunit is used for detecting the system keywords set by the user aiming at any candidate client so as to obtain the respective system keywords of the multiple candidate clients.

As an embodiment, the apparatus may further include:

an eighth determining module, configured to determine voice attribute information of the voice information;

the second processing module may include:

a content obtaining unit, configured to obtain recommended content determined by the target client based on the voice information;

a third output unit configured to output the recommended content based on the voice attribute information.

As a possible implementation manner, the third output unit may include:

the second conversion subunit is used for converting the recommended content into a first recommended voice corresponding to the voice attribute information;

a fourth output subunit for outputting the first recommended voice

Further, optionally, the second converting subunit may be specifically configured to:

In some embodiments, the apparatus may further comprise:

a ninth determining module, configured to determine user attribute information of the user based on the voice attribute information of the voice information.

And the tenth determining module is used for determining a plurality of candidate clients which have incidence relation with the user attribute information.

As an embodiment, the apparatus may further include:

a gender determining module, configured to determine gender attribute information of the voice message;

the second processing module may include:

a fifth output unit, configured to output the recommended content for the user based on the gender attribute information.

As a possible implementation manner, the gender determination module may include:

and the gender determining unit is used for determining gender attribute information of the voice information based on the characteristic information of the voice information.

As another possible implementation manner, the gender determination module may include:

a tenth determining unit configured to determine gender attribute information of the voice information using the name keyword.

Further, optionally, the tenth determining unit may include:

the third matching subunit is used for determining the gender attribute information of the voice information as first attribute information if the name keyword is matched with the first class of names;

and the fourth matching subunit is used for determining the gender attribute information of the voice information as the second attribute information if the name keyword is matched with the second type of names.

As another possible implementation manner, the fifth output unit may include:

a third generating subunit, configured to generate a second recommended voice corresponding to the recommended content by using the gender attribute information as a sound generating parameter;

and the third output subunit is used for outputting the second recommended voice.

As an embodiment, the second determining module may include:

the second acquisition unit is used for acquiring the voice information of the user;

and the eleventh determining unit is used for determining the keywords in the acquired voice information.

As yet another embodiment, the second determining module may include:

the second acquisition unit is used for acquiring the voice information sent by the electronic equipment; the voice information is acquired by the electronic equipment;

and the twelfth determining unit is used for determining the keywords in the received and obtained voice information.

In some possible designs, the apparatus may further include:

and the second sending module is used for sending the voice information to the target server.

In some embodiments, the fourth output unit may include:

a fourth generating subunit, configured to generate a second recommended voice corresponding to the recommended content, with the gender attribute information as a sound generation parameter;

and the fourth output subunit is used for outputting the second recommended voice.

In some embodiments, the fourth output unit may further include:

a fifth generating subunit, configured to generate a third recommended voice corresponding to the recommended content, with the voice attribute information and the gender attribute information as sound generation parameters;

and the fifth output subunit is used for outputting the third recommended voice.

The speech processing apparatus shown in fig. 11 can execute the speech processing method described in the embodiments shown in fig. 7 to fig. 8, and the implementation principle and the technical effect are not described again. The specific manner of operations performed by each module, unit, and sub-unit in the speech processing apparatus in the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 11 may be configured as a speech processing device, and as shown in fig. 12, a schematic structural diagram of an embodiment of a speech processing device provided in the embodiment of the present application includes: a storage component 1201 and a processing component 1202; the storage component 1201 stores one or more computer instructions that are invoked by the processing component 1202;

the processing component 1202 is configured to:

The electronic device may integrate multiple voice clients. The electronic equipment can provide voice interaction service for users through a plurality of voice clients.

converting the voice information into character information;

As a possible implementation manner, the searching, by the processing component, a target client corresponding to the keyword from the candidate clients may specifically be:

determining a system keyword of each candidate client;

In some embodiments, the determining, by the processing component, the system keyword of each of the candidate clients may specifically be:

In some embodiments, the processing component may determine the system keyword for each of the plurality of candidate clients by:

determining at least one candidate word;

As a possible implementation manner, the sequentially determining, by the processing component, candidate clients corresponding to the at least one candidate word, and obtaining target candidate words corresponding to the multiple candidate clients may specifically be:

Further, optionally, the determining, by the processing component, the candidate clients corresponding to the at least one candidate word in sequence according to the functional attributes corresponding to the candidate clients, respectively, and obtaining the target candidate words corresponding to the candidate clients may specifically be:

As an embodiment, the processing component may be further to:

determining voice attribute information of the voice information;

the processing component obtains the recommended content determined by the target client based on the voice information, and outputting the recommended content to the user may specifically be:

As a possible implementation, the processing component may be further configured to:

the processing component may determine a plurality of candidate clients by:

and outputting the first recommended voice.

As yet another embodiment, the processing component may be further to:

determining gender attribute information of the voice information;

In some embodiments, the determining, by the processing component, the gender attribute information of the voice information may specifically be:

converting the voice information into character information;

Further, optionally, the determining, by the processing component, the gender attribute information of the voice information by using the name keyword may specifically be:

In some embodiments, the processing component, based on the gender attribute information, may specifically output the recommended content to the user by:

and outputting the second recommended voice.

As an embodiment, the processing component may be further to:

The speech processing device shown in fig. 12 may execute the speech processing method described in the embodiments shown in fig. 7 to fig. 8, and the implementation principle and the technical effect are not described again. The specific manner in which the processing components of the speech processing device in the above-described embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

As shown in fig. 13, a schematic structural diagram of another embodiment of a speech processing apparatus provided in the embodiment of the present application is configured in an electronic device, and the speech processing apparatus may include:

the voice acquisition module 1301: used for collecting the voice information of the user.

The voice sending module 1302: and the voice message is sent to the central server.

The third processing module 1303: and the system is used for obtaining the recommended content fed back by the central server so as to output the recommended content to the user.

In the embodiment of the application, the voice information sent by the electronic equipment is obtained, and after the keyword in the voice information is determined, the target server corresponding to the keyword is searched from a plurality of candidate servers, so that the recommended content corresponding to the voice information is obtained from the target server. The electronic equipment acquires the voice information of the user and transmits the voice information with the data of the central server, so that the electronic equipment can obtain the network services provided by a plurality of candidate servers, and the processing and switching process of the network services is not sensitive, the application range of the electronic equipment is expanded under the condition that the use of the user is not influenced, and the utilization rate of the electronic equipment is improved.

The speech processing apparatus shown in fig. 13 can execute the speech processing method shown in the embodiment shown in fig. 5, and the implementation principle and the technical effect are not described again. The specific manner of operations performed by each module, unit, and sub-unit in the speech processing apparatus in the above embodiments has been described in detail in the embodiments related to the method, and will not be described in detail here.

The speech processing apparatus shown in fig. 13 may be configured as an electronic device, and as shown in fig. 14, the electronic device according to an embodiment of the present application is configured as a schematic structural diagram, where the electronic device may include: storage component 1401 and processing component 1402; the storage component 1401 stores one or more computer instructions that are invoked by the processing component 1402;

the processing component 1402 can be configured to:

The speech processing device shown in fig. 14 may execute the speech processing method shown in the embodiment shown in fig. 5, and the implementation principle and the technical effect are not described again. The specific manner in which the processing components of the speech processing device in the above-described embodiments perform operations has been described in detail in the embodiments related to the method, and will not be described in detail here.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process executed by the processing component of the electronic device described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of speech processing, the method comprising:

determining keywords in the acquired voice information;

2. The method of claim 1, wherein the determining keywords in the collected voice information comprises:

converting the voice information into character information;

3. The method of claim 1, wherein the searching for the target server corresponding to the keyword from the plurality of candidate servers comprises:

determining respective system keywords of the candidate servers;

4. The method of claim 3, wherein the determining the respective system keyword of the candidate servers comprises:

5. The method of claim 3, wherein the system keyword of each of the candidate servers is determined by:

determining at least one candidate word;

6. The method of claim 5, wherein the sequentially determining candidate servers corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the candidate servers respectively comprises:

7. The method according to claim 6, wherein the determining, in order according to the functional attributes corresponding to the candidate servers, the candidate servers corresponding to the at least one candidate word, respectively, and the obtaining target candidate words corresponding to the candidate servers, respectively, comprises:

8. The method of claim 5, wherein the at least one candidate word is determined by:

acquiring historical keywords corresponding to historical voice information;

9. The method of claim 8, wherein the determining, from the history keywords, a target history keyword that satisfies a selection condition comprises:

determining the occurrence frequency of each historical keyword;

10. The method of claim 5, wherein the at least one candidate word is determined by:

extracting voice attribute information in the voice information;

11. The method of claim 3, wherein the system keyword of each of the candidate servers is determined by:

12. The method of claim 1, further comprising:

determining voice attribute information of the voice information;

13. The method of claim 12, wherein the outputting the recommended content based on the voice attribute information comprises:

and outputting the first recommended voice.

14. The method of claim 13, wherein the converting the recommended content into the first recommended voice corresponding to the voice attribute information comprises:

15. The method of claim 12, further comprising:

the candidate servers are determined by:

16. The method of claim 1, further comprising:

determining gender attribute information of the voice information;

17. The method of claim 16, wherein the determining gender attribute information for the voice information comprises:

18. The method of claim 16, wherein the determining gender attribute information for the voice information comprises:

converting the voice information into character information;

19. The method of claim 18, wherein determining gender attribute information of the voice message using the name keyword comprises:

20. The method of claim 16, wherein outputting the recommended content for the user based on the gender attribute information comprises:

and outputting the second recommended voice.

21. The method of claim 1, wherein the determining keywords in the collected voice information comprises:

collecting voice information of the user;

and determining keywords in the acquired voice information.

22. The method of claim 1, wherein the determining keywords in the collected voice information comprises:

and determining the keywords in the received voice information.

23. The method of claim 1, wherein after searching for the target server corresponding to the keyword from the plurality of candidate servers, further comprising:

sending the voice information to the target server; and the voice information is used for the target server to determine the corresponding recommended content.

24. A method of speech processing, the method comprising:

determining keywords in the acquired voice information;

and acquiring recommended content determined by the target client based on the voice information, and outputting the recommended content to the user.

25. The method of claim 24, wherein determining keywords in the collected voice information comprises:

converting the voice information into character information;

26. The method of claim 24, wherein the searching for the target client corresponding to the keyword from the plurality of candidate clients comprises:

determining a system keyword of each candidate client;

27. The method of claim 26, wherein determining the system keyword for each of the plurality of candidate clients comprises:

28. The method of claim 26, wherein the system keyword of each of the candidate clients is determined by:

determining at least one candidate word;

29. The method of claim 28, wherein the sequentially determining candidate clients corresponding to the at least one candidate word respectively, and obtaining target candidate words corresponding to the candidate clients respectively comprises:

30. The method of claim 29, wherein the determining, in order according to the functional attributes corresponding to the candidate clients, the candidate clients corresponding to the at least one candidate word, respectively, and the obtaining target candidate words corresponding to the candidate clients, respectively, comprises:

31. The method of claim 26, wherein the system keyword of each of the candidate clients is determined by:

32. The method of claim 24, further comprising:

determining voice attribute information of the voice information;

33. The method of claim 32, further comprising:

the plurality of candidate clients is determined by:

34. The method of claim 24, further comprising:

determining gender attribute information of the voice information;

35. A speech processing method applied to an electronic device includes:

collecting voice information of a user;

36. A speech processing apparatus, comprising:

37. A speech processing apparatus, comprising:

38. A speech processing apparatus, configured to be provided in an electronic device, comprising:

the voice acquisition module is used for acquiring voice information of a user;

39. A speech processing device, comprising: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing component is to:

40. A speech processing device, comprising: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing component is to:

41. An electronic device, comprising: a storage component and a processing component; the storage component stores one or more computer instructions that are invoked by the processing component;

the processing component is to: