CN107657017A

CN107657017A - Method and apparatus for providing voice service

Info

Publication number: CN107657017A
Application number: CN201710882420.0A
Authority: CN
Inventors: 谢波
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Baidu Online Network Technology Beijing Co Ltd; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-09-26
Filing date: 2017-09-26
Publication date: 2018-02-02
Anticipated expiration: 2037-09-26
Also published as: CN107657017B

Abstract

This application discloses the method and apparatus for providing voice service.One embodiment of the method for being used to provide voice service includes：Obtain voice input signal；Using the tone in the semantics recognition Model Identification voice input signal for using machine learning method training and content of speaking, corresponding tone input information and text entry information are obtained, wherein, tone input information is used for the tone type for representing voice input signal；Information is inputted based on the tone and text entry information carries out voice service data query, voice-response information is generated according to Query Result.The embodiment realizes the tone identification independent of auxiliary words of mood, can more accurately detect the intention of speaker, lifts the precision of voice service.

Description

Method and apparatus for providing voice service

Technical field

The application is related to field of computer technology, and in particular to voice technology field, more particularly, to provides voice clothes The method and apparatus of business.

Background technology

Artificial intelligence (Artificial Intelligence, AI) is research, developed for simulating, extending and extending people Intelligent theory, method, a new technological sciences of technology and application system.Artificial intelligence is one of computer science Branch, it attempts to understand the essence of intelligence, and produces a kind of new intelligence that can be made a response in a manner of human intelligence is similar Energy machine, the research in the field include robot, speech recognition, image recognition, natural language processing and expert system etc..Its In, the speech recognition technology in artificial intelligence field is computer science and an important side in artificial intelligence field To.

The tone of the speaker when speaking typically include the demand informations such as the emotion of speaker, existing speech recognition skill Mainly by identifying the tone of speaker to auxiliary words of mood (such as " ", " ", " " etc.) in art, and then judge speaker Demand.But this tone recognition methods has very strong limitation, on the one hand, same auxiliary words of mood may correspond to not With the tone, such as " " in " genuine " can represent exclamation, can also represent query；On the other hand, for not comprising The voice of auxiliary words of mood, the tone of speaker can not be recognized accurately, and then the emotion and intention of speaker can not be carried out accurate True judgement.

The content of the invention

In order to solve one or more technical problems that above-mentioned background section is mentioned, the embodiment of the present application provides use In the method and apparatus for providing voice service.

In a first aspect, the embodiment of the present application provides a kind of method for providing voice service, including：It is defeated to obtain voice Enter signal；Using the tone in the semantics recognition Model Identification voice input signal for using machine learning method training and speak Content, corresponding tone input information and text entry information are obtained, wherein, tone input information is used to represent that phonetic entry is believed Number tone type；Information is inputted based on the tone and text entry information carries out voice service data query, according to Query Result Generate voice-response information.

In certain embodiments, it is above-mentioned to be looked into based on tone input information and text entry information progress voice service data Ask, voice-response information is generated according to Query Result, including：Information is inputted based on the tone and text entry information determines that user needs Seek information；Inquiry and the voice service data of user's request information matches, generate text response message；Text response message is turned It is changed to voice-response information.

In certain embodiments, it is above-mentioned to be looked into based on tone input information and text entry information progress voice service data Ask, voice-response information is generated according to Query Result, in addition to：Based on the default tone input information and the default tone obtained The corresponding relation of output information, inquires tone output information corresponding with tone input information, and tone output information is used to mark Know the tone of voice-response information to be generated；It is above-mentioned that text response message is converted into voice-response information, including：With reference to language Gas output information carries out Text To Speech conversion to text response message, and generation includes the voice-response information of the tone.

In certain embodiments, the above method also includes：Sample dialogue collection is obtained, sample dialogue collection includes multistage sample pair Words, sample dialogue include the voice data of request text and the corresponding voice data for responding text；According to the sound for responding text Frequency is according to the tone information that text is asked corresponding to determination；By the voice data for asking text, request text and ask text Tone information is trained as training sample using machine learning method to semantics recognition model.

In certain embodiments, the above method also includes the step of structure sample dialogue collection, including：Collect and asked comprising default Seek the dialogue language material of the voice data of text；Extracted from each dialogue language material and respond text corresponding to each default request text Voice data；By the voice data of each default request text and the corresponding voice data combination producing multistage sample for responding text Dialogue, to form sample dialogue collection.

Second aspect, the embodiment of the present application provide a kind of device for being used to provide voice service, including：Acquiring unit, For obtaining voice input signal；Recognition unit, for using using the semantics recognition model knowledge of machine learning method training The tone and content of speaking in other voice input signal, corresponding tone input information and text entry information are obtained, wherein, language Gas input information is used for the tone type for representing voice input signal；Response unit, for inputting information and text based on the tone Input information and carry out voice service data query, voice-response information is generated according to Query Result.

Above-mentioned response unit is further used for generating voice-response information as follows in certain embodiments：It is based on The tone inputs information and text entry information determines user's request information；Inquiry and the voice service number of user's request information matches According to generation text response message；Text response message is converted into voice-response information.

In certain embodiments, above-mentioned response unit is additionally operable to：Based on the default tone input information obtained with presetting The corresponding relation of tone output information, inquires tone output information corresponding with tone input information, and tone output information is used In the tone for identifying voice-response information to be generated；And above-mentioned response unit is further used for text as follows Response message is converted to voice-response information：Text To Speech conversion is carried out to text response message with reference to tone output information, Generation includes the voice-response information of the tone.

In certain embodiments, said apparatus also includes：Sample acquisition unit, for obtaining sample dialogue collection, sample pair Words collection includes multistage sample dialogue, and sample dialogue includes the voice data of request text and the corresponding audio number for responding text According to；Determining unit, for asking the tone information of text according to corresponding to the voice data determination for responding text；Training unit, For asking the voice data of text, asking text and asking the tone information of text as training sample, using engineering Learning method is trained to semantics recognition model.

In certain embodiments, said apparatus also includes being used for the construction unit for building sample dialogue collection, and construction unit is pressed Sample dialogue collection is built according to following manner：Collect the dialogue language material of the voice data comprising default request text；From each pair of language The voice data that text is responded corresponding to each default request text is extracted in material；By it is each it is default request text voice data and The corresponding voice data combination producing multistage sample dialogue for responding text, to form sample dialogue collection.

The method and apparatus for providing voice service that the embodiment of the present application provides, by obtaining voice input signal, Then using the tone in the semantics recognition Model Identification voice input signal for using machine learning method training and in speaking Hold, obtain corresponding tone input information and text entry information, wherein, tone input information is used to represent voice input signal The tone；Voice service data query is carried out based on tone input information and text entry information afterwards, given birth to according to Query Result Into voice-response information, the tone identification independent of auxiliary words of mood is realized, can more accurately detect the meaning of speaker Figure, lift the precision of voice service.

Brief description of the drawings

Non-limiting example is described in detail with reference to what the following drawings was made by reading, other features, Objects and advantages will become more apparent upon：

Fig. 1 is that the application can apply to exemplary system architecture figure therein；

Fig. 2 is the schematic flow sheet for being used to provide one embodiment of the method for voice service according to the application；

Fig. 3 is an application scenarios schematic diagram according to the embodiment of the present application；

Fig. 4 is the schematic flow sheet for being used to provide another embodiment of the method for voice service according to the application；

Fig. 5 is the structural representation for being used to provide one embodiment of the device of voice service of the application；

Fig. 6 is adapted for the structural representation of the computer system of the server for realizing the embodiment of the present application.

Embodiment

The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Be easy to describe, illustrate only in accompanying drawing to about the related part of invention.

It should be noted that in the case where not conflicting, the feature in embodiment and embodiment in the application can phase Mutually combination.Describe the application in detail below with reference to the accompanying drawings and in conjunction with the embodiments.

Fig. 1 shows the method for being used to provide voice service that can apply the application or the dress for providing voice service The exemplary system architecture 100 for the embodiment put.

As shown in figure 1, system architecture 100 can include terminal device 101,102, network 103 and server 104.Net Network 103 between terminal device 101,102 and server 104 provide communication link medium.Network 103 can include each Kind connection type, such as wired, wireless communication link or fiber optic cables etc..

User 110 can be mutual by network 103 and server 104 with using terminal equipment 101,102, is disappeared with receiving or sending Breath etc..Various interactive voice class applications can be installed on terminal device 101,102.

Terminal device 101,102 can have audio input interface and audio output interface and support internet access Various electronic equipments, including but not limited to smart mobile phone, tablet personal computer, intelligent watch, e-book, intelligent sound box etc..

Server 104 can provide the voice server supported for voice service, and voice server can be with receiving terminal The interactive voice request that equipment 101,102 is sent, and interactive voice request is parsed, then search corresponding service number According to, generation response data, and the response data of generation is returned into terminal device 101,102.

It should be noted that the method for being used to provide voice service that the embodiment of the present application is provided can be by server 104 are performed, and correspondingly, the device for providing voice service can be arranged in server 104.

It should be understood that the terminal device, network, the number of server in Fig. 1 are only schematical.According to realizing need Will, can have any number of terminal device, network, server.

With continued reference to Fig. 2, the stream for being used to provide one embodiment of the method for voice service according to the application is shown Journey 200.This is used for the method for providing voice service, comprises the following steps：

Step 201, voice input signal is obtained.

In the present embodiment, electronic equipment (such as Fig. 1 institutes of the above-mentioned method operation for being used to providing voice service thereon The server shown) voice input signal that generates of the voice messaging that can be sent by Network Capture according to user.Specifically, on State electronic equipment and can be established by network with the terminal device (such as terminal device shown in Fig. 1) with audio input interface and connected Connect, terminal device can obtain the voice messaging that user sends by audio input interface, and carry out coding generation phonetic entry Signal, then it is transmitted through the network to the above-mentioned electronic equipment of method operation thereon for providing voice service.

Generally, interactive voice application can be installed on the terminal device with voice input device (such as microphone), User can wake up voice assistant by gesture, specific keys or particular audio signal, and then terminal device can detect user The sound sent, and coding generation voice input signal is carried out according to the sound detected.Afterwards, in order to obtain to phonetic entry The response data of signal, terminal device can be asked to be connected with voice server, and voice input signal is sent to voice and taken Business device.Then voice server can receive the voice input signal of terminal device generation by network.

Step 202, using in the semantics recognition Model Identification voice input signal for using machine learning method training The tone and content of speaking, obtain corresponding tone input information and text entry information.

In the present embodiment, above-mentioned electronic equipment can utilize the semantics recognition model trained in voice input signal The tone and speaking content while be identified, to the recognition result of the tone for tone input information, the identification to content of speaking As a result it is text entry information.Herein, tone input information is used for the tone type for representing voice input signal.Tone type Statement, query can be included, rhetorical question, sigh with feeling, pray making.Alternatively, above-mentioned tone input information can use corresponding tone class Type label represents.Such as statement can be pre-defined, query, rhetorical question, exclamation, the tone type label of imperative mood are followed successively by <ind>、<int>、<rhe>、<exc>、<ime>, then the tone input information can be represented with these labels.

Above-mentioned semantics recognition model can be the model trained in advance using machine learning algorithm.Base can specifically be used In the machine learning algorithm of decision tree, SVMs, neutral net, deep neural network etc., trained using training sample Predicate justice identification model.In the present embodiment, the input of semantics recognition model can be voice signal, and output can be that voice is believed Tone type corresponding to number and the content of text after voice signal is changed.

When usual user is exchanged using the different tone, the tone of language differs, and is embodied in the position of stress and schwa Put and differ.Such as the last or end syllable of doubt statement is usually that schwa, declarative sentence pitch ratio be more uniform, stress of confirmative question is generally in sentence It is first-class etc..In the present embodiment, above-mentioned semantics recognition model can extract the tonality feature in voice input signal, based on sound Feature is adjusted to identify that the tone inputs information.

In some optional implementations of the present embodiment, above-mentioned semantics recognition model is based on marked training sample What this training was drawn.The voice signal for including different tone types can be gathered, as sample speech signal, then handmarking Content of text corresponding to sample speech signal and tone type, afterwards using sample speech signal as the defeated of semantics recognition model Enter, the output of corresponding content of text and tone type as semantics recognition model, continuous adjusting and optimizing semantics recognition model Structure and parameter, the recognition result of semantics recognition model is set to approach the result of handmarking.

Voice input signal is identified by using the semantics recognition model trained based on machine learning method, realized The tone independent of auxiliary words of mood identifies solve tone identification and be limited to the auxiliary words of mood of setting and pair of tone type The problem of regular is answered, has expanded the application of tone identification.

Step 203, information is inputted based on the tone and text entry information carries out voice service data query, tied according to inquiry Fruit generates voice-response information.

In the present embodiment, above-mentioned electronic equipment can input information and text according to the tone that semantics recognition Model Identification goes out This input information carries out voice response.Response data corresponding to can specifically being inquired about in voice service database.At some In optional implementation, voice service database can include corresponding from different tone input information and text entry information Default response data template, a kind of optional implementation of default response data template can include fixed text and label, Such as default response data template " today corresponding to the text entry information " today, weather was pretty good " of the query tone can be included Weather<labelA>, temperature<labelB>”.Content to be supplemented in response data template can be preset by NetFind, so The Query Result of voice service data is generated afterwards.Such as in above example, it is " fine " that can inquire weather, " 20 DEG C of temperature ~30 DEG C ", then can utilize it is " fine " replacement label "<labelA>", " 20 DEG C~30 DEG C " replacement labels of utilization "<labelB>", Generate Query Result " today, weather was fine, 20 DEG C~30 DEG C of temperature ".

In another scene, if the text entry information that semantics recognition Model Identification goes out is " today, weather was pretty good ", Tone information is statement, then can find and response data template is preset corresponding to " today, weather was pretty good " of indicative mood " fair weather is adapted to go for an outing, neighbouring<labelC>Landscape is pretty good ", user is then found when prelocalization is attached by internet The sight name of near suitable outing, such as " Forest Park ", then replace in default response data template label "< labelC>", generation Query Result " fair weather is adapted to go for an outing, and neighbouring Forest Park landscape is pretty good ".

Incidence relation between above-mentioned default response data template and each tone and text entry information can be set in advance Set, so, it is determined that corresponding to voice input signal the tone input information and text entry information after, can be according to pre- The incidence relation first set finds default response data template accordingly, is then searched by network data query, analysis etc. The content to be replaced into default response data template, generate the voice service data query result of completion.

In other optional implementations, voice service data query can be carried out as follows：Root first The emotion information of user is determined according to the tone input information identified, emotion information can include the emotional state of user, such as The emotion information that the query tone is included can be gentle mood, and the emotion information that the rhetorical question tone is included is feelings out of sorts Thread, sigh with feeling that the emotion information that the tone is included can be exciting mood.Then, above-mentioned electronic equipment can be according to emotion information The related condition of the emotion met required for response data is determined, the related condition of emotion is entered as according to text entry information The additional conditions of row response data inquiry, if the response data inquired meets additional conditions, the response that will can be inquired Query Result of the data as voice service data, otherwise can not looking into using the response data inquired as voice service data Ask result.

The Query Result of voice service data is usually the data of textual form, can use text regularization by text shape The data of formula are converted to speech data, generate voice-response information.Text regularization can be for example with based on deep learning frame The model of frame performs.

After voice-response information is generated, the terminal device that is connected with above-mentioned electronic equipment (such as Fig. 1 institutes can be passed through The terminal device shown) audio output interface (loudspeaker) output voice-response information, realize intelligent sound service.

It is above-mentioned to be entered based on tone input information and text entry information in some optional implementations of the present embodiment Row voice service data query, the step 203 of voice-response information is generated according to Query Result to be included：Inputted based on the tone Information and text entry information determine user's request information；Inquiry and the voice service data of user's request information matches, generation Text response message；Text response message is converted into voice-response information.That is, can be based on the tone input information and Text entry information parses to the potential demand of user, obtains the intent information of user.Specifically, a variety of sides can be used Method carries out the parsing of user view, such as can be parsed using machine learning model.In some optional implementations, When parsing the intent information of user, it can be searched in default intent information set defeated with tone input information and/or text Entering the keyword in information has the intent information of corresponding relation.As an example, default intent information set can include meaning Tone type and keyword " Walking Route ", " driving route ", " public affairs of figure information " enquiring route ", the intent information and query Cross-channel line ", " how to get to " etc. it is corresponding, can also with the tone type and keyword " inquiry " praying making or " planning " or " navigation ", And " route " is corresponding.

In further implementation, information and text entry information progress voice service data are being inputted based on the tone Inquiry, when generating voice-response information according to Query Result, the default tone input information for being also based on having obtained is with presetting The corresponding relation of tone output information, inquire tone output information corresponding with tone input information.Herein, the tone exports Information is used for the tone for identifying voice-response information to be generated, then text response message is being converted into voice-response information When, the conversion that tone output information carries out Text To Speech to text response message can be combined, generation includes the voice of the tone Response message.That is, it is corresponding with default tone output information to input information according to the different types of default tone Relation determines to use tone type of any tone type as voice-response information, so, is inquiring voice service Data, after generating text response message, tone type is synthesized in voice-response information.Above-mentioned default tone input information Can rule of thumb it be preset with the corresponding relation of default tone output information, such as default tone input information is query language During gas, corresponding default tone output information can be indicative mood, corresponding when presetting tone input information to sigh with feeling the tone Default tone output information can be indicative mood or sigh with feeling the tone.In such manner, it is possible to the tone is merged in voice-response information, it is rich The emotional color of rich voice-response information, help to lift the fluency of intelligent sound interaction.

Fig. 3 is refer to, it illustrates according to the embodiment of the present application a application scenarios schematic diagram.As shown in figure 3, in intelligence After energy sound B casees are waken up, user A can carry out intelligent interaction with intelligent sound box B.When user A inquires weather condition, intelligence Audio amplifier B can give the transmitting voice signal of the user collected to backstage voice server C.Voice server C is receiving language After sound signal, semantics recognition Model Identification can be used to go out the tone of user for the query tone, text entry information is " today Weather seems pretty good ".Voice server C can be crucial according to " weather " that is included in the query tone and text entry information Word judgment user wants to know the weather condition of today, may then look up the weather forecast " fine, 19 to 26 degree " to today, and Determined whether " pretty good " according to weather forecast, judged result is "Yes", then will determine that result answering as the query of user It is multiple, and the weather forecast of the today found is combined, generation response text " yes, today, weather was fine, and temperature 19 to 26 is spent ", and Response text is converted to by voice response signal by text regularization, is back to intelligent sound box B afterwards.Intelligent sound box B can be with Voice response signal is decoded and played.In this scenario, although being helped in the voice signal that user sends not comprising the tone Word or comprising inquiry be intended to keyword (such as " what ", " how " etc.), but voice server can C can identify user's Inquiry is intended to, and is then responded.

The method for providing voice service of the above embodiments of the present application, by obtaining voice input signal, Ran Houli With the tone in the semantics recognition Model Identification voice input signal for using machine learning method to train and content of speaking, obtain Corresponding tone input information and text entry information, wherein, tone input information is used for the tone for representing voice input signal Type；Information is inputted based on the tone and text entry information carries out voice service data query, voice is generated according to Query Result Response message, the tone of user can not be identified by auxiliary words of mood when providing voice service, so as to accurately detect The intention of user, and then the intention of user that the tone for combining user is included is responded, and improves voice service and user The matching degree of demand, realize more accurately voice service.

Fig. 4 is refer to, it illustrates another embodiment for being used to provide the method for voice service according to the application Flow chart.As shown in figure 4, the flow 400 for being used to provide the method for voice service of the present embodiment, may comprise steps of：

Step 401, voice input signal is obtained.

In the present embodiment, electronic equipment (such as Fig. 1 institutes of the above-mentioned method operation for being used to providing voice service thereon The server shown) it can be established by network and the terminal device (such as terminal device shown in Fig. 1) with audio input interface Connection, terminal device can obtain the voice messaging that user sends by audio input interface, and it is defeated to carry out coding generation voice Enter signal, be then transmitted through the network to the above-mentioned electronic equipment of method operation thereon for providing voice service.

Step 402, sample dialogue collection is obtained, sample dialogue collection includes multistage sample dialogue, and sample dialogue includes request text This voice data and the corresponding voice data for responding text.

In the present embodiment, above-mentioned electronic equipment can obtain the sample dialogue collection including multistage sample dialogue.Each section Sample dialogue includes the voice data by talking with both sides in exchange, wherein the voice data that the side first to speak sends is request The voice data of text, after the voice data that sends of the side that speaks be the corresponding voice data for responding text.

For example, in one section of sample dialogue, user A says " now when ", and user B answers " afternoon 4 now Point ", then in this section of sample dialogue " now when " is request text, and " at 4 points in afternoon now " is respond text.

Step 403, the tone information of text is asked according to corresponding to determining the voice data for responding text.

In the present embodiment, the tone information for asking text can be analyzed according to text is responded.Specifically, can be with A variety of tone that request text may include are determined first, it is " now assorted as multiple candidate's tone information, such as in above-mentioned example Time " is probably the query tone or sighs with feeling the tone, then " query " and " exclamation " can be used as into two candidate's tone information.So It can be decoded afterwards according to the voice data for responding text, semantic analysis, or extraction is responded in the voice data of text Tone information, judged to respond text is to be used for which kind of tone type to be responded according to semantic analysis result or the tone information extracted Request text, so, can determine from candidate's tone information to ask the tone information of text according to text is responded.

Step 404, using the tone information of the voice data of request text, request text and request text as training sample This, is trained using machine learning method to semantics recognition model.

Afterwards, semantics recognition model can be built, the voice data for asking text is identified acquisition request text, so Afterwards using the voice data for asking text, the corresponding tone information asked text and ask text as training sample, language is inputted Adopted identification model is trained.Specifically the voice data for asking text as input and then can be calculated semantics recognition model Output it is corresponding request text and ask text tone information between error, then adjust model parameter receive error Hold back.Semantics recognition model can be machine learning model, can include but is not limited to：Logic Regression Models, Hidden Markov mould Type, convolutional neural networks model, Recognition with Recurrent Neural Network model.After training is completed, it can obtain using machine learning method The semantics recognition model of training.

In some optional implementations of the present embodiment, before step 402, the method for providing voice service The step of structure sample dialogue collection can also be included, it can specifically collect comprising the voice data for presetting request text to language Material, such as collect the dialogue for including default request text in film/TV play scene etc.；Then extracted from each dialogue language material Go out to respond the voice data of text corresponding to each default request text, that is, extract and be used to respond default request text in dialogue Respond the voice data of text；Finally by the voice data of each default request text and the corresponding voice data group for responding text Symphysis is into multiterminal sample dialogue, to form sample dialogue collection, that is, by the voice data of default request text and extracts The voice data of response text for responding default request text extracts from dialogue language material, forms one section of sample pair Words.So, multiple dialogue language materials are being collected and are extracting the voice data of default request text and the corresponding sound for responding text Frequency can generate multistage sample dialogue, so as to generate sample dialogue collection after.Alternatively, above-mentioned default request text can To be the request text set in advance that can be reached with different tone type lists, such as " genuine " (can be the query tone or old Predicate gas), " men's 100m world record is broken " (can be exclamation gas, the query tone or indicative mood).

Step 405, using in the semantics recognition Model Identification voice input signal for using machine learning method training The tone and content of speaking, corresponding tone input information and text entry information are obtained, wherein, tone input information is used to represent The tone type of voice input signal.

In the present embodiment, the semantics recognition model that step 404 training is drawn can be utilized while identify that phonetic entry is believed Number the tone and speak content, obtain tone input information and text entry information.Wherein, tone input information can use the tone The label of type represents that text entry information is text corresponding to voice input signal.

Step 406, information is inputted based on the tone and text entry information carries out voice service data query, tied according to inquiry Fruit generates voice-response information.

In the present embodiment, information can be inputted according to the tone and text entry information is inquired about in voice service database Corresponding response data.Such as it can be searched in voice service database corresponding with tone input information and text entry information Default response data template, then preset from Network Capture in response data template and treat supplementary data, so as to generate response Data.In another example the emotion information of user can be determined according to the tone input information identified, language is determined according to emotion information Additional conditions during sound service data query.Afterwards, text Regularization, the text that will be inquired can be carried out to Query Result This information is converted to voice-response information.

Step 401, step 405, step 406 in above method flow respectively with the step 201 in previous embodiment, step Rapid 202, step 203 is identical, and the description above with respect to step 201, step 202, step 203 is also applied for step in this implementation 401st, step 405, step 406, here is omitted.

From fig. 4, it can be seen that compared with embodiment illustrated in fig. 2, embodiment adds acquisition sample dialogue collection, according to sample The voice data of response text in this dialogue determine corresponding to request text tone information and by the request in sample dialogue The voice data of text, ask text and ask the tone information of text as training sample, using machine learning method to language The step of adopted identification model is trained, thus, the method for being used to provide voice service of the present embodiment provide semantics recognition The training method of model so that semantics recognition model can preferably learn into real dialog scene the internal logic of interaction, Lift the reliability and accuracy of semantics recognition model.

With further reference to Fig. 5, as the realization to method shown in above-mentioned each figure, it is used to provide language this application provides one kind One embodiment of the device of sound service, the device embodiment is corresponding with the embodiment of the method shown in Fig. 2, and the device specifically may be used With applied in various electronic equipments.

As shown in figure 5, the device 500 for being used to provide voice service of the present embodiment can include：Acquiring unit 501, know Other unit 502 and response unit 503.Wherein, acquiring unit 501 is used to obtain voice input signal；Recognition unit 502 is used for profit With the tone in the semantics recognition Model Identification voice input signal for using machine learning method to train and content of speaking, obtain Corresponding tone input information and text entry information, wherein, tone input information is used for the tone for representing voice input signal Type；Response unit 503 is used to carry out voice service data query based on tone input information and text entry information, according to looking into Ask result generation voice-response information.

In the present embodiment, acquiring unit 501 can by network with audio input interface terminal device (such as Terminal device shown in Fig. 1) connection is established, receive the user after obtaining and encoded by audio input interface from terminal device Voice input signal.

Recognition unit 502 can be known using the voice input signal that semantics recognition model obtains to acquiring unit 501 Not, the tone input information and text entry information of voice input signal are drawn.Wherein, tone input information can use tone class The label of type represents that text entry information is text corresponding to voice input signal.Semantics recognition model can be use it is all As the machine learning algorithms such as regression model, deep neural network are trained obtaining, the tone in language can be identified simultaneously and is said Talk about content.

Text entry information and the tone input information that response unit 503 can identify according to recognition unit 502 are carried out Response.Specifically, it can be searched in speech database and input information matches with the tone or can meet that tone input information is wrapped User's potential demand for containing and the data of the user's request included in text entry information are disclosure satisfy that, generation text response letter Breath, then can be converted to voice-response information using text regularization by text response message.

In certain embodiments, above-mentioned response unit 503 can be further used for generating voice response as follows Information：Information is inputted based on the tone and text entry information determines user's request information；Inquiry and user's request information matches Voice service data, generate text response message；Text response message is converted into voice-response information.

In a further embodiment, above-mentioned response unit 503 can be also used for：Based on the default tone input obtained The corresponding relation of information and default tone output information, inquire tone output information corresponding with tone input information, the tone Output information is used for the tone for identifying voice-response information to be generated.At this moment, response unit 503 can be further used for according to Text response message is converted to voice-response information by following manner：Style of writing is entered to text response message with reference to tone output information This is changed to voice, and generation includes the voice-response information of the tone.

In certain embodiments, said apparatus 500 can also include：Sample acquisition unit, for obtaining sample dialogue collection, Sample dialogue collection includes multistage sample dialogue, and sample dialogue includes the voice data of request text and the corresponding sound for responding text Frequency evidence；Determining unit, for asking the tone information of text according to corresponding to the voice data determination for responding text；Training is single Member, for asking the voice data of text, asking text and asking the tone information of text as training sample, using machine Learning method is trained to semantics recognition model.

In certain embodiments, said apparatus 500 can also include being used for the construction unit for building sample dialogue collection.The structure Sample dialogue collection can be built as follows by building unit：Collect comprising the voice data for presetting request text to language Material；The voice data that text is responded corresponding to each default request text is extracted from each dialogue language material；By each default request text This voice data and the corresponding voice data combination producing multistage sample dialogue for responding text, to form sample dialogue collection.

The device 500 for being used to provide voice service of the embodiment of the present application, voice input signal is obtained by acquiring unit, Recognition unit is using the tone in the semantics recognition Model Identification voice input signal trained using machine learning method and says Content is talked about, obtains corresponding tone input information and text entry information, wherein, tone input information is used to represent phonetic entry The tone type of signal；Response unit is based on tone input information and text entry information carries out voice service data query, root Voice-response information is generated according to Query Result, the accurate meaning for detecting user of tone identification can be passed through when providing voice service Figure, and then the intention of user that the tone for combining user is included is responded, and improves of voice service and user's request With degree, more accurately voice service is realized.

It should be appreciated that all units described in device 500 can with reference to each step in figure 2 and the method for Fig. 4 descriptions It is rapid corresponding.Thus, the unit that the operation above with respect to method description and feature are equally applicable to device 500 and wherein included, It will not be repeated here.

Below with reference to Fig. 6, it illustrates suitable for for realizing the computer system 600 of the server of the embodiment of the present application Structural representation.Terminal device or server shown in Fig. 6 are only an example, should not to the function of the embodiment of the present application and Use range brings any restrictions.

As shown in fig. 6, computer system 600 includes CPU (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 602 or be loaded into program in random access storage device (RAM) 603 from storage part 608 and Perform various appropriate actions and processing.In RAM 603, also it is stored with system 600 and operates required various programs and data. CPU 601, ROM 602 and RAM 603 are connected with each other by bus 604.Input/output (I/O) interface 605 is also connected to always Line 604.

I/O interfaces 605 are connected to lower component：Importation 606 including keyboard, mouse etc.；Penetrated including such as negative electrode The output par, c 607 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.；Storage part 608 including hard disk etc.； And the communications portion 609 of the NIC including LAN card, modem etc..Communications portion 609 via such as because The network of spy's net performs communication process.Driver 610 is also according to needing to be connected to I/O interfaces 605.Detachable media 611, such as Disk, CD, magneto-optic disk, semiconductor memory etc., it is arranged on as needed on driver 610, in order to read from it Computer program be mounted into as needed storage part 608.

Especially, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product, it includes being carried on computer-readable medium On computer program, the computer program include be used for execution flow chart shown in method program code.In such reality To apply in example, the computer program can be downloaded and installed by communications portion 609 from network, and/or from detachable media 611 are mounted.When the computer program is performed by CPU (CPU) 601, perform what is limited in the present processes Above-mentioned function.It should be noted that computer-readable medium described herein can be computer-readable signal media or Computer-readable recording medium either the two any combination.Computer-readable recording medium for example can be --- but Be not limited to --- electricity, magnetic, optical, electromagnetic, system, device or the device of infrared ray or semiconductor, or it is any more than combination. The more specifically example of computer-readable recording medium can include but is not limited to：Electrical connection with one or more wires, Portable computer diskette, hard disk, random access storage device (RAM), read-only storage (ROM), erasable type may be programmed read-only deposit Reservoir (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD-ROM), light storage device, magnetic memory Part or above-mentioned any appropriate combination.In this application, computer-readable recording medium can any be included or store The tangible medium of program, the program can be commanded the either device use or in connection of execution system, device.And In the application, computer-readable signal media can include believing in a base band or as the data that a carrier wave part is propagated Number, wherein carrying computer-readable program code.The data-signal of this propagation can take various forms, including but not It is limited to electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be computer Any computer-readable medium beyond readable storage medium storing program for executing, the computer-readable medium can send, propagate or transmit use In by instruction execution system, device either device use or program in connection.Included on computer-readable medium Program code any appropriate medium can be used to transmit, include but is not limited to：Wirelessly, electric wire, optical cable, RF etc., Huo Zheshang Any appropriate combination stated.

Flow chart and block diagram in accompanying drawing, it is illustrated that according to the system of the various embodiments of the application, method and computer journey Architectural framework in the cards, function and the operation of sequence product.At this point, each square frame in flow chart or block diagram can generation The part of one module of table, program segment or code, the part of the module, program segment or code include one or more use In the executable instruction of logic function as defined in realization.It should also be noted that marked at some as in the realization replaced in square frame The function of note can also be with different from the order marked in accompanying drawing generation.For example, two square frames succeedingly represented are actually It can perform substantially in parallel, they can also be performed in the opposite order sometimes, and this is depending on involved function.Also to note Meaning, the combination of each square frame and block diagram in block diagram and/or flow chart and/or the square frame in flow chart can be with holding Function as defined in row or the special hardware based system of operation are realized, or can use specialized hardware and computer instruction Combination realize.

Being described in unit involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described unit can also be set within a processor, for example, can be described as：A kind of processor bag Include acquiring unit, recognition unit and response unit.Wherein, the title of these units is not formed to the unit under certain conditions The restriction of itself, for example, acquiring unit is also described as " unit for obtaining voice input signal ".

As on the other hand, present invention also provides a kind of computer-readable medium, the computer-readable medium can be Included in device described in above-described embodiment；Can also be individualism, and without be incorporated the device in.Above-mentioned calculating Machine computer-readable recording medium carries one or more program, when said one or multiple programs are performed by the device so that should Device：Obtain voice input signal；Using using, voice described in the semantics recognition Model Identification of machine learning method training is defeated Enter the tone in signal and content of speaking, obtain corresponding tone input information and text entry information, wherein, the tone is defeated Enter the tone type that information is used to represent the voice input signal；Information is inputted based on the tone and text entry information enters Row voice service data query, voice-response information is generated according to Query Result.

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from foregoing invention design, carried out by above-mentioned technical characteristic or its equivalent feature The other technical schemes for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein The technical scheme that the technical characteristic of energy is replaced mutually and formed.

Claims

A kind of 1. method for providing voice service, it is characterised in that methods described includes：

Obtain voice input signal；

Using the tone used in voice input signal described in the semantics recognition Model Identification of machine learning method training and say Content is talked about, obtains corresponding tone input information and text entry information, wherein, the tone input information is used to represent described The tone type of voice input signal；

Information is inputted based on the tone and text entry information carries out voice service data query, language is generated according to Query Result Sound response message.
2. according to the method for claim 1, it is characterised in that described based on tone input information and text input letter Breath carries out voice service data query, and voice-response information is generated according to Query Result, including：

Information is inputted based on the tone and text entry information determines user's request information；

Inquiry and the voice service data of the user's request information matches, generate text response message；

The text response message is converted into the voice-response information.
3. according to the method for claim 2, it is characterised in that described based on tone input information and text input letter Breath carries out voice service data query, and voice-response information is generated according to Query Result, in addition to：

Based on the default tone input information and the corresponding relation of default tone output information obtained, inquire and the tone Tone output information corresponding to information is inputted, the tone output information is used for the language for identifying voice-response information to be generated Gas；

It is described that the text response message is converted into the voice-response information, including：

Text To Speech conversion is carried out to the text response message with reference to the tone output information, generation includes the language of the tone Sound response message.
4. according to the method for claim 1, it is characterised in that methods described also includes：

Sample dialogue collection is obtained, the sample dialogue collection includes multistage sample dialogue, and the sample dialogue includes request text Voice data and the corresponding voice data for responding text；

The tone information of text is asked according to corresponding to determining the voice data of the response text；

Using the tone information of the voice data of the request text, the request text and the request text as training sample This, is trained using machine learning method to the semantics recognition model.
5. according to the method for claim 4, it is characterised in that methods described also includes the step for building the sample dialogue collection Suddenly, including：

Collect the dialogue language material of the voice data comprising default request text；

The voice data that text is responded corresponding to each default request text is extracted from each dialogue language material；

By the voice data of each default request text and the corresponding voice data combination producing multistage sample for responding text Dialogue, to form the sample dialogue collection.
6. a kind of device for being used to provide voice service, it is characterised in that described device includes：

Acquiring unit, for obtaining voice input signal；

Recognition unit, for using using voice input signal described in the semantics recognition Model Identification of machine learning method training In the tone and speak content, obtain corresponding tone input information and text entry information, wherein, the tone inputs information For representing the tone type of the voice input signal；

Response unit, for inputting information and text entry information progress voice service data query based on the tone, according to Query Result generates voice-response information.
7. device according to claim 6, it is characterised in that the response unit is further used for giving birth to as follows Into voice-response information：

Information is inputted based on the tone and text entry information determines user's request information；

Inquiry and the voice service data of the user's request information matches, generate text response message；

The text response message is converted into the voice-response information.
8. device according to claim 7, it is characterised in that the response unit is additionally operable to：

Based on the default tone input information and the corresponding relation of default tone output information obtained, inquire and the tone Tone output information corresponding to information is inputted, the tone output information is used for the language for identifying voice-response information to be generated Gas；And

The response unit is further used for that the text response message is converted into the voice response letter as follows Breath：

Text To Speech conversion is carried out to the text response message with reference to the tone output information, generation includes the language of the tone Sound response message.
9. device according to claim 6, it is characterised in that described device also includes：

Sample acquisition unit, for obtaining sample dialogue collection, the sample dialogue collection includes multistage sample dialogue, the sample pair Words include asking the voice data of text and the corresponding voice data for responding text；

Determining unit, for asking the tone information of text corresponding to the voice data determination according to the response text；

Training unit, for the tone of the voice data of the request text, the request text and the request text to be believed Breath is used as training sample, and the semantics recognition model is trained using machine learning method.
10. device according to claim 9, it is characterised in that described device also includes being used to build the sample dialogue The construction unit of collection, the construction unit build the sample dialogue collection as follows：

Collect the dialogue language material of the voice data comprising default request text；

The voice data that text is responded corresponding to each default request text is extracted from each dialogue language material；

By the voice data of each default request text and the corresponding voice data combination producing multistage sample for responding text Dialogue, to form the sample dialogue collection.
A kind of 11. server, it is characterised in that including：

One or more processors；

Storage device, for storing one or more programs,

When one or more of programs are by one or more of computing devices so that one or more of processors are real The now method as described in any in claim 1-5.
12. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The method as described in any in claim 1-5 is realized during execution.