CN111145721B

CN111145721B - Personalized prompt generation method, device and equipment

Info

Publication number: CN111145721B
Application number: CN201911276510.0A
Authority: CN
Inventors: 李深安; 章承伟; 陈琛; 张宏斌; 谈焱; 王兴宝; 雷琴辉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2024-02-13
Anticipated expiration: 2039-12-12
Also published as: CN111145721A

Abstract

The invention discloses a personalized prompt generation method, a personalized prompt generation device and personalized prompt generation equipment. The method comprises the steps of receiving user interaction instructions and personalized comprehensive information; determining the target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and the corresponding preset strategy; generating a target prompt text based on the user interaction instruction, the personalized comprehensive information, the target prompt person setting category and the corresponding preset strategy; and finally, setting categories, target prompt texts and corresponding preset strategies by the target prompt person to synthesize the target prompt. According to the invention, each task stage in the prompt generation process and the multidimensional input information are repeatedly and repeatedly combined, so that each task stage can meet the requirement of personalized output, and the finally synthesized voice content and style are more various and humanized. Thus, applying the present invention in different scene environments will also help to increase the interaction efficiency of the related applications and thereby improve the user experience of the interactive product.

Description

Personalized prompt generation method, device and equipment

Technical Field

The present invention relates to the field of man-machine interaction, and in particular, to a method, an apparatus, and a device for generating a personalized prompt.

Background

In the man-machine interaction process, the machine performs corresponding processing according to the received interaction request instruction sent by the user in the modes of voice, text, touch and the like, and feeds back the processing result to the user. The prompt message playing is one of the means that the machine feeds back information to the user more friendly, direct and efficient, and therefore, in the existing various man-machine interaction products, the application scene related to the prompt message playing is most common. For example, in-vehicle voice assistant, in order to ensure the driving safety, the user can receive the voice broadcast information fed back by the vehicle machine in an audible mode in the driving process.

Most of the existing prompt generation schemes are to give a fixed mode Text based on a specific rule when a machine processing result meets a certain condition, and then combine a Speech synthesis technology (TTS) to synthesize a corresponding prompt.

However, the output of the existing prompt generation scheme is quite uniform and boring, and is difficult to match the personalized needs of the user and carry out self-adaptive transformation, and along with the increase of the interaction times, the user is easy to generate tired and tired psychology. Even when the user wishes to personalize the reminder, for the present solution, a customized upgrade of the entire interactive system is required, thus resulting in a high operating cost.

Disclosure of Invention

The invention provides a personalized prompt generation method, a device and equipment aiming at the defects of the prior art, and correspondingly provides a computer readable storage medium.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a method for generating a personalized hint, including:

receiving user interaction instructions and personalized comprehensive information;

determining a target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

generating a target prompt text based on the user interaction instruction, the personalized comprehensive information, the target prompt person setting category and a second preset strategy;

and synthesizing the target prompt by using the target prompt person setting category, the target prompt text and a third preset strategy.

In one possible implementation manner, the personalized comprehensive information at least includes one of the following:

multi-dimensional user information characterizing a user state, wherein the multi-dimensional user information further comprises user basic information and/or user advanced information;

Representing multidimensional environment information of the environment state of a user;

based on the external information of the network hot spot; and

feedback information of the user for the interaction process.

In one possible implementation manner, the user advanced information at least includes one of the following: based on the user basic information after processing, predicting the obtained user talking information, user relation information and user interest information;

the feedback information includes: and interrupting the operation information of the target prompt by the user acquired in real time in the interaction process, and/or scoring information of the target prompt by the user acquired randomly.

In one possible implementation manner, the first preset strategy includes:

analyzing the user interaction instruction;

selecting corresponding information from the personalized comprehensive information according to the analyzed user interaction instruction;

inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-built personalized human-set generation model;

the second preset strategy comprises the following steps:

and inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal category into a pre-built personalized prompt language text generation model.

In one possible implementation manner, the method further includes:

recording interaction history data of a user, wherein the interaction history data comprises feedback information and/or use habits;

determining a reward and punishment mechanism of the personalized personal setting generation model and/or the personalized prompt language text generation model according to the interaction historical data;

based on the general model architecture, the parameters of the model are adaptively updated by combining the reward and punishment mechanism.

In one possible implementation manner, the third preset policy includes:

carrying out acoustic processing on the user interaction instruction in an audio form;

determining a noise reduction signal based on the acoustic processing result and noise related information of the current environment of the user;

inputting the target prompt person set category and the target prompt text into a pre-constructed prompt synthesis model to synthesize an initial prompt;

and fusing the noise reduction signal and the initial prompt to obtain the target prompt.

In a second aspect, the present invention provides a personalized prompter generating apparatus, including:

the receiving module is used for receiving the user interaction instruction and personalized comprehensive information;

the target person setting determining module is used for determining a target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

The prompt language text generation module is used for generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language person setting category and a second preset strategy;

and the prompt synthesis module is used for synthesizing the target prompt by utilizing the target prompt person setting category, the target prompt text and a third preset strategy.

In one possible implementation,

the receiving module specifically comprises:

the instruction analysis unit is used for analyzing the user interaction instruction;

the information selection unit is used for selecting corresponding information from the personalized comprehensive information according to the analyzed user interaction instruction;

the target person setting determining module specifically includes:

the first model processing unit is used for inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-built personalized human-set generation model;

the prompt text generation module specifically comprises:

the second model processing unit is used for inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal setting category into a pre-built personalized prompt language text generation model.

In one possible implementation manner, the apparatus further includes:

the interaction history acquisition module is used for recording interaction history data of a user, wherein the interaction history data comprises feedback information and/or use habits;

the model iteration module specifically comprises:

the reward and punishment mechanism setting unit is used for determining a reward and punishment mechanism of the personalized personal setting generation model and/or the personalized prompt language text generation model according to the interaction historical data;

and the self-adaptive updating unit is used for self-adaptively updating the parameters of the model by combining the reward and punishment mechanism on the basis of the general model architecture.

In one possible implementation manner, the apparatus further includes:

the front-end acoustic processing module is used for carrying out acoustic processing on the user interaction instruction in the audio form;

the noise reduction signal generation module is used for determining a noise reduction signal based on an acoustic processing result and noise related information of the current environment of the user;

the prompt synthesis module specifically comprises:

the initial prompt synthesis unit is used for inputting the target prompt person set category and the target prompt text into a pre-constructed prompt synthesis model to synthesize an initial prompt;

And the target prompt synthesis unit is used for fusing the noise reduction signal and the initial prompt to obtain the target prompt.

In a third aspect, the present invention provides a personalized prompter generating apparatus, including:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the device, cause the device to perform the personalized hint generation method as in the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the personalized hint generation method as in the first aspect or any possible implementation of the first aspect.

The key concept of the invention is to repeatedly and repeatedly combine and share each key task stage and multidimensional input information in the generation process of the prompt, thereby ensuring that each task stage can meet the requirement of personalized output, and further ensuring that the prompt content and style of the final synthesized voice integrating the processing results of each stage are richer and humanized. Based on the conception, the invention combines with automatic processing strategies such as modeling and the like, so that the prompt generation scheme is more automatic and intelligent, and self iteration and updating of the model are also carried out by utilizing a reward and punishment mechanism in the machine learning field, so that the high adaptability and flexible matching capability of the output personalized prompt are improved. Furthermore, the invention also superimposes acoustic optimization effects such as noise reduction and the like on the synthesized prompt, so that the user can improve the quality of the hearing sense of the prompt.

Therefore, the scheme of the invention is configured in different scene environments, so that the interaction efficiency of related applications can be assisted to be improved, and the safety, reliability, friendliness and even user satisfaction of the interaction product can be improved in a specific application scene.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of an embodiment of a personalized hint generation method provided by the present invention;

FIG. 2 is a flow chart of an embodiment of selecting personalized comprehensive information provided by the present invention;

FIG. 3 is a flowchart of a third preset strategy according to a preferred embodiment of the present invention;

FIG. 4 is a flow chart of a method for updating a personalized personal generation model and/or a personalized hint text generation model provided by the present invention;

fig. 5 is a schematic structural diagram of an embodiment of a personalized hint generating device provided by the present invention;

FIG. 6 is a schematic diagram of a personalized hint generating device according to a preferred embodiment of the present invention;

fig. 7 is a schematic structural diagram of an integrated embodiment of the personalized prompter generating apparatus provided by the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the invention.

Before the scheme is introduced, it should also be noted that:

on the one hand, aiming at a plurality of defects of the prior prompt generation scheme mentioned in the background, the invention initially makes exploratory problem root mining work. Through researches, most of the existing prompt generation schemes are derived from technical measures such as a state machine, a knowledge base, QA questions and answers and the like. Taking a state machine as an example, all automaton states and state transition conditions need to be designed manually in advance, and then the pre-designed automaton is matched according to an effective request input by a user in interaction so as to give a fixed prompt text for subsequent voice synthesis. Although other prior art schemes can omit the operations of manually enumerating all cases to a certain extent, the generated prompt text is still fixed, and even in a better operation scene, text change is only carried out in a very limited range, and flexible and various personalized adjustment cannot be carried out substantially.

On the other hand, the technical object of the invention is 'prompt voice' in the field of man-machine interaction, which has essential difference with other voice synthesis scenes (such as voice electronic books, news broadcasting and the like), and the technical object of the invention is to keep the conversation between a user and a machine in a clear, friendly and natural application scope capable of randomly straining, namely, not only the characteristics and advantages of natural conversation between people are reflected, but also the characteristics and advantages of natural conversation between people are surpassed. The technical requirements of wide coverage, flexibility and diversity and convenience in switching adjustment need to be satisfied by fully considering information, corpus, processing strategies and the like associated with the prompt voice when designing the solution, for example, which processing stages need to be configured, what input data is selected as the processing conditions of each stage, how to apply the processing results of each stage so as to achieve the final expected target and the like are to be conceived.

Based on the personalized prompt generation scheme, the personalized prompt generation scheme based on the personalized influence factors and the current interaction instruction is provided. The concept of the scheme is still erected on the existing speech synthesis framework, and the scheme comprises a text generation stage to be synthesized and a stage for acoustic synthesis by using the text, which are similar to most of the prior art, but the technical concept of the invention is different from the prior scheme in at least two characteristics: firstly, the determination of the target prompt person setting type is introduced, wherein the target prompt person setting type mainly plays a role in guiding the pronunciation style, accent, tone, specific virtual character rhythm and the like of the subsequent synthesized voice, can effectively fit the personalized requirements of a user and match the application scene of a product; it should be understood that, as the instruction, state, identity, etc. of the user are changed, and as the situation of different scenes requires, the personalized requirement for the pronunciation effect of the synthesized speech on the acoustic level will also change, so the essence of the improvement is to add a synthetic pronunciation guidance stage for meeting the personalized adjustment requirement to the prompt generation scheme. Secondly, collecting multi-dimensional comprehensive information different from the rule templates in the prior art in a called synthetic pronunciation guidance stage, wherein the comprehensive information is not disposable but is integrated again in a later text generation stage, and meanwhile, the processing result of the synthetic pronunciation guidance stage is shared in the text generation stage; that is, the multi-dimensional comprehensive information is utilized again and the processing results of the steps are combined to conduct personalized guidance on the content of the prompting voice which is finally synthesized later. In summary, the guidance on the acoustic level and the guidance on the content level are independent and associated with each other, and the "independent" is because in the concept of the present invention, the two stages are both determined by the respective input conditions and the corresponding policies to determine the adaptation relationship between the respective processing results and the personalized requirements, and the "associated" is because the subsequent text generation stage uses the input information of the previous person setting determination stage and also incorporates the processing results of the previous person setting determination stage, so that it is seen that the final speech synthesis stage is established on the technical connotation of the "independent and associated with each other" of the previous stages, so that the effect exhibited by the final stage cannot be separated from the previous process, in other words, in the complete prompt generation process of the present invention, the key task stages complement each other and are incorporated into a whole.

Various embodiments of the present invention will be specifically described herein, and for at least one embodiment of the foregoing personalized hint generation method, reference may be made to fig. 1, which specifically includes the following steps:

and S1, receiving a user interaction instruction and personalized comprehensive information.

The invention is based on a man-machine conversation scenario, so that the application effect is interaction between the user and the computer, but in the interaction process, the invention only focuses on how the computer gives a voice response, and the mode of interaction of the user, including the mode of sending the interaction instruction mentioned in the step, is not particularly limited. For example, the user can interact through voice, touch control, handwriting, motion sensing, keyboard entry or entity keys and the like; in addition, the implementation and expression of the interactive content can be different based on different product application environments and scenes, for example, a user interacts with various content such as driving operation, in-car equipment, external conversation, video entertainment and the like in a limited space of a carrier such as a vehicle, an airplane and the like, interaction between the user and various robots or intelligent interaction equipment arranged in various open space public places can be based on service and office requirements, and corresponding operation interaction between the user and various application software and operation platforms in portable mobile terminals and intelligent terminals can be also realized.

From the above description, it can be further explained at least as follows:

firstly, no matter what interaction mode is adopted by the user, in order to facilitate the subsequent operations of generating text and synthesizing voice, the front-end processing of the interaction instruction sent by the user can be performed in advance, where the front-end processing may be different based on different interaction modes, for example, semantic understanding, behavior perception analysis, signal decoding, electric signal conversion, etc., so that the computer "knows" the interaction intention of the user, and especially needs to explain the interaction intention mentioned herein, and the method is not limited to the instructions of personalized operation, inquiry, etc. of the active request of the user, and may also include the conventional wake-up instruction for the device, the passive response instruction (such as answering an incoming call or responding to a fault alarm), etc. However, it will be understood by those skilled in the art that the term front-end processing does not refer to parsing with interactive instructions, e.g., for simple key presses or the like, and does not have the technical meaning of "parsing" in the art.

Second, the personalized comprehensive information is also called as the personalized comprehensive information which is different according to the difference of the interaction scene and the environment, but is used for vehicles, robots, intelligent devices or mobile terminals and the like, the personalized comprehensive information is characterized by the information which is related to the individuals of the current user which is currently interacting, and the information is the comprehensive representation of multidimensional information. The method of acquiring the personalized comprehensive information may be varied in actual operation, for example, acquisition and uploading may be started when a user interacts with a computer, or the user may already acquire and upload the personalized comprehensive information when using an object such as a vehicle, a robot, an intelligent device, or a mobile terminal, or may be prestored history information of user registration and login and past operation, or may include information acquired in the above stages.

It will be appreciated by those skilled in the art that the actual coverage of the personalized comprehensive information is not limited, so long as the personalized comprehensive information can be obtained or manually set by means of the prior art, and various information for characterizing the personalized requirements of the user can be considered. For convenience of explanation and practical operation, the present invention provides various examples of the personalized comprehensive information according to some preferred embodiments, that is, the personalized comprehensive information may include at least one of the following:

(1) Multidimensional user information characterizing a user state, wherein the multidimensional user information further comprises user base information and/or user advanced information.

The user basic information herein may include, but is not limited to, dimensions including name ethnicity, contact, age morphology, gender native, emotional mental, dialect language, academic work, writing fonts, wearing accessories, behavioral activities, operating habits, geographic coordinates, location in space, and the like, of the user currently being interacted with.

The user advanced information herein refers to further information related to the current interactive user, and it should be noted that, in order to achieve the user personalized requirement, the invention is not limited to the collected information related to "person" only coming from the interactive user itself, because the information of person, especially the personalized information, is also dependent on the status of other participants in a certain time and space, so in some embodiments of the invention, status information of other participants related to the current interactive user, such as multiple persons in the same space, or non-current users related to the user and belonging to other time nodes of the interactive product, etc. are also considered. As such, the aforementioned user advanced information may include, in some embodiments, but is not limited to including, one or more of the following: based on the user conversation information, the user relation information and the user interest information which are obtained by prediction after the user basic information is processed.

Taking a vehicle-mounted voice assistant as an example, related information including a current user and other users in the vehicle can be acquired through a microphone, a camera, a sensor and the like, for example, basic information of the current user can be determined on one hand, and basic information such as positions, identities and the like of the total number of other users in the vehicle can be identified on the other hand based on a role separation technology, a voiceprint recognition technology, an image processing technology and the like; and then, the conversation content among the users can be analyzed and the relation among the users can be predicted by means of voice recognition, semantic extraction, information mining and the like, and the basic information of the users can be mined and supplemented, so that the interests and hobbies of the users, such as the types of music, singers, stars and the like, which are favored by the users, can be analyzed and predicted.

It should be further noted that, for the multidimensional user information, the obtained and analyzed results among the user basic information, other user information and user advanced information may not be completely isolated and independent, and each information may be shared for more accurate analysis, for example, the predicted result of the user relationship may be shared into the analysis processing of the conversation content, so that it may be determined whether the request instruction sent by the user is a self-demand or a demand of other related users, for example, a child may assist a parent in playing a child song; and the analysis results of image video and the like can be shared to the prediction of the user relationship so as to fuse multidimensional factors such as image processing, voiceprint recognition, user talking content and the like and more accurately predict the relationship among users. Taking the conversation content information as an example, the analysis of the conversation content can be established under a multi-dimensional information generation model architecture, the voice data of a plurality of users are utilized to respectively obtain context-based semantic information, and then the context association of the conversation content of the users is judged according to the semantic understanding results of the plurality of users, so that the conversation content among the users is determined, and the conversation content is used for one of the auxiliary characteristics of subsequent personalized personal setting and text generation.

(2) And the multidimensional environment information is used for representing the environment state of the user.

The multi-dimensional environment information here is information other than the user, such as related field devices, time-space attributes, and interactive environment conditions, which are established in various scenes where the interactive application occurs. Since the application scenario of the present invention is widely related, only an exemplary description is made here by using a vehicle scenario, and based on the multidimensional environmental information of the vehicle application, the multidimensional wide area information such as vehicle condition, vehicle speed, driving route, geographic position, instrument panel information, vehicle system configuration, vehicle APP state of the intelligent terminal, space layout in the vehicle, temperature and humidity in the vehicle and even such as vehicle type, vehicle age, vehicle price and the like can be but not limited to. The information may be acquired in real time by using a technical means, or may be set in advance or pre-stored, which is not limited to this embodiment.

(3) Based on information about the external information of the network hotspots.

In order to expand the types and contents of personalized personal devices and target prompt texts, the external information refers to network resources related to interaction, such as real-time news, popular network terms, hot search events, stock market quotations, weather forecast, sports events, microblog headlines or electronic commerce platforms related to users and/or environments and crawled by the multidimensional user information and/or multidimensional environment information.

(4) Feedback information of the user for the interaction process.

The feedback information referred to herein has pertinence, that is, as described above, refers to feedback made by the user on the interaction itself, and it should be pointed out that feedback herein does not refer to a one-to-one reply in the interaction process, but characterizes the satisfaction degree of the user on the interaction itself, for example, subjective evaluation of the listening feeling of the sound effect, the content length, the timeliness effectiveness accuracy of the machine response, and the like, for the synthesized target prompt. Of course, although subjective motility of the user is involved, in actual operation, the feedback information may be collected by technical means and a specific investigation angle, for example, in some embodiments of the present invention, the feedback information may include: and interrupting operation information of the target prompt, which is acquired in real time or according to a preset period, of the user, and/or scoring information of the target prompt, which is acquired randomly or according to the preset period, of the user. It should be noted that, on the one hand, in order to make timely adjustment, switching or improvement on the target prompt output by the scheme of the present invention, the feedback information is mainly used for prompting voice synthesized by the scheme of the present invention, in other words, the feedback information is understood as a closed-loop control system for the technical scheme herein; on the other hand, the interruption operation information refers to that in the target prompt broadcasting process, the user breaks the broadcasting and gives additional instruction information, the interrupted instruction information can be as described above, such as voice, keys, touch control and the like, and the scoring information refers to that the user gathers and evaluates or suggests the multiple aspects of the synthesized voice according to the invention in a preset investigation mode (such as questionnaires in voice form or graphical interface form, or scoring mechanisms such as reply of short messages by various customer service platforms after interaction is finished, or virtual or physical smiling faces, star level keys and the like can be set, and the results based on manual investigation can be counted). The above user feedback can also be used as an input feature to improve or adjust the invention for subsequent determination of the personalized personality type and generation of reference information for the target prompt text.

The above four personalized comprehensive information are only schematically introduced, and those skilled in the art can expand or change related personalized comprehensive information in the light of the disclosure herein, and as input conditions of subsequent steps, the combination modes of the personalized comprehensive information can be various, for example, the combination modes are the same or the combination modes are various, and of course, the more the types and the wider the dimensions of the personalized comprehensive information, the more obvious the benign influence on subsequent processing results.

And S2, determining the target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and the first preset strategy.

As mentioned above, the target prompt person category is essentially that of acoustically determining a category of prompt style, mood, etc. that can better match the current user's personalized needs, such as general, high-cool, gentle, serious, harmony, etc., and martial arts, quadratic elements, talk shows, news broadcasts, singing, harbor cavities, dialects, etc., and these person categories may also be combined into a composite person category with multiple effects, such as a mild martial arts type, serious talk show type, etc.

In order to determine the target prompt person setting category, the user interaction instruction and the personalized comprehensive information adopted in the step can be regarded as input conditions for obtaining the person setting category result, and the first preset strategy is a processing process for representing the conversion of the user interaction instruction and the personalized comprehensive information into the determined person setting category.

Therefore, for those skilled in the art, the first preset policy may have multiple processing forms in actual operation, for example, a corresponding condition information table is designed for each preset person setting type, then key words are parsed from the obtained user interaction instruction and personalized comprehensive information, and matching is performed by using the key fields and the condition information table, and distribution of the matching results is counted, so as to determine the most suitable person setting type. By taking a vehicle application example, key words such as an oil consumption inquiry instruction, a highway, a high speed, a low oil quantity and the like are extracted from a user interaction instruction and personalized comprehensive information, and a preset condition information table of a serious standard type covers the key words, so that the target prompt person can be determined to set the category as a serious standard. Of course, the present embodiment is merely a simple illustration, and the actual matching strategy may be more complex and accurate, which is not limited in this embodiment.

In addition, the first preset strategy can be implemented based on the processing concept of the model, and the mode of using the model is also various, for example, the whole content of the received user interaction instruction and personalized comprehensive information is input into a classification model which is not limited to a neural network and other various structures and is constructed in advance, the received information can be converted into a feature vector form in actual operation, and then classified by a multi-information joint decision model, and the feature vector is corresponding to the corresponding target prompt person-set category. And further, corresponding labels can be preset for the output person setting category, for example: < GO1> (normal standard), < GO2> (warm-soft animation), < GO3> (martial arts), etc., so that probability values of the individual categories can be output by the model, and then a better label can be selected from the probability sequences according to probability sequences of each label, or a label can be randomly extracted from labels meeting probability threshold values to be used as a personalized category of the current target prompt, which is only one schematic introduction, and the invention is not limited.

And S3, generating a target prompt text based on the user interaction instruction, the personalized comprehensive information, the target prompt person setting category and a second preset strategy.

In order to determine the target prompt text to be synthesized by voice, the user interaction instruction, the personalized comprehensive information and the target prompt person setting category adopted in the step can be regarded as the input condition of the target prompt text, the condition information in the step S2 is utilized again as mentioned above, and the generation condition of the text is emphasized at the same time, the processing result of the step of determining the person setting category is also related, so the design at least comprises the following advantages: on one hand, more condition superposition effects can be generated, so that the processing result of the previous step is fully adopted for the subsequent step, or on the other hand, when the second preset strategy is processed, information with pertinence can be selected from user interaction instructions and personalized comprehensive information based on the determined human setting type, and the selection of the condition information is provided for reference later, and is not repeated herein; however, it should be noted that, regarding the second preset strategy, the foregoing description of the first preset strategy may also be used, and a better solution is to dynamically adjust and generate the target prompt text by using the concept of machine learning.

For example, a large amount of text corpus may be collected and a vector representation of the determined people-type tags may be obtained by means such as word2vec, as input feature vectors for a personalized hint text generation model based on, but not limited to, seq2 seq. Then, the corresponding prompt text is automatically generated by the model. Of course, it will be appreciated by those skilled in the art that the personalized hint text generation model also requires training with a large and diverse corpus. For ease of implementation, a preferred personalized hint language text generation model structure is schematically illustrated herein:

the model is mainly divided into two parts, namely an Encoder and a decoder, wherein the encoding part can encode a text user interaction instruction and personalized comprehensive information through a BLSTM (bidirectional long short time memory network) and obtain an encoded feature vector expression, then the feature vector is used as the input of the decoding part, and the feature vector is decoded through an attention layer and an LSTM (unidirectional long short time memory network) to obtain a target prompt text. The decoding output may be a multidimensional vector corresponding to the size of a word stock (the word stock is a word set contained in the prompt language corpus), the multidimensional vector may represent the generation probability of each word in the text, the word with the highest probability is taken as the generated word, and then the word with the previous generation is taken as the input so as to decode the next word in the sequence until the whole sentence target prompt is generated. As described above, when the processing result of the previous step is used as an input condition, for example, the tag < GO1> is used as an initial vector of the target prompt language person setting category, and thus, the target prompt language corpus corresponding to the tag < GO1> type can be marked in combination with the instruction sample and the personalized comprehensive information in the training stage of the personalized prompt language text generating model, and the obtained target prompt language text is according to the accuracy, for example, when the vehicle runs at high speed or a fault condition occurs, the user instruction is for inquiring the oil quantity or the electric quantity, and then the category is set by the person "standard serious", so that the content of the generated target prompt language text is more concise, direct and not to disperse the attention of the user.

Furthermore, the step can be further optimized based on the user interaction instruction and the personalized comprehensive information, namely, the generated target prompt text not only can directly respond to the current instruction intention, but also can learn the extending intention based on the instruction, for example, when the user inquires the parking lot, the automatically generated target text not only contains the position information of the peripheral parking lot, but also can further contain the positions of the parking lots which are close and have sufficient parking spaces and the number of the gaps of the parking lots according to the destination of the user. The present embodiment is not limited to this.

Regarding the preferred scheme of selecting the condition information for the first preset policy and the second preset policy, in combination with the model processing concept described above, reference may be made to fig. 2 as follows:

s10, analyzing a user interaction instruction;

the analysis refers to the intention grasping of the user interaction instruction, including but not limited to semantic analysis of the voice instruction, analysis of action behavior, and the like, and certainly, based on different front-end processing, the user interaction instruction can be directly used as an input condition of a subsequent step, such as a high-level signal given by a switch button, without "analysis" in some embodiments, and related description is omitted here.

Step S20, selecting corresponding information from personalized comprehensive information according to the analyzed user interaction instruction;

in the preferred embodiment, it is emphasized that for the personalized comprehensive information in a multidimensional and wide area, in order to reduce the operation amount and obtain the personalized comprehensive information with more user intention, the personalized comprehensive information may be selected before the subsequent steps are processed, and the basis of the selection depends on the user intention conveyed by the user interaction instruction.

Step S31 (aiming at a first preset strategy), inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-built personalized personal setting generation model;

step S32 (aiming at a second preset strategy), inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal setting category into a pre-constructed personalized prompt language text generation model.

Reference is made to the foregoing for both steps, and no further description is given here.

And S4, synthesizing the target prompt by utilizing the target prompt person setting category, the target prompt text and the third preset strategy.

In order to synthesize the final target prompt, the present step combines the processing results of step S2 and step S3, that is, the personalized target prompt person setting category and the target prompt text that have been determined, so that the third preset policy may refer to invoking a TTS engine adapted to the speech speed, the tone, etc. based on the results of the foregoing steps. Therefore, on the basis of the implementation mode, in order to improve the adaptation effect, the method can be based on the interaction history behaviors of the user in advance, and a plurality of TTS engines can be customized by combining a large number of related languages in a preset range, namely, a plurality of customized TTS synthesis models based on different voice colors, rhythms, speech speeds and the like are trained in advance, so that the synthesis efficiency can be improved, and meanwhile, the user can experience the synthesis effect which is in the presence even like daily life communication.

Further, in at least one embodiment of the present invention, the synthesized target prompt may be further modulated based on acoustic processing during the synthesis of voice, and as shown in fig. 3, the third preset strategy may include the following steps without limiting the execution order (i.e., the third preset strategy in this example is not limited to only occur in the voice synthesis stage, and the order between the steps shown by the arrows in fig. 3 is also for convenience of illustration):

step S100, carrying out acoustic processing on the user interaction instruction in an audio form;

step S200, determining a noise reduction signal based on an acoustic processing result and noise related information of the current environment of the user;

step S300, inputting target prompt person setting categories and target prompt texts into a pre-constructed prompt synthesis model to synthesize an initial prompt;

and step 400, merging the noise reduction signal and the initial prompt to obtain the target prompt.

In a specific operation, acoustic processing such as echo cancellation, noise reduction, voice zone identification and the like adopted by the user interaction instruction in a voice form and information related to noise, such as conditions of vehicle speed, windowing, fetal noise, noise and the like, of the environment where the user is located, which are obtained by personalized comprehensive information, can be utilized in the front-end processing, noise which can be heard by the user is predicted, noise reduction signals with the same amplitude and opposite phases can be generated based on the predicted noise, and then the noise reduction signals are superimposed into initial TTS audio synthesized based on the target prompt language people category and the target prompt language text, so that the user can hear the target prompt language more clearly. Therefore, the user experience can be greatly improved, and other side effects in the interaction process, such as driving safety and the like, can be improved based on some special application scenes.

Based on the foregoing implementation examples and the preferred solutions thereof, in other preferred solutions of the present invention, embodiments of model processing concepts are further optimized and improved for the first and second preset strategies, for example, but not limited to, iterative updating of the algorithm of the used model in combination with the artificial intelligence field, as shown in fig. 4, where the method further includes:

m1, recording interaction history data of a user;

m2, determining a personalized personal setting generation model and/or a reward and punishment mechanism of a personalized prompt language text generation model according to the interaction history data;

and M3, based on the general model architecture, adaptively updating parameters of the model by combining a punishment mechanism.

Here, the foregoing personalized personal setting generation model and personalized prompt language text generation model are described in addition to: in some embodiments, models of generic CNN, LSTM, etc. architectures may be trained offline based on a large amount of training data. With the increase of the number of user interactions, feedback information and/or usage habits of the user on the interaction process can be continuously recorded to form interaction historical data. Based on the general model, the interactive history data are used as reward and punishment parameters, and the parameters of the model are adaptively updated in a reinforcement learning mode, so that the model can be more and more in line with the expectation of a user along with the growth of the interactive history data, and further, a better subsequent synthesis effect is achieved. It should be noted that, regarding the feedback information, reference may be made to the above-mentioned suspension operation of the user on the synthesized target prompt, and this information is used as a reinforcement learning penalty measure to continuously optimize and reduce the text length of the target prompt of the corresponding scene; and periodically or aperiodically collecting multidimensional scoring evaluation of the user for the synthesized target prompt, so that the model can be adaptively adjusted based on the satisfaction degree given by the user. The term "usage habit" may refer to a counted common interaction mode (e.g. voice or touch control) of the user, a common interaction scenario (e.g. single person or multiple persons) of the user, and an application scenario with a high occurrence probability of interruption.

For convenience of implementation, the description is made here by using an adaptive update scheme for the personalized personal setting generation model based on feedback information, where each personalized personal setting type label obtained by the personalized personal setting generation model has a probability value, and a reward mechanism can be set through recorded feedback information, for example, the situation of normal playing is recorded as 0 score, the situation of interrupting one time is recorded as-1 score, or the positive feedback is obtained as 10 score by active interaction, the negative feedback is obtained as-10 score, and the like, so that the classification result of the personalized personal setting generation model, that is, the probability value affecting each personalized personal setting type label, is interfered by using a reinforcement learning algorithm, and therefore, the type result predicted by the model can be more fit with the style type expected by the user.

In summary, the core concept of the invention is to repeatedly and repeatedly combine and share each key task stage and multidimensional input information in the process of generating the prompt, so as to ensure that each task stage can meet the requirement of personalized output, and further the prompt content and style of the final synthesized voice integrating the processing results of each stage are richer and humanized. Based on the conception, the invention combines with automatic processing strategies such as modeling and the like, so that the prompt generation scheme is more automatic and intelligent, and self iteration and updating of the model are also carried out by utilizing a reward and punishment mechanism in the machine learning field, so that the high adaptability and flexible matching capability of the output personalized prompt are improved. Furthermore, the invention also superimposes acoustic optimization effects such as noise reduction and the like on the synthesized prompt, so that the user can improve the quality of the hearing sense of the prompt. Therefore, the scheme of the invention is configured in different scene environments, so that the interaction efficiency of related applications can be assisted to be improved, and the safety, reliability, friendliness and even user satisfaction of the interaction product can be improved in a specific application scene.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a personalized hint generating device, as shown in fig. 5, which may specifically include the following components:

the receiving module 1 is used for receiving user interaction instructions and personalized comprehensive information;

the target person setting determining module 2 is used for determining a target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

the prompt language text generation module 3 is used for generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language person setting category and a second preset strategy;

and the prompt synthesis module 4 is used for synthesizing the target prompt by utilizing the target prompt person setting category, the target prompt text and a third preset strategy.

The related contents such as the personalized comprehensive information and the like can be referred to the above description, and are not repeated here.

Further, in one possible implementation,

the receiving module specifically comprises:

The target person setting determining module specifically includes:

the prompt text generation module specifically comprises:

Further, in one possible implementation manner, the apparatus further includes:

the model iteration module specifically comprises:

Further, in one possible implementation manner, as shown in fig. 6, the apparatus further includes:

the front-end acoustic processing module 5 is used for performing acoustic processing on the user interaction instruction in the audio form;

the noise reduction signal generation module 6 is used for determining a noise reduction signal based on the acoustic processing result and the noise related information of the current environment of the user;

the prompt synthesis module specifically comprises:

an initial prompt synthesis unit 41, configured to input the target prompt person set category and the target prompt text into a preset prompt synthesis model, and synthesize an initial prompt;

and a target prompt synthesizing unit 42, configured to fuse the noise reduction signal and the initial prompt to obtain the target prompt.

It should be understood that the division of the components of the personalized hint generation device shown in fig. 5 and 6 is merely a division of logic functions, and may be fully or partially integrated into one physical entity or may be physically separated. And these components may all be implemented in software in the form of a call through a processing element; or can be realized in hardware; it is also possible that part of the components are implemented in the form of software called by the processing element and part of the components are implemented in the form of hardware. For example, some of the above modules may be individually set up processing elements, or may be integrated in a chip of the electronic device. The implementation of the other components is similar. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more specific integrated circuits (Application Specific Integrated Circuit; hereinafter ASIC), or one or more microprocessors (Digital Singnal Processor; hereinafter DSP), or one or more field programmable gate arrays (Field Programmable Gate Array; hereinafter FPGA), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, those skilled in the art will appreciate that in practice the present invention is applicable to a variety of embodiments, and the present invention is schematically illustrated by the following carriers:

(1) A personalized cue generation device may include:

one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 7 is a schematic structural diagram of an embodiment of the personalized prompter generating apparatus according to the present invention, where the apparatus may be an electronic apparatus or a circuit apparatus built in the electronic apparatus. The electronic equipment can be a cloud server, a mobile terminal (mobile phone), an intelligent screen, an unmanned aerial vehicle, an ICV, an intelligent (automobile) or vehicle-mounted equipment and the like based on the interactive application scene. The specific form of the personalized prompter generating apparatus is not limited in this embodiment.

As particularly shown in fig. 7, the personalized prompter generating device 900 includes a processor 910 and a memory 930. Wherein the processor 910 and the memory 930 may communicate with each other via an internal connection, and transfer control and/or data signals, the memory 930 is configured to store a computer program, and the processor 910 is configured to call and execute the computer program from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, more commonly referred to as separate components, and the processor 910 is configured to execute program code stored in the memory 930 to perform the functions described above. In particular, the memory 930 may also be integrated within the processor 910 or may be separate from the processor 910.

In addition, in order to further refine the functionality of the personalized warning sign generating device 900, the device 900 may in some scenarios further comprise one or more of an input unit 960, a display unit 970, audio circuitry 980, a camera 990, a sensor 901, etc., which may further comprise a speaker 982, a microphone 984, etc. Wherein the display unit 970 may include a display screen.

Further, the personalized prompter generating apparatus 900 may also include a power supply 950 for providing electrical power to various devices or circuits in the apparatus 900.

It should be appreciated that the personalized prompter generating apparatus 900 shown in fig. 7 is capable of implementing the various processes of the method provided by the foregoing embodiments. The operations and/or functions of the various components in the device 900 may be respectively for implementing the corresponding flows in the method embodiments described above. Reference is specifically made to the foregoing descriptions of embodiments of methods, apparatuses and so forth, and detailed descriptions thereof are appropriately omitted for the purpose of avoiding redundancy.

It should be understood that the processor 910 in the personalized hint generation device 900 illustrated in fig. 7 may be a system on a chip SOC, where the processor 910 may include a central processing unit (Central Processing Unit; hereinafter referred to as "CPU") and may further include other types of processors, for example: an image processor (Graphics Processing Unit; hereinafter referred to as GPU) or the like, as will be described in detail below.

In general, portions of the processors or processing units within the processor 910 may cooperate to implement the preceding method flows, and corresponding software programs for the portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium having stored thereon a computer program or the above-mentioned means, which when executed, causes a computer to perform the steps/functions of the foregoing embodiments or equivalent implementations.

In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, certain aspects of the present invention may be embodied in the form of a software product as described below, in essence, or as a part of, contributing to the prior art.

(3) A computer program product (which may comprise the apparatus described above) which, when run on a terminal device, causes the terminal device to perform the steps/functions of the preceding embodiments or equivalent implementations.

From the above description of embodiments, it will be apparent to those skilled in the art that all or part of the steps of the above described methods may be implemented in software plus necessary general purpose hardware platforms. Based on such understanding, the above-described computer program product may include, but is not limited to, an APP; in connection with the foregoing, the device/terminal may be a computer device (e.g., a mobile phone, a PC terminal, a cloud platform, a server cluster, or a network communication device such as a media gateway, etc.). Moreover, the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory and at least one communication bus; the processor, the communication interface and the memory can all communicate with each other through a communication bus. The processor may be a central processing unit CPU, DSP, microcontroller or digital signal processor, and may further include a GPU, an embedded Neural network processor (Neural-network Process Units; hereinafter referred to as NPU) and an image signal processor (Image Signal Processing; hereinafter referred to as ISP), and the processor may further include an ASIC (application specific integrated circuit) or one or more integrated circuits configured to implement embodiments of the present invention, and in addition, the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage medium may include: nonvolatile Memory (non-volatile Memory), such as a non-removable magnetic disk, a USB flash disk, a removable hard disk, an optical disk, and the like, and Read-Only Memory (ROM), random access Memory (Random Access Memory; RAM), and the like.

In the embodiments of the present invention, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relation of association objects, and indicates that there may be three kinds of relations, for example, a and/or B, and may indicate that a alone exists, a and B together, and B alone exists. Wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of the following" and the like means any combination of these items, including any combination of single or plural items. For example, at least one of a, b and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, units, and method steps described in the embodiments disclosed herein can be implemented in electronic hardware, computer software, and combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

And, each embodiment in the specification is described in a progressive manner, and the same and similar parts of each embodiment are mutually referred to. In particular, for embodiments of the apparatus, device, etc., as they are substantially similar to method embodiments, the relevance may be found in part in the description of method embodiments. The above-described embodiments of apparatus, devices, etc. are merely illustrative, in which modules, units, etc. illustrated as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed across multiple places, e.g., nodes of a system network. In particular, some or all modules and units in the system can be selected according to actual needs to achieve the purpose of the embodiment scheme. Those skilled in the art will understand and practice the invention without undue burden.

The construction, features and effects of the present invention are described in detail according to the embodiments shown in the drawings, but the above is only a preferred embodiment of the present invention, and it should be understood that the technical features of the above embodiment and the preferred mode thereof can be reasonably combined and matched into various equivalent schemes by those skilled in the art without departing from or changing the design concept and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, but is intended to be within the scope of the invention as long as changes made in the concept of the invention or modifications to the equivalent embodiments do not depart from the spirit of the invention as covered by the specification and drawings.

Claims

1. A method for generating personalized hints, comprising:

determining a target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and a first preset strategy, wherein the first preset strategy represents a processing process of converting the user interaction instruction and the personalized comprehensive information into the determined person setting category;

generating a target prompt text based on the user interaction instruction, the personalized comprehensive information, the target prompt person setting category and a second preset strategy, wherein the second preset strategy is used for selecting targeted information from the user interaction instruction and the personalized comprehensive information based on the determined person setting category;

and synthesizing the target prompt by using the target prompt person setting category, the target prompt text and a third preset strategy, wherein the third preset strategy is a TTS engine call.

2. The personalized prompter generating method according to claim 1, wherein the personalized comprehensive information comprises at least one of:

based on the external information of the network hot spot; and

feedback information of the user for the interaction process.

3. The personalized prompter generating method according to claim 2, wherein:

the user advanced information at least comprises one of the following: based on the user basic information after processing, predicting the obtained user talking information, user relation information and user interest information;

4. The personalized prompter generating method according to claim 1, wherein,

the first preset strategy comprises the following steps:

analyzing the user interaction instruction;

the second preset strategy comprises the following steps:

5. The personalized hint generation method of claim 4, wherein the method further comprises:

6. The personalized warning sign generating method according to any one of claims 1 to 5, wherein the third preset policy includes:

7. A personalized prompter generating device, comprising:

the target person setting determining module is used for determining a target prompt person setting category according to the user interaction instruction, the personalized comprehensive information and a first preset strategy, wherein the first preset strategy represents a processing process of converting the user interaction instruction and the personalized comprehensive information into the determined person setting category;

the prompt language text generation module is used for generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language person setting category and a second preset strategy, wherein the second preset strategy is used for selecting targeted information from the user interaction instruction and the personalized comprehensive information based on the determined person setting type;

and the prompt synthesis module is used for synthesizing the target prompt by utilizing the target prompt person setting category, the target prompt text and a third preset strategy, wherein the third preset strategy is a call TTS engine.

8. The personalized prompter generating apparatus according to claim 7, wherein,

the receiving module specifically comprises:

the target person setting determining module specifically includes:

the prompt text generation module specifically comprises:

9. The personalized hint generation apparatus of claim 8, wherein the apparatus further comprises:

The model iteration module specifically comprises:

10. The personalized prompter generating apparatus according to any one of claims 7-9, further comprising:

the prompt synthesis module specifically comprises:

11. A personalized prompter generating apparatus, comprising:

One or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the device, cause the device to perform the personalized hint generation method of any of claims 1 to 6.

12. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, which when run on a computer causes the computer to perform the personalized hint generation method of any of claims 1 to 6.