CN111145721A

CN111145721A - Personalized prompt language generation method, device and equipment

Info

Publication number: CN111145721A
Application number: CN201911276510.0A
Authority: CN
Inventors: 李深安; 章承伟; 陈琛; 张宏斌; 谈焱; 王兴宝; 雷琴辉
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-12
Anticipated expiration: 2039-12-12
Also published as: CN111145721B

Abstract

The invention discloses a method, a device and equipment for generating a personalized prompt. The method comprises the steps of receiving a user interaction instruction and personalized comprehensive information; determining the human-set type of the target prompt language according to the user interaction instruction, the personalized comprehensive information and the corresponding preset strategy; generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language personal setting category and the corresponding preset strategy; and finally, the target prompting language is synthesized by setting categories, target prompting language texts and corresponding preset strategies for the target prompting language. According to the invention, each task stage in the prompt language generation process is repeatedly and repeatedly combined with the multi-dimensional input information, so that each task stage can meet the requirement of personalized output, and finally, the synthesized voice content and style are more diverse and humanized. Thus, the application of the invention in different scene environments can also assist in improving the interaction efficiency of related applications and thus improve the user experience of interactive products.

Description

Personalized prompt language generation method, device and equipment

Technical Field

The invention relates to the field of human-computer interaction, in particular to a method, a device and equipment for generating a personalized prompt.

Background

In the process of man-machine interaction, the machine carries out corresponding processing according to received interaction request instructions sent by users in the modes of voice, characters, touch and the like, and feeds back the processing result to the users. The prompt language reporting is one of the more friendly, direct and efficient means for the machine to feed back information to the user, and therefore, the application scene related to the prompt language reporting is the most common in various existing human-computer interaction products. For example, a vehicle-mounted voice assistant, in order to ensure the driving safety, a user can receive voice broadcast information fed back by a vehicle machine in an audible form during driving.

Most of the existing prompt language generation schemes are that when a processing result of a machine meets a certain condition, a fixed mode Text based on a specific rule is given, and then a corresponding prompt language is synthesized by combining a Text To Speech (TTS) technology.

However, the output of the existing prompt language generation scheme is discordant and dull, the personalized requirements of the user are difficult to match, the self-adaptive transformation is carried out, and the user is easy to feel tired and bored as the number of interaction times increases. Even when the user wishes to personalize the prompts, for the current solution, the entire interactive system needs to be custom upgraded, resulting in higher operating costs.

Disclosure of Invention

The invention provides a method, a device and equipment for generating personalized prompts, and correspondingly provides a computer readable storage medium, aiming at the defects of the prior art, so that the finally obtained target prompts can be separated from a fixed template, and the personalized demands of users can be met more flexibly and conveniently.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a method for generating a personalized hint, including:

receiving a user interaction instruction and personalized comprehensive information;

determining the human-set type of the target prompt language according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language personal setting category and a second preset strategy;

and synthesizing the target prompt language by utilizing the target prompt language human-set category, the target prompt language text and a third preset strategy.

In one possible implementation manner, the personalized comprehensive information at least includes one of the following:

the multi-dimensional user information represents the user state, wherein the multi-dimensional user information further comprises user basic information and/or user advanced information;

representing multidimensional environment information of the environment state of the user;

external information based on network hotspots; and

and feedback information of the user on the interactive process.

In one possible implementation manner, the user step information at least includes one of the following: predicting user conversation information, user relationship information and user interest information obtained after the user basic information is processed;

the feedback information includes: and in the interaction process, interrupting the stopping operation information of the target prompt language by the user in real time, and/or randomly collecting the grading information of the target prompt language by the user.

In one possible implementation manner, the first preset policy includes:

analyzing the user interaction instruction;

selecting corresponding information from the personalized comprehensive information according to the analyzed user interaction instruction;

inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-constructed personalized human-set generation model;

the second preset policy includes:

and inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal setting category into a pre-constructed personalized prompt language text generation model.

In one possible implementation manner, the method further includes:

recording interaction history data of a user, wherein the interaction history data comprises feedback information and/or use habits;

determining a reward and punishment mechanism of the personalized human-set generation model and/or the personalized prompt language text generation model according to the interactive historical data;

and on the basis of a general model architecture, parameters of the model are adaptively updated by combining the reward and punishment mechanism.

In one possible implementation manner, the third preset policy includes:

performing acoustic processing on the user interaction instruction in audio form;

determining a noise reduction signal based on the acoustic processing result and noise related information of the current environment where the user is located;

inputting the target prompt language personal setting category and the target prompt language text into a pre-constructed prompt language synthesis model to synthesize an initial prompt language;

and fusing the noise reduction signal and the initial prompt to obtain the target prompt.

In a second aspect, the present invention provides a personalized hint generating device, including:

the receiving module is used for receiving the user interaction instruction and the personalized comprehensive information;

the target person setting determining module is used for determining the target prompt language person setting type according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

the prompt language text generation module is used for generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language personal setting category and a second preset strategy;

and the cue language synthesis module is used for synthesizing the target cue language by utilizing the target cue language human-set type, the target cue language text and a third preset strategy.

In one of the possible implementations of the invention,

the receiving module specifically includes:

the instruction analysis unit is used for analyzing the user interaction instruction;

the information selecting unit is used for selecting corresponding information from the personalized comprehensive information according to the analyzed user interaction instruction;

the target person setting determination module specifically comprises:

the first model processing unit is used for inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-constructed personalized human-set generation model;

the prompt language text generation module specifically comprises:

and the second model processing unit is used for inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal setting category into a pre-constructed personalized prompt language text generation model.

In one possible implementation manner, the apparatus further includes:

the interactive history acquisition module is used for recording interactive history data of a user, and the interactive history data comprises feedback information and/or use habits;

the model iteration module specifically comprises:

the reward and punishment mechanism setting unit is used for determining the reward and punishment mechanism of the personalized human design generation model and/or the personalized prompt language text generation model according to the interactive historical data;

and the adaptive updating unit is used for adaptively updating the parameters of the model by combining the reward and punishment mechanism on the basis of a general model architecture.

In one possible implementation manner, the apparatus further includes:

the front-end acoustic processing module is used for carrying out acoustic processing on the user interaction instruction in an audio form;

the noise reduction signal generation module is used for determining a noise reduction signal based on the acoustic processing result and noise related information of the current environment where the user is located;

the prompt synthesis module specifically comprises:

the initial prompt synthesis unit is used for inputting the target prompt human-set type and the target prompt text into a pre-constructed prompt synthesis model and synthesizing an initial prompt;

and the target cue language synthesis unit is used for fusing the noise reduction signal and the initial cue language to obtain the target cue language.

In a third aspect, the present invention provides a personalized hint generating device, including:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions that, when executed by the apparatus, cause the apparatus to perform a personalized hint generating method as in the first aspect or any possible implementation of the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon a computer program which, when run on a computer, causes the computer to execute a personalized hint generating method as in the first aspect or any one of its possible implementations.

It can be seen from the above aspects that the core concept of the present invention is to repeatedly and repeatedly combine and share each key task stage and multi-dimensional input information in the process of generating the cue language, so as to ensure that each task stage can meet the requirement of personalized output, and further to make the cue content and style of the final synthesized voice integrating the processing results of each stage richer and more humanized. Based on the conception, the prompt language generation scheme is more automatic and intelligent by combining with automatic processing strategies such as modeling and the like, and self-iteration and updating of the model are carried out by utilizing a reward and punishment mechanism in the field of machine learning so as to improve the high adaptability and flexible matching capability of the output personalized prompt language. Furthermore, the invention also superposes the acoustic optimization effects of noise reduction and the like on the synthesized prompt words, so that the listening impression of the user on the prompt words is qualitatively changed.

Therefore, the scheme of the invention is configured in different scene environments, which also helps to improve the interaction efficiency of related applications, and thus improves the safety, reliability, friendliness and even user satisfaction of interactive products in specific application scenes.

Drawings

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an embodiment of a method for generating a personalized hint provided by the present invention;

FIG. 2 is a flowchart of an embodiment of selecting personalized integrated information provided by the present invention;

FIG. 3 is a flow chart of a third preferred embodiment of the present invention;

FIG. 4 is a schematic flow chart of an updating method of the personalized human device generating model and/or the personalized prompt language text generating model according to the present invention;

FIG. 5 is a schematic structural diagram of an embodiment of a personalized hint generating apparatus provided by the present invention;

FIG. 6 is a schematic structural diagram of a personalized hint generator according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an integrated embodiment of a personalized hint generating device provided by the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

Before the introduction of the scheme, it is also pointed out that:

on the one hand, aiming at the defects of the existing cue language generation scheme mentioned in the background, the invention initially performs exploratory problem root mining work in design. Research shows that most of the existing prompt generating schemes are derived from technical measures such as a state machine, a knowledge base, a QA question and answer and the like. Taking a state machine as an example, all the automaton states and state transition conditions need to be manually designed in advance, and then the pre-designed automaton is matched according to an effective request input by a user in interaction so as to provide a fixed prompt language text for subsequent voice synthesis. In other prior art schemes, although manual enumeration of all cases can be avoided to a certain extent, the generated prompt text is still fixed, and even in a better operation scenario, text change is only performed within a very limited range, and flexible and various personalized adjustments cannot be performed substantially.

On the other hand, the technical object of the invention is "prompt voice" in the field of human-computer interaction, which is essentially different from other voice synthesis scenes (such as audio electronic books, news broadcasts, etc.), and the technical goal of the "prompt voice" is to keep the conversation between the user and the machine in a clear, friendly, natural and randomly changeable application scope, i.e. to embody the characteristics and advantages of natural conversation even beyond human beings. This is to fully consider the information, corpus, processing strategy, etc. associated with the prompted voice in designing the solution, which needs to satisfy the technical requirements of wide coverage, flexibility, and easy switching adjustment, such as to conceive which processing stages need to be configured, what input data is selected as the processing conditions of each stage, and how to use the processing results of each stage to achieve the final intended goal.

Based on the personalized prompt language generation scheme, the personalized prompt language generation scheme is based on the personalized influence factors and the current interactive instructions. The technical concept of the present invention is still erected on the existing speech synthesis framework, which includes the generation stage of the text to be synthesized and the acoustic synthesis stage using the text similar to most existing technologies, but the technical concept of the present invention has at least two characteristics different from the existing schemes: firstly, the determination of the set type of a target prompting language is introduced, wherein the set type of the target prompting language mainly has the function of guiding the pronunciation style, the cavity tone, the tone of voice and even the rhythm of a specific virtual character of subsequent synthesized voice and the like, so that the personalized requirements of a user can be effectively met and the application scene of a product can be matched; it should be understood that, along with the change of the instruction, status, identity, etc. of the user and the requirement of different scene environments, the personalized requirement of the pronunciation effect of the synthesized voice at the acoustic level will also change, so the essence of the improvement is to add a synthesized pronunciation guidance stage meeting the personalized adjustment requirement to the prompt generation scheme. Secondly, multi-dimensional comprehensive information different from the prior art rule template is collected in the called synthesized pronunciation guide stage, particularly, the comprehensive information is not used once and is merged again in the subsequent text generation stage, and meanwhile, the processing result of the synthesized pronunciation guide stage is also shared in the text generation stage; that is, the content of the finally synthesized prompt voice is personalized and guided by using the multidimensional comprehensive information and combining the processing result of the previous step. In summary, the guidance of the acoustic level and the guidance of the content level are independent and related to each other, "independent" is because the two stages are determined by respective input conditions and corresponding strategies to be adaptive to the respective processing results and the personalized requirements in the concept of the present invention, and "related" is because the subsequent text generation stage not only utilizes the input information of the prior human setting determination stage but also blends in the processing results of the prior human setting determination stage, so it can be seen that the final speech synthesis stage is established within the technical meaning of "independent and related" in each stage, so that the effect exhibited by the final stage cannot be separated from the foregoing process.

As for at least one embodiment of the above personalized hint generating method, see fig. 1, which specifically includes the following steps:

and step S1, receiving the user interaction instruction and the personalized comprehensive information.

The invention is based on a human-computer conversation scene, so that the application effect is the interaction between the user and the computer, but in the past in the interaction process, the invention only focuses on how the computer gives a voice response, and the mode of the interaction performed by the user, including the mode of sending the interaction instruction mentioned in the step, is not particularly limited. For example, the user may interact by voice, touch, handwriting, motion sensing, keyboard entry, or physical key, etc.; in addition, the implementation performance of the interactive content may also be different based on different product application environments and scenes, for example, a user may interact with a vehicle machine, an entertainment system, and other various contents related to driving operations, in-vehicle devices, external calls, audio-visual entertainment, and the like in a limited space of a vehicle such as a vehicle and an airplane, or may interact with various robots or intelligent interactive devices installed in various open-space public places based on service and business requirements, or may interact with various applications and operation platforms in a portable mobile terminal and an intelligent terminal.

From the above description, at least the following may be further explained:

firstly, no matter what interaction mode is adopted by the user, the interaction instruction sent by the user can be subjected to front-end processing in advance for facilitating subsequent text generation and speech synthesis operations, where the front-end processing is different based on different interaction forms, and may include, for example, semantic understanding, behavior perception analysis, signal decoding, electrical signal conversion, and the like, so as to enable the computer to "know" the interaction intention of the user, especially to explain the interaction intention mentioned herein, and is not limited to personalized operation, query, and the like actively requested by the user, and may also include, for example, a conventional wake-up instruction for the device, a passive response instruction (e.g., answering an incoming call or responding to a fault alarm), and the like. However, it will be understood by those skilled in the art that the term front-end processing does not refer to resolving with interactive instructions, such as for simple key presses, and does not have the technical meaning of "resolving" as referred to in the art.

Secondly, corresponding to the difference between the interaction scene and the environment, the personalized comprehensive information may also be different according to the needs, but no matter whether the personalized comprehensive information is used for a vehicle, a robot, a smart device or a mobile terminal, etc., the personalized comprehensive information represents the information related to the current user who is interacting currently, and the information is the comprehensive embodiment of the multidimensional information. Furthermore, the manner of acquiring the personalized integrated information may be various in actual operations, for example, the personalized integrated information may be acquired and uploaded when the user has an interactive session with a computer, or the personalized integrated information may be acquired and uploaded when the user uses an object such as a vehicle, a robot, a smart device, or a mobile terminal, or the personalized integrated information may be pre-stored history information of the user during registration and login and past operations, or of course, the personalized integrated information may also include the information acquired at each stage.

It can be understood by those skilled in the art that the actual content of the personalized comprehensive information is not limited, and various information that can be obtained or manually set by means of the prior art and the like and that is used to characterize the personalized requirements of the user can be taken into consideration. However, for convenience of illustration and convenience of practical operation, the present invention provides various examples of the personalized integrated information based on some preferred embodiments, that is, the personalized integrated information may include at least one of the following:

(1) and the multidimensional user information represents the user state, wherein the multidimensional user information further comprises user basic information and/or user advanced information.

The user basic information here may include, but is not limited to, dimensions of name nationality, contact information, age looks, gender native, emotional psychology, dialect language, academic work, writing fonts, wearing accessories, behavioral activities, operation habits, geographical coordinates of the place, the position in the space, and the like of the user currently interacting with the user.

The user-level information here refers to further information related to the current interactive user, and it should be noted that, in order to implement the user personalization requirement, the collected information related to the "person" is not limited to be only from the interactive user itself, because the person information, especially the representation of the personalization information, also depends on the states of other participants in a certain time and space, and therefore, in some embodiments of the present invention, the state information of other participants related to the current interactive user, such as multiple persons in the same space, or non-current users related to the user and belonging to other time nodes of the interactive product, and the like, may be considered. As such, the aforementioned user step information may, in some embodiments, but is not limited to, include one or more of the following: and predicting the obtained user conversation information, user relationship information and user interest information after processing the user basic information.

Taking the vehicle-mounted voice assistant as an example, the vehicle-mounted voice assistant can acquire relevant information including a current user and other users in the vehicle through a microphone, a camera, a sensor and the like, for example, on one hand, basic information of the current user is determined, and on the other hand, basic information such as the number of people in the vehicle and the positions and identities of other users can be identified based on a role separation technology, a voiceprint recognition technology, an image processing technology and the like; and then, the conversation content among the users can be analyzed and the relationship among the users can be predicted by means of voice recognition, semantic extraction, information mining and the like, and the basic information of the users can be mined and supplemented, so that the interests and hobbies of the users, such as the favorite music types, singers, stars and the like of the users, can be analyzed and predicted.

It should be further noted that, for the multidimensional user information, the obtained and analyzed results of the user basic information, other user information, and user step information may not be completely isolated and independent, and the obtained and analyzed results may be shared among the information for more accurate analysis, for example, the predicted result of the user relationship may be shared in the analysis process of the conversation content, and it may be determined whether the request instruction sent by the user is the self-requirement or the requirement of other related users, for example, a child allows a parent to assist in playing a child song; and the analysis results of the image, the video and the like can be shared to the user relationship prediction, so that the multi-dimensional factors such as image processing, voiceprint recognition, user conversation content and the like can be fused, and the relationship among the users can be predicted more accurately. Taking the conversation content information as an example, the analysis of the conversation content can be established under a multi-dimensional information generation model architecture, semantic information based on context is respectively obtained by utilizing the voice data of a plurality of users, and then the context correlation of the conversation content of the users is judged according to the semantic understanding result of the plurality of users, so that the conversation content among the users is determined and is used for one of auxiliary characteristics of subsequent personalized human setting and text generation.

(2) And the multidimensional environment information represents the environment state of the user.

The multidimensional environment information here is information other than the user, such as devices in the relevant field, temporal-spatial attributes, and interactive environment conditions, which are established in various scenes in which the interactive application occurs. Since the applicable scenarios of the present invention are wide, only the vehicle scenario is taken as an exemplary description here, and the multidimensional environment information based on the vehicle application may be, but not limited to, the vehicle condition, the vehicle speed, the driving route, the geographic location, the instrument panel information, the vehicle system configuration, the vehicle APP status of the intelligent terminal, the vehicle interior space layout, the vehicle interior and exterior temperature and humidity, and even multidimensional and wide-area information such as the vehicle type, the vehicle age, and the vehicle price. The information may be acquired in real time by using a technical means, or may be set in advance or pre-stored, which is not limited in this embodiment.

(3) Based on the external information of the network hotspot.

In order to expand the types and contents of personalized human settings and target prompt language texts, the external information refers to network resources related to interaction, such as real-time news, popular network phrases, hot search events, stock market conditions, weather forecasts, sports events, microblog headlines or e-commerce platforms and other multi-aspect network information related to users and/or the environments where the real-time news, popular network phrases, hot search events, stock market conditions, weather forecasts, sports events, microblog headlines or e-commerce platforms are crawled by using the multi-dimensional user information and/or the multi-dimensional environment information.

(4) And feedback information of the user on the interactive process.

The feedback information is pointed, that is, as mentioned above, the feedback is feedback made by the user to the interaction itself, and it should be pointed out that the feedback here does not refer to a response reply from one to another in the interaction process, but represents the satisfaction degree of the user to the interaction itself, such as subjective evaluation on the sound effect listening feeling of the synthesized target prompt, the length of the content, the timeliness, validity and accuracy of the machine response, and the like. Of course, although subjective initiative of the user is involved, in actual operation, it can still be collected through technical means and specific investigation angles, for example, in some embodiments of the present invention, the feedback information may include: in the interaction process, user interruption operation information of the target prompt is collected in real time or according to a preset period, and/or user grading information of the target prompt is collected randomly or according to the preset period. It should be noted here that, on one hand, in order to make timely adjustment, switching or improvement on the target prompt output by the solution of the present invention, the feedback information mainly serves to the prompt voice synthesized by the solution of the present invention, in other words, it can be understood here as a closed-loop control system for the solution of the present invention; on the other hand, the interruption operation information means that in the target prompt language broadcasting process, the user interrupts the broadcasting to give additional instruction information, the interrupted instruction information may be as described above, such as voice, key, touch, and the like, and the scoring information means that the user collects the evaluation or suggestion on the multiple aspects of the synthesized voice of the present invention in a predetermined investigation manner (for example, questionnaires in a voice form or a graphical interface form, or a scoring mechanism such as various customer service platforms replying short messages after the interaction is finished, or virtual or solid smiling faces, star-level keys, and the like may also be set, and of course, the result based on manual investigation may also be counted). The user feedback may also be used as an input feature to improve or adjust the reference information for subsequent determination of personalized personality type and generation of target prompt text.

The above four personalized comprehensive information are only schematic descriptions, and those skilled in the art can expand or change the relevant personalized comprehensive information under the teaching disclosed herein, and as the input conditions of the subsequent steps, the combination modes of the personalized comprehensive information may be various, for example, the same personalized comprehensive information is selected or the personalized comprehensive information is combined in various ways, and of course, the more the variety of the personalized comprehensive information is, the wider the dimension is, the more the benign influence on the subsequent processing result is obvious.

And step S2, determining the human-set type of the target prompt language according to the user interaction instruction, the personalized comprehensive information and the first preset strategy.

As mentioned above, the target alert language human setting category is essentially determined from an acoustic level a category of alert language styles, moods, etc., that can better match the current user's personalized needs, such as normal, cool, gentle, serious, witty, etc., and swordsman, quadratic, talk show, news report, song, harbor platform cavity, dialect, etc., and these human setting categories can also be combined into a composite human setting category with multiple effects, such as warm-soft swordsman type, serious talk show type, etc.

In order to determine the human setting type of the target prompt, the user interaction instruction and the personalized comprehensive information adopted in the step can be regarded as input conditions for obtaining a human setting type result, and the first preset strategy is a processing process for representing that the user interaction instruction and the personalized comprehensive information are converted into the determined human setting type.

Therefore, for those skilled in the art, the first preset policy may have a plurality of processing forms in actual operation, for example, a corresponding condition information table is designed for each preset type of person, a keyword is analyzed from the obtained user interaction instruction and personalized comprehensive information, the keyword is matched with the condition information table by using the keyword field, and the distribution of the matching result is counted, so as to determine the most suitable type of person. By taking a vehicle application example, key words such as an oil consumption query command, a condition information table of a preset serious standard type covers the key words, and then the target prompting language person is determined to be the serious standard type. Of course, this is only a simple illustration, and the actual matching strategy may be more complex and precise, which is not limited in this embodiment.

In addition, the first preset strategy can be realized based on a model processing concept, and the model utilization mode is also various, for example, all contents of the received user interaction instruction and the personalized comprehensive information are input into a pre-constructed classification model which is not limited to a neural network and other various architectures, the received information can be converted into a feature vector form in actual operation, and then the classification is carried out through a multi-information combined decision model, so that the feature vector corresponds to the corresponding target prompt language human setting category. And further, the output people can be preset with corresponding labels according to the categories, such as: < GO1> (general standard), < GO2> (gentle animation), < GO3> (swordsmen), etc., so that the probability value of each person setting category can be output by the model, and then a better label can be selected from the probability sequence of each label, or a label is randomly extracted from the labels meeting the probability threshold value as the personalized person setting category of the current target prompt, which is only a schematic introduction, and the invention is not limited thereto.

And step S3, generating a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language personal setting category and the second preset strategy.

In order to determine the target prompt language text to be synthesized by voice, the user interaction instruction, the personalized comprehensive information, and the target prompt language setting category adopted in this step may be regarded as input conditions for obtaining the target prompt language text, where as mentioned above, the condition information in step S2 is reused, and the generation condition of the highlighted text also relates to the processing result of the step of determining the human setting type, so the design at least includes the following advantages: on one hand, a more multiple condition superposition effect can be generated, so that the processing result of the previous step is fully adopted for the subsequent step, or on the other hand, when the second preset strategy is processed, targeted information can be selected from the user interaction instruction and the personalized comprehensive information based on the determined human setting type, and condition information is selected, a preferable scheme is provided for reference later, and the detailed description is omitted; however, it should be noted that, regarding the second preset strategy, the above description of the first preset strategy may also be adopted, and a better scheme is to use the concept of machine learning, dynamically adjust and generate the target prompt language text.

Here, for example, a large amount of text corpora may be collected and a vector representation of the determined human type tags may be obtained by means such as word2vec, and used as an input feature vector of the personalized prompt text generation model based on, but not limited to, seq2 seq. Thereafter, corresponding prompt text is automatically generated from the model. Of course, it will be understood by those skilled in the art that the personalized prompt text generation model also requires training through a large and diverse corpus. For convenience of implementation, a preferred personalized prompt text generation model structure is schematically illustrated here:

the model is mainly divided into an Encoder part and a decoder part, the encoding part can encode a textual user interaction instruction and personalized comprehensive information through a BLSTM (bidirectional long and short time memory network) and obtain encoded feature vector expression, then the feature vector is used as input of the decoding part, and the feature vector is decoded through an attention layer and an LSTM (unidirectional long and short time memory network) to obtain a target cue language text. The decoding output may be a multidimensional vector corresponding to the size of a lexicon (the lexicon is a word set included in the corpus of the prompt language), the multidimensional vector may represent the generation probability of each word in the text, where the word with the highest probability is taken as the generated word, and then the previously generated word is taken as an input to be decoded to obtain the next word in the sequence until the target prompt of the whole sentence is generated. As described above, when the processing result of the previous step is used as an input condition, for example, the tag < GO1> is used as an initial vector of a target prompt language user-defined category, so that in a training stage of a personalized prompt language text generation model, a target prompt language text corpus corresponding to the tag < GO1> type is labeled in combination with an instruction sample and personalized comprehensive information, and the target prompt language text obtained thereby is based on accuracy, for example, when a vehicle runs at high speed or a fault condition occurs, a user instruction is to inquire the fuel quantity or the electric quantity, and then the user-defined category of "standard seriousness" is matched, so that the content of the generated target prompt language text is more concise and direct, and the attention of the user is not dispersed.

Moreover, the step can be further optimized based on the user interaction instruction and the personalized comprehensive information, that is, the generated target prompt language text not only can directly respond to the current instruction intention, but also can learn the 'extension intention' based on the instruction, for example, when the user inquires a parking lot, the automatically generated target language text not only contains the position information of the surrounding parking lot, but also can further contain the position of the parking lot with close and sufficient parking places and the number of vacant places in the generated text according to the destination of the user. This embodiment is not limited to this.

Regarding the preferred scheme of selecting the condition information for the first preset strategy and the second preset strategy, in combination with the model processing concept described above, reference may be made to fig. 2:

step S10, analyzing the user interaction instruction;

the term "parsing" refers to performing intent understanding on a user interactive instruction, including but not limited to semantic parsing on a voice instruction, or analyzing an action behavior, and certainly, based on different front-end processing, in some embodiments, the user interactive instruction may be directly used as an input condition of a subsequent step without "parsing", for example, a high-level signal given by a switch key, and the foregoing description is omitted here for brevity.

Step S20, selecting corresponding information from the personalized comprehensive information according to the analyzed user interaction instruction;

in this preferred embodiment, it is emphasized that for the multidimensional and wide-area personalized comprehensive information, in order to reduce the amount of computation and obtain the personalized comprehensive information more aiming at the user's intention, the personalized comprehensive information may be selected before the subsequent steps, and the selection criterion depends on the user's intention conveyed by the user interaction instruction.

Step S31 (aiming at the first preset strategy), inputting the analyzed user interaction instruction and the selected personalized comprehensive information into a pre-constructed personalized human-set generation model;

and step S32 (aiming at the second preset strategy), inputting the analyzed user interaction instruction, the selected personalized comprehensive information and the target prompt language personal setting category into a pre-constructed personalized prompt language text generation model.

The above contents can be referred to for these two steps, and are not described herein again.

And step S4, synthesizing the target prompt language by utilizing the target prompt language personal setting type, the target prompt language text and the third preset strategy.

In order to synthesize the final target prompt, the step combines the processing results of step S2 and step S3, that is, the determined personalized target prompt is a human-defined category and the target prompt text, so that the third preset strategy may be to invoke a TTS engine adapted to speech rate, pitch, etc. based on the results of the foregoing steps. Therefore, on the basis of the implementation mode, in order to improve the adaptation effect, various TTS engines can be customized in advance based on the interactive historical behaviors of the user, and certainly, various customized TTS synthesis models based on different timbres, rhythms, speech speeds and the like can be trained in advance in a preset range, so that the synthesis efficiency can be improved, and the user can experience the synthesis effect which is personally on the scene and even the same as daily life communication.

Further, in at least one embodiment of the present invention, the synthesized target cue may be further modulated "purification" based on acoustic processing when synthesizing the speech, and as shown in fig. 3, the third preset strategy may include the following steps without limiting the execution order (i.e. the third preset strategy in this example is not limited to only occurring in the speech synthesis stage, and the order between the steps shown by the arrows in fig. 3 is also only for convenience of illustration):

s100, performing acoustic processing on a user interaction instruction in an audio form;

s200, determining a noise reduction signal based on the acoustic processing result and noise related information of the current environment of the user;

step S300, inputting the target prompt language personal setting type and the target prompt language text into a pre-constructed prompt language synthesis model, and synthesizing an initial prompt language;

and S400, fusing the noise reduction signal and the initial prompt to obtain a target prompt.

In specific operation, acoustic processing such as echo cancellation, noise reduction and sound zone recognition which is adopted by the front-end processing on a user interaction instruction in a voice form can be utilized, noise which is heard by a user can be predicted according to information which is obtained by personalized comprehensive information and related to noise of the environment where the user is located, such as vehicle speed, windowing, fetal noise, noise and the like, then noise reduction signals with the same amplitude and opposite phases can be generated based on the predicted noise and superposed on an initial TTS audio which is synthesized based on a target prompt language human setting category and a target prompt language text, and therefore the user can hear the target prompt language more clearly. Therefore, user experience can be greatly improved, and other side effects in the interaction process, such as driving safety and the like, can be improved based on some special application scenes.

Based on the above implementation example and the preferred solutions thereof, in other preferred solutions of the present invention, optimization and improvement are further performed on an embodiment of the first and second preset strategies using a model processing concept, for example, but not limited to, performing iterative update on an algorithm in a field of combining a used model with artificial intelligence, as shown in fig. 4, where the method further includes:

step M1, recording the interaction history data of the user;

m2, determining a reward and punishment mechanism of the personalized human-set generation model and/or the personalized prompt language text generation model according to the interactive historical data;

and step M3, adaptively updating the parameters of the model by combining a reward and punishment mechanism on the basis of the general model architecture.

Here, the personalized human device generation model and the personalized prompt text generation model are described as follows: in some embodiments, a model of generic CNN, LSTM, etc. architectures may be trained offline based on a large amount of training data. With the increase of the number of times of the user interaction, the feedback information and/or the use habit of the user to the interaction process can be continuously recorded to form interaction historical data. On the basis of the general model, the interaction historical data are used as reward and punishment parameters, and the parameters of the model are updated in a self-adaptive mode such as a reinforcement learning mode, so that the model can meet the expectation of a user more and more along with the increase of the interaction historical data, and a better subsequent synthesis effect is achieved. It should be noted that, with reference to the aforementioned suspension operation of the user on the synthesized target prompt, the information is used as a reinforcement learning penalty measure to continuously optimize and reduce the text length of the target prompt corresponding to the scene; and collecting multi-dimensional scoring evaluation of the user for the synthesized target prompt words periodically or aperiodically, so that the model can be adaptively adjusted based on the satisfaction given by the user. The usage habit may refer to a statistical interaction manner commonly used by the user (e.g., voice or touch), an interaction scene commonly used by the user (e.g., single person or multiple persons), an application scene with a high interruption probability, and the like.

For convenience of implementation, an adaptive update scheme for the personalized human equipment generating model based on the feedback information is exemplarily described, each personalized human equipment category label obtained by the personalized human equipment generating model has a probability value, and a reward mechanism can be set through the recorded feedback information, for example, the normal playing condition is marked as 0 score, the interruption is marked as-1 score, or the positive feedback direction is 10 score, the negative feedback direction is-10 score and the like are obtained through active interaction, so that the classification result of the personalized human equipment generating model is interfered by using a reinforcement learning algorithm, that is, the probability value of each personalized human equipment category label is influenced, and the category result predicted by the model can be more suitable for the style type expected by the user.

In summary, the core concept of the present invention is to repeatedly and repeatedly combine and share each key task stage and multi-dimensional input information in the process of generating the cue language, so as to ensure that each task stage can meet the requirement of personalized output, and further make the final synthesized voice cue content and style of the processing result of each stage more abundant and humanized. Based on the conception, the prompt language generation scheme is more automatic and intelligent by combining with automatic processing strategies such as modeling and the like, and self-iteration and updating of the model are carried out by utilizing a reward and punishment mechanism in the field of machine learning so as to improve the high adaptability and flexible matching capability of the output personalized prompt language. Furthermore, the invention also superposes the acoustic optimization effects of noise reduction and the like on the synthesized prompt words, so that the listening impression of the user on the prompt words is qualitatively changed. Therefore, the scheme of the invention is configured in different scene environments, which also helps to improve the interaction efficiency of related applications, and thus improves the safety, reliability, friendliness and even user satisfaction of interactive products in specific application scenes.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a personalized hint generating device, which may specifically include the following components as shown in fig. 5:

the receiving module 1 is used for receiving a user interaction instruction and personalized comprehensive information;

the target person setting determining module 2 is used for determining the target prompt language person setting type according to the user interaction instruction, the personalized comprehensive information and a first preset strategy;

a prompt language text generating module 3, configured to generate a target prompt language text based on the user interaction instruction, the personalized comprehensive information, the target prompt language personal category, and a second preset policy;

and the prompt language synthesis module 4 is used for synthesizing the target prompt language by utilizing the target prompt language human-set type, the target prompt language text and a third preset strategy.

The related content such as the personalized integrated information can be referred to the above explanation, and will not be described herein again.

Further, in one of the possible implementations,

the receiving module specifically includes:

the target person setting determination module specifically comprises:

the prompt language text generation module specifically comprises:

Further, in one possible implementation manner, the apparatus further includes:

the model iteration module specifically comprises:

Further, in one possible implementation manner, as shown in fig. 6, the apparatus further includes:

the front-end acoustic processing module 5 is used for performing acoustic processing on the user interaction instruction in an audio form;

a noise reduction signal generation module 6, configured to determine a noise reduction signal based on the acoustic processing result and noise related information of the current environment where the user is located;

the prompt synthesis module specifically comprises:

an initial cue synthesis unit 41, configured to input the target cue human-defined category and the target cue text into a pre-constructed cue synthesis model, and synthesize an initial cue;

and a target cue synthesis unit 42, configured to fuse the noise reduction signal and the initial cue to obtain the target cue.

It should be understood that the division of the individual components of the personalized hint language generating device shown in fig. 5 and fig. 6 is only a logical division, and the actual implementation can be wholly or partially integrated into one physical entity or physically separated. And these components may all be implemented in software invoked by a processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of calling by the processing element in software, and part of the components can be realized in the form of hardware. For example, a certain module may be a separate processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), one or more microprocessors (DSPs), one or more Field Programmable Gate Arrays (FPGAs), etc. For another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and their preferred embodiments, it will be appreciated by those skilled in the art that in practice, the invention may be practiced in a variety of embodiments, and that the invention is illustrated schematically in the following vectors:

(1) a personalized cue generation device, which may comprise:

one or more processors, memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions, which when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.

Fig. 7 is a schematic structural diagram of an embodiment of a personalized hint generating device according to the present invention, where the device may be an electronic device or a circuit device built in the electronic device. The electronic equipment can be a cloud server, a mobile terminal (mobile phone), a smart screen, an unmanned aerial vehicle, an ICV, an intelligent (automobile) or a vehicle-mounted device and the like based on an interactive application scene. The embodiment does not limit the specific form of the personalized hint generating device.

As shown in fig. 7 in particular, the personalized hint generator 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or may be separate from the processor 910.

In addition, to further enhance the functionality of the personalized cue generating device 900, in some scenarios the device 900 may further comprise one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, etc., which may further comprise a speaker 982, a microphone 984, etc. The display unit 970 may include a display screen, among others.

Further, the personalized cue generating device 900 may also include a power supply 950 for providing power to various components or circuits within the device 900.

It should be understood that the personalized hint generator 900 shown in FIG. 7 can implement the processes of the methods provided by the foregoing embodiments. The operations and/or functions of the various components of the apparatus 900 may each be configured to implement the corresponding flow in the above-described method embodiments. Reference is made in detail to the foregoing description of embodiments of the method, apparatus, etc., and a detailed description is omitted here as appropriate to avoid redundancy.

It should be understood that the processor 910 in the personalized hint generating device 900 shown in fig. 7 can be a system on a chip SOC, and the processor 910 can include a Central Processing Unit (CPU), and can further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A readable storage medium, on which a computer program or the above-mentioned apparatus is stored, which, when executed, causes the computer to perform the steps/functions of the above-mentioned embodiments or equivalent implementations.

In the several embodiments provided by the present invention, any function, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.

(3) A computer program product (which may comprise the above-mentioned means) which, when run on a terminal device, causes the terminal device to perform the steps/functions of the preceding embodiments or equivalent implementations.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program products may include, but are not limited to, refer to APP; continuing on, the aforementioned device/terminal may be a computer device (e.g., a mobile phone, a PC terminal, a cloud platform, a server cluster, or a network communication device such as a media gateway). Moreover, the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can all complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may comprise: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), Random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and means that there may be three relationships, for example, a and/or B, and may mean that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In addition, the embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other. In particular, for embodiments of devices, apparatuses, etc., since they are substantially similar to the method embodiments, reference may be made to some of the descriptions of the method embodiments for their relevant points. The above-described embodiments of devices, apparatuses, etc. are merely illustrative, and modules, units, etc. described as separate components may or may not be physically separate, and may be located in one place or distributed in multiple places, for example, on nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are merely preferred embodiments of the present invention, and it should be understood that technical features related to the above embodiments and preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the embodiments shown in the drawings, and all the modifications and equivalent embodiments that can be made according to the idea of the invention are within the scope of the invention as long as they are not beyond the spirit of the description and the drawings.

Claims

1. A personalized prompt generation method is characterized by comprising the following steps:

2. The method for generating personalized prompts according to claim 1, wherein the personalized comprehensive information comprises at least one of the following:

external information based on network hotspots; and

and feedback information of the user on the interactive process.

3. The method for generating personalized prompts according to claim 2, wherein:

the user step information comprises at least one of the following: predicting user conversation information, user relationship information and user interest information obtained after the user basic information is processed;

4. The method of claim 1, wherein the personalized hint is generated by a user,

the first preset policy includes:

analyzing the user interaction instruction;

the second preset policy includes:

5. The method of claim 4, further comprising:

6. The personalized hint generating method of any one of claims 1 to 5, wherein the third preset strategy comprises:

7. A personalized hint generating device, comprising:

8. The personalized hint language generation apparatus of claim 7,

the receiving module specifically includes:

the target person setting determination module specifically comprises:

the prompt language text generation module specifically comprises:

9. The apparatus of claim 8, wherein the apparatus further comprises:

the model iteration module specifically comprises:

10. The personalized hint language generating device of any one of claims 7 to 9, wherein the device further comprises:

the prompt synthesis module specifically comprises:

11. A personalized hint language generating device, comprising:

one or more processors, a memory, and one or more computer programs, wherein the one or more computer programs are stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the personalized hint generating method of any of claims 1 to 6.

12. A computer-readable storage medium, in which a computer program is stored, which, when run on a computer, causes the computer to execute the personalized hint generating method of any one of claims 1 to 6.