CN114694224A

CN114694224A - Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product

Info

Publication number: CN114694224A
Application number: CN202210323198.1A
Authority: CN
Inventors: 李东泽; 孟凡亮; 何恺; 蒋琳
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-01

Abstract

The application relates to a customer service question-answering method, a customer service question-answering device, computer equipment, a storage medium and a computer program product, relating to the field of artificial intelligence and being used in the field of digital finance or other fields. The method comprises the following steps: acquiring reply data matched with a question input by a user on a question-answering interface; the reply data comprises emotion reply data and question reply data; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person; synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on a question-answering interface; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data. By adopting the method, the convenience of the online banking business handling of the client can be improved.

Description

Customer service question and answer method, customer service question and answer device, customer service question and answer equipment, storage medium and computer program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a computer device, a storage medium, and a computer program product for customer service question answering.

Background

As online banking becomes more and more popular, the importance of online customer service becomes more and more prominent. When a customer encounters a business problem, the customer hopes to help ask questions and solve confusion for the first time. With the increasing of the traffic, in the related art, an intelligent customer service is introduced to handle the problem of the client, specifically, the client can input the current problem to be consulted through a dialog box, and the intelligent customer service matches the corresponding answer through the problem input by the client, so as to output the problem to the client based on the dialog box.

However, the existing intelligent customer service replies to the customer problem stays at the level of text reply through a dialog box, and the efficiency of text reply is low, so that the convenience of handling banking business on line by the customer is reduced.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a customer service question and answer method, apparatus, computer device, computer readable storage medium and computer program product capable of improving convenience of handling banking business online by customers.

In a first aspect, the present application provides a customer service question and answer method. The method comprises the following steps:

acquiring reply data matched with a question input by a user on a question-answering interface; the reply data comprises emotion reply data and question reply data; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person; synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on a question-answering interface; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

In one embodiment, the facial animation includes a plurality of sets of mouth image frames; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person, wherein the method comprises the following steps:

matching the expressions of the digital people corresponding to the emotion reply data from a preset expression library; the corresponding relation between the emotion reply data and the expressions of the digital people is stored in a preset expression library in advance; inputting the question reply data into a preset animation model for processing to generate a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data; the mouth bone point graph comprises a plurality of mouth bone points at target positions; a plurality of mouth image frames of the digital person are generated from the plurality of sets of mouth bone point patterns of the digital person.

In one embodiment, if the question reply data is text data, inputting the question reply data into a preset animation model for processing, and generating a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data, including:

carrying out format conversion on the question reply data to generate audio data corresponding to the question reply data; adopting a preset audio segmentation rule to sequentially divide audio data into a plurality of audio data units; for each audio data unit, determining target positions of a plurality of mouth skeleton points of the digital person when the audio data unit is sent out; based on the plurality of mouth bone points at the target location, a plurality of sets of mouth bone point patterns are generated.

In one embodiment, synthesizing the expression of the digital person with facial animation of the digital person to generate the animation of the digital person comprises:

arranging a plurality of mouth image frames of the digital person corresponding to the audio data units according to the sequence of the appearance time points of the audio data units in the audio data to generate mouth animation of the digital person corresponding to the question reply data; and synthesizing the expression of the digital person and the mouth animation of the digital person to generate the animation of the digital person.

In one embodiment, acquiring reply data matched with a question input by a user on a question and answer interface comprises the following steps:

inputting questions input by a user on a question-answering interface into a preset emotion recognition model for emotion recognition, generating an emotion recognition result of the user, and determining emotion reply data corresponding to the emotion recognition result of the user; generating question reply data according to the emotion reply data and the questions input by the user on the question-answering interface; reply data is generated based on the emotion reply data and the question reply data.

In one embodiment, inputting questions input by a user on a question and answer interface into a preset emotion recognition model for processing, and generating an emotion recognition result of the user, the method includes:

inputting questions input by a user on a question-answering interface into a sentence-level semantic feature extraction model and a word-level semantic feature extraction model respectively for feature extraction, and generating sentence-level semantic features and word-level semantic features of the questions; generating first emotion data of a user based on sentence-level semantic features, and generating second emotion data of the user based on multi-phrase semantic features; and determining the emotion recognition result of the user from the first emotion data and the second emotion data by adopting a preset screening rule.

In one embodiment, the determining the current emotion recognition result of the user from the first emotion data and the second emotion data by using a preset screening rule includes:

judging whether the first emotion data belongs to a positive type or not and whether the second emotion data belongs to a negative type or not; if not, determining the first emotion data as the current emotion recognition result of the user.

In one embodiment, the method further includes:

and if the first emotion data belong to the positive type and the second emotion data belong to the negative type, determining the second emotion data as the current emotion recognition result of the user.

In a second aspect, the application further provides a customer service question and answer device. The device comprises:

the acquisition module is used for acquiring reply data matched with the questions input by the user on the question-answering interface; the reply data comprises emotion reply data and question reply data;

the first generation module is used for inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person;

the second generation module is used for synthesizing the expression of the digital person and the facial animation of the digital person, generating the animation of the digital person and displaying the animation of the digital person on a question-answering interface; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the method steps in any of the embodiments of the first aspect described above when executing the computer program.

In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the method steps of any of the embodiments of the first aspect described above.

In a fifth aspect, the present application further provides a computer program product. The computer program product comprising a computer program that when executed by a processor performs the method steps of any of the embodiments of the first aspect described above.

The customer service question and answer method, the customer service question and answer device, the computer equipment, the storage medium and the computer program product are provided. In the technical scheme provided by the embodiment of the application, reply data matched with the questions input by the user on the question-answering interface is obtained; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person; synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on a question-answering interface; the reply data comprises emotion reply data and question reply data; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data. Compared with the prior art, the method does not depend on replying the user through the traditional dialog box, adopts the digital person to perform voice broadcasting on the reply data matched with the user problem, and displays the reply data through the animation of the digital person, so that more intelligent and vivid interaction with the user can be realized, the sense of substitution of online opposite conversation is realized, and the convenience and the flexibility of online banking business handling of the user are improved; moreover, corresponding expressions are matched with the digital people according to the questions input by the user, so that different emotions of the user can be better adapted, and the intelligence and flexibility of customer service questions and answers are further improved.

Drawings

FIG. 1 is a diagram illustrating an internal structure of a computer device according to an embodiment;

FIG. 2 is a schematic flow chart diagram illustrating a customer service question and answer method in one embodiment;

FIG. 3 is a schematic flow chart illustrating the generation of expressions and facial animation of a digital person, according to one embodiment;

FIG. 4 is a schematic flow chart diagram that illustrates the generation of multiple groups of mouth bone point patterns for a digital person, under one embodiment;

FIG. 5 is a schematic flow chart illustrating the generation of an animation of a digital person in one embodiment;

FIG. 6 is a flow diagram illustrating an embodiment of obtaining reply data;

FIG. 7 is a flow diagram that illustrates emotion data for a live user in one embodiment;

FIG. 8 is a flow diagram illustrating the generation of emotion recognition results for a user in one embodiment;

FIG. 9 is a schematic flow chart diagram illustrating a customer service question and answer method in accordance with yet another embodiment;

FIG. 10 is a block diagram showing the construction of a customer service question and answer apparatus according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The customer service question-answering method provided by the application can be applied to computer equipment, the computer equipment can be a server or a terminal, the server can be one server or a server cluster consisting of a plurality of servers, the embodiment of the application is not particularly limited to this, and the terminal can be but is not limited to various personal computers, notebook computers, smart phones, tablet computers and portable wearable equipment.

Taking the example of a computer device being a server, fig. 1 shows a block diagram of a server, which, as shown in fig. 1, comprises a processor, a memory and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer equipment is used for storing customer service question and answer data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a customer service question and answer method.

Those skilled in the art will appreciate that the architecture shown in fig. 1 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, and that servers may alternatively include more or fewer components than those shown, or combine certain components, or have a different arrangement of components.

The execution subject of the embodiments of the present application may be a computer device, or may be a customer service question and answer apparatus, and the following method embodiments will be described with reference to the computer device as the execution subject.

In one embodiment, as shown in fig. 2, which shows a flowchart of a customer service question and answer provided by an embodiment of the present application, the method may include the following steps:

step 220, acquiring reply data matched with the questions input by the user on the question-answering interface; the response data includes emotion response data and question response data.

With more and more online banking businesses, a user can handle different businesses more conveniently and quickly by online banking. When a client encounters a business problem, the business problem to be consulted can be input on a question-answering interface in different modes such as voice, characters and the like, and then corresponding reply data is acquired on the question-answering interface. After the questions input by the user on the question and answer interface are subjected to data processing and analysis, reply data matched with the questions can be obtained, and the reply data comprises emotion reply data and question reply data.

The emotion reply data is data for coping with the current emotion of the user, for example, when the current emotion of the user is angry, the emotion reply data may be data of voice, character, expression, action, and the like with a sorry meaning, so that the user can be comforted. The question reply data is data for replying to a business question of the user, for example, how to set the mobile banking fingerprint payment for a business question consulted by the user, and the question reply data can be a method step for setting the mobile banking fingerprint payment replied by different modes such as voice, characters, pictures and the like.

And 240, inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person.

The preset animation model is obtained based on historical reply data, and digital human reference expressions and facial animation training corresponding to the historical reply data. The historical reply data is input into the initial animation model for processing, so that predicted expressions and facial animations corresponding to the historical reply data are generated, the predicted expressions and facial animations and the reference expressions and facial animations are substituted into a preset loss function for calculation, so that model parameters of the initial animation model are updated according to the loss function values until preset convergence conditions are achieved, and finally, a preset animation model is generated according to the updated model parameters. In the actual use process, the acquired reply data is input into a preset animation model for processing, so that the expression and facial animation of the digital person are generated. The digital person's expressions may include facial expressions, speech expression, body gesture expression, and the like, for example, the facial expression may be a smile. The facial animation is a dynamic change process of the face of a digital person when the digital person sends out voice, for example, the face of the digital person changes correspondingly when the digital person sends out different voice.

The digital human is a virtual human which is based on artificial intelligence technologies such as image recognition, voice recognition and synthesis, semantic understanding, human image modeling and the like, has the sensing, cognition and expression capabilities to the physical world, and realizes human-computer interaction by taking electronic screens, VR (virtual reality) and other devices as carriers. The digital person provides brand-new intelligent passenger service for industries such as finance, radio and television, education, marketing, medical treatment, retail, games and the like, reduces the labor cost and improves the service quality and efficiency. The digital person constructs a virtual portrait or a cartoon image through a computer technology, can realize natural interaction with a user, and provides a temperature service for the user. Through visual dimensionality, the use and interaction experience of a user can be enriched, the user can be known better in the service process, and therefore the user can be better served.

Step 260, synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on a question-answering interface; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

The digital human facial animation is composed of a plurality of continuous image frames, and an image corresponding to the expression of the digital human and each frame of image in the facial animation can be superimposed, so that the expression of the digital human and the facial animation of the digital human can be synthesized, and other synthesis methods can be adopted, which is not specifically limited in this embodiment. Therefore, the obtained animation of the digital person simultaneously comprises the expression and the facial animation of the digital person, the voice broadcast can be carried out on the question reply data through the facial animation, the face has the corresponding expression during the voice broadcast, and therefore the user can obtain the reply of the consulted service by displaying the animation of the digital person on the question-answering interface.

In the embodiment, reply data matched with the questions input by the user on the question and answer interface is obtained; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person; synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on a question-answering interface; the reply data comprises emotion reply data and question reply data; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data. Compared with the prior art, the method does not depend on replying the user through the traditional dialog box, adopts the digital person to perform voice broadcasting on the reply data matched with the user problem, and displays the reply data through the animation of the digital person, so that more intelligent and vivid interaction with the user can be realized, the sense of substitution of online opposite conversation is realized, and the convenience and the flexibility of online banking business handling of the user are improved; moreover, corresponding expressions are matched with the digital people according to the questions input by the user, so that different emotions of the user can be better adapted, and the intelligence and flexibility of customer service questions and answers are further improved.

In one embodiment, as shown in fig. 3, which shows a flow chart of a customer service question and answer provided by the embodiment of the present application, and particularly relates to a possible process of generating expressions and facial animations of digital people, the method may include the following steps:

step 320, matching the expressions of the digital people corresponding to the emotion reply data from a preset expression library; the preset expression library stores the corresponding relation between the emotion reply data and the expressions of the digital persons in advance.

The preset expression library stores the corresponding relationship between the emotion reply data and the expressions of the digital persons in advance, and the corresponding relationship in the preset expression library can be set according to manual experience or obtained by analyzing a large amount of historical data. Therefore, the expression of the digital person corresponding to the emotion reply data can be searched from the preset expression library according to the corresponding relation, for example, the emotion reply data is ' parent-sorrow ', and we are always trying to promote, so that poor experience is brought to you, and sorry is realized ', so that the matched expression of the digital person is ' polite, sorry ' and other expressions. The emotion reply data may be in one-to-one correspondence with the expressions of the digital person, the emotion reply data may also correspond to multiple expressions of the digital person, one of the multiple expressions may be selected at will, or after the multiple expressions are selected at the same time, switching may be performed among different expressions, which is not specifically limited in this embodiment.

Step 340, inputting the question reply data into a preset animation model for processing, and generating a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data; the mouth bone point pattern includes a plurality of mouth bone points at a target location.

The digital human face animation can comprise a plurality of groups of mouth image frames, the question reply data is input into a preset animation model for processing, so that a plurality of groups of mouth bone point graphs of the digital human corresponding to the question reply data are generated, and the mouth bone point graphs can comprise a plurality of mouth bone points at target positions. The mouth bone point pattern can be formed according to the moving positions of the mouth bone points corresponding to different pronunciations, and the more the mouth bone points corresponding to the different pronunciations in the mouth bone point pattern can be set, the better, so that the mouth movement during pronouncing is more exquisite and the mouth movement is more fit with the pronunciations.

And step 360, generating a plurality of mouth image frames of the digital person according to the plurality of groups of mouth bone point patterns of the digital person.

The method includes the steps of performing operations such as sorting and rendering on a plurality of groups of mouth bone point patterns of a digital person to form corresponding images, and performing image processing operations on the generated images to generate a plurality of final mouth image frames of the digital person, wherein the image processing operations may include but are not limited to image smoothing, denoising, size scaling and other processing.

In the embodiment, the expressions of the digital people corresponding to the emotion reply data are matched from the preset expression library; inputting the question reply data into a preset animation model for processing to generate a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data; according to the multiple groups of mouth skeleton point patterns of the digital person, a plurality of mouth image frames of the digital person are generated, and therefore the expressions and facial animations of the digital person can be acquired more accurately and rapidly.

In one embodiment, as shown in fig. 4, which illustrates a flow chart of customer service questions and answers provided by an embodiment of the present application, and in particular relates to a possible process of generating a plurality of groups of mouth bone point patterns of a digital person, the method may include the following steps:

step 420, converting the format of the question reply data to generate audio data corresponding to the question reply data.

If the question reply data is text data, the question reply data can be processed through a set text-to-speech tool, and therefore audio data corresponding to the question reply data is obtained. When the text-to-speech tool is used for processing, the text data input by the user may be obtained in real time to be converted, or the conversion may be performed after the user has input all the text data, which is not specifically limited in this embodiment.

Step 440, sequentially dividing the audio data into a plurality of audio data units by using a preset audio division rule.

The audio data includes audio data corresponding to each word in the question reply data, and the audio data may be sequentially divided into a plurality of audio data units by using a preset audio division rule for the audio data of each word. The preset audio segmentation rule may be set according to the pronunciation time, for example, if the pronunciation time corresponding to one word is 0.5 seconds in total, then 0.5 seconds may be divided into three time segments, each corresponding to a different audio data unit. It should be noted that the pronunciation time corresponding to each word may be divided into different number of time periods according to actual requirements, or may be divided into the same number of time periods, which is not specifically limited in this embodiment.

Step 460 determines, for each audio data unit, a target location of a plurality of skeletal points of the mouth of the digital person at the time of the issuance of the audio data unit.

According to the corresponding relation between the audio data units and the target positions of the mouth skeleton points, the target positions of the plurality of mouth skeleton points of the digital person when the audio data units are sent out can be matched for each audio data unit. The target positions of the mouth skeleton points corresponding to different audio data units are different, and the target positions of a plurality of mouth skeleton points when the audio data units are sent out can be represented by two-dimensional coordinate information.

Step 480, generating a plurality of groups of mouth bone point graphs based on the plurality of mouth bone points at the target position.

Thereby, a plurality of mouth skeleton points at the target position when the audio data unit is emitted can be determined according to the two-dimensional coordinate information of the mouth skeleton points, and a mouth skeleton point pattern of the audio data corresponding to each character can be generated. And integrating the mouth bone point patterns of all characters in the audio data to obtain a plurality of groups of mouth bone point patterns.

In this embodiment, the audio data corresponding to the question reply data is generated by performing format conversion on the question reply data; adopting a preset audio segmentation rule to sequentially divide audio data into a plurality of audio data units; for each audio data unit, determining target positions of a plurality of mouth skeleton points of the digital person when the audio data unit is sent out; based on the plurality of mouth bone points at the target location, a plurality of sets of mouth bone point patterns are generated. After the audio data are divided, the mouth bone point position corresponding to each audio data unit can be accurately obtained, and therefore the finally generated mouth bone point pattern is more accurate.

In one embodiment, as shown in fig. 5, which shows a flow chart of a customer service question and answer provided by the embodiment of the present application, and particularly relates to a possible process of generating animation of a digital person, the method may include the following steps:

and 520, arranging a plurality of mouth image frames of the digital person corresponding to the audio data units according to the sequence of the appearance time points of the audio data units in the audio data, and generating mouth animation of the digital person corresponding to the question reply data.

When the question reply data is converted into the audio data, the conversion is required to be performed according to the sequence of the text data, so that the sequence of the obtained audio data is consistent with the sequence of the text data. A plurality of mouth bone point patterns of the digital person corresponding to the audio data unit may be acquired first, and a plurality of mouth image frames of the digital person corresponding to the audio data unit may be generated by performing corresponding processing on the plurality of mouth bone point patterns. Therefore, the plurality of mouth image frames of the digital person corresponding to the audio data units can be arranged according to the sequence of the appearance time points of the plurality of audio data units in the audio data, and finally, after the arranged plurality of mouth image frames are played, the mouth animation of the digital person corresponding to the problem reply data is generated.

And 540, synthesizing the expression of the digital person and the mouth animation of the digital person to generate the animation of the digital person.

The image corresponding to the expression of the digital person and each mouth image frame in the mouth animation are superimposed, so that the expression of the digital person and the mouth animation of the digital person are synthesized, and other synthesis methods may also be adopted, which is not specifically limited in this embodiment.

In this embodiment, the mouth animation of the digital person corresponding to the question reply data is generated by arranging the plurality of mouth image frames of the digital person corresponding to the audio data units according to the sequence of the appearance time points of the plurality of audio data units in the audio data; and synthesizing the expression of the digital person and the mouth animation of the digital person to generate the animation of the digital person. Because the mouth animation of the corresponding digital person can be generated according to the sequence of the audio data units, the accuracy of voice broadcasting of the problem reply data is ensured.

In one embodiment, as shown in fig. 6, which illustrates a flowchart of a customer service question and answer provided by the embodiment of the present application, and particularly relates to a possible process of acquiring reply data, the method may include the following steps:

step 620, inputting the questions input by the user on the question and answer interface into a preset emotion recognition model for emotion recognition, generating emotion recognition results of the user, and determining emotion reply data corresponding to the emotion recognition results of the user.

The preset emotion recognition model is obtained by training based on historical problem data and a reference emotion recognition result corresponding to the historical problem data. Historical problem data are input into the initial emotion recognition model to be processed, a predicted emotion recognition result corresponding to the historical problem data is generated, the predicted emotion recognition result and a reference emotion recognition result are substituted into a preset loss function to be calculated, model parameters of the initial emotion recognition model are updated according to the loss function value until a preset convergence condition is achieved, and finally the preset emotion recognition model is generated according to the updated model parameters. In the actual use process, the questions input by the user on the question and answer interface are input into a preset emotion recognition model for emotion recognition, and an emotion recognition result of the user is generated. And matching emotion reply data corresponding to the emotion recognition result of the user from a preset question bank, or generating emotion reply data in real time according to the emotion recognition result of the user.

And step 640, generating question reply data according to the emotion reply data and the questions input by the user on the question and answer interface.

The data of the corresponding service questions replied to the user is obtained through matching according to the questions input by the user on the question-answer interface, specifically, the data can be obtained through matching from a preset question bank, the answer answers corresponding to the keywords are searched through extracting the keywords of the questions input by the user and matching in the preset question bank according to the keywords, and the answer answers are obtained through arranging and checking by experts in advance.

Therefore, the emotion reply data and the data of the business question of the reply user are combined into a complete sentence to be used as the answer of the question reply, namely, the question reply data is generated. For example, the user inputs "what spam they are, and how do transfer to people who do not want to select" so that the obtained emotion reply data is "parent, sorry, we are trying to promote all the time, bring bad experience to you, sorry", and the data of the question on the service of the reply user is "transfer related service content, and operation method". Thus, the two parts of data are combined into a complete sentence, and the question reply data is generated.

Step 660, generating response data based on the emotion response data and the question response data.

The emotion reply data can be used for generating the expression and facial animation of the digital person, and the final reply data can be generated based on the expression and facial animation of the digital person and the question reply data. The specific processing procedure is similar to the above embodiment, and is not described herein again.

In the embodiment, the questions input by the user on the question and answer interface are input into the preset emotion recognition model for emotion recognition, the emotion recognition result of the user is generated, and emotion reply data corresponding to the emotion recognition result of the user is determined; generating question reply data according to the emotion reply data and the questions input by the user on the question-answering interface; reply data is generated based on the emotion reply data and the question reply data. By dividing the problem reply data into two parts, more intelligent and vivid interaction with the user can be realized, and the dialect is more flexible and accurate when replying the user.

In one embodiment, as shown in fig. 7, which illustrates a flow chart of a customer service question and answer provided by the embodiment of the present application, specifically, a possible process of generating emotion data of a user is provided, and the method may include the following steps:

and 720, inputting the questions input by the user on the question and answer interface into the sentence-level semantic feature extraction model and the word-level semantic feature extraction model respectively for feature extraction, and generating sentence-level semantic features and word-level semantic features of the questions.

The sentence-level semantic feature extraction model is a simple model SWEM-aver based on word vectors, and specifically, the extraction of sentence-level semantic features is realized by averaging the word vectors according to elements by adopting average pooling, and the sentence-level semantic features in the user utterance are obtained. The word level semantic feature extraction model is an improved CNN model, specifically, the traditional CNN model extracts n-element phrase semantic features, n is the size of a convolution window, which can be set to 2, 3 and 4 respectively, and 14 convolution kernels are set for each window size respectively, so as to extract rich n-element phrase semantic information from an original word vector matrix.

The set hyper-parameters are added during the CNN model training, and Dropout is added in the full-connection layer, so that part of connection is randomly abandoned, and overfitting is effectively prevented; and the pooling layer was engineered to use k-Max pooling to retain more features, resulting in an improved CNN model. The questions input by the user on the question and answer interface are respectively input into the sentence-level semantic feature extraction model and the word-level semantic feature extraction model for feature extraction, so that sentence-level semantic features and word-level semantic features of the questions are extracted.

And 740, generating first emotion data of the user based on the sentence-level semantic features, and generating second emotion data of the user based on the multi-phrase semantic features.

After the extracted features are classified, first emotion data of the user can be generated based on sentence-level semantic features, and second emotion data of the user can be generated based on multi-phrase semantic features. Specifically, the sentence-level semantic features can be respectively input into a classifier of the sentence-level semantic feature extraction model, and first emotion data of the user is generated; and inputting the word-level semantic features into a classifier of the word-level semantic feature extraction model, and generating second emotion data of the user.

And 760, determining the emotion recognition result of the user from the first emotion data and the second emotion data by adopting a preset screening rule.

The preset screening rule can be preset according to actual requirements, and the emotion recognition result of the user is determined from the first emotion data and the second emotion data by analyzing and comparing the first emotion data and the second emotion data.

In the embodiment, questions input by a user on a question-answering interface are respectively input into a sentence-level semantic feature extraction model and a word-level semantic feature extraction model for feature extraction, and sentence-level semantic features and word-level semantic features of the questions are generated; generating first emotion data of a user based on sentence-level semantic features, and generating second emotion data of the user based on multi-phrase semantic features; and determining the emotion recognition result of the user from the first emotion data and the second emotion data by adopting a preset screening rule. By adopting different models to generate the emotion data of the user, the emotion recognition result of the final user can be screened from the first emotion data and the second emotion data by combining the advantages of different models, so that the obtained emotion recognition result of the user is more accurate.

In one embodiment, as shown in fig. 8, which illustrates a flowchart of a customer service question and answer provided by an embodiment of the present application, specifically, a possible process of generating an emotion recognition result of a user, the method may include the following steps:

and step 820, judging whether the first emotion data belongs to the positive type or not and whether the second emotion data belongs to the negative type or not.

Experiments prove that compared with the CNN model, the classification accuracy of the SWEM is equal to or slightly higher than that of the CNN model, so that the classification result of the SWEM is mainly used. Based on the above, after the first emotion data and the second emotion data are acquired, the current emotion recognition result of the user can be determined by judging the types of the first emotion data and the second emotion data. Specifically, it is determined whether the first emotion data belongs to a positive type, which may include but is not limited to mild, happy, excited, etc., and the second emotion data belongs to a negative type, which may include but is not limited to angry, nervous, fear, etc.

And step 840, if not, determining the first emotion data as the current emotion recognition result of the user.

Wherein, if no, the condition may include: the first sentiment data is not of an active type and the second sentiment data is not of a passive type; the first sentiment data is not of the positive type and the second sentiment data is of the negative type; the first emotion data belongs to the positive type, and the second emotion data does not belong to the negative type. The first emotion data is determined as the current emotion recognition result of the user.

For example, if the first emotion data is tension and the second emotion data is happy, the current emotion recognition result of the user is tension; if the first emotion data is tension and the second emotion data is tension, the current emotion recognition result of the user is tension; and if the first emotion data is happy and the second emotion data is gentle, the current emotion recognition result of the user is happy. The example is only used to explain the above-mentioned process of determining as the current emotion recognition result of the user, and is not illustrated here.

And 860, if the first emotion data belongs to the positive type and the second emotion data belongs to the negative type, determining the second emotion data as the current emotion recognition result of the user.

The CNN model result takes the CNN model result as a final result only when the SWEM classification result is positive emotion and the CNN result is negative emotion, so that the customer emotion can be comforted under the condition that the number of the CNN model results is as large as possible. Specifically, if the first emotion data belongs to the positive type and the second emotion data belongs to the negative type, the second emotion data is determined as the current emotion recognition result of the user. For example, if the first emotion data is moderate and the second emotion data is angry, the current emotion recognition result of the user is angry.

In the embodiment, whether the first emotion data belongs to the positive type or not and whether the second emotion data belongs to the negative type or not are judged; if not, determining the first emotion data as a current emotion recognition result of the user; and if the first emotion data belong to the positive type and the second emotion data belong to the negative type, determining the second emotion data as the current emotion recognition result of the user. The two emotion data are analyzed and judged, so that the obtained current emotion recognition result of the user is more accurate; and if the first emotion data belongs to the positive type and the second emotion data belongs to the negative type, the second emotion data is determined as the current emotion recognition result of the user, so that the emotion of the client can be comforted under the condition of being as much as possible, and the intelligence and flexibility of customer service question answering are further improved.

In one embodiment, if negative emotions such as long-term dissatisfaction of a certain user are detected, early warning information can be sent to a manual customer service, so that a manual special line intervenes in related services to practically solve the problem of customers. And the service dialogue is recorded as a problem which is difficult to be handled by the intelligent customer service and is submitted to an expert for analysis, so that the process of the customer service question answering can be optimized. Further, if the emotion recognition result of the user obtained by the recognition according to the preset emotion recognition model is a type of tension and fear, the early warning information may be directly sent to the artificial customer service, that is, the user may be in a dangerous state at present, and may also be implemented by other early warning methods, which is not specifically limited in this embodiment. And after the user uses the intelligent customer service each time, the user can be requested to score the intelligent customer service, evaluate the low score, record the related conversation, analyze the conversation by an expert, check whether the abnormal condition exists or not, and upgrade and perfect the intelligent customer service.

In the embodiment, the intelligent customer service is assisted by the artificial customer service, so that the user experience can be improved. And when the user is dangerous, the early warning operation can be carried out, and the intellectualization degree of the customer service question answering is further improved. In addition, the intelligent customer service can be continuously upgraded and perfected, and the accuracy and reliability of the customer service question answering are improved.

In one embodiment, as shown in fig. 9, which shows a flowchart of a customer service question and answer provided by an embodiment of the present application, the method may include the following steps:

and step 901, carrying out voice input on the problems of the user.

Audio data of a question input by a user is acquired.

And step 902, performing voice recognition and converting the voice recognition into characters.

And carrying out voice recognition on the audio data to obtain text data.

And step 903, detecting the emotion of the user.

And inputting the questions input by the user on the question and answer interface into a preset emotion recognition model for emotion recognition, and generating an emotion recognition result of the user.

Step 904, question answer matching.

And acquiring reply data matched with the questions input by the user on the question-answering interface.

And step 905, abnormal emotion early warning.

If the emotion recognition result of the user obtained by recognition according to the preset emotion recognition model is of a nervous and fearful type, early warning information can be directly sent to the artificial customer service, namely that the user is possibly in a dangerous state at present.

And step 906, judging whether the abnormality occurs for multiple times.

And detecting whether a certain user expresses discontented negative emotions for a long time.

And 907, if yes, manual customer service intervention is carried out.

Early warning information is sent to the manual customer service, so that the manual special line intervenes in related services, and the problem of customers is solved practically.

Step 908, answer processing.

Generating the response data based on the emotional response data and the question response data.

And step 909, broadcasting by the digital person.

Inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person; synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on the question-answering interface; and the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

And step 910, satisfaction degree collection.

After the user uses the intelligent customer service each time, the user can be requested to score the intelligent customer service, low score evaluation is carried out, related conversations are recorded and analyzed by experts, whether abnormal conditions exist or not is checked, and the intelligent customer service is upgraded and perfected.

Compared with the prior art, the embodiment does not depend on replying the user problems through a traditional dialog box, adopts the digital person to perform voice broadcasting on the reply data matched with the user problems, and displays the reply data through the animation of the digital person, so that more intelligent and vivid interaction with the user can be realized, the sense of substitution of offline opposite conversation is realized, and the convenience and the flexibility of online banking business handling of the user are improved; moreover, corresponding expressions are matched with the digital people according to the questions input by the user, so that different emotions of the user can be better adapted, and the intelligence and flexibility of customer service questions and answers are further improved.

It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be rotated or alternated with other steps or at least a part of the steps or stages in other steps.

Based on the same inventive concept, the embodiment of the application also provides a customer service question and answer device for realizing the customer service question and answer method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so that specific limitations in one or more customer service question and answer device embodiments provided below can be referred to the limitations of the customer service question and answer method in the foregoing, and details are not described herein.

In one embodiment, as shown in fig. 10, there is provided a 1000 apparatus comprising: an acquisition module 1002, a first generation module 1004, and a second generation module 1006, wherein:

an obtaining module 1002, configured to obtain reply data matched with a question input by a user on a question-and-answer interface; the reply data includes emotion reply data and question reply data.

A first generating module 1004, configured to input the reply data into a preset animation model for processing, so as to generate the expression and facial animation of the digital person.

A second generating module 1006, configured to synthesize the expression of the digital person and the facial animation of the digital person, generate the animation of the digital person, and display the animation of the digital person on the question-answering interface; the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

In one embodiment, the facial animation includes a plurality of sets of mouth image frames; the first generating module 1004 is specifically configured to match the expression of the digital person corresponding to the emotion reply data from a preset expression library; the preset expression library stores the corresponding relation between the emotion reply data and the expressions of the digital people in advance; inputting the question reply data into a preset animation model for processing, and generating a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data; the mouth bone point graph comprises a plurality of mouth bone points at target positions; generating a plurality of mouth image frames of the digital person from the plurality of groups of mouth bone point patterns of the digital person.

In an embodiment, if the question reply data is text data, the first generating module 1004 is further configured to perform format conversion on the question reply data to generate audio data corresponding to the question reply data; adopting a preset audio segmentation rule to sequentially divide the audio data into a plurality of audio data units; for each of the audio data units, determining target locations of a plurality of skeletal points of the mouth of the digital person at the time of the emission of the audio data unit; generating the plurality of sets of mouth bone point patterns based on a plurality of mouth bone points at the target location.

In an embodiment, the second generating module 1006 is specifically configured to arrange, according to a sequence of occurrence time points of the multiple audio data units in the audio data, multiple mouth image frames of the digital person corresponding to the audio data units, and generate a mouth animation of the digital person corresponding to the question reply data; and synthesizing the expression of the digital person and the mouth animation of the digital person to generate the animation of the digital person.

In an embodiment, the obtaining module 1002 is specifically configured to input the question input by the user on the question-answering interface into a preset emotion recognition model for emotion recognition, generate an emotion recognition result of the user, and determine emotion reply data corresponding to the emotion recognition result of the user; generating question reply data according to the emotion reply data and the questions input by the user on a question-answering interface; generating the response data based on the emotion response data and the question response data.

In one embodiment, the obtaining module 1002 is further configured to input the question input by the user on the question-and-answer interface into a sentence-level semantic feature extraction model and a word-level semantic feature extraction model respectively for feature extraction, so as to generate a sentence-level semantic feature and a word-level semantic feature of the question; generating first emotion data of the user based on the sentence-level semantic features, and generating second emotion data of the user based on the multi-phrase semantic features; and determining the emotion recognition result of the user from the first emotion data and the second emotion data by adopting a preset screening rule.

In one embodiment, the obtaining module 1002 is further configured to determine whether the first emotion data belongs to an active type and the second emotion data belongs to a passive type; if not, determining the first emotion data as the current emotion recognition result of the user.

In one embodiment, the obtaining module 1002 is further configured to determine the second emotion data as the current emotion recognition result of the user if the first emotion data belongs to the positive type and the second emotion data belongs to the negative type.

The modules in the customer service question answering device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the facial animation includes a plurality of sets of mouth image frames;

the processor, when executing the computer program, further performs the steps of:

In one embodiment, if the question reply data is text data;

In one embodiment, the processor, when executing the computer program, further performs the steps of:

In one embodiment, the processor when executing the computer program further performs the steps of:

judging whether the first emotion data belongs to an active type or not and whether the second emotion data belongs to a passive type or not; if not, determining the first emotion data as the current emotion recognition result of the user.

The implementation principle and technical effect of the computer device provided in the embodiment of the present application are similar to those of the method embodiment described above, and are not described herein again.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, if the question reply data is text data;

the computer program when executed by the processor further realizes the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

The implementation principle and technical effect of the computer-readable storage medium provided by this embodiment are similar to those of the above-described method embodiment, and are not described herein again.

In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of:

In one embodiment, if the question reply data is text data;

The computer program product provided in this embodiment has similar implementation principles and technical effects to those of the method embodiments described above, and is not described herein again.

It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include a Read-Only Memory (ROM), a magnetic tape, a floppy disk, a flash Memory, an optical Memory, a high-density embedded nonvolatile Memory, a resistive Random Access Memory (ReRAM), a Magnetic Random Access Memory (MRAM), a Ferroelectric Random Access Memory (FRAM), a Phase Change Memory (PCM), a graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A customer service question-answering method is characterized by comprising the following steps:

acquiring reply data matched with a question input by a user on a question-answering interface; the reply data comprises emotion reply data and question reply data;

inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person;

synthesizing the expression of the digital person and the facial animation of the digital person to generate the animation of the digital person, and displaying the animation of the digital person on the question-answering interface; and the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

2. The method of claim 1, wherein the facial animation comprises a plurality of sets of mouth image frames; inputting the reply data into a preset animation model for processing to generate the expression and facial animation of the digital person, wherein the processing comprises the following steps:

matching the expressions of the digital people corresponding to the emotion reply data from a preset expression library; the preset expression library stores the corresponding relation between the emotion reply data and the expressions of the digital people in advance;

inputting the question reply data into a preset animation model for processing, and generating a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data; the mouth bone point graph comprises a plurality of mouth bone points at target positions;

generating a plurality of mouth image frames of the digital person from the plurality of groups of mouth bone point patterns of the digital person.

3. The method according to claim 2, wherein if the question reply data is text data, the inputting the question reply data into a preset animation model for processing to generate a plurality of groups of mouth bone point patterns of the digital person corresponding to the question reply data comprises:

carrying out format conversion on the question reply data to generate audio data corresponding to the question reply data;

adopting a preset audio segmentation rule to sequentially divide the audio data into a plurality of audio data units;

for each of the audio data units, determining target locations of a plurality of skeletal points of the mouth of the digital person at the time of the emission of the audio data unit;

generating the plurality of sets of mouth bone point patterns based on a plurality of mouth bone points at the target location.

4. The method of claim 3, wherein the synthesizing the expression of the digital person with the facial animation of the digital person to generate the animation of the digital person comprises:

arranging a plurality of mouth image frames of the digital person corresponding to the audio data units according to the sequence of the appearance time points of the audio data units in the audio data, and generating mouth animation of the digital person corresponding to the question reply data;

and synthesizing the expression of the digital person and the mouth animation of the digital person to generate the animation of the digital person.

5. The method of claim 1, wherein the obtaining reply data matching a question entered by a user on a question and answer interface comprises:

inputting the questions input by the user on a question-answering interface into a preset emotion recognition model for emotion recognition, generating an emotion recognition result of the user, and determining emotion reply data corresponding to the emotion recognition result of the user;

generating question reply data according to the emotion reply data and the questions input by the user on a question-answering interface;

6. The method according to claim 5, wherein the inputting the question input by the user on the question and answer interface into a preset emotion recognition model for processing, and generating the emotion recognition result of the user comprises:

inputting the questions input by the user on a question-answering interface into a sentence-level semantic feature extraction model and a word-level semantic feature extraction model respectively for feature extraction, and generating sentence-level semantic features and word-level semantic features of the questions;

generating first emotion data of the user based on the sentence-level semantic features, and generating second emotion data of the user based on the multi-phrase semantic features;

and determining the emotion recognition result of the user from the first emotion data and the second emotion data by adopting a preset screening rule.

7. The method of claim 6, wherein the determining the current emotion recognition result of the user from the first emotion data and the second emotion data by using a preset screening rule comprises:

judging whether the first emotion data belongs to an active type or not and whether the second emotion data belongs to a passive type or not;

if not, determining the first emotion data as the current emotion recognition result of the user.

8. The method of claim 7, further comprising:

and if the first emotion data belong to a positive type and the second emotion data belong to a negative type, determining the second emotion data as a current emotion recognition result of the user.

9. A customer service question answering apparatus, characterized in that the apparatus comprises:

the second generation module is used for synthesizing the expression of the digital person and the facial animation of the digital person, generating the animation of the digital person and displaying the animation of the digital person on the question-answering interface; and the animation of the digital person is used for displaying the expression corresponding to the expression data and carrying out voice broadcasting on the question reply data.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 8.

12. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 8 when executed by a processor.