CN110196930A

CN110196930A - A kind of multi-modal customer service automatic reply method and system

Info

Publication number: CN110196930A
Application number: CN201910430832.XA
Authority: CN
Inventors: 聂礼强; 王文杰; 王英龙; 姚一杨; 张化祥; 宋雪萌
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2019-09-03
Anticipated expiration: 2039-05-22
Also published as: CN110196930B

Abstract

The invention discloses a kind of multi-modal customer service automatic reply method and systems, the described method comprises the following steps: receiving language and are encoded, obtain context vector；Based on context vector, its corresponding intention classification is determined based on the intention classification identification model of pre-training；Reply classification corresponding to the intention is determined based on setting rule；According to the reply classification, using the context vector as input, corresponding reply is generated using the reply model of pre-training.The present invention sufficiently can be according to its intention of the language automatic identification of user, the reply of adaptive generation form multiplicity.

Description

A kind of multi-modal customer service automatic reply method and system

Technical field

The invention belongs to field of artificial intelligence more particularly to a kind of multi-modal customer service automatic reply method and systems.

Background technique

Only there is provided background technical informations relevant to the disclosure for the statement of this part, it is not necessary to so constitute first skill Art.

Multi-modal dialog system is established on the basis of text conversation system, is especially sold neck in different field recently Domain, it is receive more and more attention.Although the multi-modal dialog system of existing oriented mission has shown that promising Performance, but they still have following problems:

The reply of chat robots expresses various information using different media formats, as merchandise display, buyer's guide, day Often greet etc., it is stated often by being combined in text or text image.Existing method is by multi-modal dialog system In text generation and image selection be considered as two independent tasks, and text and image next life are assembled by manual selectivity At reply；

Image selection task is substantially commercial product recommending problem.The preference that recommended models are conveyed within a context according to user Ranking is carried out to commodity, and returns to ranking near preceding commodity.Existing method only considers visual pattern during selection, but completely Have ignored attribute information abundant relevant to commodity, such as price, material, size and pattern etc.；

Dialogue between buyer and chat robots is usually directed to the knowledge of many-sided polymorphic type, including style collocation, quotient The popularity etc. of product attribute and commodity in famous person.Nevertheless, now only a kind of method is examined in multi-modal dialog system Consider style collocation, and other methods never quote any kind of knowledge.

Summary of the invention

To overcome above-mentioned the deficiencies in the prior art, the present invention provides a kind of multi-modal customer service automatic reply method and it is Multi-modal context, is embedded into context vector by system first with context coding device, then by being intended to Understanding Module Disaggregated model sorted users intention, be intended to generate multi-form reply for different user.

To achieve the above object, one or more embodiments of the invention provides following technical solution:

A kind of multi-modal customer service automatic reply method, comprising the following steps:

It receives language and is encoded, obtain context vector；

Based on context vector, its corresponding intention classification is determined based on the intention classification identification model of pre-training；

Reply classification corresponding to the intention is determined based on setting rule；

According to the reply classification, using the context vector as input, using the reply model generation pair of pre-training It should reply.

One or more embodiments provide a kind of multi-modal customer service automatic answering system, comprising the following steps:

Context coding device receives language and is encoded, obtains context vector；

Intention type identification module is based on context vector, determines its phase based on the intention classification identification model of pre-training The intention classification answered；

Category determination module is replied, reply classification corresponding to the intention is determined based on setting rule；

Generation module is replied, according to the reply classification, using the context vector as input, using returning for pre-training Multiple model generates corresponding reply.

One or more embodiments provide a kind of electronic equipment, including memory, processor and storage are on a memory And the computer program that can be run on a processor, the processor realize the multi-modal customer service certainly when executing described program Dynamic answering method.

A kind of computer readable storage medium, is stored thereon with computer program, realization when which is executed by processor The multi-modal customer service automatic reply method.

The above one or more technical solution there are following the utility model has the advantages that

The present invention has fully considered the intention of user, carries out intention assessment to the language of user's input, that is, distinguishing is which kind of The language (such as greeting, demand commodity requirement, determining purchase etc. to be met) of type, generates diversified forms based on intention It replys, so that user intention can be met to greatest extent by replying.

Detailed description of the invention

The Figure of description for constituting a part of the invention is used to provide further understanding of the present invention, and of the invention shows Examples and descriptions thereof are used to explain the present invention for meaning property, does not constitute improper limitations of the present invention.

Fig. 1 is multi-modal customer service automatic reply method overall flow figure in the one or more embodiments of the present invention；

Fig. 2 is multi-modal customer service automatic answering system frame diagram in the one or more embodiments of the present invention；

Fig. 3 is the model schematic of context coding device in the one or more embodiments of the present invention；

Fig. 4 is the model schematic that knowledge perceives decoder in the one or more embodiments of the present invention；

Fig. 5 is the model schematic of recommended models in the one or more embodiments of the present invention.

Specific embodiment

It is noted that described further below be all exemplary, it is intended to provide further instruction to the present invention.Unless another It indicates, all technical and scientific terms used herein has usual with general technical staff of the technical field of the invention The identical meanings of understanding.

It should be noted that term used herein above is merely to describe specific embodiment, and be not intended to restricted root According to exemplary embodiments of the present invention.As used herein, unless the context clearly indicates otherwise, otherwise singular Also it is intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or their combination.

In the absence of conflict, the feature in the embodiment and embodiment in the present invention can be combined with each other.

General thought proposed by the present invention:

This system gives the type and corresponding media of multi-modal context being intended to judge reply by understanding first Form.Reply needed for it is generated using adaptive decoder later: it is generated using simple Recognition with Recurrent Neural Network common It replys, then the Recognition with Recurrent Neural Network decoder of design knowledge perception, more has information content in conjunction with multi-form domain knowledge to generate Reply, and multi-modal reply decoder include image recommendation model, the model by one comprehensively consider text attribute and Visual pattern realizes the commercial product recommending to user.

Embodiment one

The multi-modal dialog system with adaptive decoder that present embodiment discloses a kind of.The following steps are included:

Step 1: receiving and talk with and encoded, obtain context vector；

The coding uses context coding device.The context coding device includes: on low level, i.e. word level follows In ring neural network and the residual error network enhanced using soft visual attention and high-level, i.e. the circulation nerve of sentence level Network.

Specifically, on low level, the text language of input, by Chinese word coding, is incited somebody to action by the Recognition with Recurrent Neural Network of word level The final hidden state for being embedded in entire utterance information is considered as inputting the expression of text language.It is worth noting that, language can be with Text, be also possible to it is multi-modal, as the extraction of visual signature, it is contemplated that visual attention of the user to image-region Difference, the residual error network that the image language of commodity is enhanced using soft visual attention extracts the visual signatures of commodity.

On high-level, if input language only includes text, high-level Recognition with Recurrent Neural Network only makees text feature For input.For multi-mode language, the connection features of text feature or text and visual signature are input to sentence in each language In the Recognition with Recurrent Neural Network of subhierarchy, final hidden state is exported as context vector.If a language is by several images Composition then these images will be launched into as a series of vision language, and is one by one inputted with text feature splicing together Into high-rise Recognition with Recurrent Neural Network.Therefore, from the perspective of high-level, Recognition with Recurrent Neural Network iteratively processes language, gradually User related information in characterization dialogue, and final hidden state is exported as context vector.Hereafter, context vector is made Indicate that input is intended to Understanding Module, recommended models and two Recognition with Recurrent Neural Network decoders for multi-modal contextual feature.

Step 2: being based on context vector, its corresponding intention classification is determined based on multi-layer perception (MLP) network；

Given context vector, it is intended that understand that component is intended to understand the intention of user, then determine accordingly for replying The decoder of generation.Here this system is predicted using the multi-layer perception (MLP) network context vector that based on context encoder generates The probability distribution of 15 intentions.In addition, optimizing network using cross entropy loss function.

The diversification that user is conveyed in multi-modal context is intended to be divided into 15 classifications, including " greeting ", " self Introduction ", " certain commodity is liked in expression, it is desirable to see more similar commodity ", " does not like certain class commodity, wishes " given standard " Other commodity are seen in prestige ", " it is desirable that the different angle for checking commodity ", " show commodity similar with some commodity ", " to result row Sequence ", " filtering commodity ", " inquiry tie-in sale style ", " inquiry item property ", " popularity of the inquiry commodity in famous person ", " certain the part commodity seen before checking again ", " purchase " and " end-of-dialogue ".

Step 3: reply classification corresponding to the intention is determined based on setting rule；

The reply being intended to this 15 kinds devises the reply of three types: general text is replied, and knowledgeable text returns Multiple and text and image synthesis form multi-modal reply.

The reply being intended to 15 can specially be divided into three types, i.e., general to reply, and knowledge perception is replied and multi-modal It replys.Wherein, multi-modal reply is indicated with text and image；And other reply all is that text is replied.It is inspired by this, the present invention A look-up table is devised, wherein being the triple of (being intended to classification, reply type, media format) comprising many formats.Once Give the intention classification of multi-modal context, the model by searching for triple entry (be intended to classification, reply type, Media format) predefined table, judge automatically and reply type and its corresponding media format.This model may be selected by correctly Decoder generates corresponding reply with media format appropriate.

Step 4: according to the reply classification, using the context vector as input, using the reply model of pre-training Generate corresponding reply.

This system utilizes three parallel components (simple cycle neural network decoder, the Recognition with Recurrent Neural Network of knowledge perception Decoder and recommended models) generate the replies of three types: general to reply, knowledge perception is replied and multi-modal reply.It is worth It is noted that general reply refers to the reply that dialogue medium-high frequency occurs, this to talk with smooth carry out without comprising any Education information, such as " today I can help you what? " it perceives and replys as knowledge, they are mutually tied with multi-form domain knowledge The reply of conjunction, to meet the particular demands of user, such as to problem answer " T-shirt and these sandals are arranged in pairs or groups? " in addition, multimode State is replied including polite general reply, the visual pattern of Recommendations, and introduces the knowledge perception reply of item property.

1, general reply is generated using simple Recognition with Recurrent Neural Network；

The purpose of simple cycle neural network decoder is that general reply is generated based on context vector, these replies are normal See, and does not need any domain knowledge.Due to generating general reply using the Recognition with Recurrent Neural Network decoder of knowledge perception The problems such as additional computation burden may be brought, introduce noise and mislead the optimization of model, we introduce simple circulation nerve Network generates general reply.The hidden state of simple cycle neural network is iterated by context vector initialization, model By the hidden state linear projection of each step into the one-dimensional vector in vocabulary size, and export the prediction probability of each word Distribution, wherein vocabulary refers to the ordered list that all words are constituted in data set.Finally, using cross entropy error function maximization The prediction probability of word in target retro.

2, knowledge perception Recognition with Recurrent Neural Network decoder passes through memory network and key assignments memory network for the neck of diversified forms Domain knowledge is embedded into the knowledge vector of higher dimensional space, then knowledge vector is introduced into unified Recognition with Recurrent Neural Network decoder In, to generate richer reply.

In view of buyer tends to express their requirement before final purchase and collects enough merchandise news, the hair Bright to introduce three kinds of domain knowledges, i.e. style is arranged in pairs or groups, the commodity popularity in item property and famous person.More specifically, 1) wind Lattice collocation describes the matching status between different commodity, such as necktie and white shirt match；2) item property key value table organization, Record the common attribute of commodity, such as price, brand and material；3) for the commodity popularity in famous person, it presents famous person couple The preference distribution of extensive stock.For example, some famous persons like black trousers rather than blue trousers.According to be intended to understand as a result, The system can determination to include which kind of domain knowledge.Specifically, if style Matching Relation, quotient are sought in being intended that for user Product attribute or commodity popularity, which can be embedded into corresponding domain knowledge in knowledge vector, and be incorporated into circulation In the decoder of neural network.

Specifically, the style collocation natural terrain in merchandise sales field is shown as non-directed graph, the side meaning between two kinds of commodity Taste one with another collocation.Therefore, we can describe the figure with pairs of product name, we are first by the every of centering A product name is embedded into a vector, then connects them to obtain knowledge vector.Finally, all these knowledge Vector is stored in single layer of memory network, calculates the knowledge vector replied and needed according to given inquiry.As for item property, We obtain knowledge vector using key assignments memory network, because attribute is always expressed as the form of key-value pair, knowledge vector can To be calculated according to given inquiry.The hobby to all commodity of one famous person is distributed by we is considered as knowledge vector, and is remembering Store the knowledge vector of all famous persons in network, the acquisition of knowledge vector is similar with style collocation when inquiry.

The present embodiment by using the hidden state in Recognition with Recurrent Neural Network previous step as inquiry, from diversified forms Knowledge vector is obtained in knowledge base.It is worth noting that, the knowledge vector of the first step be by using dialog history context to What amount was obtained as inquiry, and the first word for inputting Recognition with Recurrent Neural Network decoder is special marking<start>.With This, pairs of commodity are introduced into Recognition with Recurrent Neural Network decoder by we by memory network.

3, recommended models by the neural model joint of largest interval loss optimization consider text attribute and visual pattern come Learn the expression of commodity.Finally, recommended models indicate the similitude between the insertion of Historical remarks to candidate quotient based on commodity Product carry out ranking.

Recommended models indicate that the similitude between dialog history context vector arranges candidate products according to commodity Name, projects to height identical with context vector for the attribute information of commodity and image information by the way of context coding device Then dimension space calculates similarity, scored by similarity Recommendations.With existing method by simply considering that vision is special The method difference that ranking is carried out to product image is levied, visual signature and auxiliary information are fused in recommended models by we.Especially It is that, for each commodity, we arrange in alphabetical order its attribute first, then by by the coding of its property key and attribute value Vector is spliced into a vector to indicate the key-value pair of each attribute.Later, we gradually mention orderly attribute coding's vector Recognition with Recurrent Neural Network model is supplied, the hidden state of its final step is considered as the expression of text attribute；And visual representation is by pre- First trained residual error network extracts.Finally, text and the expression of the character representation of vision are stitched together, then by its linear projection Into higher dimensional space identical with context vector, is lost using largest interval as loss function, calculated using backpropagation Method Optimized model parameter is to obtain better recommendation.

Significantly, since multi-modal reply is extremely complex, general reply is merged, the intellectual for introducing commodity is returned Multiple and visual commodity picture, therefore they are integrated with simple cycle neural network decoder, the circulation of knowledge perception simultaneously The output of neural network decoder and recommended models.

Embodiment two

The purpose of the present embodiment is to provide a kind of multi-modal customer service automatic answering system.

To achieve the goals above, a kind of multi-modal customer service automatic answering system is present embodiments provided, comprising:

Embodiment three

The purpose of the present embodiment is to provide a kind of electronic equipment.

To achieve the goals above, it present embodiments provides a kind of electronic equipment, including memory, processor and is stored in On memory and the computer program that can run on a processor, the processor are realized when executing described program:

It receives language and is encoded, obtain context vector；

Example IV

The purpose of the present embodiment is to provide a kind of computer readable storage medium.

A kind of computer readable storage medium, is stored thereon with computer program, execution when which is executed by processor Following steps:

It receives language and is encoded, obtain context vector；

Each step involved in above embodiments two, three and four is corresponding with embodiment one, and specific embodiment can be found in The related description part of embodiment one.Term " computer readable storage medium " is construed as including one or more instruction set Single medium or multiple media；It should also be understood as including any medium, any medium can be stored, encodes or be held It carries instruction set for being executed by processor and processor is made either to execute in the present invention method.

The above one or more embodiment has following technical effect that

The present invention is generating recovery stage, provides that the adaptive generation of a variety of decoders is multi-form to be replied message.Its In include simple Recognition with Recurrent Neural Network for generating simple reply, the decoder of knowledge perception introduces tie-in sale, belongs to Property and popular relevant knowledge, in addition, commercial product recommending device has also been devised, fully consider quotient for generating abundant in content reply The vision and auxiliary information of product recommend to require the commercial product recommending being most consistent with user.

The present invention has sufficiently excavated the information in user session, while by adaptive decoder in user session All kinds of content-adaptives generation reply, reply it is with strong points, meet user demand.

It will be understood by those skilled in the art that each module or each step of aforementioned present invention can be filled with general computer It sets to realize, optionally, they can be realized with the program code that computing device can perform, it is thus possible to which they are stored Be performed by computing device in the storage device, perhaps they are fabricated to each integrated circuit modules or by they In multiple modules or step be fabricated to single integrated circuit module to realize.The present invention is not limited to any specific hardware and The combination of software.

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Above-mentioned, although the foregoing specific embodiments of the present invention is described with reference to the accompanying drawings, not protects model to the present invention The limitation enclosed, those skilled in the art should understand that, based on the technical solutions of the present invention, those skilled in the art are not Need to make the creative labor the various modifications or changes that can be made still within protection scope of the present invention.

Claims

1. a kind of multi-modal customer service automatic reply method, which comprises the following steps:

It receives language and is encoded, obtain context vector；

It is corresponded to back using the context vector as input using the reply model generation of pre-training according to the reply classification It is multiple.

2. multi-modal customer service automatic reply method as described in claim 1, which is characterized in that the coding is compiled using context Code device；The context coding device includes: the Recognition with Recurrent Neural Network of word level and to be increased using soft visual attention on low level The Recognition with Recurrent Neural Network of sentence level on strong residual error network and high-level.

3. multi-modal customer service automatic reply method as claimed in claim 2, which is characterized in that the language only includes text words Language includes text and image language simultaneously；Carrying out coding to the language includes:

On low level, input text language is carried out by the Recognition with Recurrent Neural Network of word level on low level by Chinese word coding, will Final hidden state is considered as inputting the character representation of text language；The residual error net that image language is enhanced by soft visual attention Network extracts visual signature；On high-level, if only including text language, by the circulation nerve net of text feature input sentence level Network, final hidden state, that is, context vector connect text feature with visual signature if including simultaneously text and image language Get up to input the Recognition with Recurrent Neural Network of sentence level, final hidden state, that is, context vector.

4. multi-modal customer service automatic reply method as described in claim 1, which is characterized in that the intention classification identification model For multi-layer perception (MLP) network.

5. multi-modal customer service automatic reply method as described in claim 1, which is characterized in that the setting rule defines meaning Figure classification, the corresponding relationship for replying type and media format；

Preferably, the setting rule uses the form of look-up table, includes multiple triples, each triple in the look-up table Form be (be intended to classification, reply type, media format).

6. multi-modal customer service automatic reply method as described in claim 1, which is characterized in that the reply model includes:

It is the context vector generally replied for replying type, is generated and replied using simple cycle neural network；

It is the context vector that knowledge perception is replied for replying type, is generated and replied using knowledge perception Recognition with Recurrent Neural Network；

For replying the context vector that type is multi-modal reply, in conjunction with simple cycle neural network, knowledge perception circulation mind It generates and replys through network and recommended models.

7. multi-modal customer service automatic reply method as claimed in claim 6, which is characterized in that wherein,

The hidden state of simple cycle neural network is iterated by context vector initialization, and model is hidden each step State is linearly projected to and is replied in word library in the one-dimensional vector of word population size, and exports the prediction probability of each word Distribution, uses the prediction probability of the word in cross entropy error function maximization target retro；

Domain knowledge is embedded into the knowledge vector of higher dimensional space by knowledge perception Recognition with Recurrent Neural Network by memory network, then Knowledge vector is introduced into unified Recognition with Recurrent Neural Network；

Recommended models consider text attribute and visual pattern by the neural model joint of largest interval loss optimization to learn quotient The expression of product indicates that the similitude between the insertion of Historical remarks carries out ranking to candidate commodity based on commodity.

8. a kind of multi-modal customer service automatic answering system characterized by comprising

Intention type identification module is based on context vector, determines that it is corresponding based on the intention classification identification model of pre-training It is intended to classification；

Generation module is replied, according to the reply classification, using the context vector as input, using the reply mould of pre-training Type generates corresponding reply.

9. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor is realized when executing described program as claim 1-7 is described in any item multi-modal Customer service automatic reply method.

10. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor Such as claim 1-7 described in any item multi-modal customer service automatic reply methods are realized when execution.