CN115733925A

CN115733925A - Business voice intention presenting method, device, medium and electronic equipment

Info

Publication number: CN115733925A
Application number: CN202110987221.2A
Authority: CN
Inventors: 张超; 洪沛; 杨国锋; 徐虎; 李冠华; 张国成; 戴胜林; 程诚; 马亮; 陈天池
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-08-26
Filing date: 2021-08-26
Publication date: 2023-03-03

Abstract

The application relates to the field of intention recognition, and discloses a service voice intention presenting method, device, medium and electronic equipment. The method comprises the following steps: acquiring outbound audio data; transferring the outbound audio data according to at least two voice transfer engines to obtain corresponding text transfer results; verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the outbound audio data based on the verification result; acquiring a first business related keyword in a final text transcription result based on a preset business related keyword identification model, wherein the business related keyword identification model comprises a feature vector extraction module and an XGB OST model which are sequentially connected; determining intention information corresponding to the first business related keywords; and outputting the intention information. The method improves the accuracy and efficiency of intention identification and also improves the effect of auxiliary marketing; in addition, the whole method can be automatically completed, and the labor cost is greatly reduced.

Description

Business voice intention presenting method, device, medium and electronic equipment

Technical Field

The present application relates to the field of intent recognition technologies, and in particular, to a method, an apparatus, a medium, and an electronic device for presenting a service voice intent.

Background

With the development of the internet and information technology, the amount of data generated by humans is growing exponentially, and a big data age has come. In the marketing field, big data technology is widely used.

For structured data, big data technology can process to some extent; however, large data technologies cannot handle unstructured data such as speech efficiently. The voice data is key data in the marketing field, so that the intention behind the voice data is mainly analyzed in a manual mode in the marketing field, the efficiency is low, the cost is high, the analysis result is seriously dependent on manual experience, and the auxiliary effect on marketing is limited.

Disclosure of Invention

In order to solve the above technical problems in the technical field of intention recognition, the present application aims to provide a service voice intention presenting method, apparatus, medium and electronic device.

According to an aspect of the present application, there is provided a service voice intention presenting method, the method including:

acquiring outbound audio data;

transferring the outbound audio data according to at least two voice transfer engines to obtain corresponding text transfer results;

verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the outbound audio data based on the verification result;

acquiring a first business related keyword in the final text transcription result based on a preset business related keyword identification model, wherein the business related keyword identification model comprises a feature vector extraction module and an XGB OST model which are sequentially connected, an attention mechanism layer is added in the feature vector extraction module, the feature vector extraction module is used for extracting discrete feature vectors of the keywords, and the XGB OST model is used for outputting classification results of the keywords;

determining intention information corresponding to the first business related keywords according to the first business related keywords;

and outputting the intention information.

According to another aspect of the present application, there is provided a service voice intention presenting apparatus, the apparatus including:

an acquisition module configured to acquire outbound audio data;

the transcription module is configured to transcribe the outbound audio data according to at least two voice transcription engines to obtain a corresponding text transcription result;

the verification module is configured to verify the text transcription results of the voice transcription engines of all paths and determine the final text transcription result of the outbound audio data based on the verification result;

the keyword obtaining module is configured to obtain a first business related keyword in the final text transcription result based on a preset business related keyword recognition model, wherein the business related keyword recognition model comprises a feature vector extraction module and an XGBOOST model which are sequentially connected, an attention mechanism layer is added in the feature vector extraction module, the feature vector extraction module is used for extracting discrete feature vectors of the keywords, and the XGBOOST model is used for outputting a classification result of the keywords;

the determining module is configured to determine intention information corresponding to the first business related keyword according to the first business related keyword;

an output module configured to output the intention information.

According to another aspect of the present application, there is provided a computer readable program medium storing computer program instructions which, when executed by a computer, cause the computer to perform the method as previously described.

According to another aspect of the present application, there is provided an electronic device including:

a processor;

a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method as previously described.

The technical scheme provided by the embodiment of the application can have the following beneficial effects:

the service voice intention presenting method provided by the application comprises the following steps: acquiring outbound audio data; transferring the outbound audio data according to at least two voice transfer engines to obtain corresponding text transfer results; verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the outbound audio data based on the verification result; acquiring a first business related keyword in the final text transcription result based on a preset business related keyword identification model, wherein the business related keyword identification model comprises a feature vector extraction module and an XGBOOST model which are sequentially connected, an attention mechanism layer is added in the feature vector extraction module, the feature vector extraction module is used for extracting discrete feature vectors of the keyword, and the XGBOOST model is used for outputting a classification result of the keyword; determining intention information corresponding to the first business related keywords according to the first business related keywords; and outputting the intention information.

According to the method, the outbound audio data are obtained firstly, then the outbound audio data are transcribed by the voice transcription engine to obtain a text transcription result, then the business related keywords are determined by the business related keyword recognition model, finally the intention information is determined by the business related keywords and output, processing and value mining of unstructured data of the outbound audio data are achieved, the intention information is obtained based on the outbound audio data, marketing activities can be assisted by the intention information, marketing effects can be improved, effective utilization of the outbound audio data is achieved, and intention recognition efficiency of the audio data is improved; meanwhile, by adding a feature vector extraction module and an XGBOOST model of an attention mechanism layer, accurate identification of the keywords related to the business and self-growth of the keywords related to the business are realized, the accuracy of intention identification is improved, the effect of auxiliary marketing is also improved, the text transcription result of the multi-path voice transcription engine is verified, and the final text transcription result is determined based on the verification result, so that the accuracy of text transcription can be improved, and the accuracy of intention identification can be improved to a certain extent; in addition, the whole method can be automatically completed, the massive unstructured data can be rapidly mined, and the labor cost is greatly reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 is a system architecture diagram illustrating a business voice intent presentation method in accordance with an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a business voice intent presentation method in accordance with an exemplary embodiment;

FIG. 3 is a schematic diagram illustrating the manner in which a data and voice transcription engine is stored in accordance with an exemplary embodiment;

FIG. 4 is a diagram illustrating a process for transcribing outbound audio data based on a two-way speech transcription engine in accordance with an exemplary embodiment;

FIG. 5 is a block diagram illustrating a business related keyword recognition model in accordance with an exemplary embodiment;

FIG. 6 is a diagram illustrating a data processing process in training a business related keyword recognition model in accordance with an exemplary embodiment;

FIG. 7 is a diagram illustrating role recognition of a final text transcription result in accordance with an illustrative embodiment;

FIG. 8 is a diagram illustrating intent recognition with respect to role segments in accordance with an illustrative embodiment;

FIG. 9 is a schematic diagram illustrating an interface showing elite dialogues, according to an exemplary embodiment;

FIG. 10 is a schematic diagram illustrating an interface displaying statistics of reasons for failure in accordance with an exemplary embodiment;

FIG. 11 is a block diagram illustrating a business voice intent presenting apparatus in accordance with an exemplary embodiment;

FIG. 12 is a block diagram illustrating an example of an electronic device implementing the business voice intent presentation method described above, according to an example embodiment;

fig. 13 is a program product for implementing the above-described service voice intention presenting method according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities.

The application firstly provides a service voice intention presenting method. The business voice is voice data generated by marketers and clients in a conversation process in the marketing process, wherein the marketers can be outbound marketers or offline marketers, and the clients are users needing products or services. The intent of business speech is connotative information reflected in the speech data by the dialog behavior of the dialog personnel. The method and the device for recognizing the intention information of the dialogue personnel in the voice data can efficiently and accurately recognize the intention information of the dialogue personnel in the voice data, and can display the intention information in various modes. The scheme in the embodiment of the application can be applied to the field needing marketing and customer service, for example, the scheme can be applied to the service industries with outbound marketing scenes such as communication, catering, real estate and insurance, and the service level and the marketing level are improved. When the scheme of the embodiment of the application is applied to the communication field, the scheme can be particularly applied to the aspect of 5G service outbound marketing assistance.

The implementation terminal of the present application may be any device having an operation function, which may be connected to an external device for receiving or sending data, and specifically may be a portable mobile device, such as a smart phone, a tablet computer, a notebook computer, a PDA (Personal Digital Assistant), or the like, or may be a fixed device, such as a computer device, a field terminal, a desktop computer, a server, a workstation, or the like, or may be a set of multiple devices, such as a physical infrastructure of cloud computing or a server cluster.

Optionally, the implementation terminal of the present application may be a server or a physical infrastructure of cloud computing.

Fig. 1 is a system architecture diagram illustrating a business voice intention presentation method according to an exemplary embodiment. As shown in fig. 1, the system architecture includes a data layer, an algorithm layer, and an application layer. Specifically, the data layer is used to obtain outbound voice files, which may include voice files for both marketing campaigns and customer service services. The algorithm layer comprises a voice transcription engine and an NLP module, and the voice transcription engine is used for carrying out the operation of transferring the text by calling out the audio; after the text obtained by the transcription is subjected to processing operations such as data cleaning, classification and arrangement, the processed data is input into an NLP module, the NLP module can perform operations such as service dictionary construction, seat recognition, similarity calculation, dynamic window detection and the like, finally the intention recognition is realized, and an intention recognition result is output through an application layer; the application layer can comprise modules of business analysis, marketing assistance, business opportunity insight, outbound reminding, excellent marketing skills, business dynamic following and the like, wherein the outbound reminding module can feed back intention information of a client to an outbound person in real time, so that the outbound person can deeply know the intention of the user, and the outbound person can be helped to make a marketing decision better; the excellent marketing communication module can arrange and output excellent marketing communication, for example, if a certain outbound person successfully completes marketing through voice conversation, the communication information can be extracted from voice data of the outbound person as the excellent marketing communication, so that the marketing person can learn the communication, and the marketing ability of the marketing person is continuously improved.

FIG. 2 is a flow chart illustrating a business voice intent presentation method in accordance with an exemplary embodiment. The service voice intention presenting method provided by this embodiment may be executed by a server, as shown in fig. 2, and includes the following steps:

step 210, call-out audio data is obtained.

The outbound audio data is audio data generated by customer service or seat personnel through conversation with customers in an outbound scene, and the outbound audio data can be outbound audio data generated by various channels.

The outbound audio data may be audio data previously accumulated and saved prior to the current time, or may be currently acquired in real time.

In one embodiment, the obtaining outbound audio data comprises: and acquiring the outbound audio data received in real time.

Specifically, currently, a certain seat person is in a conversation with a client and recommends a product to the client, and at the moment, the outbound audio data is acquired in real time and intention identification is performed on the outbound audio data.

In the embodiment of the application, the outbound audio data when the seat personnel talk with the clients is obtained in real time, intention identification can be carried out in real time based on the outbound audio data, and the intention information of the clients can be fed back to the seat personnel in real time, so that the seat personnel can adjust words according to the intention information in a targeted manner, and the marketing success rate can be improved.

And step 220, transcribing the outbound audio data according to at least two voice transcription engines to obtain a corresponding text transcription result.

The speech transcription engine is capable of transcribing outgoing audio data into text data, i.e. the speech transcription engine includes a speech recognition model.

The at least two voice transcription engines are a plurality of different voice transcription engines, and each voice transcription engine can independently output transcription results of the outbound audio data. The speech transcription engines may include the same or different speech recognition models, which may employ the same or different algorithms when the speech recognition models are different, wherein the parameters in the models may be different if the same algorithm is employed. The speech recognition model may employ a Long Short-Term Memory network (LSTM), a DFSMN model, or the like.

In one embodiment, the outbound audio data and the text transcription result are both stored in a cloud platform, and the speech transcription engine is deployed on the cloud platform.

Specifically, a voice transcription engine can be deployed on a private cloud platform in advance by using a system local development server, outbound audio data can be directly uploaded to the cloud platform by the outbound system, and the obtained text transcription result is still stored on the cloud platform after the voice transcription engine on the cloud platform transcribes the outbound audio data.

In the embodiment of the application, the outbound audio data, the text transcription result and the voice transcription engine are stored on the cloud platform, so that the outbound audio data and the voice transcription engine cannot be accessed and obtained by the terminal, and the data security is effectively guaranteed.

FIG. 3 is a schematic diagram illustrating a manner of storage for a data and voice transcription engine in accordance with an exemplary embodiment. As shown in fig. 3, the private cloud platform includes outbound cloud data, a voice transcription engine, and outbound transcription data, where the outbound cloud data includes campaign marketing outbound cloud data and customer service outbound cloud data, where the campaign marketing outbound cloud data may be outbound audio data generated when an agent actively makes a call to a customer, and the customer service outbound cloud data may be outbound audio data generated when the agent receives a consultation call made by the customer; the external call transcription data comprise activity marketing external call transcription data and customer service external call transcription data which are text transcription results corresponding to the activity marketing external call cloud data and the customer service external call cloud data respectively. The outbound audio data can belong to any outbound cloud data.

Fig. 4 is a diagram illustrating a process for transcribing outbound audio data based on a two-way speech transcription engine in accordance with an exemplary embodiment. As shown in fig. 4, when two voice transcription engines are deployed, for a large amount of outbound audio data, the outbound audio data are transcribed by using an X transcription engine and a Y transcription engine in the two voice transcription engines, respectively, and the obtained transcription results can be divided into two types, which are respectively active marketing outbound transcription data and ten thousand number acceptance outbound transcription data, where the ten thousand number acceptance outbound transcription data is similar to the customer service outbound transcription data described above.

The embodiment shown in fig. 3 and 4 is actually a scheme of firstly performing transcription by using pre-obtained outbound audio data to obtain transcribed data, and then constructing a business keyword library. The relevant service keyword library will be described in the following embodiments.

And step 230, verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the outbound audio data based on the verification result.

The verification of the text transcription results of the voice transcription engines is a process for determining the difference of the text transcription results, and the verification result is the specific condition of whether the text transcription results are inconsistent and are inconsistent.

In one embodiment, the transcribing the outbound audio data according to at least two speech transcription engines to obtain a corresponding text transcription result includes: transferring the outbound audio data according to the two voice transfer engines to obtain a first text transfer result and a second text transfer result; the verifying the text transcription results of the voice transcription engines of all paths and determining the final text transcription result of the outbound audio data based on the verification result comprise: discarding the first transcription sentence and the second transcription sentence according to the preset number of continuously inconsistent words in the first transcription sentence in the first text transcription result and the second transcription sentence corresponding to the first transcription sentence in the second text transcription result; according to the fact that the number of continuously inconsistent words in a first transcription sentence in the first text transcription result and a second transcription sentence corresponding to the first transcription sentence in the second text transcription result is smaller than a preset number, any one of the first transcription sentence and the second transcription sentence is reserved; and taking all the reserved transcription sentences as final text transcription results of the outbound audio data.

Specifically, the first transcription sentence corresponds to the second transcription sentence, that is, the first transcription sentence and the second transcription sentence are transcription sentences at the same position, or transcription sentences corresponding to the same piece of audio data in the outbound audio data. The predetermined number is arbitrarily set according to needs, for example, the predetermined number may be 10, and when the number of continuously inconsistent words in the first transcription statement and the second transcription statement is greater than or equal to 10, both the first transcription statement and the second transcription statement are discarded; when the number of words that are continuously inconsistent in the first transcription sentence and the second transcription sentence is less than 10, the two cases are divided into two cases: if the first transcription sentence is consistent with the second transcription sentence, namely inconsistent characters do not exist in the first transcription sentence and the second transcription sentence, reserving any transcription sentence in the first transcription sentence and the second transcription sentence; if the first transcription sentence and the second transcription sentence are inconsistent and the number of continuous inconsistent words is less than 10, in addition to the step of reserving any one of the first transcription sentence and the second transcription sentence, a difference word can be extracted from the inconsistent part of the first transcription sentence and the second transcription sentence, and then a difference word dictionary is constructed, wherein the difference word is often homophone at the same position in the first transcription sentence and the second transcription sentence, so that the text repair can be carried out by using the difference word dictionary.

In one embodiment, the retaining any one of the first and second transcription sentences according to the number of consecutive inconsistent words in the first and second transcription sentences in the first and second text transcription results corresponding to the first transcription sentence being less than a predetermined number includes: and reserving any transcription sentence in the first transcription sentence and the second transcription sentence according to the consistency of the first transcription sentence in the first text transcription result and the second transcription sentence corresponding to the first transcription sentence in the second text transcription result.

All the transcription statements that are reserved can be stored as final text transcription results.

Referring to fig. 4, after the external call transcription data of the campaign marketing and the ten thousand number accepted external call transcription data are obtained, the external call transcription data are transferred out, then the transcription results corresponding to the two voice transcription engines are compared, then an error is corrected or an error sentence is discarded, and finally the transcription text is stored, wherein the original vocabulary and the check vocabulary which are not consistent with each other are input into the two voice transcription engines again, and the voice transcription engines are trained again to improve the performance of the voice transcription engines.

And 240, acquiring a first service related keyword in the final text transcription result based on a preset service related keyword identification model.

The business related keyword identification model comprises a feature vector extraction module and an XGB OST model which are sequentially connected, wherein an attention mechanism layer is added in the feature vector extraction module, the feature vector extraction module is used for extracting discrete feature vectors of keywords, and the XGB OST model is used for outputting classification results of the keywords.

In one embodiment, the feature vector extraction module includes a two-way long-short term memory network module and a deep semantic matching model.

Long Short-Term Memory networks (LSTM) are a time-cycled neural network. The RNN and LSTM algorithm compresses all the past information into a vector and transmits the vector, so that when a sentence is too long, the dimension of the vector is too small, which easily causes information compression and even loss. The bidirectional long-short term memory network module is a bidirectional LSTM. The depth Semantic matching model is Deep Structured Semantic Models, which can be used to output low-dimensional vector representations. Therefore, the bidirectional long-short term memory network module is the bidirectional LSTM-DSSM added with the attention mechanism layer, and because the LSTM-DSSM algorithm has the problems of gradient disappearance and long-distance dependence in a neural network, the external storage mechanism is expanded by adding the attention mechanism layer, information compression caused by backward transfer of only one hidden state vector is avoided, and prediction accuracy and operation efficiency can be improved.

The XGBOOST model is an eXtreme Gradient Boosting (eXtreme Gradient Boosting) model, which is a strong classifier model composed of a plurality of weak classifiers, and is a Boosting tree model.

The service related keyword identification model can directly output corresponding first service related keywords according to the input of the final text transcription result; the service-related keyword recognition model may also output a corresponding first service-related keyword according to the input of the keyword sequence.

In one embodiment, before obtaining the first service-related keyword in the final text transcription result based on a preset service-related keyword recognition model, the method further includes: extracting keywords from the final text transcription result to obtain a keyword sequence corresponding to the final text transcription result; the obtaining of the first service related keyword in the final text transcription result based on the preset service related keyword recognition model includes: and inputting the keyword sequence into the service related keyword recognition model to obtain a first service related keyword in the final text transcription result.

FIG. 5 is a block diagram illustrating a business related keyword recognition model in accordance with an exemplary embodiment. As shown in FIG. 5, it can be seen that the traffic-related keyword recognition model includes, in addition to bi-directional LSTM-DSSM and XGBOOST, a N-gram layer and a Max-posing pooling layer located between the bi-directional LSTM-DSSM and XGBOOST, which includes a decision Tree constructed from a plurality of Tree branches (Tree Splits).

N-gram is an algorithm based on a statistical language model. The basic idea is to perform sliding window operation with size of N on the content in the text according to bytes, and form a byte fragment sequence with length of N. Each byte segment is called as a gram, the occurrence frequency of all the grams is counted, and filtering is performed according to a preset threshold value to form a key gram list, namely a vector feature space of the text, wherein each gram in the list is a feature vector dimension.

After a keyword sequence consisting of keywords such as '5G', 'broadband' and the like is obtained, the keyword sequence is input into a service-related keyword recognition model, word vectors are output through an N-gram layer, and then the word vectors output hidden layer feature vectors by means of weighted synthesis of all features of bidirectional LSTMN-DSSM; the characteristics are automatically screened and combined by utilizing the memory of the neural network, so that a new discrete characteristic vector is generated, the discrete characteristic vector is used as the input of the XGB OST model, the XGB OST model can output the prediction probability corresponding to each keyword and each classification, and further the model classification result related to the keywords can be obtained.

FIG. 6 is a diagram illustrating a data processing procedure in training a business related keyword recognition model in accordance with an exemplary embodiment. The training and use of the business related keyword recognition model may employ the spark2.0 tool and use the data framework of pyspark. The data processing process is as follows: firstly, sorting modeling samples, and vectorizing keywords in the modeling samples to obtain word vectors, wherein the modeling samples comprise the keywords and corresponding service keyword labels; the modeling samples can be divided into a training set and a test set, wherein the training set comprises 7562 training samples, and the test set comprises 1687 test samples; then, the word vector is used as the input of the bidirectional LSTMN-DSSM model, the word vector is processed by the first layer model in the bidirectional LSTMN-DSSM model, and is converted into 1028-dimensional word vector through the hidden layer in the first layer model; then, inputting 1028-dimensional word vectors output by the first layer model into a second layer model of the bidirectional LSTMN-DSSM model, and mapping the 1028-dimensional word vectors into a 300-dimensional vector space through processing to obtain 300-dimensional word vectors; then, processing the 300-dimensional vector to obtain a discrete feature vector, inputting the discrete feature vector to an XGB OST model, and outputting the probability of the business keyword; and finally, according to the probability of the keywords, setting a threshold value as a screening condition for screening, and finally realizing the identification of the service keywords, wherein according to the prediction probability, the probability value greater than 0.5 is defined as a telecommunication professional word, and otherwise, the probability value is defined as a telecommunication non-professional word. In the whole training process, parameters of the model are adjusted according to the recognition result of the keywords, the service keyword labels corresponding to the keywords and the loss function; and continuously and iteratively executing the training of the model until the model meets the training stop condition.

After training is completed, the performance of the model can also be tested using a test set.

Through comparison and verification of test set effects, the classification accuracy of the traditional LSTM-DSSM algorithm on telecommunication service words/non-service words is 0.67, and the recall rate is 0.87; the improved bidirectional LSTMN-DSSM-XGBOOST algorithm has the advantages that the classification accuracy rate of telecommunication service words/non-service words is 0.81, the recall rate is 0.84, and the effect is remarkably improved.

Step 250, determining intention information corresponding to the first business related keyword according to the first business related keyword.

In one embodiment, the outbound audio data is generated during a business marketing process, and the determining intent information corresponding to the first business-related keyword according to the first business-related keyword includes: and inquiring marketing result reason information matched with the first business related keyword from a business keyword library according to the first business related keyword, and taking the marketing result reason information as intention information corresponding to the first business related keyword.

Specifically, the marketing result reason information may include marketing success reason information and marketing failure reason information.

In one embodiment, before querying the marketing result reason information matched with the first business-related keyword from a business keyword library according to the first business-related keyword, the method further comprises: the method comprises the steps of obtaining prestored target outbound audio data and marketing result information corresponding to the target outbound audio data; transferring the target outbound audio data according to at least two voice transfer engines to obtain a corresponding text transfer result; verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the target outbound audio data based on the verification result; acquiring a second service related keyword in the final text transcription result based on a preset service related keyword identification model; and pushing the second business related keywords and the marketing result information to a reason induction end so that a user of the reason induction end induces the marketing result reason information according to the second business related keywords and the marketing result information, and adding the second business related keywords and the marketing result reason information to the business keyword library.

In other embodiments, the second service related keyword may also be directly added to the service keyword library after being acquired.

In the embodiment of the application, the self-growth of the service key words is realized by continuously performing text transcription and service related key word identification.

The second service related keyword may be the same as or different from the first service related keyword, and the second service related keyword is different from the first service related keyword in that the second service related keyword is a keyword at a service keyword library establishment stage.

In one embodiment, the method further comprises: if the intention information corresponding to the first business related keyword cannot be determined, pushing the first business related keyword to a reason induction end so that a user of the reason induction end induces marketing result reason information corresponding to the first business related keyword, and adding the first business related keyword and the marketing result reason information to a business keyword library.

The reason induction end can be a terminal used by a business expert, and when the corresponding marketing result information is marketing success, the business expert can define marketing success reason information according to the second business related key words; when the corresponding marketing result information is marketing failure, the service expert can define the marketing failure reason information according to the second service related keyword. Therefore, the marketing success reason information or the marketing failure reason information summarized by experts can be obtained by searching in the business keyword library, and the intention identification is further realized.

Step 260, outputting the intention information.

The intention information can be output in a polymorphic visualization mode, and the analysis result of the intention information can also be presented in a polymorphic visualization WEB front end.

In one embodiment, the determining, according to the first business-related keyword, intention information corresponding to the first business-related keyword further includes: performing role recognition on the final text transcription result based on a logistic regression model to respectively obtain text transcription results corresponding to the client role and the seat personnel role; determining intention information corresponding to the first business related keywords according to the text transcription result corresponding to the client role to which the first business related keywords belong; the outputting the intention information comprises: and returning the intention information to the terminal where the role of the seat personnel is located.

And (3) performing role recognition on the final text transcription result based on the logistic regression model, wherein the role recognition is actually a positioning process of roles of two conversation parties (seat personnel and clients) in the final text transcription result. FIG. 7 is a diagram illustrating role recognition of a final text transcription result, according to an example embodiment. As shown in fig. 7, the operation of performing character recognition on the final text transcription result is as follows: firstly, performing word segmentation and vectorization on conversation contents of two roles, namely Spk0 and Spk1, wherein Spk0 and Spk1 are marking data corresponding to the conversation contents; then, adding the service words and words appearing in the conversation content into the word vector to obtain a text vector; and finally, inputting the text vector and the corresponding labeled data serving as an artificial labeled data set into a Logistic Regression (LR) model for training. The logistic regression model may output a category corresponding to each sentence in the text transcription result. For example, for a statement, if the logistic regression model predicts that it is an agent with a probability of 88% and a customer with a probability of 12%, then the logistic regression model predicts the statement as Spk0: a seat; for another statement, if the logistic regression model predicts that it is 88% with customers and 12% with agents, then the logistic regression model will predict the statement as Spk1: and (4) a client.

In other embodiments of the present application, the role recognition may also be performed based on other classification models.

In one embodiment, the method further comprises: and when the marketing result information corresponding to the outbound audio data is obtained, extracting marketing communication information from the text transcription result corresponding to the agent role according to the marketing result information as successful marketing, and storing the marketing communication information.

The marketing communication information also corresponds to the intention of the seat person, and therefore, extracting the marketing communication information corresponds to the intention recognition of the seat person.

FIG. 8 is a diagram illustrating role-splitting for intent recognition in accordance with an exemplary embodiment. As shown in fig. 8, on one hand, based on the lexicon output by the keyword self-growth model algorithm, the business experts summarize the lexicon to define key reasons of marketing success/failure, match the original corpus of the client with the keyword lexicon, and finally output the success/failure reasons summarized by the experts, so as to realize the intention identification expressed by the client; on the other hand, based on the marketing result, customer service key marketing skills of different services are preferentially selected by combining the smoothness of the translation dialogue content, and intelligent extraction of customer service excellent skills is achieved. In fig. 8, the keywords related to the service in the final text transcription result are continuously extracted through the sliding window, so that the capturing of the keywords of the client is realized, and the keywords for identifying the marketing success by the intention of the client can be obtained; meanwhile, based on the seat semantic recognition model, seat excellent speech technology is extracted intelligently. The keyword and the seat excellent dialect which are used for identifying the marketing success by the client intention can be output through a page, and a button or a hyperlink for listening to the original recording can be displayed on the page as shown in fig. 8, so that when the button or the hyperlink is used by the seat personnel, the original recording corresponding to the seat excellent dialect can be heard, the tone of the excellent dialect can be learned, and the marketing capability of the seat personnel can be further improved.

Since the outbound is in a conversation form between the agent and the client and the words of the agent and the words of the client in the transcribed text exist in an interlaced mode, the embodiment of the application captures the keyword content of the conversation text of the agent and the client based on agent semantic recognition and realizes intelligent extraction of customer service skills and intention analysis of the client language.

In one embodiment, the method further comprises: and summarizing the intention information and the marketing tactical information, and outputting a summarizing result in a visual mode.

Visualization may be implemented based on the Tableau visualization, and the aggregated results may be sent in the form of an outbound assisted analysis report.

After the intention information and the marketing communication information are intelligently acquired in the above manner, the information can be displayed in various forms, and the intention information and the marketing communication information can be further processed for displaying, for example, based on a polymorphic echarts graphical tool, presented in the forms of a word cloud dot diagram, a sequence diagram, a ring scale diagram and the like. FIG. 9 is a schematic diagram illustrating an interface showing elite speech according to an exemplary embodiment. As shown in fig. 9, the page not only shows successful excellent dialects, but also shows successful keywords, and each successful keyword is displayed in a form of a cloud-of-words graph, where the size of the cloud-of-words graph represents the occurrence frequency of the keyword, so that the successful keyword which can finally improve the marketing effect can be understood by the agent faster.

FIG. 10 is a diagram illustrating an interface displaying statistics of reasons for failure, according to an example embodiment. Referring to fig. 10, a page shows a statistical result showing the reason of the failure, which is presented in the form of a circular scale chart, and details the ratio of the reason of the marketing failure in each failed marketing, for example, the ratio of the reason of "enough for package" may be 30.31%, and so on; also shown in fig. 10 is a failure keyword top10 for showing the most frequently mentioned 10 keywords of the voice data of marketing failure; the failed keyword top10 can be dynamically presented in real time according to the keywords/content.

In summary, the embodiment of the application is realized based on the outbound data, the voice transcription engine, the NLP algorithm and the application software programming, the deployment is completed on a general server, the application is deployed on a company big data capacity platform, and the intention identification of the outbound marketing record is successfully realized. By introducing the analysis capability of the foreign language call, the marketing language with high acceptance of the client is mined and output, the successful order quantity of the active foreign call marketing of the business is improved by over 150 percent compared with the Putonghua in the aspects of combination of fertilizer, and the promotion and pulling of the business is expected to increase the income by over hundred million yuan. The intention identification is carried out through the voice-to-text data conversion of the client, the reason can be pertinently fixed to the real reason of marketing failure, and the client with the 5G requirement is screened out to carry out secondary marketing.

The application also provides a service voice intention presenting device, and the following device embodiment is provided in the application.

FIG. 11 is a block diagram illustrating a business voice intent presenting apparatus in accordance with an exemplary embodiment. As shown in fig. 11, the apparatus 1100 includes:

an obtaining module 1110 configured to obtain outbound audio data;

the transcription module 1120 is configured to transcribe the outbound audio data according to at least two voice transcription engines to obtain a corresponding text transcription result;

a verification module 1130 configured to verify the text transcription results of the voice transcription engines of the respective paths, and determine a final text transcription result of the outbound audio data based on the verification result;

a keyword obtaining module 1140, configured to obtain a first business related keyword in the final text transcription result based on a preset business related keyword recognition model, where the business related keyword recognition model includes a feature vector extraction module and an XGBOOST model, which are connected in sequence, and an attention mechanism layer is added to the feature vector extraction module, the feature vector extraction module is used to extract discrete feature vectors of the keyword, and the XGBOOST model is used to output a classification result of the keyword;

a determining module 1150 configured to determine intention information corresponding to the first business-related keyword according to the first business-related keyword;

an output module 1160 configured to output the intent information.

According to a third aspect of the present application, there is also provided an electronic device capable of implementing the above method.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 1200 according to this embodiment of the present application is described below with reference to fig. 12. The electronic device 1200 shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 12, the electronic device 1200 is embodied in the form of a general purpose computing device. Components of the electronic device 1200 may include, but are not limited to: the at least one processing unit 1210, the at least one memory unit 1220, and a bus 1230 connecting various system components including the memory unit 1220 and the processing unit 1210.

Wherein the storage unit stores program code, which can be executed by the processing unit 1210, to cause the processing unit 1210 to perform the steps according to various exemplary embodiments of the present application described in the section "example methods" above in this specification.

The storage unit 1220 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 1221 and/or a cache memory unit 1222, and may further include a read only memory unit (ROM) 1223.

Storage unit 1220 may also include a program/utility 1224 having a set (at least one) of program modules 1225, such program modules 1225 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Bus 1230 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 1200 may also communicate with one or more external devices 1400 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 1200, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 1200 to communicate with one or more other computing devices. Such communication can occur via input/output (I/O) interfaces 1250, such as to communicate with a display unit 1240. Also, the electronic device 1200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) via the network adapter 1260. As shown, the network adapter 1260 communicates with the other modules of the electronic device 1200 via a bus 1230. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 1200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present application.

According to a fourth aspect of the present application, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

Referring to fig. 13, a program product 1300 for implementing the above method according to an embodiment of the present application is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present application is not so limited, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed, for example, synchronously or asynchronously in multiple modules.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for presenting a business voice intention, the method comprising:

acquiring outbound audio data;

acquiring a first business related keyword in the final text transcription result based on a preset business related keyword identification model, wherein the business related keyword identification model comprises a feature vector extraction module and an XGBOOST model which are sequentially connected, an attention mechanism layer is added in the feature vector extraction module, the feature vector extraction module is used for extracting discrete feature vectors of the keyword, and the XGBOOST model is used for outputting a classification result of the keyword;

and outputting the intention information.

2. The method of claim 1, wherein transcribing the outbound audio data according to at least two speech transcription engines to obtain corresponding text transcription results comprises:

according to the two voice transcription engines, the outbound audio data are transcribed to obtain a first text transcription result and a second text transcription result;

the checking the text transcription result of each voice transcription engine and determining the final text transcription result of the outbound audio data based on the checking result comprises the following steps:

discarding the first transcription sentence and the second transcription sentence according to the preset number of continuously inconsistent words in the first transcription sentence in the first text transcription result and the second transcription sentence corresponding to the first transcription sentence in the second text transcription result;

according to the fact that the number of continuously inconsistent words in a first transcription sentence in the first text transcription result and a second transcription sentence corresponding to the first transcription sentence in the second text transcription result is less than a preset number, any transcription sentence in the first transcription sentence and the second transcription sentence is reserved;

and taking all the reserved transcription sentences as final text transcription results of the outbound audio data.

3. The method of claim 1, wherein the outbound audio data and the text transcription result are both stored in a cloud platform, and wherein the speech transcription engine is deployed on the cloud platform.

4. The method of claim 1, wherein the outbound audio data is generated during a business marketing process, and wherein determining intent information corresponding to the first business-related keyword according to the first business-related keyword comprises:

and inquiring marketing result reason information matched with the first business related keyword from a business keyword library according to the first business related keyword, and taking the marketing result reason information as intention information corresponding to the first business related keyword.

5. The method of claim 4, wherein before querying the marketing result reason information matching the first business-related keyword from a business keyword library according to the first business-related keyword, the method further comprises:

acquiring prestored target outbound audio data and marketing result information corresponding to the target outbound audio data;

transferring the target outbound audio data according to at least two voice transfer engines to obtain a corresponding text transfer result;

verifying the text transcription results of the voice transcription engines of all paths, and determining the final text transcription result of the target outbound audio data based on the verification result;

acquiring a second service related keyword in the final text transcription result based on a preset service related keyword identification model;

and pushing the second business related keywords and the marketing result information to a reason induction terminal so that a user of the reason induction terminal induces the marketing result reason information according to the second business related keywords and the marketing result information, and adding the second business related keywords and the marketing result reason information to the business keyword library.

6. The method according to claim 4, wherein the determining intent information corresponding to the first business-related keyword according to the first business-related keyword further comprises:

performing role recognition on the final text transcription result based on a logistic regression model to respectively obtain text transcription results corresponding to the client role and the seat personnel role;

determining intention information corresponding to the first business related keywords according to a text transcription result corresponding to the first business related keywords belonging to the client role;

the outputting the intention information comprises:

and returning the intention information to the terminal where the role of the seat personnel is located.

7. The method of claim 6, further comprising:

and when the marketing result information corresponding to the outbound audio data is obtained, extracting marketing communication information from the text transcription result corresponding to the agent role according to the marketing result information as marketing success, and storing the marketing communication information.

8. A business voice intent presentation apparatus, the apparatus comprising:

an acquisition module configured to acquire outbound audio data;

an output module configured to output the intention information.

9. A computer-readable program medium, characterized in that it stores computer program instructions which, when executed by a computer, cause the computer to perform the method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory having stored thereon computer readable instructions which, when executed by the processor, implement the method of any one of claims 1 to 7.