CN111309905A - Clustering method and device for conversation sentences, electronic equipment and storage medium - Google Patents

Clustering method and device for conversation sentences, electronic equipment and storage medium Download PDF

Info

Publication number
CN111309905A
CN111309905A CN202010081669.3A CN202010081669A CN111309905A CN 111309905 A CN111309905 A CN 111309905A CN 202010081669 A CN202010081669 A CN 202010081669A CN 111309905 A CN111309905 A CN 111309905A
Authority
CN
China
Prior art keywords
dialogue
word vector
sentences
word
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010081669.3A
Other languages
Chinese (zh)
Inventor
罗华刚
张�杰
李犇
徐世超
吴涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN202010081669.3A priority Critical patent/CN111309905A/en
Publication of CN111309905A publication Critical patent/CN111309905A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a clustering method and device of dialogue sentences, electronic equipment and a storage medium. According to the method and the system, the dialogue text data of the dialogue recording data can be determined through the obtained dialogue recording data about the target product between the salesman and the client, further, word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, the similarity between any two dialogue sentences can be determined according to the time sequence distance metric value between any two word vector sequences in the word vector sequences, further, the dialogue sentences in different sentence types can be determined by clustering the dialogue sentences based on the similarity. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.

Description

Clustering method and device for conversation sentences, electronic equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for clustering dialog statements, an electronic device, and a storage medium.
Background
With the rapid development of economy, the appeal of people for consumption is gradually enhanced, the retail industry is also developed rapidly, generally, sales influence the sales volume of products in the retail industry, but physical stores in the retail industry face some difficulties in sales, on one hand, new sales staff lack corresponding sales skills, on the other hand, sales staff have large mobility, and therefore how to efficiently improve the sales level of the sales staff is a technical problem to be solved urgently at present.
Aiming at the technical problem, at present, the sales recording is listened by a service expert, and the sales dialogue sentences are summarized according to the experience of the service expert, so that the training of sales personnel is realized.
Disclosure of Invention
In view of this, the embodiments of the present application at least provide a method, an apparatus, an electronic device, and a storage medium for clustering dialog sentences, so as to achieve training of sales personnel and improve the sales level of the sales personnel.
The application mainly comprises the following aspects:
in a first aspect, an embodiment of the present application provides a method for clustering conversational sentences, where the method includes:
acquiring dialogue recording data of a target product between a salesman and a client, and determining dialogue text data corresponding to the dialogue recording data;
extracting a plurality of dialogue sentences from the dialogue text data, and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
determining the similarity between any two dialogue sentences in the plurality of dialogue sentences according to the time sequence distance metric value between any two word vector sequences in the plurality of word vector sequences;
and clustering the plurality of dialogue sentences based on the similarity, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In a possible implementation, the clustering method further includes determining dialog text data corresponding to the dialog recording data according to the following steps:
and inputting the dialogue recording data into a trained voice recognition model, and determining dialogue text data corresponding to the dialogue recording data.
In a possible implementation manner, the extracting a plurality of conversational sentences from the conversational text data and determining a word vector sequence corresponding to each conversational sentence includes:
for each dialogue sentence extracted from the dialogue text data, performing word segmentation on each dialogue sentence, and determining a plurality of words corresponding to each dialogue sentence;
converting a plurality of words in each dialogue statement into corresponding word vectors respectively according to the word and word vector mapping table;
and determining a word vector sequence corresponding to each dialogue statement according to the plurality of word vectors corresponding to each dialogue statement.
In a possible implementation, the clustering method further includes establishing the word-to-word vector mapping table according to the following steps:
acquiring field text data of the field to which the target product belongs;
generating a model according to the field text data and the trained word vector to generate word vectors corresponding to a plurality of words;
and establishing a word and word vector mapping table according to the mutual corresponding relation between each word and the corresponding word vector.
In a possible implementation, the clustering method further includes calculating a time-series distance metric between any two word vector series according to the following steps:
calculating a plurality of Euclidean distances between each word vector in the first word vector sequence and each word vector in the second word vector sequence; the any two word vector sequences comprise the first word vector sequence and the second word vector sequence;
and determining a time series distance metric value between any two word vector sequences according to the Euclidean distances.
In a possible implementation manner, the clustering the plurality of conversational sentences based on the similarity to determine a conversational sentence under each sentence type of a plurality of sentence types includes:
and clustering the plurality of dialogue sentences according to the similarity and a preset clustering algorithm, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In a possible implementation manner, after the clustering the plurality of conversational sentences based on the similarity and determining the conversational sentences under each sentence type of a plurality of sentence types, the clustering method further includes:
and analyzing the conversation sentences in each sentence type in the plurality of sentence types to determine the sales characteristics of the salesman and the interest points of the customers.
In a second aspect, an embodiment of the present application further provides a clustering device for conversational sentences, where the clustering device includes:
the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for acquiring conversation recording data of a target product between a salesman and a client and determining conversation text data corresponding to the conversation recording data;
the second determining module is used for extracting a plurality of dialogue sentences from the dialogue text data and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
a third determining module, configured to determine a similarity between any two conversational sentences in the multiple conversational sentences according to a time series distance metric between any two word vector sequences in the multiple word vector sequences;
and the fourth determining module is used for clustering the plurality of dialogue sentences based on the similarity and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In a possible implementation manner, the first determining module is configured to determine dialog text data corresponding to the dialog sound recording data according to the following steps:
and inputting the dialogue recording data into a trained voice recognition model, and determining dialogue text data corresponding to the dialogue recording data.
In one possible implementation, the second determining module includes:
a first determining unit, configured to perform word segmentation on each dialogue sentence extracted from the dialogue text data, and determine a plurality of words corresponding to each dialogue sentence;
the conversion unit is used for respectively converting a plurality of words in each dialogue statement into corresponding word vectors according to the word and word vector mapping table;
and the second determining unit is used for determining a word vector sequence corresponding to each dialogue statement according to the plurality of word vectors corresponding to each dialogue statement.
In a possible implementation, the clustering device further includes an establishing module; the establishing module is used for establishing the word and word vector mapping table according to the following steps:
acquiring field text data of the field to which the target product belongs;
generating a model according to the field text data and the trained word vector to generate word vectors corresponding to a plurality of words;
and establishing a word and word vector mapping table according to the mutual corresponding relation between each word and the corresponding word vector.
In one possible implementation, the third determining module includes:
the calculation unit is used for calculating a plurality of Euclidean distances between each word vector in the first word vector sequence and each word vector in the second word vector sequence; the any two word vector sequences comprise the first word vector sequence and the second word vector sequence;
and a third determining unit, configured to determine a time series distance metric value between any two word vector sequences according to the euclidean distances.
In a possible implementation, the fourth determining module is configured to determine the conversational sentence in each of a plurality of sentence types according to the following steps:
and clustering the plurality of dialogue sentences according to the similarity and a preset clustering algorithm, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In a possible implementation, the clustering device further includes:
and the analysis module is used for analyzing the conversation sentences in each sentence type in the plurality of sentence types to determine the sales characteristics of the salesman and the interest points of the customers.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, wherein the memory stores machine-readable instructions executable by the processor, when the electronic device runs, the processor and the memory communicate with each other through the bus, and the machine-readable instructions are executed by the processor to perform the steps of the method for clustering conversational utterances according to the first aspect or any one of the possible embodiments of the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for clustering conversational utterances as described in the first aspect or any one of the possible implementations of the first aspect is performed.
In the embodiment of the application, the dialogue text data of the dialogue recording data can be determined through the obtained dialogue recording data about the target product between the salesman and the client, further, word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, the similarity between any two dialogue sentences can be determined according to the time sequence distance metric value between any two word vector sequences in the word vector sequences, and further, the dialogue sentences under different sentence types can be determined by clustering the dialogue sentences based on the similarity. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a flowchart illustrating a method for clustering conversational sentences according to an embodiment of the present application;
FIG. 2 is a flow chart of another method for clustering conversational sentences according to the embodiment of the present application;
FIG. 3 is a functional block diagram of a clustering device for conversational sentences according to an embodiment of the present application;
FIG. 4 illustrates a functional block diagram of the second determination module of FIG. 3;
fig. 5 is a second functional block diagram of a conversational sentence clustering apparatus according to an embodiment of the present application;
FIG. 6 illustrates a functional block diagram of the third determination module of FIG. 3;
fig. 7 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Description of the main element symbols:
in the figure: 300-clustering means of conversational utterances; 310-a first determination module; 320-a second determination module; 322-a first determining unit; 324-a conversion unit; 326 — a second determination unit; 330-a third determination module; 332-a calculation unit; 334-a third determination unit; 340-a fourth determination module; 350-establishing a module; 360-an analysis module; 700-an electronic device; 710-a processor; 720-a memory; 730-bus.
Detailed Description
To make the purpose, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for illustrative and descriptive purposes only and are not used to limit the scope of protection of the present application. Additionally, it should be understood that the schematic drawings are not necessarily drawn to scale. The flowcharts used in this application illustrate operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be performed out of order, and that steps without logical context may be performed in reverse order or concurrently. One skilled in the art, under the guidance of this application, may add one or more other operations to, or remove one or more operations from, the flowchart.
In addition, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
To enable those skilled in the art to utilize the present disclosure, the following embodiments are presented in conjunction with a specific application scenario "clustering of dialog statements," and it will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and application scenarios without departing from the spirit and scope of the present disclosure.
The method, apparatus, electronic device or computer-readable storage medium described in the embodiments of the present application may be applied to any scenario in which conversational sentences need to be clustered, and the embodiments of the present application do not limit a specific application scenario, and any scheme using the method and apparatus for clustering conversational sentences provided in the embodiments of the present application is within the scope of protection of the present application.
It is worth noting that before the method is provided, the business experts listen to the sales recording and summarize the sales dialogue sentences according to the experience of the business experts, so as to achieve the training of sales personnel, but the method needs more manpower and material resource cost.
In view of the above problems, in the embodiment of the present application, through the obtained dialogue recording data about a target product between a salesman and a customer, dialogue text data of the dialogue recording data may be determined, and then word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, and according to a time series distance metric between any two word vector sequences in the plurality of word vector sequences, a similarity between any two dialogue sentences may be determined, and further, based on the similarity, a plurality of dialogue sentences are clustered, and dialogue sentences in different sentence types may be determined. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
For the convenience of understanding of the present application, the technical solutions provided in the present application will be described in detail below with reference to specific embodiments.
Fig. 1 is a flowchart of a method for clustering conversational sentences according to an embodiment of the present disclosure. As shown in fig. 1, the method for clustering conversational sentences provided in the embodiment of the present application includes the following steps:
s101: conversation recording data of a target product between a salesman and a client is obtained, and conversation text data corresponding to the conversation recording data is determined.
In a specific implementation, the recording device may be used to record a conversation between the salesperson and the customer, and thus, conversation recording data on a target product between the salesperson and the customer may be acquired from the recording device, where the conversation recording data on the target product is acquired, so as to extract a conversation sentence related to the sales target product from the conversation recording data on the target product, where the conversation recording data may be converted into conversation text data.
Further, determining dialog text data corresponding to the dialog recording data according to the following steps: and inputting the dialogue recording data into a trained voice recognition model, and determining dialogue text data corresponding to the dialogue recording data.
In a specific implementation, the dialogue recording data of the target product between the salesman and the client can be input into a trained voice recognition model, and dialogue text data corresponding to the dialogue recording data is obtained.
It should be noted that the speech recognition model is a model that relies on a speech recognition technology (ASR) to convert the dialogue transcription data into corresponding dialogue text data. Speech recognition techniques, also referred to herein as automatic speech recognition, are directed to converting the lexical content of human speech into computer-readable input, such as keystrokes, binary codes or character sequences, as opposed to speaker recognition and speaker verification, which attempts to recognize or verify the speaker who uttered the speech rather than the lexical content contained therein. Common methods for speech recognition technology include linguistic and acoustic based methods, stochastic modeling methods, methods using artificial neural networks, and probabilistic parsing.
Here, the dialogue text data may be presented in the form of a salesperson and customer dialogue, for example:
the salesman: text statement 1;
customer: a text sentence 2;
the salesman: a text statement 3;
......
customer: a text sentence n.
S102: extracting a plurality of dialogue sentences from the dialogue text data, and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue sentence.
In a specific implementation, after converting dialogue recording data about a target product between a salesman and a customer into corresponding dialogue text data, extracting a plurality of dialogue sentences from the dialogue text data, and further determining a word vector sequence corresponding to each dialogue sentence for each dialogue sentence in the extracted plurality of dialogue sentences, wherein the word vector sequence corresponding to each dialogue sentence comprises word vectors corresponding to a plurality of words contained in the dialogue sentence.
Further, the step S102 of extracting a plurality of dialogue sentences from the dialogue text data and determining a word vector sequence corresponding to each dialogue sentence includes the following steps:
step a 1: and for each dialogue sentence extracted from the dialogue text data, segmenting each dialogue sentence, and determining a plurality of words corresponding to each dialogue sentence.
In a specific implementation, when determining to extract a word vector sequence corresponding to each of a plurality of dialogue sentences from the dialogue text data, for each of the plurality of dialogue sentences, each dialogue sentence is segmented first, so that a plurality of words corresponding to each dialogue sentence can be obtained.
Step a 2: and respectively converting a plurality of words in each dialogue statement into corresponding word vectors according to the word and word vector mapping table.
In specific implementation, for each of a plurality of terms in each dialogue statement, a term vector corresponding to each term is queried from a term-term vector mapping table, and then, term vectors corresponding to the plurality of terms in each dialogue statement are obtained. Here, the word and word vector mapping table stores a plurality of words and a plurality of word vectors, and each word and the corresponding word vector are stored in association with each other.
Step a 3: and determining a word vector sequence corresponding to each dialogue statement according to the plurality of word vectors corresponding to each dialogue statement.
In specific implementation, for word vectors corresponding to a plurality of words in each conversational sentence, the word vectors corresponding to each conversational sentence are spliced according to the position of each word in each conversational sentence, so as to obtain a word vector sequence corresponding to each conversational sentence.
In an example, the dialog sentence "what price is of the car a in the index", the dialog sentence is divided into a plurality of words "index/car a/yes/what/price", where a word vector corresponding to the word "index" is x1, a word vector corresponding to the word "is x2, a word vector corresponding to the word" car a "is x3, a word" is x4, a word vector corresponding to the word "what" is x5, and a word vector corresponding to the word "price" is x6, and then the word vector sequence corresponding to the dialog sentence is "(x 1, x2, x3, x4, x5, x 6)".
Further, the word and word vector mapping table is established according to the following steps:
step b 1: and acquiring field text data of the field to which the target product belongs.
In a specific implementation, before extracting a dialog sentence related to a target product, a word and word vector mapping table of a field to which the target product belongs may be established, and if the word and word vector mapping table is to be established, field text data of the field to which the target product belongs needs to be acquired first, and specifically, the field text data of the field to which the target product belongs may be acquired from channels such as microblogs, product reviews, forums, and the like.
Step b 2: and generating word vectors corresponding to a plurality of words according to the field text data and the trained word vector generation model.
In specific implementation, after field text data of a field to which a target product belongs is obtained, the field text data is segmented to obtain a plurality of words corresponding to the field text data, and the plurality of words are input into a trained word vector generation model to generate word vectors corresponding to the plurality of words. Here, the word vector generation model is a model that generates a word vector.
It should be noted that the Word vector generation model may be a Word2vec model, the Word2vec model is a model for generating Word vectors, the model is a shallow and double-layer neural network used for training to reconstruct Word text of linguistics, the network is represented by words and needs to guess input words at adjacent positions, the order of the words is not important under the assumption of the Word bag model in the Word2vec, and after training, the Word2vec model may be used to map each Word to a Word vector and may be used to represent the relationship between the words and the Word vectors.
Step b 3: and establishing a word and word vector mapping table according to the mutual corresponding relation between each word and the corresponding word vector.
In a specific implementation, after word vectors of words included in the domain text data are determined, each word and the corresponding word vector are stored in an associated manner, and here, a word-word vector mapping table may be established according to a mutual correspondence relationship between each word and the corresponding word vector.
S103: and determining the similarity between any two conversational sentences in the plurality of conversational sentences according to the time sequence distance metric value between any two word vector sequences in the plurality of word vector sequences.
In a specific implementation, after a plurality of dialogue sentences are extracted from the dialogue text data and word vector sequences corresponding to the plurality of dialogue sentences are determined, for any two of the plurality of dialogue sentences, a time sequence distance metric between two word vector sequences corresponding to the any two dialogue sentences can be determined as a similarity between the any two dialogue sentences.
Here, it should be noted that a Time series distance metric (DTW) is a method for measuring similarity between two Time series with different lengths, a Time series is a common representation form of data, and for Time series processing, a common task is to compare similarity between two sequences. In the time series, the lengths of two time series needing to be compared with each other may not be equal, and the speech speed of different persons is different in the speech recognition field, because the speech signal has considerable randomness, even if the same person pronounces the same sound at different moments, the time length may not be complete, and the pronunciation speeds of different phonemes in the same word are different, under these complex conditions, the distance between the two time series cannot be effectively obtained only by using the traditional euclidean distance.
Further, a time series distance metric between any two word vector sequences is calculated according to the following steps:
step c 1: calculating a plurality of Euclidean distances between each word vector in the first word vector sequence and each word vector in the second word vector sequence; the arbitrary two word vector sequences include the first word vector sequence and the second word vector sequence.
In a specific implementation, when calculating the time series distance metric between any two word vector sequences, a plurality of euclidean distances between each word vector in a first word vector sequence and each word vector in a second word vector sequence in the any two word vector sequences are calculated.
It should be noted that the euclidean distance, also called euclidean metric (eulerian metric), is a commonly used distance definition, and refers to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance from the point to the origin), and the euclidean distance in two-dimensional and three-dimensional spaces is the actual distance between two points.
In one example, if there are two word vectors, word vector x (x1, x2, …, xn) and word vector y (y1, y2, …, yn), then the euclidean distance between word vector x and word vector y is equal to
Figure BDA0002380527130000131
Step c 2: and determining a time series distance metric value between any two word vector sequences according to the Euclidean distances.
In a specific implementation, the time-series distance metric between any two word vector sequences may be determined according to a plurality of euclidean distances between each word vector of the first word vector sequence and each word vector of the second word vector sequence in any two word vector sequences.
In one example, the first word vector sequence is Q (Q1, Q2, …, qi, …, qn), the second word vector sequence is C (C1, C2, …, cj, …, cm), when calculating the distance between the first word vector sequence and the second word vector sequence, an n × m matrix grid is first constructed, matrix elements (i, j) represent the euclidean distance d (qi, cj) between qi and cj (i.e., the euclidean distance between each point of sequence Q and each point of C), each matrix element (i, j) represents the alignment of points qi and cj, where the matching of the two sequences Q and C is required to be performed from (0, 0) to (n, m) and not reduced, thus many complete matching sequences are generated, the distances calculated for all points in the matching sequences are accumulated, and the minimum distance among all the euclidean distances is selected, this cumulative distance is a time series distance measure between the first sequence of word vectors and the second sequence of word vectors, i.e. the similarity between the first sequence of word vectors Q and the second word vector C.
S104: and clustering the plurality of dialogue sentences based on the similarity, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In specific implementation, after the similarity between any two of the plurality of conversational sentences is determined, the plurality of conversational sentences may be clustered according to the plurality of similarities, and the conversational sentence in each of the plurality of sentence types is determined, that is, the plurality of conversational sentences are classified into different sentence types, and each sentence type corresponds to at least 2 different conversational sentences.
Further, in step S104, clustering the plurality of conversational sentences based on the similarity, and determining a conversational sentence in each of a plurality of sentence types, includes the following steps:
and clustering the plurality of dialogue sentences according to the similarity and a preset clustering algorithm, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In specific implementation, the plurality of dialogue sentences may be clustered according to the similarity between any two dialogue sentences in the plurality of dialogue sentences and a preset clustering algorithm, so as to determine the dialogue sentences in each sentence type in the plurality of sentence types. Here, the similarity to a conversational sentence belonging to one sentence type is high.
It should be noted that the clustering algorithm may be a K-means clustering algorithm (K-means clustering algorithm), which is an iterative solution clustering analysis algorithm and includes the steps of randomly selecting K objects as initial clustering centers, calculating distances between each object and each seed clustering center, assigning each object to the nearest clustering center, where the clustering center and the objects assigned to the object represent a cluster, assigning a sample, and recalculating the clustering centers of the clusters according to the objects existing in the clusters, where the process is repeated until a termination condition is met, where the termination condition may be that no (or minimum) object is reassigned to different clusters, no (or minimum) clustering center is changed, the square error and the local part are minimized, and further, clustering multiple dialogs, a conversational sentence under each of a plurality of sentence types is determined.
In the embodiment of the application, the dialogue text data of the dialogue recording data can be determined through the obtained dialogue recording data about the target product between the salesman and the client, further, word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, the similarity between any two dialogue sentences can be determined according to the time sequence distance metric value between any two word vector sequences in the word vector sequences, and further, the dialogue sentences under different sentence types can be determined by clustering the dialogue sentences based on the similarity. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
Fig. 2 is a flowchart of another method for clustering conversational sentences according to an embodiment of the present disclosure. As shown in fig. 2, the method for clustering conversational sentences provided in the embodiment of the present application includes the following steps:
s201: conversation recording data of a target product between a salesman and a client is obtained, and conversation text data corresponding to the conversation recording data is determined.
S202: extracting a plurality of dialogue sentences from the dialogue text data, and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue sentence.
S203: and determining the similarity between any two conversational sentences in the plurality of conversational sentences according to the time sequence distance metric value between any two word vector sequences in the plurality of word vector sequences.
S204: and clustering the plurality of dialogue sentences based on the similarity, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
The descriptions of S201 to S204 may refer to the descriptions of S101 to S104, and the same technical effects can be achieved, and are not described herein again.
S205: and analyzing the conversation sentences in each sentence type in the plurality of sentence types to determine the sales characteristics of the salesman and the interest points of the customers.
In the specific implementation, after the conversation sentence in each sentence type of the plurality of sentence types is determined, the conversation sentence in each sentence type can be analyzed, the sales characteristics of the salesman and the interest points of the customers can be determined, and the salesman can be assisted in analyzing the sales condition, so that the sales level of the salesman is improved.
It should be noted that, with the help of a large amount of field text data, the method and the system automatically extract the dialogue sentences of different sentence types from the dialogue recording data, can assist the salesperson to quickly acquire the attention points, namely the interest points, of the customers, and determine the sales characteristics of the salesperson, can improve the training efficiency, and can improve the sales level of the salesperson.
In the embodiment of the application, the dialogue text data of the dialogue recording data can be determined through the obtained dialogue recording data about the target product between the salesman and the client, further, word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, the similarity between any two dialogue sentences can be determined according to the time sequence distance metric value between any two word vector sequences in the word vector sequences, and further, the dialogue sentences under different sentence types can be determined by clustering the dialogue sentences based on the similarity. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
Based on the same application concept, the embodiment of the present application further provides a device for clustering dialog sentences corresponding to the method for clustering dialog sentences provided in the above embodiment, and because the principle of solving the problems of the device in the embodiment of the present application is similar to the method for clustering dialog sentences provided in the above embodiment of the present application, the implementation of the device can refer to the implementation of the method, and repeated details are omitted.
As shown in fig. 3 to 6, fig. 3 is one of functional block diagrams of a clustering apparatus 300 for conversational sentences according to an embodiment of the present application, fig. 4 is a functional block diagram of a second determining module 320 in fig. 3, fig. 5 is a functional block diagram of a clustering apparatus 300 for conversational sentences according to an embodiment of the present application, and fig. 6 is a functional block diagram of a third determining module 330 in fig. 3.
As shown in fig. 3, the apparatus 300 for clustering conversational sentences includes:
the first determining module 310 is configured to obtain conversation recording data about a target product between a salesman and a customer, and determine conversation text data corresponding to the conversation recording data;
a second determining module 320, configured to extract a plurality of dialog sentences from the dialog text data, and determine a word vector sequence corresponding to each dialog sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
a third determining module 330, configured to determine a similarity between any two conversational sentences in the multiple conversational sentences according to a time series distance metric between any two word vector sequences in the multiple word vector sequences;
a fourth determining module 340, configured to cluster the multiple dialog statements based on the similarity, and determine a dialog statement in each of multiple statement types.
In a possible implementation manner, as shown in fig. 3, the first determining module 310 is configured to determine the dialog text data corresponding to the dialog recording data according to the following steps:
and inputting the dialogue recording data into a trained voice recognition model, and determining dialogue text data corresponding to the dialogue recording data.
In one possible implementation, as shown in fig. 4, the second determining module 320 includes:
a first determining unit 322, configured to perform word segmentation on each dialogue sentence extracted from the dialogue text data, and determine a plurality of words corresponding to each dialogue sentence;
a conversion unit 324, configured to convert, according to the word-word vector mapping table, a plurality of words in each dialog statement into corresponding word vectors respectively;
the second determining unit 326 is configured to determine a word vector sequence corresponding to each conversational sentence according to the plurality of word vectors corresponding to each conversational sentence.
In a possible implementation manner, as shown in fig. 5, the clustering device 300 for conversational sentences further includes an establishing module 350; the establishing module 350 is configured to establish the word and word vector mapping table according to the following steps:
acquiring field text data of the field to which the target product belongs;
generating a model according to the field text data and the trained word vector to generate word vectors corresponding to a plurality of words;
and establishing a word and word vector mapping table according to the mutual corresponding relation between each word and the corresponding word vector.
In one possible implementation, as shown in fig. 6, the third determining module 330 includes:
a calculating unit 332, configured to calculate a plurality of euclidean distances between each word vector in the first word vector sequence and each word vector in the second word vector sequence; the any two word vector sequences comprise the first word vector sequence and the second word vector sequence;
a third determining unit 334, configured to determine a time-series distance metric value between any two word vector sequences according to the euclidean distances.
In one possible implementation, as shown in fig. 3, the fourth determining module 340 is configured to determine the conversational sentence under each of a plurality of sentence types according to the following steps:
and clustering the plurality of dialogue sentences according to the similarity and a preset clustering algorithm, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In a possible implementation manner, as shown in fig. 5, the apparatus 300 for clustering conversational sentences further includes:
the analysis module 360 is configured to analyze the conversational sentence in each of the plurality of sentence types to determine the sales characteristic of the salesman and the interest point of the customer.
In this embodiment of the application, through the obtained dialogue recording data about the target product between the salesman and the customer, the dialogue text data of the dialogue recording data may be determined by the first determining module 310, further, the word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined by the second determining module 320, and according to the time sequence distance metric between any two word vector sequences in the word vector sequences, the similarity between any two dialogue sentences may be determined by the third determining module 330, further, based on the similarity, the plurality of dialogue sentences are clustered, and the dialogue sentences in different sentence types may be determined by the fourth determining module 340. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
Based on the same application concept, referring to fig. 7, a schematic structural diagram of an electronic device 700 provided in the embodiment of the present application includes: a processor 710, a memory 720 and a bus 730, wherein the memory 720 stores machine-readable instructions executable by the processor 710, when the electronic device 700 is operated, the processor 710 communicates with the memory 720 through the bus 730, and the machine-readable instructions are executed by the processor 710 to perform the steps of the method for clustering conversational sentences as in any one of the above embodiments.
In particular, the machine readable instructions, when executed by the processor 710, may perform the following:
acquiring dialogue recording data of a target product between a salesman and a client, and determining dialogue text data corresponding to the dialogue recording data;
extracting a plurality of dialogue sentences from the dialogue text data, and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
determining the similarity between any two dialogue sentences in the plurality of dialogue sentences according to the time sequence distance metric value between any two word vector sequences in the plurality of word vector sequences;
and clustering the plurality of dialogue sentences based on the similarity, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
In the embodiment of the application, the dialogue text data of the dialogue recording data can be determined through the obtained dialogue recording data about the target product between the salesman and the client, further, word vector sequences corresponding to a plurality of dialogue sentences extracted from the dialogue text data are determined, the similarity between any two dialogue sentences can be determined according to the time sequence distance metric value between any two word vector sequences in the word vector sequences, and further, the dialogue sentences under different sentence types can be determined by clustering the dialogue sentences based on the similarity. Based on the mode, the dialogue sentences of different sentence types can be clustered from the dialogue recording data without manual summarization, and then the training of salesmen can be realized according to the dialogue sentences of different sentence types, so that the sales level of the salesmen can be improved.
Based on the same application concept, embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the method for clustering dialog statements provided in the foregoing embodiments are performed.
Specifically, the storage medium may be a general-purpose storage medium, such as a mobile disk, a hard disk, or the like, and when a computer program on the storage medium is executed, the clustering method of the dialogue statements may be executed, so that training of the salespersons may be implemented according to the dialogue statements in different statement types, and the sales level of the salespersons may be improved.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A clustering method of conversational sentences, the clustering method comprising:
acquiring dialogue recording data of a target product between a salesman and a client, and determining dialogue text data corresponding to the dialogue recording data;
extracting a plurality of dialogue sentences from the dialogue text data, and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
determining the similarity between any two dialogue sentences in the plurality of dialogue sentences according to the time sequence distance metric value between any two word vector sequences in the plurality of word vector sequences;
and clustering the plurality of dialogue sentences based on the similarity, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
2. The clustering method according to claim 1, further comprising determining dialogue text data corresponding to the dialogue record data according to the following steps:
and inputting the dialogue recording data into a trained voice recognition model, and determining dialogue text data corresponding to the dialogue recording data.
3. The clustering method according to claim 1, wherein the extracting a plurality of conversational sentences from the conversational text data and determining a word vector sequence corresponding to each conversational sentence comprises:
for each dialogue sentence extracted from the dialogue text data, performing word segmentation on each dialogue sentence, and determining a plurality of words corresponding to each dialogue sentence;
converting a plurality of words in each dialogue statement into corresponding word vectors respectively according to the word and word vector mapping table;
and determining a word vector sequence corresponding to each dialogue statement according to the plurality of word vectors corresponding to each dialogue statement.
4. The clustering method according to claim 3, further comprising building the word to word vector mapping table according to the following steps:
acquiring field text data of the field to which the target product belongs;
generating a model according to the field text data and the trained word vector to generate word vectors corresponding to a plurality of words;
and establishing a word and word vector mapping table according to the mutual corresponding relation between each word and the corresponding word vector.
5. The clustering method according to claim 1, further comprising calculating a time-series distance metric value between any two word vector sequences according to the following steps:
calculating a plurality of Euclidean distances between each word vector in the first word vector sequence and each word vector in the second word vector sequence; the any two word vector sequences comprise the first word vector sequence and the second word vector sequence;
and determining a time series distance metric value between any two word vector sequences according to the Euclidean distances.
6. The clustering method according to claim 1, wherein the clustering the plurality of conversational sentences based on the similarity to determine the conversational sentences under each sentence type of a plurality of sentence types comprises:
and clustering the plurality of dialogue sentences according to the similarity and a preset clustering algorithm, and determining the dialogue sentences in each sentence type in the plurality of sentence types.
7. The clustering method according to claim 1, wherein after clustering the plurality of conversational utterances based on the similarity and determining the conversational utterances under each of a plurality of utterance types, the clustering method further comprises:
and analyzing the conversation sentences in each sentence type in the plurality of sentence types to determine the sales characteristics of the salesman and the interest points of the customers.
8. An apparatus for clustering conversational utterances, the apparatus comprising:
the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for acquiring conversation recording data of a target product between a salesman and a client and determining conversation text data corresponding to the conversation recording data;
the second determining module is used for extracting a plurality of dialogue sentences from the dialogue text data and determining a word vector sequence corresponding to each dialogue sentence; the word vector sequence comprises word vectors corresponding to a plurality of words contained in the dialogue statement;
a third determining module, configured to determine a similarity between any two conversational sentences in the multiple conversational sentences according to a time series distance metric between any two word vector sequences in the multiple word vector sequences;
and the fourth determining module is used for clustering the plurality of dialogue sentences based on the similarity and determining the dialogue sentences in each sentence type in the plurality of sentence types.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when an electronic device is operated, the machine-readable instructions being executed by the processor to perform the steps of the method for clustering conversational utterances according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, performs the steps of the method for clustering conversational utterances as recited in any one of claims 1 to 7.
CN202010081669.3A 2020-02-06 2020-02-06 Clustering method and device for conversation sentences, electronic equipment and storage medium Pending CN111309905A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010081669.3A CN111309905A (en) 2020-02-06 2020-02-06 Clustering method and device for conversation sentences, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010081669.3A CN111309905A (en) 2020-02-06 2020-02-06 Clustering method and device for conversation sentences, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111309905A true CN111309905A (en) 2020-06-19

Family

ID=71144975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010081669.3A Pending CN111309905A (en) 2020-02-06 2020-02-06 Clustering method and device for conversation sentences, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111309905A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133308A (en) * 2020-09-17 2020-12-25 中国建设银行股份有限公司 Method and device for multi-label classification of voice recognition text
CN112562678A (en) * 2020-11-26 2021-03-26 携程计算机技术(上海)有限公司 Intelligent dialogue method, system, equipment and storage medium based on customer service recording
CN112651787A (en) * 2021-01-06 2021-04-13 上海明略人工智能(集团)有限公司 Payment method and device for orally-played advertisement
CN112765976A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Text similarity calculation method, device and equipment and storage medium
CN113268577A (en) * 2021-06-04 2021-08-17 厦门快商通科技股份有限公司 Training data processing method and device based on dialogue relation and readable medium
CN114155024A (en) * 2021-11-30 2022-03-08 北京京东振世信息技术有限公司 Method, device, equipment and medium for determining target object
CN114429134A (en) * 2021-11-25 2022-05-03 北京容联易通信息技术有限公司 Hierarchical high-quality speech mining method and device based on multivariate semantic representation
CN114692574A (en) * 2022-04-19 2022-07-01 中国银行股份有限公司 News conversion method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107992596A (en) * 2017-12-12 2018-05-04 百度在线网络技术(北京)有限公司 A kind of Text Clustering Method, device, server and storage medium
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
US20190147853A1 (en) * 2017-11-15 2019-05-16 International Business Machines Corporation Quantized dialog language model for dialog systems
CN110704586A (en) * 2019-09-30 2020-01-17 支付宝(杭州)信息技术有限公司 Information processing method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190147853A1 (en) * 2017-11-15 2019-05-16 International Business Machines Corporation Quantized dialog language model for dialog systems
CN107992596A (en) * 2017-12-12 2018-05-04 百度在线网络技术(北京)有限公司 A kind of Text Clustering Method, device, server and storage medium
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence
CN110704586A (en) * 2019-09-30 2020-01-17 支付宝(杭州)信息技术有限公司 Information processing method and system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133308A (en) * 2020-09-17 2020-12-25 中国建设银行股份有限公司 Method and device for multi-label classification of voice recognition text
CN112562678A (en) * 2020-11-26 2021-03-26 携程计算机技术(上海)有限公司 Intelligent dialogue method, system, equipment and storage medium based on customer service recording
CN112765976A (en) * 2020-12-30 2021-05-07 北京知因智慧科技有限公司 Text similarity calculation method, device and equipment and storage medium
CN112651787A (en) * 2021-01-06 2021-04-13 上海明略人工智能(集团)有限公司 Payment method and device for orally-played advertisement
CN113268577A (en) * 2021-06-04 2021-08-17 厦门快商通科技股份有限公司 Training data processing method and device based on dialogue relation and readable medium
CN114429134A (en) * 2021-11-25 2022-05-03 北京容联易通信息技术有限公司 Hierarchical high-quality speech mining method and device based on multivariate semantic representation
CN114155024A (en) * 2021-11-30 2022-03-08 北京京东振世信息技术有限公司 Method, device, equipment and medium for determining target object
CN114692574A (en) * 2022-04-19 2022-07-01 中国银行股份有限公司 News conversion method and device

Similar Documents

Publication Publication Date Title
CN111309905A (en) Clustering method and device for conversation sentences, electronic equipment and storage medium
CN106683680B (en) Speaker recognition method and device, computer equipment and computer readable medium
CN109493850B (en) Growing type dialogue device
CN110033760B (en) Modeling method, device and equipment for speech recognition
CN109887497B (en) Modeling method, device and equipment for speech recognition
CN108388674B (en) Method and device for pushing information
CN111368085A (en) Recognition method and device of conversation intention, electronic equipment and storage medium
CN103714048B (en) Method and system for correcting text
CN104598644B (en) Favorite label mining method and device
CN109686383B (en) Voice analysis method, device and storage medium
CN109192225B (en) Method and device for recognizing and marking speech emotion
CN111613212A (en) Speech recognition method, system, electronic device and storage medium
CN108108347B (en) Dialogue mode analysis system and method
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
CN113435196B (en) Intention recognition method, device, equipment and storage medium
CN113807103B (en) Recruitment method, device, equipment and storage medium based on artificial intelligence
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN113990352B (en) User emotion recognition and prediction method, device, equipment and storage medium
CN111462761A (en) Voiceprint data generation method and device, computer device and storage medium
CN110222331A (en) Lie recognition methods and device, storage medium, computer equipment
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN111274390A (en) Emotional reason determining method and device based on dialogue data
CN113903361A (en) Speech quality detection method, device, equipment and storage medium based on artificial intelligence
CN113609865A (en) Text emotion recognition method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination