CN110489531B - Method and device for determining high-frequency problem - Google Patents

Method and device for determining high-frequency problem Download PDF

Info

Publication number
CN110489531B
CN110489531B CN201810448748.6A CN201810448748A CN110489531B CN 110489531 B CN110489531 B CN 110489531B CN 201810448748 A CN201810448748 A CN 201810448748A CN 110489531 B CN110489531 B CN 110489531B
Authority
CN
China
Prior art keywords
sentence
phrase
sentences
category
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810448748.6A
Other languages
Chinese (zh)
Other versions
CN110489531A (en
Inventor
李凤麟
郭依昆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810448748.6A priority Critical patent/CN110489531B/en
Publication of CN110489531A publication Critical patent/CN110489531A/en
Application granted granted Critical
Publication of CN110489531B publication Critical patent/CN110489531B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for determining a high-frequency problem. Wherein the method comprises the following steps: acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences of at least one user question; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; and determining the high-frequency problem in the question set according to the phrases and the clustering results. The method solves the technical problem of poor clustering effect in the process of determining the high-frequency problem.

Description

Method and device for determining high-frequency problem
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for determining a high-frequency problem.
Background
Intelligent assistant/conversation robot products have found wide application in the relevant fields. In these intelligent question-answering systems, an important problem is faced: in order to continuously improve the capability of solving the problems, the unsolved problems need to be analyzed, the high-frequency problems are extracted, and the main unsolved scenes of the robot are analyzed.
The high-frequency problem analysis is not only oriented to internal artificial intelligence trainers, but also helps to find high-frequency unresolved problems in a certain category (industry); meanwhile, the method is also oriented to shops, so that each shop is helped to know the high-frequency unresolved solution of the shop, and corresponding knowledge is further configured to improve the problem solving rate of the robot. As another example, during a merchandise promotion campaign, users often need to pay attention to hot spots, and service personnel need to know what the user is currently most concerned about, or about merchandise, that has responded accordingly. In these scenarios, the discovery of high frequency problems is of paramount importance.
In the discovery process of high-frequency problems, similarity calculation and clustering are very important technical means. The similarity calculation includes a conventional N-gram (word/term) -based similarity calculation TFIDF (Term Frequency Inverse Document Frequency) and a similarity calculation based on sentence vectors (semantic space) in recent years. In the aspect of clustering, a plurality of classical algorithms such as K-means, density clustering, hierarchical clustering and the like are also included.
However, in practice in industry, the method of similarity calculation overlay clustering faces a troublesome problem: if the purity of class families is maximized during clustering, a large number of classes are easily generated, for example, millions of data often have hundreds of thousands of class families after accurate clustering; if the number of clusters is minimized, the clustering result is often poor, and the data in the output class family are mixed and not have the same meaning or the same meaning.
An alternative method is to extract the high-frequency keywords in the sentence sets as indexes, and then cluster the sentence sets corresponding to the keywords. The problem faced by this approach is that a single keyword contains too little information to be unoccupied.
Another method of fast clustering is grouping based on keyword sets. For each sentence in the sentence set, by means of TFIDF, textRank, or Attention mechanism of deep learning, several keywords of which the importance is the most are acquired, and then these keywords are used as key values of a group. Each sentence in the sentence set is traversed and sentences with the same key value will be clustered into the same class. For example, if the sentence "hello, i want to examine the logistics information, how to get" and "how to examine the logistics information? "can extract the same keyword set (e.g.," how "," look-up "," logistics information ") can be categorized into the same class. Note that a normalization may be further required in the process of extracting keywords, for example, both "how" and "how" words may be normalized to "how". One disadvantage of this approach is that after grouping, further clustering is required from group to group, and the computation process is relatively complex and time consuming and not suitable for near real-time environments (e.g., every moment requires a view of the high frequency problem within 15 or 30 minutes before).
Hierarchical clustering can be used in the process of exact clustering, but the biggest problem of hierarchical clustering is that the computation speed is too slow and the parallelization cost is high (framework support similar to a parameter server is needed and the performance is still found to be a bottleneck in practical testing).
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for determining a high-frequency problem, which are used for at least solving the technical problem of poor clustering effect in the process of determining the high-frequency problem.
According to an aspect of an embodiment of the present invention, there is provided a method for determining a high frequency problem, including: acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences of at least one user question; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; and determining the high-frequency problem in the question set according to the phrases and the clustering results.
According to another aspect of the embodiment of the present invention, there is also provided a device for determining a high frequency problem, including: the system comprises an acquisition unit, a query unit and a query unit, wherein the acquisition unit is used for acquiring a question set in a period of time, and the question set comprises a plurality of sentences which are asked by at least one user; the extraction unit is used for extracting the keywords of each sentence in the question set to obtain a plurality of phrases; the indexing unit is used for establishing an index of each phrase and the corresponding sentence, and clustering the sentences indexed by each phrase to obtain a plurality of clustering results; and the determining unit is used for determining the high-frequency problems in the question set according to the phrases and the clustering results.
According to another aspect of the embodiment of the present invention, there is also provided a processor, configured to execute a program, where the program executes the method for determining a high frequency problem according to any one of the above-mentioned embodiments.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program performs the method of determining a high frequency problem of any one of the above.
In the embodiment of the invention, a mode of extracting phrases in sentences and then clustering sentences indexed by the phrases is adopted, and a question set in a period of time is obtained, wherein the question set comprises a plurality of sentences which are asked by at least one user; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; according to the high-frequency problems in the question set determined by the phrases and the clustering results, the purposes of improving the accuracy of the clustering results and the clustering speed are achieved, the technical effect of improving the user experience is achieved, and the technical problem that the clustering effect is poor in the process of determining the high-frequency problems is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:
FIG. 1 is a flow chart of a method of determining a high frequency problem according to an embodiment of the present invention;
FIG. 2 is a flow chart of an alternative method of determining a high frequency problem according to an embodiment of the present invention;
FIG. 3 is a flow chart of an alternative method of determining a high frequency problem according to an embodiment of the present invention;
FIG. 4 is a flow chart of an alternative method of determining a high frequency problem according to an embodiment of the present invention;
FIG. 5 is a flow chart of an alternative method of determining a high frequency problem according to an embodiment of the present invention;
FIG. 6 is a flow chart of an alternative method of determining a high frequency problem according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a high frequency problem determination apparatus according to an embodiment of the present invention;
fig. 8 is a block diagram of a hardware structure of a computer terminal according to an embodiment of the present invention; and
fig. 9 is a block diagram of an alternative computer terminal according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, partial terms or terminology appearing in describing embodiments of the present application are applicable to the following explanation:
high frequency problem: for a certain product (set) or service (set) in a certain time period, the user group repeatedly or at high frequency presents similar or same problems, and the high-frequency problems are generally expressed in terms of sentences.
Statement: a sentence refers to a complete expression of a sentence, and a sentence may be a question posed by the user, e.g., the user consults "where coupons get? ", the statement is: where does coupons get?
Sentence vector: the fixed-dimension vector representation into which the sentences are encoded in some way, the euclidean distance between the vectors can approximate the degree of similarity of the corresponding sentences.
Example 1
The present embodiment provides a method for determining a high-frequency problem, which can help an information supply platform focus on a problem that is the most focused by a user, that is, a high-frequency problem. And then, making a corresponding decision according to the problems concerned by the user, for example, in the process of selling the product by the e-commerce, the customer service receives a large number of related questions of the user on commodity information, acquires related question sentences in a period of time, extracts keywords of the related question sentences to obtain a plurality of phrases, establishes indexes of each phrase and the corresponding sentences, clusters the sentences indexed by each phrase to obtain a plurality of clustering results, and determines high-frequency problems in a question set according to the plurality of phrases and the plurality of clustering results so as to quickly focus the most concerned problems of the user, thereby configuring corresponding answer information and improving user experience.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Specifically, the present application provides a method of determining a high frequency problem as shown in fig. 1. Fig. 1 is a flowchart of a method of determining a high frequency problem according to an embodiment of the present invention.
Step S102, acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences which are asked by at least one user.
When the consumer shops online, the consumer usually inquires about the quality, specification, using notice and other problems of the commodity when the displayed commodity details can not meet the personalized demands of the consumer, and during the activity, the consumer can inquire about the customer service at the position where the activity rule is in doubt, and the customer service can receive a large amount of consultation information. In order to better solve the user's questions, and to get a better understanding of the user's actual needs, it is necessary to collect the questions posed by the user over a period of time.
For example, a seller of a Taobao store on 1 day of 4 months releases a sales promotion of a spring end commodity, the activity is carried out for one month, the seller wants to know the rules of the sales promotion and the questions of the sales promotion commodity by a user so as to better answer the questions of the user, adjust the rules of the sales promotion, increase the information of the displayed commodity, acquire questions received by customer service in the past month and questions presented by the user in comments, and form a question set.
Step S104, extracting keywords of each sentence in the question set to obtain a plurality of phrases;
it should be noted that, the obtained question set includes a large number of questions posed by users, and because the content of the questions posed by different users is different, the way of dispatching words and making sentences when describing the same questions is also different, the questions asked by users are various, in order to accurately and efficiently obtain the high-frequency questions, the questions of the buyers are better solved, the corresponding knowledge is assisted to be configured, the next activity decision is made, the keywords of each sentence in the question set are extracted, and the keywords are combined to obtain a plurality of phrases.
For example, when a commodity enters a price reduction promotion, some sellers purchase the commodity before the commodity enters the promotion, after the commodity enters the promotion, the buyers do not receive the commodity yet, the buyers may consult customer service for ' how to refund price, "" how to refund price, "and ' refund bar ', the sentences of the user questions form a question set, the keywords of each sentence in the question set are extracted to ' refund ' and ' price ', and the keywords are combined to obtain the phrase ' refund price '.
Step S106, establishing indexes of each phrase and the corresponding sentence, and clustering the sentences indexed by each phrase to obtain a plurality of clustering results;
A phrase may be extracted from a plurality of sentences, in order to find the corresponding sentence through the phrase, an index of each phrase and the corresponding sentence is established, for example, the sentence "can fade the price", "trouble fade the price", "fade the price bar", "how fade the price", the keywords are submitted and combined to obtain the phrase "fade price", an index of the phrase "fade price" and the corresponding sentence is established, and the phrase "fade price" can index to the corresponding sentence.
It should be noted that, in order to better find a class of problems corresponding to the phrases, after establishing indexes of the phrases and the sentences corresponding to the phrases, clustering the sentences indexed by each phrase to obtain a plurality of clustering results.
For example, the phrase "gap is moved" to the sentence "how the gap can be moved", "the gap is moved by trouble", "the gap bar is moved", "how the gap is moved", and these sentences are clustered to obtain a plurality of clustering results, one of which is to request the gap to be moved and the other is to ask the operation method of the gap to be moved.
For another example, the phrase "Taobao open" indexes how to open the sentence "," how to operate in Taobao open "," treatment scheme of violation of Taobao open ", and the sentences are clustered to obtain a plurality of clustered results, one clustered result is how to open the shop, and the other clustered result is treatment measure of violation, and different question answer schemes are provided according to different clustered results.
Because there are multiple phrases obtained according to the sentences in the question set, there may be relevance between different phrases, in order to further improve the clustering effect, optionally, after extracting the keywords of each sentence in the question set to obtain multiple phrases, the method further includes: clustering the phrases to obtain phrase categories; establishing an index of each phrase and the corresponding sentence, clustering the sentences indexed by each phrase, and obtaining a plurality of clustering results comprises the following steps: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
It should be noted that, extracting the keywords of each sentence in the question set, and the obtained multiple phrases may include phrases with similar meaning, i.e. synonymous phrases. In order to quickly acquire the high-frequency problem, a representative expression needs to be selected for the synonymous phrases, specifically, a plurality of phrases are clustered, and different phrase categories are obtained.
For example, the phrases "Taobao" and "Alibao" are phrases having similar meanings, the "Taobao" and "Alibao" may be clustered into the same phrase category, and the phrase having the highest frequency of occurrence in a class of synonymous phrases may be selected as a representative expression of the class of synonymous phrases, and if the frequency of occurrence of the "Taobao" is greater than that of the "Alibao" among the resulting plurality of phrases, i.e., the former is more frequently mentioned by the user, the "Taobao" may be selected as a representative phrase of the class.
The phrase class of "Taobao open" and the index of the corresponding sentence are established, for example, the phrase class is indexed to the phrase "how to open the Taobao in Taobao", "how to open the Albab", "the processing scheme of violations of Taobao open", "the processing scheme of violations of Albab open the Taobab open", the indexed sentences are clustered, and two clustering results are obtained by possible convergence, one clustering result is how to open the Taobao, and the other clustering result is how to ask for violations.
According to the scheme, the extracted phrases are clustered, the phrases with similar meanings are clustered into one type, and a plurality of phrase categories are obtained, so that corresponding sentence clustering results can be searched through the phrase categories, and the clustering effect is improved.
In general, each phrase indexes a plurality of sentences, when the sentences indexed by each phrase are directly clustered, the clustering speed is low, and optionally, before the sentences indexed by each phrase are clustered to obtain a plurality of clustering results, the method further comprises: clustering each sentence in the question set to obtain a plurality of sentence categories; clustering sentences indexed by each phrase to obtain a plurality of clustering results, wherein the clustering results comprise: marking sentences indexed by each phrase according to a plurality of sentence categories to obtain categories corresponding to each sentence indexed by each phrase, and obtaining a plurality of clustering results.
For example, 10000 sentences are in the question set, 200 phrases are extracted, and if question subsets corresponding to the 200 phrases are clustered directly, the clustering speed is slower. Optionally, 10000 sentences are clustered in advance to obtain the category to which each sentence belongs, and the sentences indexed by each phrase are marked based on the category to which each sentence belongs, so that the sentences indexed by each phrase can be clustered rapidly to obtain a plurality of clustering results.
Fig. 2 is a flowchart of an alternative method for determining a high frequency problem according to an embodiment of the present invention, and fig. 2 shows a flow of clustering each sentence in a question set to obtain a plurality of sentence categories in the above disclosed technical solution. As shown in fig. 2, the method specifically includes the following steps:
step S202, sentence vectors of each sentence in the question set are obtained.
And obtaining sentence vectors from each sentence in the sentence set through word segmentation, fasttext pre-training word vectors, word vector addition and average and the like.
For example, according to "how to make a store in naught with a base", a sentence vector of [0.1,0.3,0.2,0.6,0.7,0.2,0.1,0.8] is obtained.
In step S204, the sentence vector of each sentence is divided into M segments, so as to obtain M segment sub-vectors, where M is a natural number greater than 1.
For example, m=2, dividing each sentence vector into 2 segments, resulting in 2 segments of sub-vectors for each sentence vector, for example, dividing sentence vector [0.1,0.3,0.2,0.6,0.7,0.2,0.1,0.8] into 2 segments, resulting in [0.1,0.3,0.2,0.6] and [0.6,0.7,0.2,0.1,0.8]. And taking [0.1,0.3,0.2,0.6] as a first segment of the sentence and [0.6,0.7,0.2,0.1,0.8] as a second segment of the sentence, and performing similar operation on sentence vectors of sentences in all question sentence sets to obtain a first segment set and a second segment set.
Step S206, obtaining a category corresponding to each segment of sub-vector in the M segments of sub-vectors of each sentence through a pre-training category set, wherein the pre-training category set comprises: the class to which each segment of subvector of all statements corresponds.
For example, for each segment, k=32 categories are obtained by kmens clustering, and the number of possible categories for the whole sentence set is 32×32=1024.
And step S208, splicing or accumulating the sub-vectors of each segment to obtain the category to which each sentence belongs.
For example, the category of the first segment and the category of the second segment of each subvector are spliced or accumulated to obtain the category to which each sentence belongs.
Step S210, the category to which all sentences in the question set belong is taken as a plurality of sentence categories.
Through the steps, the categories obtained through pre-training are used for clustering new data, each sentence in the question set is clustered, a plurality of sentence categories are obtained, and the clustering speed is improved.
Fig. 3 is a flowchart of an alternative method for determining a high frequency problem according to an embodiment of the present invention, and fig. 3 shows a flow of acquiring a category corresponding to each of M-segment subvectors of each sentence through a pre-training category set in the technical solution disclosed in step S206. As shown in fig. 3, the method specifically includes the following steps:
in step S302, the euclidean distance between each of the M segments of subvectors and each of the subvectors in the pre-training class set is calculated.
Step S304, determining a sub-vector with the minimum Euclidean distance with the sub-vector in the M-segment sub-vectors in the pre-training class set to obtain a target sub-vector.
In step S306, the class to which the target sub-vector belongs is taken as the class corresponding to the sub-vector in the M-segment sub-vectors.
Step S108, determining high-frequency problems in the question set according to the phrases and the clustering results.
It should be noted that phrase categories can be displayed on an interface, and a worker selects the phrase category, so that the phrase category clustered under the phrase category is indexed, and then the high-frequency problem is determined through sentence category analysis with the highest sentence frequency.
Fig. 4 is a flowchart of an alternative method for determining a high-frequency question according to an embodiment of the present invention, and fig. 4 shows a flow of determining a high-frequency question in a question set according to a plurality of phrases and a plurality of clustering results in the technical solution disclosed in step S108. As shown in fig. 4, the method specifically includes the following steps:
step S402, counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase.
For example, in a one-time clothing sales promotion, a question set is acquired, keywords are extracted and combined to obtain a plurality of phrases and the frequency of occurrence of each phrase in the sentences of the question set, as shown in table 1, the phrase "height #weight" occurs 124 times, the phrase "click #link" occurs 114 times, the phrase "drop #spread" occurs 99 times, the phrase "drop #spread" in brackets represents the synonymous phrase of the phrase, the phrase "full net #through" occurs 93 times, the phrase "size #introduction" occurs 70 times, the phrase "size #profile" in brackets represents the synonymous phrase of the phrase, the phrase "view #video file" occurs 68 times, and the phrase "height #weight" with the highest frequency is counted as the target phrase.
TABLE 1
Phrase (synonymous phrase) Frequency of
Height, weight 124
Clicking on the # link 114
Returning # spread (returning # spread) 99
Whole net # -communication 93
Size # introduction (size # introduction) 70
Viewing # video files 68
Link # View 68
Step S404, determining the frequency of occurrence of all sentences included in each category under the target phrase.
It should be noted that, the category mentioned in the present invention is a category to which each sentence belongs, for example, the target phrase indexes under the sentence "what code is asked to buy by the weight of 137 i's height" belongs to the weight category of height.
For example, the categories C1, C2, and C3 to which the respective sentences belong under the height # weight. The frequency of occurrence of all sentences under C1 is 28, the frequency of occurrence of all sentences under C2 is 13, and the frequency of occurrence of all sentences under C3 is 24.
Step S406, sorting the categories according to the frequency of occurrence of all sentences included in each category from high to low.
For example, the frequency of occurrence of all sentences under C1 is 28, the frequency of occurrence of all sentences under C2 is 13, the frequency of occurrence of all sentences under C3 is 24, and the classification is sorted from high to low to obtain C1, C3 and C2.
In step S408, the top K categories are set as target categories, and the high-frequency problem is determined according to the target categories.
For example, K is 1, C1 arranged in the first 1 is set as a target category, that is, category C1 under the height # weight is set as a target category, and the high-frequency problem posed by the user is determined according to C1.
For example, clicking on the row where the target phrase "height#weight" is located can look at the cluster summary information and the cluster detailed information of the sentence indexed by the target phrase, as shown in table 2, the first column represents the class number C1 to which the target phrase "height#weight" belongs, the second column represents the summary information of the class C1, the summary information selects a sentence having a representative in the class, the representative sentence in the class where the target phrase "height#weight" is located can be intuitively seen through the summary information, and the third column represents the sum of the frequencies of all the sentences in the class, which is 28.
TABLE 2
Figure BDA0001657899950000091
As shown in table 3, the clustering detailed information of the sentences indexed for the target phrase shows the situation of all the sentences included in the sentence category, the first column is each sentence, the second column is the frequency of occurrence of the sentence, the third column indicates the number of the category to which the sentence belongs, all belong to the C1 category, the sentence "height [ n3] weight 50", "height [ n3] weight 80", "height [ n3] weight [ n3] waist penetration size is proper", "height [ n3] weight 49" is high, all the frequencies are 4 times, the category corresponding to the sentences is regarded as the target category, and then the high-frequency problem is determined according to the target category.
TABLE 3 Table 3
Statement Frequency of Category(s)
Height [ n3 ]]Body weight [ n3]Weight of jin 3 C1
Height [ n3 ]]Body weight [ n3]How big the wearing is 3 C1
Height [ n3 ]]Body weight [ n3] 3 C1
Height of me [ n3 ]]Weight 81 asks what code to buy 3 C1
Height [ n3 ]]Body weight 50 4 C1
Height [ n3 ]]Body weight 80 4 C1
Height [ n3 ]]Body weight [ n3]Proper waist size 4 C1
Height [ n3 ]]Weight 49 4 C1
Fig. 5 is a flowchart of an alternative method for determining a high frequency problem according to an embodiment of the present invention, and fig. 5 shows a process of extracting keywords of each sentence in a question set to obtain a plurality of phrases in the technical solution disclosed in step S104. The method specifically comprises the following steps as shown in fig. 5:
step S502, word segmentation processing is carried out on each sentence, and a plurality of words are obtained.
For example, the sentences in the sentence collection are "what code is purchased by me height 175 weight 81", and the word is segmented to obtain the following words "me", "height", "175", "weight", "81", "buy", "what", "code" and "are obtained.
Step S504, filtering the plurality of words according to a first preset condition to obtain filtered words.
For example, the first preset condition is: the verb, noun, adjective and adverb are reserved, and the words "me", "height", "175", "weight", "81", "buy", "what", "code" and "are filtered according to the first preset condition, so that the filtered words" height "," weight "," buy "and" code "are obtained.
Step S506, the filtered words are combined according to a second preset condition to obtain candidate phrases corresponding to each sentence.
For example, the second preset condition is: setting a window with the size of D=2 to the right, and then sequentially pairing the words with each word in the window to form D double-key words. Candidate phrases that may be formed for the words "height", "weight", "buy" and "code" in accordance with the second preset condition combination include "height #weight", "height #buy", "height #code", "weight #buy", "weight #code", "buy # code".
Step S508, obtaining a plurality of phrases according to the candidate phrases corresponding to each sentence.
It should be noted that, for the candidate phrases obtained by combination, screening is performed according to the association degree between the words in each phrase, so as to obtain an effective phrase.
Fig. 6 is a flowchart of an alternative method for determining a high frequency problem according to an embodiment of the present invention, and fig. 6 shows a flow of obtaining a plurality of phrases according to candidate phrases corresponding to each sentence in the technical solution disclosed in step S508. As shown in fig. 6, the method specifically includes the following steps:
step S602, judging whether the candidate phrase corresponding to each sentence accords with a third preset condition;
For example, the third preset condition is a condition of a legal phrase, and since there is a candidate phrase that is not a legal phrase, it is necessary to determine whether the candidate phrase corresponding to each sentence is a legal phrase.
Step S604, the candidate phrases meeting the third preset condition are composed into a plurality of phrases.
For example, height # buys, i.e., is not a legal phrase, filters candidate phrases that are not legal phrases, and composes the candidate phrases of the legal phrases into multiple phrases.
In an alternative embodiment, determining whether the candidate phrase corresponding to each sentence meets the third preset condition includes: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not through a pre-trained judging model.
There are a variety of pre-trained judgment models, and in an alternative embodiment, the training dimension of the pre-trained judgment model includes at least one of: the method comprises the steps of dependency relationship, mutual information, left and right entropy, phrase context, frequency and class distinction value, wherein the dependency relationship represents whether the dependency relationship exists between two words in a corpus, the mutual information is used for checking the co-occurrence degree of the two words in the corpus, the left and right entropy is used for checking the degree of freedom of each word in the corpus, the phrase context represents the relationship between each word in the corpus and sentences of which the index frequency meets preset frequency, the frequency represents the occurrence frequency of the phrase in the corpus, and the class distinction value represents the inverse text frequency index value of each word in the corpus.
Specifically, the training dimensions of the pre-trained judgment model are as follows:
dependency, i.e., whether there is a dependency between two words. In the above examples, a centering relationship (ATT) exists between "height" and "weight", a master-called relationship (SBV) exists between "weight" and "buy", and a moving object relationship (VOB) exists between "buy" and "code". This means that "height # weight", "weight # buy" and "buy # code" are possible phrases (double key words), while "height # buy", "height # code" and "weight # code" may not be phrases.
Mutual information is used for checking the co-occurrence degree of certain two words, and the calculation formula is MI=P (x, y)/P (x) P (y), wherein P (x) and P (y) respectively represent the probability that the words x and y independently appear, and P (x, y) represents the probability that the two words x and y co-occur. For example, assuming that in a corpus of 1000 sentences, word x appears 100 times, y appears 50 times, x, y appears 25 times together (i.e., y is within a window of x in the right direction d=2), then the probabilities are 0.1=100/1000,0.05 =50/1000,0.025 =25/1000, mi=5, respectively. Note that for ease of calculation, the probability values are typically logarithmic. In general, the larger the mutual information, the more likely the two terms are to be combined. For example, "payroll # hierarchy", "general # general", such words have larger mutual information.
Left-right entropy is used to check the degree of freedom of a phrase (double-key word), i.e., the diversity of words to the left or right of the phrase. The measure of freedom can be defined as entropy: -P i logP i (i refers to traversing each word). For example, assuming that four words such as "how", "what to do not want", "where" and the like are present on the left of "upload # picture" in a certain corpus, each occurrence is 1 times, the left entropy thereof is- (1/4 log1/4+1/4log1/4+1/4log1/4+1/4log 1/4) =2. A first partThe greater the degree of freedom of an individual phrase, the more likely that phrase is valid. It should be noted that: excessive degrees of freedom may result in nonsensical phrases such as "also #", "talent #"; for a phrase, attention needs to be paid to the degrees of freedom on the left and right at the same time.
The context of a phrase refers to the relationship between the phrase and the top three sentences it indexes. Training word vectors in advance by using FastText, and obtaining vectors of phrases and the vectors of sentences in which the phrases are positioned by adding and averaging; the relationship of the phrase and sentence is then defined as the dot product of the two vectors in each dimension. Still further, the results of the dot product are barreled, i.e. only one or two decimal places are reserved per dimension.
Frequency, number of occurrences of a phrase in the corpus.
TFIDF (class differentiation value), TFIDF value of a certain phrase in the corpus.
According to the dimension, a GBDT model (corresponding to the judgment model) is constructed, and the prediction accuracy can reach 69.5%.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.
Example 2
According to an embodiment of the present invention, there is also provided an apparatus for implementing the above-mentioned method for determining a high frequency problem, as shown in fig. 7, the apparatus including: an acquisition unit 702, an extraction unit 704, an indexing unit 706, and a determination unit 708.
An obtaining unit 702, configured to obtain a question set in a period of time, where the question set includes a plurality of sentences that are asked by at least one user.
And an extracting unit 704, configured to extract keywords of each sentence in the question set, so as to obtain a plurality of phrases.
And the indexing unit 706 is configured to establish an index of each phrase and a sentence corresponding to the phrase, and cluster the sentences indexed by each phrase to obtain a plurality of clustering results.
A determining unit 708, configured to determine a high-frequency problem in the question set according to the plurality of phrases and the plurality of clustering results.
Because there are multiple phrases obtained according to the sentences in the question set, there may be a correlation between different phrases, in order to further increase the clustering speed, optionally, in the high-frequency problem determining apparatus provided in the embodiment of the present invention, the apparatus further includes: the first clustering unit is used for clustering the phrases after extracting the keywords of each sentence in the question set to obtain a plurality of phrases to obtain a plurality of phrase categories; the index unit 706 is further configured to: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
In general, each phrase indexes a plurality of sentences, and when the sentences indexed by each phrase are directly clustered, the clustering speed is low, optionally, in the high-frequency problem determining device provided by the embodiment of the invention, the device further includes: the second clustering unit is used for clustering each sentence in the question set before clustering the sentences indexed by each phrase to obtain a plurality of clustering results to obtain a plurality of sentence categories; the index unit is further configured to: marking sentences indexed by each phrase according to a plurality of sentence categories to obtain categories corresponding to each sentence indexed by each phrase, and obtaining a plurality of clustering results.
Optionally, in the apparatus for determining a high frequency problem provided in the embodiment of the present invention, the second aggregation unit includes: the device comprises a first acquisition module, a vector generation module, a second acquisition module, a merging module and a construction module.
Specifically, a first obtaining module is used for obtaining sentence vectors of each sentence in the question set; the vector generation module is used for dividing the sentence vector of each sentence into M segments to obtain M segment sub-vectors, wherein M is a natural number larger than 1; the second obtaining module is configured to obtain, through the pre-training class set, a class corresponding to each segment of subvector in the M segments of subvectors of each sentence, where the pre-training class set includes: the category corresponding to each segment of subvector of all sentences; the merging module is used for splicing or accumulating the sub-vectors of each segment correspondingly to obtain the category to which each sentence belongs; the construction module is used for taking the category of all the sentences in the question set as a plurality of sentence categories. Through the module, each sentence in the question set is clustered to obtain a plurality of sentence categories.
Optionally, in the apparatus for determining a high frequency problem provided in the embodiment of the present invention, the second acquisition module includes: the method comprises the steps of calculating a sub-module, determining the sub-module and constructing the sub-module.
Specifically, the computing sub-module is used for computing the Euclidean distance between each segment of sub-vector in the M segments of sub-vectors and each segment of sub-vector in the pre-training class set; the determining sub-module is used for determining a sub-vector with the minimum Euclidean distance with a sub-vector in M segments of sub-vectors in the pre-training class set to obtain a target sub-vector; and the construction sub-module is used for taking the category to which the target sub-vector belongs as the category corresponding to the sub-vector in the M-segment sub-vectors.
In order to obtain an effective phrase, optionally, in the apparatus for determining a high-frequency problem provided in the embodiment of the present invention, the extracting unit 704 includes: the device comprises a processing module, a filtering module, a combining module and a selecting module.
Specifically, the processing module is used for carrying out word segmentation processing on each sentence to obtain a plurality of words; the filtering module is used for filtering the plurality of words according to a first preset condition to obtain filtered words; the combination module is used for combining the filtered words according to a second preset condition to obtain candidate phrases corresponding to each sentence; and the selection module is used for obtaining a plurality of phrases according to the candidate phrases corresponding to each sentence.
Optionally, in the apparatus for determining a high frequency problem provided in the embodiment of the present invention, the selecting module includes: the device comprises a judging sub-module and a filtering sub-module.
Specifically, the judging submodule is used for judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not; and the filtering sub-module is used for forming a plurality of phrases from the candidate phrases meeting the third preset condition.
Optionally, in the high-frequency problem determining apparatus provided in the embodiment of the present invention, the judging submodule is further configured to: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not through a pre-trained judging model.
In the device for determining a high-frequency problem provided by the embodiment of the invention, the training dimension of the pre-trained judgment model includes at least one of the following: the method comprises the steps of dependency relationship, mutual information, left and right entropy, phrase context, frequency and class distinction value, wherein the dependency relationship represents whether the dependency relationship exists between two words in a corpus, the mutual information is used for checking the co-occurrence degree of the two words in the corpus, the left and right entropy is used for checking the degree of freedom of each word in the corpus, the phrase context represents the relationship between each word in the corpus and sentences of which the index frequency meets preset frequency, the frequency represents the occurrence frequency of the phrase in the corpus, and the class distinction value represents the inverse text frequency index value of each word in the corpus.
After obtaining the target phrase, how to determine the high-frequency problem according to the target phrase, optionally, in the determining device for the high-frequency problem provided by the embodiment of the present invention, the determining unit 708 includes: the device comprises a statistics module, a first determination module, a second determination module and a third determination module.
Specifically, a statistics module is used for counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase; a first determining module, configured to determine a frequency of occurrence of all sentences included in each category under the target phrase; the second determining module is used for sequencing the categories according to the sequence from high to low according to the occurrence frequency of all sentences included in each category; and the third determining module is used for taking the top K categories as target categories and determining the high-frequency problems according to the target categories.
Here, it should be noted that the above-mentioned obtaining unit 702, extracting unit 704, indexing unit 706, and determining unit 708 correspond to steps S102 to S108 in embodiment 1, and 4 units are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to what is disclosed in the above-mentioned embodiment one. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in the first embodiment.
According to the embodiment, the acquiring unit 702 acquires a question set in a period of time; the extracting unit 704 extracts keywords of each sentence in the question set to obtain a plurality of phrases; the indexing unit 706 establishes an index of each phrase and the sentence corresponding to the phrase, and clusters the sentence indexed by each phrase to obtain a plurality of clustering results; the determining unit 708 determines the high-frequency problem in the question set according to the plurality of phrases and the plurality of clustering results, solves the technical problem of poor clustering effect in the process of determining the high-frequency problem, and rapidly focuses the most focused problem of the user according to the plurality of clustering results, so that corresponding answer information is configured, and user experience is improved.
Example 3
The embodiment of the method for determining the high-frequency problem provided by the embodiment of the invention can be executed in mobile equipment, a computer terminal or similar computing devices. Fig. 8 shows a hardware block diagram of a computer terminal for implementing a determination method of a high frequency problem. As shown in fig. 8, the computer terminal 10 (or mobile device 10) may include one or more processors 102 (shown as 102a, 102b, … …,102 n) which may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA, a memory 104 for storing data, and a transmission module 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial Bus (USB) port (which may be included as one of the ports of the I/O interface), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 8 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 8, or have a different configuration than shown in FIG. 8.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module, or incorporated, in whole or in part, into any of the other elements in the computer terminal 10 (or mobile device). As referred to in embodiments of the invention, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the method for determining a high frequency problem in the embodiment of the present invention, and the processor 102 executes the software programs and modules stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the method for determining a high frequency problem described above. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10 (or mobile device).
It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 8 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 8 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.
It should be noted here that in some alternative embodiments, the computer device (or mobile device) shown in fig. 8 described above may include hardware elements (including circuitry), software elements (including computer code stored on a computer-readable medium), or a combination of both hardware and software elements. It should be noted that fig. 8 is only one example of a specific example, and is intended to illustrate the types of components that may be present in the computer device (or mobile device) described above.
Embodiments of the present invention may provide a computer terminal, which may be any one of a group of computer terminals. Alternatively, in this embodiment, the above-mentioned computer terminal may be replaced with a terminal device such as a mobile device.
Alternatively, in this embodiment, the above-mentioned computer terminal may be located in at least one network device among a plurality of network devices of the computer network.
Fig. 9 is a block diagram of an alternative computer terminal according to an embodiment of the present invention. As shown in fig. 9, the computer terminal 10 may include: one or more (only one shown) processors and memory.
The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the method and apparatus for determining a high frequency problem in the embodiments of the present invention, and the processor executes the software programs and modules stored in the memory, thereby performing various functional applications and data processing, that is, implementing the method for determining a high frequency problem described above. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory located remotely from the processor, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences of at least one user question; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; and determining the high-frequency problem in the question set according to the phrases and the clustering results.
Optionally, the above processor may further execute program code for: extracting keywords of each sentence in the question set, and clustering the phrases after obtaining the phrases to obtain phrase categories; establishing an index of each phrase and the corresponding sentence, clustering the sentences indexed by each phrase, and obtaining a plurality of clustering results comprises the following steps: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
Optionally, the above processor may further execute program code for: clustering each sentence in the question set before clustering sentences indexed by each phrase to obtain a plurality of clustering results to obtain a plurality of sentence categories; clustering sentences indexed by each phrase to obtain a plurality of clustering results, wherein the clustering results comprise: marking sentences indexed by each phrase according to a plurality of sentence categories to obtain categories corresponding to each sentence indexed by each phrase, and obtaining a plurality of clustering results.
Optionally, the above processor may further execute program code for: clustering each sentence in the question set to obtain a plurality of sentence categories, wherein the step of obtaining the sentence categories comprises the following steps: acquiring sentence vectors of each sentence in the question set; dividing sentence vectors of each sentence into M sections to obtain M-section sub-vectors, wherein M is a natural number greater than 1; the class corresponding to each segment of sub-vector in the M segments of sub-vectors of each sentence is obtained through a pre-training class set, wherein the pre-training class set comprises: the category corresponding to each segment of subvector of all sentences; splicing or accumulating the sub-vectors of each segment to obtain the category of each sentence; and taking the category to which all the sentences in the question set belong as a plurality of sentence categories.
Optionally, the above processor may further execute program code for: the obtaining the category corresponding to each segment of sub-vector in the M segments of sub-vectors of each sentence through the pre-training category set comprises the following steps: calculating Euclidean distance between each segment of sub-vector in the M segments of sub-vectors and each segment of sub-vector in the pre-training class set; determining a sub-vector with the minimum Euclidean distance with a sub-vector in M-segment sub-vectors in the pre-training class set to obtain a target sub-vector; and taking the category to which the target sub-vector belongs as the category corresponding to the sub-vector in the M-segment sub-vectors.
Optionally, the above processor may further execute program code for: extracting keywords of each sentence in the question set, and obtaining a plurality of phrases comprises: word segmentation processing is carried out on each sentence to obtain a plurality of words; filtering the plurality of words according to a first preset condition to obtain filtered words; combining the filtered words according to a second preset condition to obtain candidate phrases corresponding to each sentence; and obtaining a plurality of phrases according to the candidate phrases corresponding to each sentence.
Optionally, the above processor may further execute program code for: obtaining a plurality of phrases from the candidate phrases corresponding to each sentence includes: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition; and forming the candidate phrases meeting the third preset condition into a plurality of phrases.
Optionally, the above processor may further execute program code for: the judging whether the candidate phrase corresponding to each sentence accords with a third preset condition comprises the following steps: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not through a pre-trained judging model.
Optionally, the above processor may further execute program code for: the training dimension of the pre-trained judgment model includes at least one of: the method comprises the steps of dependency relationship, mutual information, left and right entropy, phrase context, frequency and class distinction value, wherein the dependency relationship represents whether the dependency relationship exists between two words in a corpus, the mutual information is used for checking the co-occurrence degree of the two words in the corpus, the left and right entropy is used for checking the degree of freedom of each word in the corpus, the phrase context represents the relationship between each word in the corpus and sentences of which the index frequency meets preset frequency, the frequency represents the occurrence frequency of the phrase in the corpus, and the class distinction value represents the inverse text frequency index value of each word in the corpus.
Optionally, the above processor may further execute program code for: determining high frequency questions in the question set based on the plurality of phrases and the plurality of clustering results includes: counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase; determining the frequency of occurrence of all sentences included in each category under the target phrase; sorting the categories according to the sequence from high to low according to the occurrence frequency of all sentences included in each category; the top K categories are taken as target categories, and the high-frequency problems are determined according to the target categories.
By adopting the embodiment of the invention, a scheme of a computer terminal for realizing a determination method of a high-frequency problem is provided. Acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences which are asked by at least one user; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; according to the high-frequency problems in the question set determined by the phrases and the clustering results, the purposes of improving the accuracy of the clustering results and the clustering speed are achieved, the technical effect of improving the user experience is achieved, and the technical problem that the clustering effect is poor in the process of determining the high-frequency problems is solved.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute in association with hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
Example 4
The embodiment of the invention also provides a storage medium. Alternatively, in the present embodiment, the storage medium may be used to store the program code executed by the method for determining a high frequency problem provided in the first embodiment.
Alternatively, in this embodiment, the storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences of at least one user question; extracting keywords of each sentence in the question set to obtain a plurality of phrases; establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results; and determining the high-frequency problem in the question set according to the phrases and the clustering results.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: extracting keywords of each sentence in the question set, and clustering the phrases after obtaining the phrases to obtain phrase categories; establishing an index of each phrase and the corresponding sentence, clustering the sentences indexed by each phrase, and obtaining a plurality of clustering results comprises the following steps: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: clustering each sentence in the question set before clustering sentences indexed by each phrase to obtain a plurality of clustering results to obtain a plurality of sentence categories; clustering sentences indexed by each phrase to obtain a plurality of clustering results, wherein the clustering results comprise: marking sentences indexed by each phrase according to a plurality of sentence categories to obtain categories corresponding to each sentence indexed by each phrase, and obtaining a plurality of clustering results.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: clustering each sentence in the question set to obtain a plurality of sentence categories, wherein the step of obtaining the sentence categories comprises the following steps: acquiring sentence vectors of each sentence in the question set; dividing sentence vectors of each sentence into M sections to obtain M-section sub-vectors, wherein M is a natural number greater than 1; the class corresponding to each segment of sub-vector in the M segments of sub-vectors of each sentence is obtained through a pre-training class set, wherein the pre-training class set comprises: the category corresponding to each segment of subvector of all sentences; splicing or accumulating the sub-vectors of each segment to obtain the category of each sentence; and taking the category to which all the sentences in the question set belong as a plurality of sentence categories.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the obtaining the category corresponding to each segment of sub-vector in the M segments of sub-vectors of each sentence through the pre-training category set comprises the following steps: calculating Euclidean distance between each segment of sub-vector in the M segments of sub-vectors and each segment of sub-vector in the pre-training class set; determining a sub-vector with the minimum Euclidean distance with a sub-vector in M-segment sub-vectors in the pre-training class set to obtain a target sub-vector; and taking the category to which the target sub-vector belongs as the category corresponding to the sub-vector in the M-segment sub-vectors.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: extracting keywords of each sentence in the question set, and obtaining a plurality of phrases comprises: word segmentation processing is carried out on each sentence to obtain a plurality of words; filtering the plurality of words according to a first preset condition to obtain filtered words; combining the filtered words according to a second preset condition to obtain candidate phrases corresponding to each sentence; and obtaining a plurality of phrases according to the candidate phrases corresponding to each sentence.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: obtaining a plurality of phrases from the candidate phrases corresponding to each sentence includes: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition; and forming the candidate phrases meeting the third preset condition into a plurality of phrases.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the judging whether the candidate phrase corresponding to each sentence accords with a third preset condition comprises the following steps: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not through a pre-trained judging model.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: the training dimension of the pre-trained judgment model includes at least one of: the method comprises the steps of dependency relationship, mutual information, left and right entropy, phrase context, frequency and class distinction value, wherein the dependency relationship represents whether the dependency relationship exists between two words in a corpus, the mutual information is used for checking the co-occurrence degree of the two words in the corpus, the left and right entropy is used for checking the degree of freedom of each word in the corpus, the phrase context represents the relationship between each word in the corpus and sentences of which the index frequency meets preset frequency, the frequency represents the occurrence frequency of the phrase in the corpus, and the class distinction value represents the inverse text frequency index value of each word in the corpus.
Alternatively, in the present embodiment, the storage medium is configured to store program code for performing the steps of: determining high frequency questions in the question set based on the plurality of phrases and the plurality of clustering results includes: counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase; determining the frequency of occurrence of all sentences included in each category under the target phrase; sorting the categories according to the sequence from high to low according to the occurrence frequency of all sentences included in each category; the top K categories are taken as target categories, and the high-frequency problems are determined according to the target categories.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (16)

1. A method for determining a high frequency problem, comprising:
acquiring a question set in a period of time, wherein the question set comprises a plurality of sentences which are asked by at least one user;
extracting keywords of each sentence in the question set to obtain a plurality of phrases;
establishing an index of each phrase and a sentence corresponding to the phrase, and clustering sentences indexed by each phrase to obtain a plurality of clustering results;
determining high-frequency problems in the question set according to the phrases and the clustering results;
wherein determining the high-frequency questions in the question set according to the phrases and the clustering results comprises:
counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase;
determining the frequency of occurrence of all sentences included in each category under the target phrase;
Sorting the categories according to the sequence from high to low according to the occurrence frequency of all sentences included in each category;
and taking the category ranked in the top K as a target category, and determining the high-frequency problem according to the target category.
2. The method for determining according to claim 1, wherein,
extracting keywords of each sentence in the question set to obtain a plurality of phrases, wherein the method further comprises: clustering the phrases to obtain phrase categories;
establishing an index of each phrase and the corresponding sentence, clustering the sentences indexed by each phrase, and obtaining a plurality of clustering results comprises the following steps: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
3. The method for determining according to claim 1, wherein,
before clustering the sentences indexed by each phrase to obtain a plurality of clustering results, the method further comprises: clustering each sentence in the question set to obtain a plurality of sentence categories;
clustering sentences indexed by each phrase to obtain a plurality of clustering results, wherein the clustering results comprise: marking sentences indexed by each phrase according to the sentence categories to obtain the category corresponding to each sentence indexed by each phrase, and obtaining the clustering results.
4. The method of determining according to claim 3, wherein clustering each sentence in the set of question sentences to obtain a plurality of sentence categories comprises:
acquiring sentence vectors of each sentence in the question set;
dividing sentence vectors of each sentence into M sections to obtain M-section sub-vectors, wherein M is a natural number greater than 1;
acquiring a category corresponding to each segment of sub-vector in M segments of sub-vectors of each sentence through a pre-training category set, wherein the pre-training category set comprises: the category corresponding to each segment of subvector of all sentences;
splicing or accumulating the sub-vectors of each segment to obtain the category of each sentence;
and taking the category to which all sentences in the question set belong as the plurality of sentence categories.
5. The method of determining according to claim 4, wherein obtaining the category corresponding to each of the M-segment subvectors of each sentence through the pre-training category set comprises:
calculating the Euclidean distance between each segment of sub-vector in the M segments of sub-vectors and each segment of sub-vector in the pre-training class set;
determining a sub-vector with the minimum Euclidean distance with a sub-vector in the M-segment sub-vectors in the pre-training class set to obtain a target sub-vector;
And taking the category to which the target sub-vector belongs as the category corresponding to the sub-vector in the M-segment sub-vectors.
6. The method of determining of claim 1, wherein extracting keywords for each sentence in the set of question sentences to obtain a plurality of phrases comprises:
word segmentation processing is carried out on each sentence to obtain a plurality of words;
filtering the plurality of words according to a first preset condition to obtain filtered words, wherein the first preset condition is that non-target words are filtered, and the target words comprise: verbs, nouns, adjectives, adverbs;
combining the filtered words according to a second preset condition to obtain each of the following components: candidate phrases corresponding to the sentences, wherein the second preset condition is the sequence relation of the words in each sentence;
and obtaining the phrases according to the candidate phrases corresponding to each sentence.
7. The method of determining according to claim 6, wherein obtaining the plurality of phrases from the candidate phrases for each sentence comprises:
judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not;
and combining the candidate phrases meeting the third preset condition into the phrases.
8. The method of determining according to claim 7, wherein determining whether the candidate phrase corresponding to each sentence meets a third preset condition comprises: judging whether the candidate phrase corresponding to each sentence accords with a third preset condition or not through a pre-trained judging model.
9. The method of determining of claim 8, wherein the training dimension of the pre-trained decision model comprises at least one of: the method comprises the steps of dependency relationship, mutual information, left and right entropy, context of phrases, frequency and category distinguishing value, wherein the dependency relationship represents whether the dependency relationship exists between two words in a corpus, the mutual information is used for detecting the co-occurrence degree of the two words in the corpus, the left and right entropy is used for detecting the degree of freedom of each word in the corpus, the context of the phrases represents the relationship between each word in the corpus and sentences with index frequencies meeting preset frequencies, the frequency represents the occurrence times of the phrases in the corpus, and the category distinguishing value represents the inverse text frequency index value of each word in the corpus.
10. A high frequency problem determination apparatus, comprising:
The system comprises an acquisition unit, a query unit and a query unit, wherein the acquisition unit is used for acquiring a question set in a period of time, and the question set comprises a plurality of sentences of at least one user question;
the extraction unit is used for extracting the keywords of each sentence in the question set to obtain a plurality of phrases;
the indexing unit is used for establishing an index of each phrase and the corresponding sentence, and clustering the sentences indexed by each phrase to obtain a plurality of clustering results;
a determining unit, configured to determine a high-frequency problem in the question set according to the plurality of phrases and the plurality of clustering results;
wherein the determining unit includes: a statistics module, a first determination module, a second determination module and a third determination module,
the statistics module is used for counting the occurrence frequency of each phrase in the sentences of the question set, and taking the phrase with the highest frequency as a target phrase; a first determining module, configured to determine a frequency of occurrence of all sentences included in each category under the target phrase; the second determining module is used for sequencing the categories according to the sequence from high to low according to the occurrence frequency of all sentences included in each category; and the third determining module is used for taking the top K categories as target categories and determining the high-frequency problems according to the target categories.
11. The determination apparatus according to claim 10, wherein the apparatus further comprises:
the first clustering unit is used for clustering the phrases after extracting the keywords of each sentence in the question set to obtain a plurality of phrases to obtain a plurality of phrase categories;
the index unit is further configured to: establishing indexes of each phrase category and the corresponding sentence, and clustering the sentences indexed by each phrase category to obtain a plurality of clustering results.
12. The determination apparatus according to claim 10, wherein the apparatus further comprises:
the second clustering unit is used for clustering each sentence in the question set before clustering the sentences indexed by each phrase to obtain a plurality of clustering results to obtain a plurality of sentence categories;
the index unit is further configured to: marking sentences indexed by each phrase according to the sentence categories to obtain the category corresponding to each sentence indexed by each phrase, and obtaining the clustering results.
13. The determining apparatus of claim 12, wherein the second aggregation unit comprises:
the first acquisition module is used for acquiring sentence vectors of each sentence in the question set;
The vector generation module is used for dividing the sentence vector of each sentence into M segments to obtain M segment sub-vectors, wherein M is a natural number larger than 1;
the second obtaining module is configured to obtain, through a pre-training class set, a class corresponding to each segment of subvector in the M segments of subvectors of each sentence, where the pre-training class set includes: the category corresponding to each segment of subvector of all sentences;
the merging module is used for splicing or accumulating the sub-vectors of each segment correspondingly to obtain the category to which each sentence belongs;
and the construction module is used for taking the category to which all sentences in the question set belong as the sentence categories.
14. The determining device of claim 13, wherein the second acquisition module comprises:
the computing sub-module is used for computing Euclidean distance between each segment of sub-vector in the M segments of sub-vectors and each segment of sub-vector in the pre-training class set;
the determining sub-module is used for determining a sub-vector with the minimum Euclidean distance with the sub-vector in the M-segment sub-vectors in the pre-training class set to obtain a target sub-vector;
and the construction sub-module is used for taking the category to which the target sub-vector belongs as the category corresponding to the sub-vector in the M-segment sub-vectors.
15. A processor for running a program, wherein the program runs to perform the method of determining a high frequency problem according to any one of claims 1 to 9.
16. A storage medium comprising a stored program, wherein the program performs the method of determining a high frequency problem according to any one of claims 1 to 9.
CN201810448748.6A 2018-05-11 2018-05-11 Method and device for determining high-frequency problem Active CN110489531B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810448748.6A CN110489531B (en) 2018-05-11 2018-05-11 Method and device for determining high-frequency problem

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810448748.6A CN110489531B (en) 2018-05-11 2018-05-11 Method and device for determining high-frequency problem

Publications (2)

Publication Number Publication Date
CN110489531A CN110489531A (en) 2019-11-22
CN110489531B true CN110489531B (en) 2023-05-30

Family

ID=68543206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810448748.6A Active CN110489531B (en) 2018-05-11 2018-05-11 Method and device for determining high-frequency problem

Country Status (1)

Country Link
CN (1) CN110489531B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127611B (en) * 2019-12-31 2024-05-14 北京中关村科金技术有限公司 Method, device and storage medium for processing question corpus
CN111341312A (en) * 2020-02-24 2020-06-26 百度在线网络技术(北京)有限公司 Data analysis method and device based on intelligent voice device and storage medium
CN112183089A (en) * 2020-09-25 2021-01-05 中国建设银行股份有限公司 Corpus analysis method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426507B1 (en) * 2004-07-26 2008-09-16 Google, Inc. Automatic taxonomy generation in search results using phrases
US8554769B1 (en) * 2008-06-17 2013-10-08 Google Inc. Identifying gibberish content in resources
CN107944027A (en) * 2017-12-12 2018-04-20 苏州思必驰信息科技有限公司 Create the method and system of semantic key index

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001188678A (en) * 2000-01-05 2001-07-10 Mitsubishi Electric Corp Language case inferring device, language case inferring method, and storage medium on which language case inference program is described
US20020065857A1 (en) * 2000-10-04 2002-05-30 Zbigniew Michalewicz System and method for analysis and clustering of documents for search engine
JP2011227758A (en) * 2010-04-21 2011-11-10 Sony Corp Information processing apparatus, information processing method and program
JP5825676B2 (en) * 2012-02-23 2015-12-02 国立研究開発法人情報通信研究機構 Non-factoid question answering system and computer program
CN104142918B (en) * 2014-07-31 2017-04-05 天津大学 Short text clustering and focus subject distillation method based on TF IDF features
US10025773B2 (en) * 2015-07-24 2018-07-17 International Business Machines Corporation System and method for natural language processing using synthetic text
CN107085581B (en) * 2016-02-16 2020-04-07 腾讯科技(深圳)有限公司 Short text classification method and device
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN107679144B (en) * 2017-09-25 2021-07-16 平安科技(深圳)有限公司 News sentence clustering method and device based on semantic similarity and storage medium
CN110442718B (en) * 2019-08-08 2023-12-08 腾讯科技(深圳)有限公司 Statement processing method and device, server and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7426507B1 (en) * 2004-07-26 2008-09-16 Google, Inc. Automatic taxonomy generation in search results using phrases
US8554769B1 (en) * 2008-06-17 2013-10-08 Google Inc. Identifying gibberish content in resources
CN107944027A (en) * 2017-12-12 2018-04-20 苏州思必驰信息科技有限公司 Create the method and system of semantic key index

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《KP-Miner: A Simple System for Effective Keyphrase Extraction》;Samhaa R. El-beltagy;2006 Innovations in Information Technology;1-5 *
《基于频繁模式的消息文本聚类研究》;胡吉祥;信息科技(第第10期期);59-67 *
结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究;贾晓婷;王名扬;曹宇;;数据分析与知识发现(第02期);90-99 *
面向协作式问答的问题理解技术研究;张宇;赵鑫;刘挺;;中文信息学报(第02期);30-35 *

Also Published As

Publication number Publication date
CN110489531A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
Ramzan et al. An intelligent data analysis for recommendation systems using machine learning
US9117006B2 (en) Recommending keywords
US9959563B1 (en) Recommendation generation for infrequently accessed items
KR101419504B1 (en) System and method providing a suited shopping information by analyzing the propensity of an user
US8583685B2 (en) Determination of category information using multiple stages
CN103870507B (en) Method and device of searching based on category
KR102042047B1 (en) System and method of marketing service using formal and informal big data
CN104933100A (en) Keyword recommendation method and device
CN110489531B (en) Method and device for determining high-frequency problem
CN108984554B (en) Method and device for determining keywords
Chen et al. Clustering for simultaneous extraction of aspects and features from reviews
CN106062743A (en) Systems and methods for keyword suggestion
KR102464783B1 (en) Method and apparatus for analyzing customer's needs
CN110837581A (en) Method, device and storage medium for video public opinion analysis
CN111966886A (en) Object recommendation method, object recommendation device, electronic equipment and storage medium
CN109977316A (en) A kind of parallel type article recommended method, device, equipment and storage medium
CN108804541A (en) Electric business title optimization system and optimization method
CN111104590A (en) Information recommendation method, device, medium and electronic equipment
CN113077317A (en) Item recommendation method, device and equipment based on user data and storage medium
CN114443957A (en) Traditional Chinese medicine commodity recommendation method and device, electronic equipment and storage medium
Ren et al. Online choice decision support for consumers: Data-driven analytic hierarchy process based on reviews and feedback
CN115511582B (en) Commodity recommendation system and method based on artificial intelligence
CN108694171B (en) Information pushing method and device
CN114741606A (en) Enterprise recommendation method and device, computer readable medium and electronic equipment
CN114971767A (en) Information processing method, information processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant