CN106372956B

CN106372956B - Method and system for identifying intention entity based on user search log

Info

Publication number: CN106372956B
Application number: CN201510440013.5A
Authority: CN
Inventors: 孙鹏飞; 李春生; 金阳春
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: NANJING SUNING ELECTRONIC INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2015-07-23
Filing date: 2015-07-23
Publication date: 2020-03-24
Anticipated expiration: 2035-07-23
Also published as: CN106372956A

Abstract

The invention relates to the field of electronic commerce, and discloses a method for identifying an intention entity based on a user search log, which comprises the following steps: analyzing and extracting the original logs to form a session query set, a co-click query set and a co-query commodity set; processing the data to obtain mutually disjoint subsets (I, Q), establishing a bipartite graph through a click relation, and clustering words; performing word segmentation processing on each query word according to the word clustering result to obtain a plurality of word segments; and performing weight calculation on the participles to further obtain the weight scores of the participles, wherein the participles with the highest scores are used as corresponding entities of the query words. According to the invention, through the analysis of the original log, word clustering and weight analysis are carried out on the query words, the intention entity recognition is realized, the context environment and the semantic background are integrated, the search accuracy is improved, a better intention recognition effect is achieved, and the overhead of on-line calculation is saved.

Description

Method and system for identifying intention entity based on user search log

Technical Field

The invention relates to the field of electronic commerce, in particular to a method for identifying an intention entity based on a user search log.

Background

In the age with internet prevalence, the shopping mode of people is gradually changed from shopping in physical stores to purchasing in e-commerce websites, and the change of the mode not only provides larger selection space for people, but also provides more convenient shopping experience for people. At present, different users are influenced by factors such as academic calendar, culture and region, so that input query words can be greatly different when the same commodity is expressed, and therefore, the intention entity identification needs to be carried out on the query words.

In the existing intention entity recognition model, mainly through a machine learning method, learning samples are labeled manually, an entity recognition model is trained, query words are labeled, and then high-priority core words are obtained according to priority rules. The model can solve the problem of the identification of the intention entity to a certain extent, however, some existing methods lack the analysis of the user context or semantic background, which can lead to the error of the search result, such as: one query word is a three-star mobile phone, the other query word is a millet mobile phone, the two query words both contain the word of the mobile phone, but the analysis of the context and semantic background of the user from the search log can find that the two query words are more concerned with brand information, which cannot be realized by the existing intention entity recognition model.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for identifying an intention entity based on a user search log so as to solve the problem that the search result has errors due to missing analysis on the context or semantic background of a user.

The technical scheme adopted by the invention for solving the technical problems is to provide a method for identifying an intention entity based on a user search log, which comprises the following steps:

s1, analyzing and extracting the original search log to obtain the query word clicked each time and the corresponding commodity information;

s2, forming conversation inquiry set S_sessionCo-click query set S_queryAnd co-query commodity set S_item；

S3, representing the co-click query set S by a bipartite graph_queryAnd co-query commodity set S_itemProcessing the vertexes of the bipartite graph to obtain out-degree sets A1 and A2;

s4, merging the set A1 and the set A2 to obtain a word clustering result A;

s5, performing word segmentation processing on each query word in the word clustering result A by adopting a word segmentation technology, and calculating the weight of each word segmentation;

and S6, selecting the participle corresponding to the highest score as the entity corresponding to the query word.

Preferably, the step S2 includes the steps of:

s201, processing each conversation unit of a user to obtain a query word set corresponding to each conversation unit;

s202, acquiring a query word set corresponding to a commodity clicked by a user together;

s203, acquiring a set of different commodity data clicked by a user under the same query term;

s204, merging and de-duplicating the query term set of the conversation unit, the query term unit of the co-click commodities and the commodity click data set of the co-click commodities to obtain a conversation query set S_sessionCo-click query set S_queryAnd co-query commodity set S_item。

Preferably, the step S3 includes the steps of:

s301, collecting the co-click query S_queryAnd co-query commodity set S_itemThe method comprises the following steps of (V, E) representing a bipartite graph G, wherein a vertex V can be divided into two mutually-disjoint subsets (I, Q), the I and Q are a commodity information set and a query word set respectively, and an edge E represents a click relation between a commodity and the query word;

s302, classifying the vertex V ═ I (I, Q) of the bipartite graph G ═ V, E, and calculating the departure set a1 with the commodity information set I as an arc head and the query term set Q as an arc tail, and the departure set a2 with the query term set Q as an arc head and the commodity information set I as an arc tail, respectively.

Preferably, in step S4, the set a1 and a2 are subjected to merging processing, a ═ a1 ∪ a2) - (a1 ∩ a2, and a word clustering result, a ═ a' ∪ a1, is further obtained.

Preferably, in step S5, the weight of each participle is calculated according to the occurrence number of each participle:

wherein Ti is the ith participle of the query term, N_iIs the number of times the Ti occurs in a subset of the set a.

Preferably, the set a includes m subsets, and the participles in each subset of the set a are weighted and combined:

wherein, α_iSimilarity of the ith subset of the set A, whose value is the inverse of the bipartite graph path length A, β_iIs the weight of a subset of the set a.

In another aspect, the present invention provides a system for identifying an intended entity based on a search log of a user, the system comprising:

the analysis extraction unit is used for analyzing and extracting the original search log to obtain the query word clicked each time and the corresponding commodity information;

a query term forming unit for forming a set S of conversational queries_sessionCo-click query set S_queryAnd co-query commodity set S_item；

A word clustering unit for querying the set S according to the co-click_queryAnd co-query commodity set S_itemObtaining a word clustering result A by the formed bipartite graph;

the weight unit is used for performing word segmentation processing on each query word in the word clustering result A by adopting a word segmentation technology and calculating the weight of each word segmentation;

and the comparison unit is used for comparing the weight of each participle and selecting the participle with the highest score as the entity corresponding to the query word.

Preferably, the word clustering unit includes a bipartite graph forming unit, a degree-out unit, and a merging processing unit, wherein,

the bipartite graph forming unit is used for collecting the co-click queries S_queryAnd co-query commodity set S_itemThe method comprises the following steps of (V, E) representing a bipartite graph G, wherein a vertex V can be divided into two mutually-disjoint subsets (I, Q), the I and Q are a commodity information set and a query word set respectively, and an edge E represents a click relation between a commodity and the query word;

the run-out unit is configured to classify a vertex V ═ I (I, Q) of the bipartite graph G ═ V, E, and calculate a run-out set a1 with the commodity information set I as an arc head and the query term set Q as an arc tail, and a run-out set a2 with the query term set Q as an arc head and the commodity information set I as an arc tail, respectively;

the merging processing unit is configured to perform merging processing on the sets a1 and a2, where a ═ a1 ∪ a2) - (a1 ∩ a2, and further obtain a word clustering result a ═ a' ∪ a 1.

Preferably, the weighting unit includes a weight calculation unit and a weight combination unit, wherein,

the weight calculation unit is used for utilizing a formula according to the occurrence frequency of each participle

Calculating the weight of each participle, wherein Ti is the ith participle of the query word, N_iIs the number of occurrences of said Ti in a subset of said set a;

the weighted combination unit is used for calculating the sum of the weighted combination unit and the sum of the weighted combination unit according to the formula score (Ti) ═ sigma TF (T)_i)×(α₁×β₁+a₂×β₂+ … + α i × β i + … + am × β m) to weight-merge the participles in each subset of the set a, wherein α_iSimilarity of the ith subset of the set A, whose value is the inverse of the bipartite graph path length A, β_iIs the weight of a subset of the set a.

According to the method, click commodity information and query words are obtained according to an original search log to form a candidate set; establishing a bipartite graph of a candidate set, clustering the bipartite graph, and calculating similarity; the method solves the problem of context and semantic background analysis loss, improves the accuracy of searching the search terms, reduces the searching error caused by the influence of factors such as academic history, culture and region of different users, and improves the user experience when the users use the electronic commerce website for shopping.

Drawings

FIG. 1 is a flow diagram of a method for intent entity identification based on a user search log in a preferred embodiment of the present invention;

FIG. 2 is a block diagram of a system for intent entity identification based on user search logs in accordance with a preferred embodiment of the present invention.

Detailed Description

The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby. The following description is of the preferred embodiment for carrying out the invention, and is made for the purpose of illustrating the general principles of the invention and not for the purpose of limiting the scope of the invention. The scope of the present invention is defined by the appended claims.

The invention is described in further detail below with reference to the figures and specific embodiments.

As shown in fig. 1, a method for identifying an intention entity based on a user search log is disclosed as a preferred embodiment of the present invention, and the method comprises the steps of:

s4, merging the sets A1 and A2 to obtain a word clustering result A;

In this embodiment, set S is queried by co-clicking_queryAnd co-query commodity set S_itemThe formed bipartite graph obtains a word clustering result, and each participle in the word clustering result is subjected to weight calculation to select the word with the highest scoreThe participles are used as entities corresponding to the query words, so that the problem that the context is irrelevant or the semantic background is irrelevant in the prior art is avoided to a certain extent, and the accuracy of the search result is improved.

Further, step S2 includes the steps of:

In this embodiment, a session query set S is obtained by performing data preprocessing on an original log_sessionCo-click query set S_queryAnd co-query commodity set S_itemThe query words are determined through the click quantity and the query quantity, the core words are extracted, and the problem that the input query words are too different when the same problem is expressed due to the fact that different users are influenced by factors such as academic calendar, culture and lower rate is solved.

Further, step S3 includes the steps of:

s301, collecting co-click query set S_queryAnd co-query commodity set S_itemThe method comprises the following steps of (V, E) representing a bipartite graph G, wherein a vertex V can be divided into two mutually-disjoint subsets (I, Q), the I and Q are a commodity information set and a query word set respectively, and an edge E represents a click relation between a commodity and a query word;

s302, a vertex V ═ I, Q of the bipartite graph G ═ V, E is classified, and a degree-out set a1 with the commodity information set I as an arc head and the query word set Q as an arc tail and a degree-out set a2 with the query word set Q as an arc head and the commodity information set I as an arc tail are calculated.

Further, in step S4, the merging process a ' ═ a1 ∪ a2) - (a1 ∩ a2 is performed on the sets a1 and a2, and a word clustering result a ' ═ a ' ∪ a1 is further obtained.

In this embodiment, a commodity information set I with an arc head and a query word set Q with an arc tail output set a1 are obtained in a bipartite graph manner, and a commodity information set I with a query word set Q with an arc head and a commodity information set I with an arc tail output set a2 are obtained, and a1 and a2 are combined in a word clustering manner, so that it is ensured that the obtained word clustering result includes user context information and semantic background analysis, and the accuracy of query words is improved.

Further, in step S5, the weight of each participle is calculated according to the occurrence number of each participle:

wherein Ti is the ith participle of the query word, N_iIs the number of occurrences of Ti in a subset of the set a.

Further, the set a includes m subsets, and the participles in each subset of the set a are weighted and combined:

wherein, α_iSimilarity of the ith subset of set A, whose value is the inverse of bipartite graph path length A, β_iIs the weight of a subset of set a.

In the embodiment, the word clustering result is subjected to word segmentation to obtain a plurality of words, each word is subjected to weight calculation and weighting combination to obtain a score corresponding to each word, the obtained scores are compared, the word with the highest score is used as an entity corresponding to the query word, and the off-line calculation mode is adopted, so that the cost of on-line calculation is saved.

It will be understood by those skilled in the art that all or part of the steps in the method of the above embodiments may be implemented by hardware instructions related to a program, the program may be stored in a computer-readable storage medium, and when executed, the program includes the steps of the method of the above embodiments, and the storage medium may be: ROM/RAM, magnetic disks, optical disks, memory cards, and the like. Therefore, it should be understood by those skilled in the art that the present invention also includes a system for performing intent entity recognition based on user search logs, corresponding to the method of the present invention, and referring to fig. 2, the system includes, in one-to-one correspondence with the above method steps:

A word clustering unit for searching the set S according to the co-click_queryAnd co-query commodity set S_itemObtaining a word clustering result A by the formed bipartite graph;

In the embodiment, the most accurate query word which gives consideration to context and semantic background is obtained through the word clustering unit, the weight and the score of each participle of the query word are calculated through the weighting and merging unit, and the participle corresponding to the highest score is obtained through the comparison unit and serves as the entity corresponding to the query word, so that the context environment and the semantic background are merged, the search accuracy is improved, a better intention identification effect is achieved, meanwhile, the calculation cost can be saved, and the accurate participle can be quickly obtained.

Further, the word clustering unit comprises a bipartite graph forming unit, an out-degree unit and a merging processing unit, wherein,

a bipartite graph forming unit for aggregating the co-click queries S_queryAnd co-query commodity set S_itemThe method comprises the following steps of (V, E) representing a bipartite graph G, wherein a vertex V can be divided into two mutually-disjoint subsets (I, Q), the I and Q are a commodity information set and a query word set respectively, and an edge E represents a click relation between a commodity and a query word;

the system comprises a degree output unit, a degree output unit and a degree output unit, wherein the degree output unit is used for classifying vertexes V (I, Q) of a bipartite graph G (V, E), and respectively calculating a degree output set A1 with a commodity information set I as an arc head and a query word set Q as an arc tail and a degree output set A2 with the query word set Q as an arc head and the commodity information set I as an arc tail;

and the merging processing unit is used for merging the sets A1 and A2, wherein A 'is (A1 ∪ A2) - (A1 ∩ A2), and further obtaining a word clustering result A' ∪ A1.

In this embodiment, the word clustering unit obtains a word clustering result through the bipartite graph unit, the output unit, and the merging processing unit, where the bipartite graph and the output establish association for the commodity and the query word, and then merge the results, so as to ensure that the word clustering result is more accurate, and can give consideration to both semantic context and context.

Further, the weighting unit includes a weight calculation unit and a weight combination unit, wherein,

a weight calculation unit for using a formula according to the occurrence frequency of each participle

Calculating the weight of each participle, wherein Ti is the ith participle of the query word, N_iIs the number of occurrences of Ti in a subset of the set A;

a weighted combination unit for calculating the sum of the weighted sums (T) and the sum of the weighted sums (Ti) and the sum of the weighted sums (Score) and the weighted sum (T)_i)×(α₁×β₁+a₂×β₂+ … + α i × β i + … + am × β m) to weight-merge the participles in each subset of set a, where α i is the similarity of the ith subset of set a, whose value is the inverse of bipartite graph path length a, β_iIs the weight of a subset of set a.

In the embodiment, the weights and the scores of all the participles of the query word are calculated off-line through the weight calculating unit and the weighting merging unit, so that the cost of on-line calculation is saved, and meanwhile, the participles corresponding to the entity can be visually obtained by judging the highest score.

Compared with the prior art, the invention provides a method for identifying the intention entity based on the search log of the user, which comprises the steps of obtaining click commodity information and query words according to the original search log to form a candidate set; establishing a bipartite graph of a candidate set, clustering the bipartite graph, and calculating similarity; the method solves the problem of context and semantic background analysis loss, improves the accuracy of searching the search terms, reduces the searching error caused by the influence of factors such as academic history, culture and region of different users, and improves the user experience when the users use the electronic commerce website for shopping.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and the scope of the present invention is not limited thereby, and the present invention may be modified in materials and structures, or replaced with technical equivalents, in the constructions of the above-mentioned various components. Therefore, structural equivalents made by using the description and drawings of the present invention or by directly or indirectly applying to other related arts are also encompassed within the scope of the present invention.

Claims

1. A method for intent entity identification based on a user search log, the method comprising the steps of:

s2, obtaining the query word set of the conversation unit, the query word set of the co-click commodities and the commodity click data set of the co-click commodities, merging and de-duplicating to form a conversation query set S_sessionCo-click query set S_queryAnd co-query commodity set S_item；

S3, representing the above by a bipartite graphCo-click query set S_queryAnd the co-query set of items S_itemProcessing the vertexes of the bipartite graph, dividing the vertexes of the bipartite graph into a commodity information set and a query word set, respectively taking the commodity information set as an arc head and the query word set as an arc tail, and taking the query word set as an arc head and the commodity information set as an arc tail to calculate to obtain a degree-out set A1 and A2;

s4, merging the sets a1 and a2 to obtain a ═ a1 ∪ a2) - (a1 ∩ a2, and further obtaining a word clustering result a ═ a' ∪ a 1;

2. The method of claim 1, wherein the step S2 includes the steps of:

s204, merging and de-duplicating the query term set of the conversation unit, the query term set of the co-click commodities and the commodity click data set of the co-click commodities to obtain a conversation query set S_sessionCo-click query set S_queryAnd co-query commodity set S_item。

3. The method of claim 1, wherein the step S3 includes the steps of:

s301, collecting the co-click query S_queryAnd co-query commodity set S_itemDenoted by bipartite graph G ═ (V, E), where vertex V can be partitioned into two mutually disjoint subsets (I, Q), each of which is denoted byFor the commodity information set and the query term set, the edge E represents the click relation between the commodity and the query term;

4. The method according to claim 1, wherein in step S5, the weight of each participle is calculated according to the number of occurrences of each participle:

wherein Ti is the ith participle of the query term, N_iIs the number of times the Ti appears in the subset of the word clustering result a.

5. The method of claim 4, wherein the word clustering result A comprises m subsets, and the weighted combination is performed on the participles in each subset of the word clustering result A:

score(Ti)＝∑TF(T_i)×(α₁×β₁+a₂×β₂+…+α_i×β_i+…+a_m×β_m)

wherein, α_iSimilarity of the ith subset of the word clustering result A, whose value is the reciprocal of the path length of the bipartite graph, β_iWeights for the subset of the word clustering results a.

6. A system for intent entity identification based on user search logs, the system comprising:

query word formA forming unit for obtaining the query word set of the conversation unit, the query word set of the co-click commodities and the commodity click data set of the co-click commodities, and carrying out merging and de-duplication processing to form a conversation query set S_sessionCo-click query set S_queryAnd co-query commodity set S_item；

A word clustering unit for searching the set S of co-click queries_queryAnd the co-query set of items S_itemPerforming vertex processing on the formed bipartite graph, dividing the vertex of the bipartite graph into a commodity information set and a query word set, respectively taking the commodity information set as an arc head and the query word set as an arc tail, and taking the query word set as an arc head and the commodity information set as an arc tail to calculate to obtain a degree set A1 and A2, and merging the degree sets, wherein A '═ A1 ∪ A2) - (A1 ∩ A2) is used for obtaining a word clustering result A ═ A' ∪ A1;

and the comparison unit is used for comparing the weight of each participle and selecting the participle corresponding to the highest score as the entity corresponding to the query word.

7. The system of claim 6, wherein the word clustering unit comprises a bipartite graph forming unit, a degree-out unit, and a merge processing unit, wherein,

the bipartite graph forming unit is used for collecting the co-click queries S_queryAnd co-query commodity set S_itemA bipartite graph G ═ V, E is used to represent, where, a vertex V may be divided into two mutually disjoint subsets (I, Q), I and Q are the commodity information set and the query term set respectively, and an edge E represents a click relationship between a commodity and the query term;

8. The system of claim 6, wherein the weighting unit comprises a weight calculation unit and a weight combining unit, wherein,

Calculating the weight of each participle, wherein Ti is the ith participle of the query word, N_iThe number of times that the Ti appears in the subset of the word clustering result A is obtained;

the weighted combination unit is used for calculating the sum of the weighted combination unit and the sum of the weighted combination unit according to the formula score (Ti) ═ sigma TF (T)_i)×(α₁×β₁+a₂×β₂+…+α_i×β_i+…+a_m×β_m) Carrying out weighted combination on the participles in each subset of the word clustering result A, wherein α_iSimilarity of the ith subset of the word clustering result A, whose value is the reciprocal of the path length of the bipartite graph, β_iWeights for the subset of the word clustering results a.