CN111079035A

CN111079035A - Domain search ordering method based on dynamic map link analysis

Info

Publication number: CN111079035A
Application number: CN201911146865.8A
Authority: CN
Inventors: 鲍家坤; 刘思培; 高天成; 曹玲玲; 张志虎; 袁鸯; 宋春林; 侯海婷; 邹媛媛; 童安玲; 李金龙; 李香亭; 王娟; 杨磊
Original assignee: North Information Control Institute Group Co ltd
Current assignee: North Information Control Institute Group Co ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-04-28
Anticipated expiration: 2039-11-21
Also published as: CN111079035B

Abstract

The invention belongs to the field of internet search, and particularly relates to a field search ordering method based on dynamic map link analysis. The invention firstly establishes semantic level link relation for file resources in field search, and then calculates from two aspects of authority and relevance, and finally realizes the fusion ordering of search results. The method comprises the following steps: dynamically constructing a map facing to the search sequencing field; performing authority offline incremental calculation on file nodes based on the whole graph; on-line computing the relevance of the file nodes based on the search subgraph; and fusing and ordering the search results based on the authority and the relevance. According to the method and the device, entities and relations in the text content of the file can be taken as links, the originally isolated file is associated from a semantic level, the problem of information isolated island of a single file in search sequencing is broken through, analysis and calculation are carried out from two levels of authority and relevance of file nodes, and finally fusion sequencing of search results is achieved.

Description

Domain search ordering method based on dynamic map link analysis

Technical Field

The invention belongs to the field of internet search, and particularly relates to a field search ordering method based on dynamic map link analysis.

Background

Helping users to accurately and quickly locate needed resources is a consistent goal of search engines. But as information is continuously generated and accumulated, a search often returns a large number of results. Therefore, the search engine must rely on an effective search ranking method to return the results desired by the user and to perform preferential presentation. Compared with internet search, the user in the field search has stronger specialty and purpose, and higher requirements are put forward for search sequencing.

The traditional search ordering method based on word frequency and word position has too single ordering basis and can not consider the quality of file resources. The existing search ranking method (such as PageRank, HillTop, and the like) based on webpage link analysis cannot be directly applied to field search lacking webpage link relation. An existing search ordering method (such as RankSVM and the like) based on user browsing preference learning generally trains a 'user-query' record as an isolated sample set, and although historical search requests of historical users can be well processed, effective ordering is difficult to provide for new users or new requests; even if improved by similar "user-query", it cannot be applied to a domain search scenario of a small user amount. The bidding ranking method of the internet search engine is contrary to the professional and authority principles of field search and is not applicable.

Disclosure of Invention

The invention aims to provide a domain searching and sequencing method based on dynamic map link analysis.

The technical solution for realizing the purpose of the invention is as follows:

the method comprises the steps of firstly establishing a semantic level link relation for file resources in field search, further calculating from two aspects of authority degree and relevance degree, and finally realizing fusion ordering of search results; the method comprises the following specific steps:

step (1): dynamically constructing a domain map facing search sequencing; establishing a domain map by taking various file sets in the domain as input;

step (2): performing off-line calculation on the authority increment of the file node based on the full graph; taking the domain graph in the step (1) as input, and calculating authority degrees of all file nodes in the domain graph;

and (3): on-line computing the relevance of the file nodes based on the search subgraph; taking a domain map and a user retrieval word as input, extracting a search subgraph related to retrieval from the whole domain map, and calculating the relevance of each file node in the subgraph;

and (4): search results fusion ordering based on authority and relevance; and (4) in the calculation process, taking the authority and the relevance of each file node in the search subgraph in the step (3) as input, comprehensively calculating the ranking degree of the file nodes, sequencing according to the ranking degree, and returning to the user.

Compared with the prior art, the invention has the remarkable advantages that:

(1) the domain map construction method facing the search ranking method can associate originally isolated files from a semantic level by taking entities and relations in file text contents as links, breaks through the information isolated island problem of a single file in search ranking, brings all the domain files into the same association system for evaluation, and lays a foundation for analyzing authority and relevance of each file node.

(2) The file node authority degree and correlation degree defining and calculating method based on the domain graph, which is provided in the step (2) and the step (3), can quantitatively evaluate the authority of the file node in the whole domain graph and the correlation between the file node and the search keyword input by the user in the search subgraph, and further realize the search ranking method fusing the authority degree and the correlation degree, which is provided in the step (4).

(3) The dynamic construction method and the incremental calculation method provided in the step (1) and the step (2) can be used for dynamically constructing the domain map and calculating the authority increment of the file nodes in the whole domain map according to the addition, deletion and modification conditions of the files to be retrieved in the domain search, so that the calculated amount of the system is reduced, and the calculation efficiency and the practicability of the system are improved.

Drawings

FIG. 1 is a flow chart of a search ranking method of the present invention.

Fig. 2 is a partial schematic view of a domain map of the search ranking method of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

As shown in fig. 1, an overall flowchart of a domain search ranking method based on dynamic graph link analysis according to an embodiment of the present invention includes the following steps:

and step S1, dynamically constructing the search sequencing domain oriented graph. Fig. 2 is a partial schematic view of a domain graph, and the domain graph is composed of 4 elements, namely an entity node, a file node, an association edge and a link edge. The file nodes correspond to all files to be retrieved in the field search, and the file types include but are not limited to text files, multimedia files, database files and the like; the entity node corresponds to the entity described in the file content and is obtained through the steps of entity extraction, entity disambiguation, coreference resolution and the like; the association edges describe the association relationship between the entity nodes, the association relationship is extracted and obtained on the basis of extracting the entity nodes, a new potential relationship between the entities is found through relationship reasoning, the association edges have weights, and the weight represents the tightness degree of the relationship between the two entity nodes; the link edge is used for connecting the entity node and the file node and indicating that the entity node is extracted according to the file description, the link edge has a weight value, and the weight value indicates the tightness degree of the file and the entity.

Step S101, building of the entity node and the associated edge and calculating of the associated weight. By combining domain prior knowledge, entity recognition, entity disambiguation (solving the problem of same name and different meaning) and coreference resolution (solving the problem of synonymy and different name) are carried out on the text content of the file, so that named entities which are as accurate as possible and unambiguous can be obtained, and the entities form entity nodes in the domain map. Further, the association relationship among the entities in the text content is identified to obtain a potential candidate relationship, and the relationship is disambiguated and resolved to obtain an accurate and unambiguous association relationship, wherein the association relationship forms an 'association side' in the domain map, and the association side is noticed to have directionality and appear in pairs. Each association edge has an "association weight", and the weight represents the closeness of the relationship between the entities, which can be represented by but not limited to the co-occurrence of the entities.

The method comprises the following two steps of initial association weight calculation and normalization: if the entities at the two ends of the associated edge appear together in k files in total, the initial associated weight corrValue' (i, j) of the associated edge is equal to k; after the association weights of all the association edges are calculated, normalizing the initial association weight corrValue' (i, j) sent by the same entity node according to the numerical proportion, so as to obtain the association weight corrValue (i, j) of the association edge.

The method comprises the following steps that S102, file nodes, link edges and link weight calculation are built, the file nodes and files to be retrieved in a domain graph are bijective to each other and can be directly built, each file node in the graph represents one file to be retrieved, if a certain entity node is extracted from the file content corresponding to the certain file node, a link edge exists between the entity node and the file node, the link edge weight calculation comprises an initial link weight calculation process and a normalization calculation process, and the initial link weight calculation considers two aspects, namely the association degree α of the entity node to the file node and the importance degree β of the file entity node.

① when the importance of a file node to an entity node is difficult to classify or evaluate manually, it is difficult to classify or evaluate the importance of a file node to an entity node for different file nodes

At this time, the initial link weight value' α, after the initial weight value of each link edge is calculated, normalizing the initial weight value of each link edge connected to the same file node, so as to obtain the link weight value linkValue, wherein α may adopt, but is not limited to, the following calculation method:

α＝TF(t,d)·IDF(t,d)·α₁(t,d)

where t is the entity name of the entity node, d is the file to be retrieved, TF (t, d) is the frequency of t appearing in d, and IDF is log (N/(N) }_t,d+ gamma)) (N is the number of files in the file set to be retrieved, and N is the number of files in the file set to be retrieved_t,dFor the number of files containing entity t, γ is usually taken to be 0.01 to ensure that the denominator is not zero), α₁(t, d) is a position coefficient which is more than 1 when the entity name t is at a special position such as a title, an abstract, a keyword and the like,otherwise, 1 is taken.

② further, when the entity and document can be classified and scored manually according to different fields, for example, the document in the financial field is classified into the types of report, account, financial news, etc., the document in the mechanical field is classified into the types of manual, reference material, etc., the document in the software field is classified into the types of software test instruction, software development manual, software test report, etc., and β value is set for the importance degree of different types of documents in each field, the initial link weight linkValue' is α · β (α calculation method is the same as in the above case).

And step S103, dynamically updating the map increment. In an application scenario of domain search, a file to be retrieved has a possibility of updating and changing, so a corresponding domain map increment updating mechanism needs to be designed, and global map reconstruction caused by local file change is avoided. The variation form of the file set to be searched comprises 3 types of files which are added, deleted and modified. In the face of a newly added file, extraction of entity nodes, file nodes, associated edges and link edges corresponding to the newly added file is required to be completed according to the methods in the steps S101 and S102; and updating the weights of the affected associated edges and the linked edges. In case of deleting a file, the corresponding file node and its associated edge need to be deleted first; if the entity node does not have a connected link edge, deleting the entity node and the associated edge thereof; and updating the weights of the affected associated edges and the linked edges. And in the case of modifying the file, updating the domain map according to the equivalent operation of deleting and adding the domain map.

And step S2, performing offline incremental calculation on the authority of the file nodes based on the full graph. The invention takes the entity nodes of the domain map as the reachable states of the system, the transition probability among the states is determined by the associated edge weight values among the entity nodes, the whole system forms a Markov chain, and the stable distribution of the Markov chain is the authority of the entity nodes. If the total number of the physical nodes is N, the transfer matrix is B_N×N(N rows and N columns of matrix B), vector x for authority of N entity nodes_N×1And (N rows and 1 columns, vector x), then Bx equals x. The invention is based on the Monte Carlo methodAnd when the domain map changes, the random walk process can be updated in an incremental manner aiming at the affected entity nodes, so that the incremental calculation of the authority of the entity nodes is realized. The authority degree of the file node is equal to the sum of the authority degrees of all the link weights of the file node and the entity nodes of the link.

Step S201, authority design of the entity node is carried out.

If entity node i has an associated edge pointing to entity node j, then A_jicorrValue (i, j), otherwise a_ji0. Since the normalization processing is performed on the associated edge weights sent by the same entity node in step S101, if an entity node i has an associated edge pointing to an entity node, the sum of the ith column of the matrix a is 1. If entity node i does not point to any other entity node, forcing A_ii1. This ensures that matrix a is a column and all 1's transition matrix.

Considering that the user has a certain probability 1- δ (obtained by counting the number of times that the user directly accesses a new node/the total number of times that the user accesses each node) to skip the link relation, and directly accessing the new node, the method can obtain the following results according to the markov model:

the authority vectors x of the N entity nodes are the smooth distribution of the markov chain, as defined in step S2. In the above formula x_n、x_n+1Iterative process for calculating x (x can be considered as x ═ x)_n＝∞)。

Order to

Then B also satisfies the column sum all 1, then the entity node authority is equivalent to solving the smooth distribution of the markov chain with the transition matrix B, i.e. the authority vector satisfies x ═ Bx (equivalent to x ═ x)_n＝∞)。

Step S202, authority increment calculation of the entity node is designed based on a Monte Carlo method. The behavior of a user accessing the entity node is simulated by random walk, and the stable distribution of the Markov chain in the step S201, namely the authority of each entity node, is estimated by counting the number of times each node is accessed.

The method adopts a mode of cycle starting point, M random walk processes (total N multiplied by M random walk processes) are respectively started by taking N entity nodes as starting points, each step of the random walk directly accesses a new node (can be regarded as the stop of the random walk) according to the probability of (1- α), and walks from the entity node i to the entity node j according to the probability of α. corrValue (i, j), finally, the number v (i) of times of accessing any entity node i is counted, and the v (i) is divided by the sum of the accessed times of all the entity nodes to obtain the average access probability of the node i, namely the authority entity of the entity node i.

The method comprises the steps of firstly, recording the random walk process before the structure of each round of the graph is changed, counting entity nodes (including addition and deletion of the entity nodes, and recording a set of the entity nodes as X) which are changed in the graph of the round and associated edges (including addition and deletion of the associated edges or weight change of the associated edges, and recording a set of the entity nodes as Y), recording the entity nodes which have the associated relation with the X or the entity nodes connected with the Y as a set Z, and then, X ∪ Z is a trigger node which needs to update the flow in the random walk.

Step S203, authority calculation of the file node. The authority file of the file node is equal to the sum of linkValue of each link weight of the file node and the authority of the link. That is to say that the first and second electrodes,

wherein authityfile (i) represents the authority of file node i; authority (k) represents authority of an entity node k, and a linking edge exists between the file node i and the entity node k; linkValue (i, k) represents the link edge weight between file node i and entity node k.

And step S3, calculating the relevance of the file nodes on the basis of the search subgraph on line. And extracting a search subgraph from the domain graph according to the file nodes contained in the search result. The entity node relevance is determined by the number of file nodes linked by the entity node. The relevancy of the file node is determined by the product of the weight of each link edge of the file node and the relevancy of the link to the entity node.

Step 301, constructing a search subgraph. The search subgraph is constructed according to the related result obtained by each search and is a subgraph of the domain map. Each related result obtained by the search engine through the keyword matching and the like corresponds to a certain file node, and the file nodes form the 'file node' of the search subgraph. The link edges of the file nodes in the domain graph and the linked entity nodes respectively form the link edges and the entity nodes of the search subgraph. The entity nodes in the search subgraph keep the incidence relation among the entity nodes according to the structure of the domain graph to form the incidence edges of the search subgraph.

Step 302, "entity node" relevance calculation. The entity node relevance is determined by the number of file nodes linked by the entity node, and the relevance of each entity node in the search subgraph is equal to the number of file nodes linked by the entity node. Assuming that fig. 2 is a search subgraph, the correlation between the entity node a and the entity node B is 3.

Step 303, "file node" relevance calculation. The relevancy of the file node is determined by the product of the weight of each link edge of the file node and the relevancy of the link to the entity node. When the file node has a plurality of link edges, the product of each link is calculated and then summed.

The calculation rule is given by taking fig. 2 as an example, and it is assumed that fig. 2 is a search subgraph, the correlation degrees of the entity node a and the entity node B are relavancyEntityA and relavancyEntityB, the linkValue3 is the weight value of the link edge between the entity node a and the file node C, and the linkValue4 is the weight value of the link edge between the entity node B and the file node C. The method for calculating the authority relavancyfile of the file node c comprises the following steps:

relavancyFileC＝relavancyEntityA·linkValue3+relavancyEntityB·linkValue4。

and step S4, fusing and sorting the search results based on the authority and the relevance.

In the invention, the search result ordering needs to comprehensively consider the influence of both the authority and the relevance, so that the ranking degree rank value of each file node is omega-authority file + (1-omega) -lambda-relavancyFile, wherein lambda is introduced to ensure that the authority and the relevance are similar in magnitude, and omega is used for determining the weight of the authority and the relevance in the file node ranking. The file node here only considers the files that are retrieved during each search.

If the median of authionFile is a and the median of relavancyFile is b, then λ may be a/b. Constructing manual sequencing samples of m times of search results, and recording n of ith search result_iManually ordering samples, n for each given Ω to get the ith search result_iAnd automatically sorting the results. Considering that the manually ordered samples are correctly ordered results, and taking the minimized error rate of the automatically ordered results as an optimization target, the value of Ω can be obtained by an equidistant sampling method (the step length of Ω from 0 to 1 is Δ (determined by the required precision, such as 0.01)), or a one-dimensional search algorithm (such as a newton method).

Claims

1. The method is characterized in that firstly, semantic level link relation is established for file resources in the field search, then calculation is carried out from two aspects of authority degree and relevance degree, and finally fusion ordering of search results is realized; the method comprises the following specific steps:

2. The method of claim 1, wherein the step (1) comprises the steps of:

step (11): the construction of the entity node and the association edge and the calculation of the association weight are carried out;

the method comprises the steps of carrying out entity identification, entity disambiguation and coreference resolution on text contents of a file to obtain accurate and unambiguous named entities, wherein the entities form entity nodes in a domain map; identifying incidence relations among entities in text content to obtain potential candidate relations, disambiguating and resolving the relations to obtain accurate and unambiguous incidence relations, wherein the incidence relations form 'incidence edges' in a domain map; each association edge has an association weight, and the weight represents the closeness degree of the relationship between the entities;

step (12): the construction of the file nodes and the linking edges and the calculation of the linking weight values are carried out; the 'file node' in the domain map and the file to be retrieved are bijective to each other and are directly constructed, and each file node in the map represents one file to be retrieved; if a certain entity node is extracted from the file content corresponding to a certain file node, a link edge exists between the entity node and the file node; the calculation of the link weight comprises two processes of initial link weight calculation and normalization calculation;

step (13): dynamically updating the map increment;

the change form of the file set to be retrieved comprises a newly added file, a deleted file and a modified file, and in the case of the newly added file, the extraction of the entity node, the file node, the associated edge and the link edge corresponding to the newly added file is required to be completed according to the steps (11) and (12); updating the weight values of the affected associated edges and the link edges; in case of deleting a file, the corresponding file node and its associated edge need to be deleted first; if the entity node does not have a connected link edge, deleting the entity node and the associated edge thereof; updating the weight values of the affected associated edges and the link edges; and in the case of modifying the file, updating the domain map according to the equivalent operation of deleting and adding the domain map.

3. The method according to claim 2, wherein the calculation of the association weight in step (11) includes two steps of initial association weight calculation and normalization; the method specifically comprises the following steps: if the entities at the two ends of the associated edge are commonly present in k files in total, the initial associated weight corrValue' (i, j) of the associated edge is equal to k; after the association weights of all the association edges are calculated, normalizing the initial association weight corrValue' (i, j) sent by the same entity node according to the numerical proportion to obtain the association weight corrValue (i, j) of the association edge.

4. The method of claim 2, wherein the initial link weight calculation in step (12) takes into account two aspects, namely the degree of association α of the entity node with the file node and the degree of importance β of the file node with respect to the entity node, and specifically:

①, when the importance degree of the file node to the entity node is difficult to classify or evaluate manually, for different file nodes β ═ 1 and initial link weights linkValue ═ α, after calculating the initial weight of each link edge, normalizing the initial weight of each link edge connected with the same file node to obtain the link weight linkValue, α adopts the following calculation method:

α＝TF(t,d)·IDF(t,d)·α₁(t,d)

wherein t is the entity name of the entity node, d is the file to be retrieved, and TF (t, d) is that t appears in dFrequency of (d), IDF ═ log (N/(N))_t,d+ gamma)), N is the number of files in the file set to be retrieved, N is_t,dTo contain the number of files of entity t, γ is usually taken to be 0.01 to ensure that the denominator is not zero, α₁(t, d) taking the coefficient more than 1 when the entity name t is in the title, the abstract and the keyword, and taking 1 if not;

② when entities and files can be classified and scored manually according to different domains, β value is set for the importance of different types of files in each domain, and at this time, the initial link weight linkValue' is α · β.

5. The method according to claim 1, wherein in the step (2), all the entity nodes of the domain graph are taken as the reachable states of the system, the transition probability between each state is determined by the associated edge weights between the entity nodes, the whole system forms a markov chain, and the stationary distribution of the markov chain is the authority of the entity nodes, which comprises the following steps:

step (21): designing authority degree of an entity node;

step (22): performing authority degree increment calculation on the entity nodes; based on a Monte Carlo method, the behavior of a user accessing the entity node is simulated by random walk, and when the domain map changes, the random walk process is updated in an incremental mode aiming at the affected entity node, so that the incremental calculation of the authority degree of the entity node is realized;

step (23): authority calculation of the file nodes; the authority of the file node is equal to the sum of the link weights linkValue of the file node and the authority of the linked entity node, namely

6. The method of claim 5, wherein the incremental calculation of authority of the entity node is performed by using a loop starting point manner, and starting M random walk processes from N entity nodes, respectively, to obtain N × M random walk processes, each step of the random walk directly accessing a new node with a probability of (1- α), and walking from an entity node i to an entity node j with a probability of α · corrValue (i, j), and finally counting the number v (i) of times that any entity node i is accessed, so that v (i) is divided by the sum of the number of times that all entity nodes are accessed to obtain an average access probability of the node i, i.e. authority entity of the entity node i;

the method comprises the steps of firstly recording the random walk process before the structure of each round of the graph is changed, counting entity nodes which are changed in the graph of the current round, including adding and deleting entity nodes, recording a set of the entity nodes as X and associated edges, including adding and deleting the associated edges or weight change of the associated edges, recording a set of the entity nodes as Y, recording an entity node which has an associated relationship with the X or an entity node which is connected with the Y as a set Z, wherein X ∪ Z is a trigger node which needs to update the process in the random walk, and the updating process is to examine the previous round of N X M random walk processes, find the first trigger node in each random walk process, reserve the random walk before the trigger node, continue the subsequent random walk according to a new field graph and calculate the authority degree of each entity node.

7. The method according to claim 1, wherein the step (3) comprises in particular the steps of:

step (31): constructing a search subgraph; the search subgraph is constructed according to the related result obtained by each search and is a subgraph of the domain map; each related result obtained by the search engine through the keyword matching mode corresponds to a certain file node, and the file nodes form a 'file node' of the search subgraph; the link edges of the file nodes in the domain graph and the linked entity nodes respectively form the link edges and the entity nodes of the search subgraph; the entity nodes in the search subgraph keep the incidence relation among the entity nodes according to the structure of the domain map to form the incidence edges of the search subgraph;

step (32): calculating the relevance of the entity node of the search subgraph; the entity node relevance is determined by the number of file nodes linked by the entity node, and the relevance of each entity node in the search subgraph is equal to the number of file nodes linked by the entity node;

step (33): calculating the relevance of the 'file node' of the search subgraph; the file node relevancy is determined by the product of each link edge weight of the file node and the relevancy of the link to the entity node; when the file node has a plurality of link edges, the product of each link is calculated and then summed.

8. The method according to claim 1, characterized in that said step (4) is in particular:

the search result ordering needs to comprehensively consider two influences of authority and relevance, so that the ranking degree of each file node is as follows:

rankValue＝Ω·authorityFile+(1-Ω)·λ·relavancyFile，

the lambda is introduced to ensure that the authority degree is similar to the relevance degree magnitude, and the omega is used for determining the weight of the authority degree and the relevance degree in the file node ranking; the file node only considers the searched files in each searching process, if the median of authyFile is a and the median of relavancyFile is b, lambda can be a/b; constructing manual sequencing samples of m times of search results, and recording n of ith search result_iManually ordering samples, n for each given Ω to get the ith search result_iAutomatically sorting the results; and considering the manual sequencing sample as a correct sequencing result, and taking the error rate of the automatic sequencing result as an optimization target, wherein the value of omega can be obtained by an equidistant sampling method, wherein the step length of omega from 0 to 1 is delta each time, or a one-dimensional search algorithm.