CN109857856B

CN109857856B - Text retrieval sequencing determination method and system

Info

Publication number: CN109857856B
Application number: CN201910082601.4A
Authority: CN
Inventors: 郭永红
Original assignee: Beijing Hexiang Wisdom Technology Co ltd
Current assignee: Beijing Hexiang Wisdom Technology Co ltd
Priority date: 2019-01-28
Filing date: 2019-01-28
Publication date: 2020-05-22
Anticipated expiration: 2039-01-28
Also published as: CN109857856A

Abstract

The invention discloses a method and a system for determining retrieval sequencing of texts, wherein the method comprises the following steps: acquiring a target text to be retrieved and a candidate text set; acquiring an association metric value of the target text and each text in the candidate text set; sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition; and sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text. The embodiment of the invention integrates the advantages of various algorithms, improves the accuracy of the patent retrieval result and improves the retrieval efficiency of users.

Description

Text retrieval sequencing determination method and system

Technical Field

The invention relates to the field of data processing, in particular to a method and a system for determining retrieval sequencing of texts.

Background

In the prior art, when documents (such as journal articles, patents and the like) are searched, different sorting results can be obtained after candidate documents are sorted by utilizing a plurality of existing different kinds of similarity calculation methods (such as structural analysis, semantic analysis, keyword analysis and the like); in addition, there may be different results for the same type of similarity calculation method, for example, in the case of semantic analysis, there is also a difference between the similarity calculation result between the same pair of original patent documents and the similarity calculation result between the translated patent documents. Therefore, for the same target patent, for different solutions, the arrangement modes of the similarity of the candidate patents are also various, each mode has its respective ordering rule, and the obtained ordering results may be greatly different, for example, the most relevant patent really needed by the user may be arranged at the top 10 bits in one solution, and may be arranged at the back 1000 bits in another solution, in this case, the user cannot know the best retrieval result, and if the user browses and uses various arrangement modes one by one, the retrieval efficiency is also greatly affected.

Disclosure of Invention

Therefore, the invention provides a method and a system for determining the retrieval selection and the ranking of documents, and overcomes the defect that the best retrieval result cannot be obtained due to different arrangement modes of document retrieval in the prior art.

In a first aspect, an embodiment of the present invention provides a method for determining a search ranking of a text, including the following steps: acquiring a target text to be retrieved and a candidate text set; acquiring an association metric value of the target text and each text in the candidate text set; sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition; and sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text.

In an embodiment, the step of sorting each text in the first text set according to a second preset rule to obtain a retrieval sorting result of the target text includes: sequencing each text in the first text set according to a third preset rule, eliminating noise texts according to a second preset screening condition, and constructing a second text set; and sequencing each text in the second text set according to a second preset rule to obtain a retrieval sequencing result of the target text.

In an embodiment, the step of obtaining the association metric value of the target text and each text in the candidate text set includes calculating the association metric value of the target text and each text in the candidate text set by using preset N association metric algorithms, where N is a positive integer greater than or equal to 2.

In an embodiment, the step of sorting each text in the candidate text set according to a first preset rule by using the association metric value and constructing a first text set according to a first preset screening condition includes: respectively sorting each text in the candidate text set according to an association metric value obtained by presetting N association metric algorithms to obtain N sort sets; comprehensively sequencing the N sequencing sets according to a first preset rule, and constructing a first text set according to a first preset screening condition; preferably, the step of constructing the first text set according to the first preset filtering condition includes: respectively calculating the association metric values of the target text and each text in the candidate text set by N association metric algorithms according to a preset strategy, and analyzing to obtain an analysis result; and judging whether each text in the candidate text set meets a preset condition according to an analysis result, and selecting the text meeting the preset condition in the candidate text set into the first text set.

In an embodiment, the step of comprehensively sorting the N sort sets according to a first preset rule and constructing a first text set according to a first preset screening condition includes: respectively distributing weights to the associated metric values obtained by utilizing the preset N metric algorithms according to a first preset rule, multiplying and adding the associated metric values and the corresponding weights to obtain a comprehensive associated metric value, determining a comprehensive ordering result according to the size of the comprehensive associated metric value, and selecting the texts larger than a threshold value of the first preset comprehensive associated metric value into a first text set.

In an embodiment, the step of sorting each text in the candidate text set according to a first preset rule by using the association metric value and constructing a first text set according to a first preset screening condition includes: respectively sorting the association metric values obtained by using a preset N metric algorithms according to the sizes to obtain N sort sets; and selecting the texts of which the association metric values are larger than a first association metric value threshold and/or smaller than a first sorting order threshold from the association metric values of the texts in the N sorting sets into the first text set.

In an embodiment, the step of sorting each text in the first text set according to a third preset rule, excluding a noise text according to a second preset screening condition, and constructing a second text set includes: respectively distributing weights to the texts in the first text set according to a third preset rule for the associated metric values obtained by utilizing the preset N metric algorithms; multiplying the correlation metric values by corresponding weights and adding to obtain a comprehensive correlation metric value; determining a comprehensive sorting result according to the magnitude of the comprehensive association metric value; taking the text smaller than the second preset comprehensive association metric value threshold value as a noise text; and removing the noise text from the first text set to construct the second text set.

In an embodiment, the step of sorting each text in the first text set according to a third preset rule, excluding a noise text according to a second preset screening condition, and constructing a second text set includes: acquiring second association metric values of the texts in the first text set and the target text according to preset N association metric algorithms; sorting according to the second association metric values respectively to obtain N sort sets; taking the texts with the relevance metric values smaller than a second relevance metric value threshold and/or in the ranking order larger than a second ranking order in the relevance metric values of the texts in the N ranking sets as noise texts; and removing the noise text from the first text set to construct the second text set.

In an embodiment, the second preset rule of the texts in the first text set is set according to the magnitude of the associated metric value of the target text or the ranking order of the associated metric value, so as to obtain the retrieval ranking result of the target text; preferably, N kinds of association metric algorithms are used for obtaining an association metric value of each text in a preset sample and a candidate text set, the recall rate of the association metric value of the preset sample in a preset section is obtained, corresponding weights are set for the N kinds of association metric algorithms according to the recall rate in the preset section, a comprehensive ranking value of each text in the candidate text set is obtained, and a retrieval ranking result of a target text is obtained according to the comprehensive ranking value; preferably, N ranking orders of the relevance metric value of the target text are obtained according to N relevance metric algorithms, a comprehensive ranking value of each text in the candidate text set is obtained according to the N ranking orders, and a retrieval ranking result of the target text is obtained according to the comprehensive ranking value; preferably, N kinds of association metric algorithms are used for obtaining an association metric value of a preset sample and each text in a candidate text set, obtaining a ranking rank of the most relevant text corresponding to the preset sample in the candidate text set according to the association metric value, setting corresponding weights for the N kinds of association metric algorithms according to an average recall rate of the ranking ranks of the preset sample or a recall rate of the preset sample in a preset section, obtaining a comprehensive ranking value of each text in the candidate text set, and obtaining a retrieval ranking result of a target text according to the comprehensive ranking value.

In an embodiment, the second preset rule of the texts in the second text set is set according to the magnitude of the association metric value of the target text or the ranking order of the association metric value, and the ranking is performed to obtain the retrieval ranking result of the target text; preferably, N kinds of association metric algorithms are used for obtaining an association metric value of each text in a preset sample and a candidate text set, the recall rate of the association metric value of the preset sample in a preset section is obtained, corresponding weights are set for the N kinds of association metric algorithms according to the recall rate in the preset section, a comprehensive ranking value of each text in the candidate text set is obtained, and a retrieval ranking result of a target text is obtained according to the comprehensive ranking value; preferably, N ranking orders of the relevance metric value of the target text are obtained according to N relevance metric algorithms, a comprehensive ranking value of each text in the candidate text set is obtained according to the N ranking orders, and a retrieval ranking result of the target text is obtained according to the comprehensive ranking value; preferably, N kinds of association metric algorithms are used for obtaining an association metric value of a preset sample and each text in a candidate text set, obtaining a ranking rank of the most relevant text corresponding to the preset sample in the candidate text set according to the association metric value, setting corresponding weights for the N kinds of association metric algorithms according to an average recall rate of the ranking ranks of the preset sample or a recall rate of the preset sample in a preset section, obtaining a comprehensive ranking value of each text in the candidate text set, and obtaining a retrieval ranking result of a target text according to the comprehensive ranking value.

In an embodiment, the step of obtaining an association metric value between the target text and each text in the candidate text set includes: and acquiring the association metric value of the target text and each text in the candidate text set or the deformed text corresponding to each text in the candidate text set according to the deformed text corresponding to the target text by utilizing a preset one or N association metric algorithms.

In a second aspect, an embodiment of the present invention provides a system for determining a search ranking of a text, including: the target text and candidate text set acquisition module is used for acquiring a target text and a candidate text set to be retrieved; an association metric value obtaining module, configured to obtain an association metric value between the target text and each text in the candidate text set; the first text set building module is used for sequencing each text in the candidate text set according to a first preset rule by using the association metric value and building a first text set according to a first preset screening condition; and the retrieval ordering result acquisition module is used for ordering each text in the first text set according to a second preset rule to acquire a retrieval ordering result of the target text.

In a third aspect, an embodiment of the present invention provides a computer device, including: the text retrieval system comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the text retrieval sequencing determination method provided by the first aspect of the invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause the computer to execute the method for determining the search ranking of text provided in the first aspect of the present invention.

The technical scheme of the invention has the following advantages:

the invention provides a method and a system for determining retrieval ordering of texts, which are characterized in that a target text to be retrieved and a candidate text set are firstly obtained, wherein the target text can be a patent; further acquiring an association metric of the target text and each text in the candidate text set, where the association may be a similarity; then, sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition; and finally, sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text. Compared with the prior art, the method has the advantages that the user cannot know the optimal retrieval result, various arrangement modes need to be browsed one by one, the retrieval efficiency is low, the method provided by the embodiment of the application integrates the advantages of various algorithms, the accuracy of the patent retrieval result is improved, and the retrieval efficiency of the user is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a specific example of a text retrieval order determination method according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating steps of constructing a first text set in the method for determining retrieval ranking of texts according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the accuracy of an algorithm in each permutation zone according to three embodiments of the present invention;

fig. 4 is a flowchart of another specific example of a text retrieval order determining method according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating steps of constructing a second text set according to an embodiment of the present invention;

FIG. 6 is a flowchart illustrating steps of another embodiment for constructing a second text set according to an embodiment of the present invention;

fig. 7 is a block diagram of a specific example of a system for determining search ranking of a text according to an embodiment of the present invention;

fig. 8 is a block diagram of a specific example of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The embodiment of the invention provides a method for determining retrieval ordering of a text, which can be applied to electronic equipment, wherein the electronic equipment can be a server or a terminal, and as shown in fig. 1, the method comprises the following steps:

and step S1, acquiring a target text to be retrieved and a candidate text set.

In practical applications, the target text to be retrieved includes, but is not limited to, technical documents, patents, academic papers, and the like. The server may receive a target patent to be retrieved, which is input by a user at a user terminal, and acquire a candidate patent set from a patent database, where the candidate patent set may be a full-base patent or a patent set customized in another manner according to a usage scenario, for example, the candidate patent set includes only chinese patents, or the candidate patent set may be a subset of all patents in a technical field in a patent base, and may include ten thousand patents, and it should be noted that the number of patents included in the candidate patent set is only an example and is not limited.

And step S2, acquiring the association metric value of the target text and each text in the candidate text set.

In practical application, the association metric value of the target text and each text in the candidate text set may be any metric value that can represent the association degree of the target text and each text in the candidate text set, such as similarity, novelty, importance, value degree, and the like. The similarity is taken as an example for explanation in the embodiment of the present invention, and the similarity of each benefit of the target patent and the pre-candidate patent set can be obtained by using N similarity algorithms, where N is not less than 2. In the embodiment of the present invention, the similarity value is obtained by three similarity calculation methods, i.e., structure analysis, keyword analysis, and semantic analysis, but the present invention is not limited thereto, and any two or more similarity calculation methods may be selected in other embodiments.

In practical application, the step of obtaining the association metric value between the target text and each text in the candidate text set includes: and acquiring the association metric value of the target text and each text in the candidate text set or the deformed text corresponding to each text in the candidate text set by utilizing a preset association metric algorithm or N association metric algorithms according to the deformed text corresponding to the target text.

The modified text in the embodiment of the present invention is other expression form text associated with the original text, for example: translations in other languages corresponding to the original text; abbreviations, rewrites, or generalizations made from the original text content; partial text content contained in the original text (for example, for patent text, a specification abstract, a claim or all or part of the content in the specification of the target text can be selected); other texts corresponding to the original text (for example, for the patent text, the patent text may be the same family of patent text of the original patent text) and the like, which are all used as examples and not limited thereto.

In a specific embodiment, in the process of obtaining the association metric value of each text in the target text and the candidate text set, the association metric value of each text in the target text and the candidate text set may be obtained respectively according to the chinese text or the english translation thereof in the target text by using preset N association metric algorithms. For example, the similarity between the english text of the target patent and each patent in the candidate text set may be calculated by using a preset N similarity algorithm, or the similarity between the chinese translation of the english patent and each patent in the candidate text set may be calculated, so as to obtain different sorting manners.

In a specific embodiment, the process of obtaining the association metric value of the target text and each text in the candidate text set may be to obtain the association metric value of the target patent and each patent in the candidate patent set and the similarity value of translations in other languages corresponding to the candidate patent text set by using an association metric algorithm to obtain different sorting modes.

In another specific embodiment, the process of obtaining the association metric value of the target text and each text in the candidate text set may further be a process of obtaining the association metric value of the target text and each text in the candidate text set according to different text contents in the target text by using preset N association metric algorithms. For example, the similarity between the abstract of the specification, the claim or the content of the specification in the target patent and the abstract of the specification, the claim or all or part of the content of the specification in each text in the candidate patent set can be calculated by using a preset N similarity algorithm to obtain different sorting modes.

And step S3, sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition.

In the embodiment of the invention, the first text set is a primary patent selection set constructed by primarily screening each patent in the candidate patent set.

In an embodiment, as shown in fig. 2, the process of constructing the initially selected patent set in step S3 may specifically include the following steps:

and step S31, respectively sorting each text in the candidate text set according to the association metric values obtained by presetting N association metric algorithms to obtain N sorting sets.

In a specific embodiment, the similarity between the target patent and each patent in the preset patent set is sorted from large to small by using the three similarity algorithms of the structure analysis, the semantic analysis and the keyword analysis, so as to obtain the patent set X, Y, Z with three sorting modes corresponding to the three similarity algorithms.

Step S32: and comprehensively sequencing the N sequencing sets according to a first preset rule, and constructing a first text set according to a first preset screening condition.

In one embodiment, weights are respectively distributed to associated metric values obtained by utilizing preset N metric algorithms according to a first preset rule, the associated metric values are multiplied by the corresponding weights and added to obtain a comprehensive associated metric value, a comprehensive ordering result is determined according to the size of the comprehensive associated metric value, and texts larger than a first preset comprehensive associated metric value threshold value are selected into a first text set.

In a specific embodiment, the similarity effects of the three similarity algorithms can be compared and verified according to a known most relevant text, for example, an X document in the examination process as the most relevant text, so as to obtain an advantage section of each calculation method, and further select the number of patents for each calculation method to be included in the initial selection set. For example, 100 patents are sampled, and the three similarity algorithms are respectively adopted for ranking and comparison, and only 3 sets of data are taken as examples, as shown in table 1:

TABLE 1

Table 1 shows, for a sampled target patent (e.g., CN104983351A), the most relevant patent (the most relevant patent refers to the X document of the patent given by the examiner, e.g., corresponding to CN104983351A, the most relevant patent of which is CN203247669U), the arrangement values of the most relevant patent in the whole library are obtained by using keyword analysis (algorithm 1), structural analysis (algorithm 2) and semantic analysis (algorithm 3), respectively.

For 100 sampled patents, the number of the most relevant files of each target patent in each ranking section is obtained through statistics, as shown in table 2, the statistical result of the number of the most relevant files of each target patent in each ranking section is obtained, and the precision contrast curves of three similarity algorithms formed based on the data in table 2 are shown in fig. 3. The recall ratio is the ratio of the number of searched relevant documents to the number of all relevant documents in the document library, and can be used for measuring the recall ratio of the retrieval system, and the recall ratio of the most relevant documents of each target patent in each ranking section number is formed as shown in table 3 based on the data in table 2:

TABLE 2

TABLE 3

According to the above statistical results, if the similarity calculated by the three algorithms for the patent J in the candidate patent set with respect to the target patent O is Rx, Ry, Rz, respectively, the recall rate of the section where Rx, Ry, Rz are located can be calculated, the weight ratio in each section is determined, and is W1, W2, W3, respectively, and the calculation and reordering are performed to obtain the comprehensive similarity value J' of the patent J: and J' ═ Rx × W1+ Ry × W2+ Rz × W3, and patents with the comprehensive similarity value larger than the preset value are selected into the initially selected patent set.

In one embodiment, the association metric values obtained by using a preset N metric algorithms are respectively sorted according to size to obtain N sort sets; and selecting the texts of which the association metric values are larger than a first association metric value threshold and/or smaller than a first sorting order threshold from the association metric values of the texts in the N sorting sets into the first text set.

In one embodiment, the screening may be performed according to a rank order summation: respectively obtaining three arrangement modes of the patent K in X, Y, Z three sets, namely, the rank Kx, the rank Ky and the rank Kz, performing summation calculation, and if the rank is smaller than a preset threshold, selecting an initial patent set, for example: for a patent K in the candidate text set, if Σ (Kx, Ky, Kz) is less than 500 (the range is default according to experience values or can be set by the user), the patent is selected into the initially selected patent set.

In one embodiment, the screening may be based on the rank mean: respectively obtaining average values mean (Kx, Ky, Kz) of ranking orders of three ways of the patent K in X, Y, Z three sets, and if the ranking order is smaller than a preset threshold, selecting an initial patent set, for example: for a patent K in the candidate text set, if mean (Kx, Ky, Kz) is less than 100 (the range is default according to experience values or can be set by the user), the patent is selected into the initially selected patent set.

In one embodiment, the screening may be based on rank order minima: respectively obtaining the ranking orders Kx, Ky and Kz of the patent K in the X, Y, Z three sets, finding out the similarity algorithm of the smallest min (Kx, Ky and Kz) of the ranking orders, and when the ranking orders are smaller than a preset threshold, entering an initial patent set, for example: for a patent K in the candidate text set, if min (Kx, Ky, Kz) is within the top 50 (the range is default according to experience values or can be set by the user), the patent is selected as the initial selected patent set.

In one embodiment, the screening may be performed according to a plurality of ways in which the rank is simultaneously less than the predetermined rank threshold: the method comprises the steps of respectively obtaining arrangement orders Kx, Ky and Kz of patents K in X, Y, Z three sets, and if two or more rows of previous n bits exist in Kx, Ky and Kz, selecting an initially selected patent set, wherein in practical application, a preset threshold n can be properly increased along with the increase of the number of terms meeting conditions.

In an embodiment, the permutation orders Kx, Ky, Kz of patent K in X, Y, Z sets may be obtained respectively, and two permutation orders are combined and summed to obtain Σ (Kx, Ky), Σ (Kx, Kz), Σ (Ky, Kz), the minimum value of Σ (Kx, Ky, Kz) is obtained, and if the order is smaller than a preset threshold, the initial patent set is selected, for example: for a patent K in the candidate text set, if min (Sigma (Kx, Ky),. Sigma (Kx, Kz),. Sigma (Ky, Kz)) is within the top 100 (the range is default according to experience values or can be set by the user), the initial patent set is selected.

In an embodiment, the permutation orders Kx, Ky, Kz of the patent K in the X, Y, Z three sets are obtained respectively, the maximum value of the permutation orders is removed, and the remaining values are summed, and when the order is smaller than a preset threshold, an initial patent set is selected, for example: for a patent K in the candidate set, if Σ (Kx, Ky, Kz) -max (Kx, Ky, Kz) is less than 70 (this range is default according to experience values or can be set by the user), the patent is selected into the initial selection range.

In an embodiment, the permutation orders Kx, Ky, Kz of the patent K in the X, Y, Z three sets are obtained respectively, after the maximum value max (Kx, Ky, Kz) of the permutation orders is removed, an average value of the permutation orders of the other two ways is obtained, and when the order is smaller than a preset threshold, the initial patent set is selected, for example: for a patent K in the candidate set, if (∑ (Kx, Ky, Kz) -max (Kx, Ky, Kz))/2 is less than 70 (this range is default according to empirical values or can be set by the user at will), the patent is selected as the initially selected patent set.

In a specific embodiment, respectively calculating the association metric values of the target text and each text in the candidate text set by using a preset N association metric algorithms according to a preset strategy, and analyzing to obtain an analysis result; and judging whether each text in the candidate text set meets a preset condition according to an analysis result, and selecting the text meeting the preset condition in the candidate text set into the first text set. For example: screening can be performed by means of differential analysis: respectively obtaining the arrangement orders Kx, Ky and Kz of the patent K in X, Y, Z three sets, respectively selecting the maximum value max (Kx, Ky and Kz) of the arrangement orders in the three sets and the minimum value min (Kx, Ky and Kz) of the arrangement orders in the three sets to calculate the order contrast coefficient, wherein the order contrast coefficient can be calculated by the following optional four schemes:

alternative 1: c1 ═ max (Kx, Ky, Kz) -min (Kx, Ky, Kz))/max (Kx, Ky, Kz);

alternative 2: c1 ═ max (Kx, Ky, Kz) -min (Kx, Ky, Kz))/min (Kx, Ky, Kz);

alternative 3: c3 ═ max (Kx, Ky, Kz)/min (Kx, Ky, Kz);

alternative 4: c4 ═ min (Kx, Ky, Kz)/max (Kx, Ky, Kz).

The bit order contrast coefficients obtained by the above-mentioned four optional alternatives (only by way of example, not limited to this), whether the patents belong to the high drop height patents (which refers to the case where the difference between two different sorting manners is large) is determined according to a preset threshold, and if the patents belong to the high drop height patents, whether the patents are imported into the initially selected patent set is determined according to a preset strategy. The preset strategy can be a patent selection strategy scheme obtained according to the big data statistical analysis result and practical experience. For example, according to the big data statistical analysis results and practical experience, when the bit number Kx of the patent K in the set X is far smaller than the bit number Ky of the patent K in the set Y, if the patent satisfies the condition 1 (e.g., if the technology belongs to the technical field F1), the patent K is selected into the primary selection set, and if the patent satisfies the condition 2 (e.g., if the technology belongs to the technical field F2), the patent K is not selected into the primary selection patent set.

In a specific embodiment, a preset lowest correlation threshold Rtx, Rty, Rtz is given according to the calculation method for each correlation, and only patents higher than the lowest threshold can be selected as the initially selected patent set.

In a specific embodiment, a comprehensive threshold Rt1 is preset, and for a patent K, the similarity Rx, Ry, Rz of the patent K relative to a target patent O is obtained by using three correlation calculation methods; and selecting a max (Rx, Ry, Rz) with the maximum similarity to judge whether the max (Rx, Ry, Rz) is larger than a comprehensive threshold Rt1, and if the max (Rx, Ry, Rz) is larger than the comprehensive threshold Rt1, selecting the patent K into a primary selection set.

In a specific embodiment, a comprehensive threshold Rt2 is preset, and for a patent K, similarity Rx, Ry, Rz of the patent K relative to a target patent O is obtained by using three correlation calculation methods respectively; and selecting a mean (Rx, Ry, Rz) of the similarity to judge whether mean (Rx, Ry, Rz) is greater than a comprehensive threshold Rt2, and if the mean (Rx, Ry, Rz) is greater than the comprehensive threshold Rt2, selecting the patent K of the candidate text set into a primary selection set.

In one embodiment, the lowest correlation threshold Rx, Ry, Rz is set for each similarity algorithm, and if two or more patents K are greater than the preset threshold, they are imported into the initially selected patent set.

In other embodiments, the patents in the candidate patent set can be selected by two or more selection methods to construct the initially selected patent set as long as the two or more selection methods are not mutually conflicting.

And step S4, sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text. In the embodiment of the present invention, the second preset rule is set according to the magnitude of the association metric value with the target text or the rank of the association metric value.

In an embodiment, as shown in fig. 4, the step S4 may specifically include the following steps:

and step S41, sequencing each text in the first text set according to a third preset rule, eliminating the noise text according to a second preset screening condition, and constructing a second text set.

In the embodiment of the invention, the second text set is a similar patent set obtained after the user further screens and denoises the initially selected patent set.

In an embodiment, as shown in fig. 5, the process of constructing a set of similar patents may specifically include the following steps:

step S411: and respectively distributing weights to the texts in the first text set according to a third preset rule by using the association metric values obtained by using the preset N metric algorithms.

In practical applications, in the embodiment of the present invention, a preset N kinds of metric algorithms are used to obtain novelty, similarity, and the like that are beneficial to each patent in the initially selected patent set according to a target, the third preset rule may refer to a manner of constructing the first rule in the first text set, and may be adaptively adjusted on the preset value, or may adopt other preset rules, for example, manually set according to experience, and the like, which is not limited to this example.

Step S412: and multiplying the correlation metric value by the corresponding weight and adding to obtain a comprehensive correlation metric value.

In the embodiment of the present invention, the weight corresponding to the association metric value may determine the weight proportion on each section according to the recall rate of each analysis algorithm in the section as shown in fig. 3.

Step S413: and determining a comprehensive sorting result according to the magnitude of the comprehensive association metric value.

Step S414: and taking the text which is smaller than the second preset comprehensive association metric value threshold value as the noise text.

In the embodiment of the present invention, a patent in which the comprehensive correlation metric is smaller than the preset value or a patent in which the ranking order is greater than the preset value may be used as a noise patent, which is only taken as an example and not limited thereto.

Step S415: and removing the noise text from the first text set to construct a second text set.

According to the embodiment of the invention, in the constructed initially selected patent set, a similar patent set is constructed after noise patents are removed.

In another embodiment, as shown in fig. 6, the process of constructing a similar patent set may specifically include the following steps:

step S416: and acquiring a second association metric value of the text in the first text set and the target text according to N preset association metric algorithms.

Step S417: and respectively sorting according to the second association metric values and the sizes to obtain N sorting sets.

Step S418: and taking the texts with the relevance metric values smaller than a second relevance metric value threshold and/or in the ranking order larger than the second ranking order in the relevance metric values of the texts in the N ranking sets as noise texts.

Step S419: and removing the noise text from the first text set to construct a second text set.

In the embodiment of the present invention, similar patent sets may be constructed by setting appropriate thresholds to remove noise patents with reference to the sequence obtained according to the similarity threshold and/or according to each similarity algorithm used in constructing the initially selected patent sets, and details are not repeated here.

And step S42, sequencing each text in the second text set according to a second preset rule to obtain a retrieval sequencing result of the target text.

In the embodiment of the present invention, the second preset rule is set according to the magnitude of the association metric value with the target text or the rank of the association metric value. In one embodiment, a mode of evenly distributing weights is adopted (the weights of the three algorithms are the same), that is: for example, if three algorithms of a certain patent J with respect to the target patent O obtain similarities of Wx ═ Wy ═ Wz ═ 1/3, respectively: rx 90%, Ry 85%, Rz 96%, the simple weighted average similarity of the candidate patent J with respect to the target patent is: and R is 90 percent 1/3+85 percent 1/3+96 percent 1/3 is 90.3 percent, and retrieval sorting results of the target patents are obtained according to the weighted average similarity of the patents.

In one implementation, each algorithm may be empirically assigned a weight artificially, e.g., Wx may be assigned 20% artificially; 30% of Wy; and Wz is 50%, and the retrieval sorting result of the target patent is obtained according to the weight value of each patent.

In a specific embodiment, N kinds of association metric algorithms are used for obtaining an association metric value of a preset sample and each text in a candidate text set, the recall rate of the association metric value of the preset sample in a preset section is obtained, corresponding weights are set for the N kinds of association metric algorithms according to the recall rate in the preset section, a comprehensive ranking value of each text in the candidate text set is obtained, and a retrieval ranking result of a target text is obtained according to the comprehensive ranking value. For example: dividing the relevancy of each calculation method into a plurality of sections, and calculating the recall rate of the patents in each section according to the number of recalled X documents in each section and the total number of patents in the relevancy section, such as dividing the relevancy into the following 6 sections:

for algorithm 1 statistics:

greater than 95% with Z11 ═ 5% (number/total number of X literature recalls)

95% -90%, (Z12 ═ 10% (number/total number of X literature recalls)

90-80%, Z13 ═ 11% (number/total number of X article recalls)

80-70%, Z14 ═ 13% (number/total number of X article recalls)

70-60%, Z15 ═ 19% (number/total number of X article recalls)

Below 60%, Z16 ═ 42% (number/total number of X literature recalls)

For algorithm 2 statistics:

greater than 95% with Z21 ═ 3% (number/total number of X literature recalls)

95% -90%, (Z22 ═ 12% (number/total number of X literature recalls)

90-80%, Z23 ═ 17% (number/total number of X article recalls)

80-70%, Z24 ═ 15% (X literature recall number/total number%

70-60%, Z25 ═ 23% (number/total number of X article recalls)

Below 60%, Z26 ═ 30% (number/total number of X literature recalls)

For the statistical results of algorithm 3:

greater than 95% with Z31 ═ 7% (number/total number of X literature recalls)

95% -90%, (Z32 ═ 9% (number/total number of X literature recalls)

90-80%, Z33 ═ 18% (number/total number of X article recalls)

80-70%, Z34 ═ 19% (number/total number of X article recalls)

70-60%, Z35 ═ 15% (number/total number of X article recalls)

Below 60%, Z36 ═ 32% (number/total number of X literature recalls)

The weight assignment scheme is specified according to the above statistical data, for example: for patent J, if the similarity calculated by the three algorithms with respect to the target patent O is Rx, Ry, Rz, the segments where Rx, Ry, Rz are located can be calculated, the corresponding weight ratios are found according to the above statistical results, and are respectively W1, W2, W3, and the weights are calculated and then reordered, so that the comprehensive similarity value J' is: and J' ═ Rx × W1+ Ry × W2+ Rz × W3, and the retrieval and sorting results of the target patents are obtained according to the comprehensive similarity values of the patents.

In one embodiment, the integrated similarity takes the highest value of the similarities obtained by the three algorithms, namely max (Rx, Ry, Rz). For example, if the similarity obtained by three algorithms of a certain patent with respect to the target patent is: rx 90%, Ry 85%, Rz 96%, the degree of similarity directly assigned to the patent with respect to the target patent is: and R is 96 percent.

In an embodiment, the sorting may be performed by interval sorting, for example, three sorting manners may be used to sort similar patent sets, and the final sorting manner of the sorted sets X, Y, Z in which three similar patent sorts are obtained may be sequentially arranged by interval in a manner of X1, Y1, Z1, X2, Y3, Z2, X3, Y3, and Z3., for example, if a patent belongs to X2, Y6, and Z53 at the same time, the patent is first arranged at the position of "X2", and is directly skipped when reaching the position of Y6, and a subsequent Y7 patent is selected (if Y7 is also selected, it is sequentially delayed), and Z53 performs similar processing.

In an embodiment, N ranking orders of the relevance metric of the target text are obtained according to N relevance metric algorithms, a comprehensive ranking value of each text in the candidate text set is obtained according to the N ranking orders, and a retrieval ranking result of the target text is obtained according to the comprehensive ranking value. For example: the user can sort the similar patent sets by using three sorting methods to obtain three sorted sets X, Y, Z sorted by the similar patent sorting, and if the sorting of the patents C is Cx, Cy, and Cz, the three sets can be calculated and then sorted again, for example, the comprehensive sorting value C ' can be set as C ' ═ Cu + Cv + Cw, and finally sorted according to the size of C ', and if a plurality of equal-valued patents occur, the patents can be sorted according to a preset rule, for example, the minimum value of each group of Cx, Cy, and Cz corresponding to each C ' can be compared, and the patent with the minimum sorting min (Cx, Cy, and Cz) can be prioritized, or the maximum value of each group of Cx, Cy, and Cz corresponding to each C ' can be compared, and the patent with the minimum sorting max (Cx, Cy, and Cz) can be prioritized.

In an embodiment, the step of performing ranking according to the second preset rule to obtain a retrieval ranking result of the target text includes: acquiring an association metric value of a preset sample and each text in a candidate text set by using N association metric algorithms, acquiring a ranking order of a most relevant text corresponding to the preset sample in the candidate text set according to the association metric value, setting corresponding weights for the N association metric algorithms according to an average recall rate of the ranking order of the preset sample or a recall rate of the preset sample in a preset section, acquiring a comprehensive ranking value of each text in the candidate text set, and acquiring a retrieval ranking result of a target text according to the comprehensive ranking value, wherein the retrieval ranking result specifically comprises the following steps:

weighting according to the distribution of the arrangement results, taking a batch of patent samples (such as 100 patents with X documents) from a preset patent set, finding the most relevant documents of the patents (for example, X document information provided in patent examination information can be used for defining X type contrast documents as the closest contrast documents of the patents), establishing a mapping relation between the most relevant documents and candidate patents, and calculating the similarity of the closest contrast patents of each patent in the samples by using different similarity calculation methods. For different similarity calculation methods, the rank of the correlation degree of the X literature of each patent of the sample patent relative to the target patent in the whole candidate patent set is calculated respectively (if one sample patent corresponds to a plurality of X literatures, the X literature is ranked the first one). According to the above method, the ordering of the X document Pi corresponding to each sample patent in three different algorithms can be obtained: pix, Piy, and Piz (i is 1 to 100). Analyzing the obtained data to obtain the sorting distribution condition of each operation mode, and comparing the accuracy trends of the three algorithms in each section, for example, as shown in fig. 3, according to the sorting distribution condition for the advantageous section of each algorithm.

According to the above statistical results, the recall rates of algorithm 1 and algorithm 2 are higher in the (first 10 bits) and later (101-1000 bits), and there is no obvious advantage in the 10-100 sections compared with other calculation methods; algorithm 3 exhibits the opposite trend. According to the statistical result, the weights of the comprehensive ranking calculation formula can be assigned and adjusted correspondingly, and according to the statistical result of the patent recall rate, the weights can be assigned according to one of the following two methods respectively:

the method comprises the following steps: respectively counting the average recall rate of each algorithm in each bit, carrying out weight assignment on different algorithms according to the recall rate, and respectively arranging the comparison files closest to the three algorithms in the 6 th bit according to the statistical result: 1.5%, 2.3% and 0.6%, and the relative proportions of the closest comparison file ranks at the 6 th bit of the three algorithms are calculated as follows:

algorithm 1: the ratio is 1.5/(1.5+2.3+0.6) × 100 ═ 34%,

and 2, algorithm: 2.3/(1.5+2.3+0.6) × 100% ═ 52%,

algorithm 3: 0.6/(1.5+2.3+0.6) × 100% ═ 14%;

then, for the case of ranking at the 6 th bit, 34%, 52%, 14% of the weight is given, and for patent C, if the ranking in the three sets is Cx, Cy, Cz, respectively, the corresponding weight ratios are found by the above method, and are W1, W2, W3, respectively, and the weights are calculated and then reordered, and the comprehensive ranking value is set as: c' ═ Cu × W1+ Cv × W2+ Cw × W3.

The second method comprises the following steps: dividing the hit bit of the retrieval result into a plurality of sections, respectively counting the recall rate of each algorithm in each section, and performing weight assignment on different algorithms according to the recall rate, for example, the three algorithms retrieve that the proportion of the closest patent row in the 6 th to 10 th bits is 5%, 3% and 11%, and through calculation, the relative proportion of the closest comparison file row in the 6 th to 10 th bits of the three algorithms is respectively:

algorithm 1: 5/(5+3+11) × 100% ═ 26%,

and 2, algorithm: the ratio is 3/(5+3+11) × 100% ═ 16%,

algorithm 3: the ratio is 11/(5+3+11) × 100%: 58%;

then, for the case of ranking at the 6 th bit, 26%, 16%, 58% of the weight is given, for patent C, if the ranking in the three sets is Cx, Cy, Cz, respectively, the segments where Cx, Cy, Cz are located can be calculated, the corresponding weight ratios are found according to the above method, and are respectively set as W1, W2, W3, and the weights are calculated and then reordered, the comprehensive ranking values are: c' ═ Cu × W1+ Cv × W2+ Cw × W3.

The above embodiments are merely exemplary, not restrictive, and other variations and modifications may be made on the basis of the above descriptions in practical applications.

The retrieval ordering determination method provided by the embodiment of the invention comprises the steps of firstly obtaining a target text to be retrieved and a candidate text set, wherein the target text can be a patent; further acquiring an association metric of the target text and each text in the candidate text set, where the association may be a similarity; then, sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition; and finally, sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text. The method provided by the embodiment of the application integrates the advantages of multiple algorithms, improves the accuracy of the patent retrieval result, and improves the retrieval efficiency of a user.

Example 2

An embodiment of the present invention provides a system for determining search ranking of a text, as shown in fig. 7, the system includes:

a target text and candidate text set obtaining module 1, configured to obtain an association metric value between the target text and each text in the candidate text set. This module executes the method described in step S1 in embodiment 1, and is not described herein again.

And the association metric value acquisition module 2 is configured to acquire an association metric value between the target text and each text in the candidate text set. This module executes the method described in step S2 in embodiment 1, and is not described herein again.

The first text set building module 3 is configured to rank each text in the candidate text set according to a first preset rule by using the association metric value, and build a first text set according to a first preset screening condition; this module executes the method described in step S3 in embodiment 1, and is not described herein again.

And the retrieval ordering result obtaining module 4 is configured to order each text in the first text set according to a second preset rule, and obtain a retrieval ordering result of the target text. This module executes the method described in step S4 in embodiment 1, and is not described herein again.

The system for determining the retrieval sequence of the texts, provided by the embodiment of the invention, comprises the steps of firstly obtaining a target text to be retrieved and a candidate text set, wherein the target text can be a patent; further acquiring an association metric of the target text and each text in the candidate text set, where the association may be a similarity; then, sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition; and finally, sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of the target text. The system provided by the embodiment of the application integrates the advantages of various algorithms, improves the accuracy of patent retrieval results, and improves the retrieval efficiency of users.

Example 3

An embodiment of the present invention provides a computer device, as shown in fig. 8, including: at least one processor 401, such as a CPU (Central Processing Unit), at least one communication interface 403, memory 404, and at least one communication bus 402. Wherein a communication bus 402 is used to enable connective communication between these components. The communication interface 403 may include a Display (Display) and a Keyboard (Keyboard), and the optional communication interface 403 may also include a standard wired interface and a standard wireless interface. The Memory 404 may be a RAM (random Access Memory) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The memory 404 may optionally be at least one memory device located remotely from the processor 401. Wherein the processor 401 may execute the method for determining retrieval order of text described in fig. 1, the memory 404 stores a set of program codes therein, and the processor 401 calls the program codes stored in the memory 404 for executing the method for determining retrieval order of text in embodiment 1.

The communication bus 402 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The communication bus 402 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one line is shown in FIG. 8, but this does not represent only one bus or one type of bus.

The memory 404 may include a volatile memory (RAM), such as a random-access memory (RAM); the memory may also include a non-volatile memory (english: non-volatile memory), such as a flash memory (english: flash memory), a hard disk (english: hard disk drive, abbreviation: HDD), or a solid-state drive (english: SSD); the memory 404 may also comprise a combination of memories of the kind described above.

The processor 401 may be a Central Processing Unit (CPU), a Network Processor (NP), or a combination of a CPU and an NP.

The processor 401 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The aforementioned PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

Optionally, the memory 404 is also used to store program instructions. Processor 401 may invoke program instructions to implement a method for determining a search ranking for text as provided in embodiment 1 of the present application.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer-executable instruction is stored on the computer-readable storage medium, and the computer-executable instruction can execute the method for determining the retrieval sequence of the text in embodiment 1. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard disk (Hard disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims

1. A method for determining the search ranking of a text, comprising the steps of:

acquiring a target text to be retrieved and a candidate text set;

respectively calculating an association metric value of the target text and each text in the candidate text set by utilizing preset N association metric algorithms, wherein the association metric value represents a metric value of the association degree of each text in the target text and the candidate text set, and N is a positive integer greater than or equal to 2;

sorting each text in the candidate text set according to a first preset rule by using the association metric value, and constructing a first text set according to a first preset screening condition;

sequencing each text in the first text set according to a second preset rule to obtain a retrieval sequencing result of a target text, wherein the second preset rule is set according to a sequencing order of an association metric value of the target text to obtain the retrieval sequencing result of the target text, and the method comprises the following steps: acquiring an association metric value of a preset sample and each text in a candidate text set by using N association metric algorithms, acquiring a ranking order of the most relevant text corresponding to the preset sample in the candidate text set according to the association metric value, setting corresponding weights for the N association metric algorithms according to an average recall rate of the ranking order of the preset sample or a recall rate in a preset section, acquiring a comprehensive ranking value of each text in the candidate text set, and acquiring a retrieval ranking result of a target text according to the comprehensive ranking value;

the step of sorting each text in the candidate text set according to a first preset rule by using the association metric value and constructing a first text set according to a first preset screening condition comprises the following steps:

respectively sorting each text in the candidate text set according to an association metric value obtained by presetting N association metric algorithms to obtain N sort sets;

and comprehensively sequencing the N sequencing sets according to a first preset rule, and constructing a first text set according to a first preset screening condition.

2. The method for determining search ranking of text according to claim 1, wherein the step of ranking each text in the first text set according to a second preset rule to obtain a search ranking result of a target text includes:

sequencing each text in the first text set according to a third preset rule, eliminating noise texts according to a second preset screening condition, and constructing a second text set;

and sequencing each text in the second text set according to a second preset rule to obtain a retrieval sequencing result of the target text.

3. The method for determining retrieval ranking of text according to claim 1, wherein the step of comprehensively ranking the N ranking sets according to a first preset rule and constructing a first text set according to a first preset screening condition includes:

respectively calculating the association metric values of the target text and each text in the candidate text set by N association metric algorithms according to a preset strategy, and analyzing to obtain an analysis result; and judging whether each text in the candidate text set meets a preset condition according to an analysis result, and selecting the text meeting the preset condition in the candidate text set into the first text set.

4. The method for determining retrieval ranking of text according to claim 1, wherein the step of comprehensively ranking the N ranking sets according to a first preset rule and constructing a first text set according to a first preset screening condition includes:

respectively distributing weights to the associated metric values obtained by utilizing the preset N metric algorithms according to a first preset rule, multiplying and adding the associated metric values and the corresponding weights to obtain a comprehensive associated metric value, determining a comprehensive ordering result according to the size of the comprehensive associated metric value, and selecting the texts larger than a threshold value of the first preset comprehensive associated metric value into a first text set.

5. The method of claim 1, wherein the step of ranking each text in the candidate text set according to a first predetermined rule by using the relevance metric value, and constructing a first text set according to a first predetermined filtering condition comprises:

respectively sorting the association metric values obtained by using a preset N metric algorithms according to the sizes to obtain N sort sets;

and selecting the texts of which the association metric values are larger than a first association metric value threshold and/or smaller than a first sorting order threshold from the association metric values of the texts in the N sorting sets into the first text set.

6. The method for determining retrieval sequence of texts according to claim 2, wherein the step of sequencing each text in the first text set according to a third preset rule, excluding noisy text according to a second preset screening condition, and constructing a second text set comprises:

respectively distributing weights to the texts in the first text set according to a third preset rule for the associated metric values obtained by utilizing the preset N metric algorithms;

multiplying the correlation metric values by corresponding weights and adding to obtain a comprehensive correlation metric value;

determining a comprehensive sorting result according to the magnitude of the comprehensive association metric value;

taking the text smaller than the second preset comprehensive association metric value threshold value as a noise text;

and removing the noise text from the first text set to construct the second text set.

7. The method for determining retrieval sequence of texts according to claim 2, wherein the step of sequencing each text in the first text set according to a third preset rule, excluding noisy text according to a second preset screening condition, and constructing a second text set comprises:

acquiring second association metric values of the texts in the first text set and the target text according to preset N association metric algorithms;

sorting according to the second association metric values respectively to obtain N sort sets;

taking the texts with the relevance metric values smaller than a second relevance metric value threshold and/or in the ranking order larger than a second ranking order in the relevance metric values of the texts in the N ranking sets as noise texts;

8. The method according to claim 1, wherein the step of calculating the association metric value of the target text and each text in the candidate text set by using a preset N association metric algorithms comprises:

and acquiring an association metric value of the target text and each text in the candidate text set or a deformed text corresponding to each text in the candidate text set according to the deformed text corresponding to the target text by utilizing a preset one or N association metric algorithms, wherein the deformed text is other expression form texts associated with the original text.

9. The method for determining retrieval ranking of text according to claim 8, wherein the step of obtaining an association metric value between the target text and each text in the candidate text set or the deformed text corresponding to each text in the candidate text set according to the deformed text corresponding to the target text by using one or N preset association metric algorithms includes:

and respectively acquiring the association metric value of each text in the target text and the candidate text set according to the Chinese text or the English translation in the target text by utilizing N preset association metric algorithms.

10. The method for determining retrieval ranking of text according to claim 8, wherein the step of obtaining an association metric value between the target text and each text in the candidate text set or the deformed text corresponding to each text in the candidate text set according to the deformed text corresponding to the target text by using one or N preset association metric algorithms includes:

and acquiring an association metric value of each patent in the target patent and the candidate patent set and a similarity value of translations of other corresponding languages in the candidate patent text set by using an association metric algorithm.

11. A computer device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method of determining a search ranking of text as recited in any of claims 1-10.

12. A computer-readable storage medium storing computer instructions for causing a computer to execute the method for determining search ranking of text according to any one of claims 1 to 10.