CN101625680A

CN101625680A - Document retrieval method in patent field

Info

Publication number: CN101625680A
Application number: CN200810012248A
Authority: CN
Inventors: 朱靖波; 王会珍; 曹菲菲; 肖桐; 李天宁; 宋国龙
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2008-07-09
Filing date: 2008-07-09
Publication date: 2010-01-13
Anticipated expiration: 2028-07-09
Also published as: CN101625680B

Abstract

The invention relates to a document retrieval method in the patent field, which comprises the following steps: preprocessing query texts and patent texts; retrieving the patent texts correlative with the query texts, adopting a calculation method with various similarities to obtain values of different similarities, combining the values of different similarities to recalculate the similarities, and sequencing the patent texts according to the new values of the similarities; adopting various decision methods to map the sequencing of the similarities of the patent text into different sequencings of patent category interdependencies; integrating the sequencing results of various patent category interdependencies, and performing resequencing to obtain the sequencing of new patent category interdependencies; and selecting the patent category most relevant to the query texts from the sequencing of the new patent category interdependencies. The document retrieval method uses the calculation method with various similarities to finally weigh the degree of correlation of the query texts and the patent texts, and uses information of characteristic multi-angles and considers a plurality of system combinations to achieve the aim of mutual complementation and improve the system performance.

Description

Document retrieval method towards patent field

Technical field

The present invention relates to a kind of data-searching method, particularly a kind of document retrieval method towards patent field.

Background technology

Developing rapidly of science and technology, the document dramatic growth of record scientific and technological achievement, patent more and more is much accounted of as one of most important means of intellectual property protection.The related technical scheme of innovation and creation that the patent text record is the most novel, yet the document of record scientific and technological achievement except patent, also have other non-patent text, for example scientific research paper, technical report etc.There is certain relation between patent and the non-patent, for example, to the research of scientific research paper and patent relation, can the forecasting techniques developing trend.To the research of patent documentation and off-patent scientific research document, can understand the up-to-date technology of every field, thereby avoid overlapping development, avoid infringement, even can analyze the development of whole technique industry; Can analyze rival's technical research situation and strategy; Can realize ineffectivity retrieval to patent.Retrieval to patent documentation and non-patent literature is the newer problem in patent research field.

Usually have in the patent text and quote relevant patent or scientific research paper, utilize the adduction relationship research non-patent literature of patent and scientific research paper and the relation between the patent text merely, very limited.And, the patent file in the patent database have millions of more than, adopting the patent operation of manual type merely is a job of wasting time and energy.How from huge patent database, to retrieve relevant patent and obtain the difficult problem that useful patent information is a patent research.

Present patent retrieval and sorting technique have two kinds, a kind of patent retrieval of patent database to having classified, another kind of search method based on natural language processing technique of being based on.

Early stage patent retrieval method great majority are based on the method for patent database, and for example publication number is the CN1996290A patent, has mainly utilized the text message of patent structureization, extract the patent citation relation, make up the patent associated diagram.Then according to certain patent querying condition, for example application number, the patent No., date of application, date of declaration, inventor, patentee etc., patent searching and in the patent associated diagram with the patent that retrieves.This method depends on the fixing structured text of patent itself, and is intelligent inadequately, patent content do not analyzed.

Method based on natural language processing, be meant and adopt natural language processing technique the patent text content analysis, title from patent, summary, instructions, in the texts such as claims, obtain the useful feature that characterizes patent, give weight information to feature, the relevant patent text of retrieval, for example (this article author is Leah S.Larkey to article SomeIssues in the Automatic Classification of U.S.Patents, article is the special report in the AAAI-98 text classification study group), introduced and adopted natural language processing technique to carry out the method for patent classification.(this article author is In-Su Kang to article POSTECH at NTCIR-5Patent Retrieval:Smoothing Experiments in a Language Modeling Approach toPatent Retrieval, Seung-Hoon Na, Jun-Ki Kim, Jong-Hyeok Lee, article is published in Proceedings of NTCIR-5 Workshop Meeting, December 6-9,2005, Tokyo Japan), adopts natural language processing technique to realize patent retrieval.

But existing method only is confined to keyword retrieval, and only at the retrieval between the patent text, do not consider the relation between non-patent text and patent text, non-patent text and the patent classification, can not realize the intelligent full-text search of non-patent text and patent text.

Summary of the invention

At not considering relation between non-patent text and patent text, non-patent text and the patent classification towards the file retrieval of patent field in the prior art, can not realize the weak point of the intelligent full-text search of non-patent text and patent text, the technical problem to be solved in the present invention provides a kind of method of patent retrieval, can realize that the proper vector of patent text represents, calculate non-patent text and relevant patent text similarity, retrieve maximally related patent text.

For solving the problems of the technologies described above, the technical solution used in the present invention may further comprise the steps based on the patent retrieval method of natural language processing technique:

Query text and patent text are carried out pre-service;

Retrieve the patent text relevant, adopt multiple different similarity Calculation Method to obtain the value of different similarities, make up the value of different similarities, recomputate similarity, patent text is sorted by the value of new similarity with query text;

Adopt multiple different decision-making technique, the sequencing of similarity of patent text is become the difference ordering of patent classification correlativity; A plurality of different patent classification relevance ranking results are integrated, and rearrangement obtains new patent classification relevance ranking;

From new patent classification relevance ranking, select and the maximally related patent classification of query text.

Described disposal route to text comprises the pre-service to text, obtain the candidate of feature speech, statistical nature speech data message, adopt the method selected characteristic of Feature Selection, text is converted into the vector representation form, be specially: removing in the patent text is not the label of patent text, extracts patent text information, the number of patenting, patent IPC classification mark, patent name, specification digest, claims, instructions; English text is kept all Caps word; Remove the word that contains numeral; Remove stop word; English text is carried out the morphological pattern reduction handle, obtain feature candidate vocabulary; Feature candidate vocabulary is added up, obtained the classification frequency information of word frequency, document frequency, speech; Selected characteristic vocabulary from the feature candidate word, the feature weight of each feature speech in the calculated characteristics vocabulary is converted into computable vector according to feature speech and feature weight thereof with patent text and query text.

Described multiple different calculation of similarity degree methods obtain the similarity value of query text and patent text, and based on the above-mentioned multiple different similarity value of Log-linear model integration, computing formula is as follows:

Sim ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}) = \frac{\exp (\overset{&RightArrow;}{θ} \cdot \overset{&RightArrow;}{S} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}))}{Σ_{k = 0}^{n} \exp (\overset{&RightArrow;}{θ \cdot \overset{&RightArrow;}{S}} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{d}}_{k}))}

Wherein,

It is query text

And patent text

The vector that the similarity value that adopts different similarity calculating methods to obtain is formed as feature, Be the weight vectors that adopts the similarity value that different similarity calculating methods obtain, n is the patent text sum relevant with query text,

Represent k relevant patent text vector.

Described multiple different decision-making technique, the similarity that comprises patent classification weight adds similarity with method, patent text sequencing of similarity position weight and adds with method and patent text similarity and add and method, and wherein the similarity of patent classification weight adds with computing formula as follows:

score (x) = Σ_{i = 1}^{k} {(k_{r})}^{c_{i}} \times ICF \times {score}_{d_{i}} \times role (x, i)

ICF = \log (\frac{N + 0.5}{C_{x} + 0.5})

Wherein, k _rBe the penalty factor constant, k represents the patent text number of the candidate among the patent text sequencing of similarity result, c _iBe meant the position that the affiliated patent classification of candidate's patent text i obtains according to sequencing of similarity,

Be query text and patent text d _iThe similarity value, ICF is meant the inverse of classification text frequency, wherein C _xBe meant the textual data under the classification x, the textual data that N is total, score (x) is the value of the correlativity of query text and patent classification x, (x i) judges whether patent text di belongs to patent classification x to role.

The similarity of described patent text sequencing of similarity position weight adds with computing formula as follows:

score (x) = Σ_{i = 1}^{k} {(k_{t})}^{i} \times {score}_{d_{i}} \times role (x, i)

Described a plurality of different patent classification relevance ranking results are integrated, be the patent classification relevance ranking result who adopts after multiple different similarity values and multiple different classes of decision methods make up, as the feature of patent classification position, based on of the combination of Rank-SVM model to a plurality of patent classification relevance ranking results.

Described a plurality of different patent classification relevance ranking results being integrated, is to adopt according in a plurality of different patent classification correlation results, the positional value that classification occurs add and, calculate the value of new patent classification correlativity.

The present invention has following beneficial effect and advantage:

1. the inventive method has adopted the technology of natural language processing, utilizes the degree of correlation of multiple similarity Calculation Method as final balance query text and patent text, makes full use of the information of feature multi-angle.At last, consider a plurality of system in combination, reached the purpose of complementation each other, improved system performance.

Description of drawings

Fig. 1 is the inventive method process flow diagram;

Fig. 2 is text pretreatment process figure;

Fig. 3 is query text and patent text similarity calculation flow chart;

Fig. 4 is query text and patent classification correlation calculations process flow diagram;

Embodiment

Below in conjunction with being that embodiment and accompanying drawing are further illustrated method of the present invention:

As shown in Figure 1, a kind of document retrieval method towards patent field may further comprise the steps:

Query text and patent text are carried out pre-service; Retrieve the patent text relevant, adopt multiple different similarity Calculation Method to obtain the value of different similarities, make up the value of different similarities, recomputate similarity, patent text is sorted by the value of new similarity with query text; Adopt multiple different decision-making technique, the sequencing of similarity of patent text is become the difference ordering of patent classification correlativity, a plurality of different patent classification relevance ranking results are integrated, rearrangement obtains new patent classification relevance ranking; From new patent classification relevance ranking, select and the maximally related patent classification of query text.

As shown in Figure 2, describedly query text and patent text are carried out pre-service may further comprise the steps:

A) removing in the patent text is not the label of patent text, extracts patent text information, the number of patenting, patent IPC classification mark, patent name, specification digest, claims and instructions; Remove inner non-letter of word or non-Chinese symbol in the patent text information of acquisition, for example: '-', ', ', ' (', ') ' etc.; English text is kept all Caps word; Remove the word that contains numeral; Remove stop word, for example: in the English patent, " claim ", " said " etc., in the Chinese patent, " step ", " feature " etc. and preposition, adverbial word, article etc.; English text is carried out the morphological pattern reduction handle, obtain feature candidate vocabulary;

B) feature candidate vocabulary is added up, obtained the classification frequency information of word frequency, document frequency, speech;

C) selected characteristic vocabulary from the feature candidate word, the feature weight of each feature speech in the calculated characteristics vocabulary is converted into computable vector according to feature speech and feature weight thereof with patent text and query text.

D) with the feature speech of patent as index terms, be that patent file and patent text vector makes up the inverted index document storage.

As shown in Figure 3, multiple different calculation of similarity degree method may further comprise the steps:

In the patent text storehouse, find the patent text that co-occurrence feature speech is arranged with query text, constitute relevant patent text set.

Calculate the relevant patent in the relevant patent text set and the similarity of query text, adopted multiple similarity Calculation Method in the present embodiment, wherein directed quantity cosine method, BM25 method, SMART method specifically are calculated as follows:

1. the computing method of vectorial cosine

Represent query text with vector space model And patent text

, the cosine computing formula of two vectors:

\cos ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}) = \frac{{\overset{&RightArrow;}{D}}_{1} \cdot {\overset{&RightArrow;}{D}}_{2}}{| | {\overset{&RightArrow;}{D}}_{1} | \cdot | | {\overset{&RightArrow;}{D}}_{2} | |}

2.BM25 computing method

BM25 has a lot of mutation, and BM25 computing method formula is as follows in the present embodiment:

score ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}) = Σ_{i = 1}^{n} IDF (t_{i}) \cdot \frac{f (t_{i}, {\overset{&RightArrow;}{D}}_{2}) \cdot (k_{1} + 1)}{f (t_{i}, {\overset{&RightArrow;}{D}}_{2}) + k_{1} \cdot (1 - b + b \cdot \frac{| {\overset{&RightArrow;}{D}}_{2} |}{avgdl})}

Wherein n represents query text

Feature speech number; F (t _i, D ₂) be feature speech t _iAt patent text

The middle number of times that occurs;

The expression patent text

Text size; Avgdl is the average length that the patent text relevant with query text gathered Chinese version; k ₁With b be free parameter, in the present embodiment, k ₁Value is 2.0, and the b value is 0.75; IDF (t _i) be the inverse of document frequency, be term t _iWeight, computing formula is as follows:

IDF (t_{i}) = \log \frac{N - n (t_{i}) + 0.5}{n (t_{i}) + 0.5}

Wherein N is the total number of documents on the whole data set, n (t _i) be meant and comprise term t _iNumber of files.

3.SMART computing method

SMART algorithm computation formula is as follows:

{Sim}_{SMART} = \underset{t &Element; T}{Σ} ({\overset{&RightArrow;}{D}}_{1} \times {\overset{&RightArrow;}{D}}_{2})

The query text vector

In the weight w of every dimensional feature _iThe employing following formula calculates:

w_{i} = (1 + \log ({tf}_{i})) \times \log \frac{N + 1}{n}

The patent text vector

w_{i} = \frac{1 + \log ({tf}_{i})}{1 + \log (avtf)} \times \frac{1}{0.8 + 0.2 \frac{utf}{pivot}}

Wherein T represents query text

With patent text The feature set of words of common appearance; Tf _iIt is the word frequency of i feature speech in the text vector; N is whole patent text set Chinese version numbers, and n is meant the patent text number that i feature occur; Avtf is the average word frequency of feature speech document in relevant patent text set; Utf is the patent text vector In feature speech number; Pivot is the average characteristics speech number of each document in whole patent text set.

Calculate the similarity value of different query text and patent text respectively with three kinds of methods.

The different similarity value that obtains through above-mentioned each computing method is carried out normalized, obtain the similarity value between 0 to 1.

Similarity values different after the normalization is taken the logarithm respectively.

With the feature of the different similarity values after taking the logarithm as the Log-linear model, computing formula is as follows:

Sim ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}) = \frac{\exp (\overset{&RightArrow;}{θ} \cdot \overset{&RightArrow;}{S} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}))}{Σ_{k = 0}^{n} \exp (\overset{&RightArrow;}{θ \cdot \overset{&RightArrow;}{S}} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{d}}_{k}))}

Wherein,

It is query text

And patent text

The vector that the similarity value that adopts different similarity calculating methods to obtain is formed as feature, Be the weight vectors that adopts the similarity value that different similarity calculating methods obtain, n is the patent text sum relevant with query text, Represent k relevant patent text vector.

As shown in Figure 4, adopt multiple different patent classification decision methods, calculate the relevance ranking between query text and the patent classification different patent text sequencing of similarity results.In the present embodiment, the patent classification decision methods of employing has: similarity add and method, patent text similarity position weight add with method and patent classification weight and add and method, its computing method are as follows:

Similarity add and method, calculate as follows as formula:

score (x) = Σ_{i = 1}^{k} {score}_{d_{i}} \times role (x, i)

Wherein x represents the classification of IPC, and k represents the patent text number of the candidate among the patent text sequencing of similarity result,

Represent the similarity value of i candidate's patent text.(x i) judges patent text d to role _iWhether belong to patent classification x.

2. patent classification weight adds and method, and computing formula is as follows:

score (x) = Σ_{i = 1}^{k} {(k_{r})}^{c_{i}} \times ICF \times {score}_{d_{i}} \times role (x, i)

ICF = \log (\frac{N + 0.5}{C_{x} + 0.5})

Be query text and patent text d _iThe similarity value, ICF is meant the inverse of classification text frequency, wherein C _xBe meant the textual data under the classification x, N is total textual data, and score (x) is the value of the correlativity of query text and patent classification x.(x i) judges patent text d to role _iWhether belong to patent classification x.

3. patent text similarity position weight adds and method, and computing formula is as follows:

score (x) = Σ_{i = 1}^{k} {(k_{t})}^{i} \times {score}_{d_{i}} \times role (x, i)

Wherein, k _iBe a penalty factor constant, k represents the patent text number of the candidate among the patent text sequencing of similarity result,

Be query text and patent text d _iThe similarity value.(x i) judges patent text d to role _iWhether belong to patent classification x.

A plurality of different patent classification relevance ranking results 1～3 are made up, the classification ranking results is resequenced.Array mode has multiple, and the combined method of Cai Yonging has following two kinds in the present embodiment:

With the patent classification relevance ranking result after multiple different similarity values and the multiple different classes of decision methods combination, as the feature of patent classification position, based on of the combination of Rank-SVM model to a plurality of patent classification relevance ranking results.

Employing is according in a plurality of different patent classification correlation results, the positional value that classification occurs add and, calculate the value of new patent classification correlativity.

Obtain the similarity value of query text and patent text by above-mentioned steps, sort, select maximally related patent classification with query text according to this similarity value.

Method of the present invention is not limited to the embodiment described in collective's implementation method, as if those skilled in the art's just scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims

1. document retrieval method towards patent field may further comprise the steps:

Query text and patent text are carried out pre-service;

2. a kind of document retrieval method as claimed in claim 1 towards patent field, it is characterized in that: the disposal route of text is comprised pre-service to text, obtain the candidate of feature speech, statistical nature speech data message, adopt the method selected characteristic of Feature Selection, text is converted into the vector representation form, is specially:

Removing in the patent text is not the label of patent text, extracts patent text information, the number of patenting, patent IPC classification mark, patent name, specification digest, claims, instructions; English text is kept all Caps word; Remove the word that contains numeral; Remove stop word; English text is carried out the morphological pattern reduction handle, obtain feature candidate vocabulary;

Feature candidate vocabulary is added up, obtained the classification frequency information of word frequency, document frequency, speech;

Selected characteristic vocabulary from the feature candidate word, the feature weight of each feature speech in the calculated characteristics vocabulary is converted into computable vector according to feature speech and feature weight thereof with patent text and query text.

3. a kind of document retrieval method as claimed in claim 1 towards patent field, it is characterized in that: described multiple different calculation of similarity degree methods obtain the similarity value of query text and patent text, based on the above-mentioned multiple different similarity value of Log-linear model integration, computing formula is as follows:

Sim ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}) = \frac{\exp (\overset{&RightArrow;}{θ} \cdot \overset{&RightArrow;}{S} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{D}}_{2}))}{Σ_{k = 0}^{n} \overset{&RightArrow;}{\exp (θ \cdot \overset{&RightArrow;}{S} ({\overset{&RightArrow;}{D}}_{1}, {\overset{&RightArrow;}{d}}_{k}))}}

Wherein,

It is query text

And patent text

The vector that the similarity value that adopts different similarity calculating methods to obtain is formed as feature,

Be the weight vectors that adopts the similarity value that different similarity calculating methods obtain, n is the patent text sum relevant with query text,

Represent k relevant patent text vector.

4. a kind of according to claim 1 document retrieval method towards patent field, it is characterized in that: described multiple different decision-making technique, the similarity that comprises patent classification weight adds similarity with method, patent text sequencing of similarity position weight and adds with method and patent text similarity and add and method, and wherein the similarity of patent classification weight adds with computing formula as follows:

score (x) = Σ_{i = 1}^{k} {(k_{r})}^{c_{i}} \times ICF \times {score}_{d_{i}} \times role (x, i)

ICF = \log (\frac{N + 0.5}{C_{x} + 0.5})

Wherein, k _rBe the penalty factor constant, k represents the patent text number of the candidate among the patent text sequencing of similarity result, c _iBe meant the position that the affiliated patent classification of candidate's patent text i obtains according to sequencing of similarity, Be query text and patent text d _iThe similarity value, ICF is meant the inverse of classification text frequency, wherein C _xBe meant the textual data under the classification x, the textual data that N is total, score (x) is the value of the correlativity of query text and patent classification x, (x i) judges whether patent text di belongs to patent classification x to role.

5. as a kind of document retrieval method towards patent field as described in the claim 4, it is characterized in that: the similarity of described patent text sequencing of similarity position weight adds with computing formula as follows:

score (x) = Σ_{i = 1}^{k} {(k_{t})}^{i} \times {score}_{d_{i}} \times role (x, i)

6. a kind of according to claim 1 document retrieval method towards patent field, it is characterized in that: described a plurality of different patent classification relevance ranking results are integrated, be the patent classification relevance ranking result who adopts after multiple different similarity values and multiple different classes of decision methods make up, as the feature of patent classification position, based on of the combination of Rank-SVM model to a plurality of patent classification relevance ranking results.

7. a kind of according to claim 1 document retrieval method towards patent field, it is characterized in that: described a plurality of different patent classification relevance ranking results are integrated, be to adopt according in a plurality of different patent classification correlation results, the positional value that classification occurs add and, calculate the value of new patent classification correlativity.