CN107122382A

CN107122382A - A kind of patent classification method based on specification

Info

Publication number: CN107122382A
Application number: CN201710082677.8A
Authority: CN
Inventors: 朱玉全; 金健; 佘远程; 石亮
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2017-09-01
Anticipated expiration: 2037-02-16
Also published as: CN107122382B

Abstract

The invention discloses a kind of patent classification method based on specification, belong to text-processing and Data Mining.Text Pretreatment is carried out to patent specification first；Thereafter inverted index file is built, the feature selection approach being combined using information gain and word frequency is come selected characteristic word；The TF IDF formula improved further utilized calculate term weight function, and build patent characteristic vector；Then training patent field set is built；Finally patent is classified using the KNN graders optimized.The research provides new thinking for patent document classification, is also laid a good foundation for further research patent document intelligent retrieval etc..

Description

A kind of patent classification method based on specification

Technical field

The invention belongs to computer analytical technology patent document application, and in particular to one kind utilizes patent specification Patent classification method.

Background technology

Patent is technological innovation and the specific manifestation of enterprise value, be knowledge development and innovation important carrier, achievement and One of source, many innovation and creation achievements are only come across in patent document.Counted according to World Intellectual Property Organization (WIPO), the world The 70%~90% of upper invention achievement is primarily occur inned in patent document, rather than in the document of other carriers such as magazine, paper. In addition, the interests in order to protect itself, enterprise as early as possible can apply for a patent, and often concentrate the most active and first in patent The technology entered, contain in the world 90%~95% technical information.Simultaneously for the convenience of examination, patent document is often write Comparison in detail, for other kinds of data, patent document can provide more information, be a kind of most common Achievements of technology innovation, records the complete procedure of Patent Activities.It not only reflects the present situation of technical activity in each technical field, And the developing history of technical activity in some particular technology area can be embodied.Applied for a patent in patent document containing each Innovation and creation particular technique solution, there is very important effect for enterprise innovation, not only make enterprise can be with Solve newest scientific research dynamic, it is to avoid repeat to study, search time and research funding are saved, while can also edify business research personnel's Open one's minds, improve the starting point of innovation, use for reference conventional invention, greatly shorten research work progress.

With continuing to bring out for China's recent studies on achievement and innovation and creation, patent numbers show quick growth.Cut-off On October 5th, 2016, the patent of invention number that China has announced is more than 5,980,000, wherein mandate patent of invention sum is 223.850 ten thousand.If the mean size of each patent is 2M, the capacity of patent data is up to hundreds of TB.In order to scientifically These patent documentation datas are managed, and also to quickly and easily retrieve Patents documents, the classification of patent document seems It is particularly important.At present, most countries use International Patent Classification IPC (International Patent in the world Classification) patent document is classified, IPC is according to five grade separations, i.e. portion (Section), major class (Class), group (Subclass), main group (Main Grop), packet (Grop), its middle part is highest ranking in classification chart Classification layer, it is different according to field, it is divided into eight big portion, is marked with the English alphabet of one, be A-H respectively, each part subordinate Provided with multiple major classes, major class is made up of binary digit, the major class for having varying number below each portion.For example：G06F21/00 tables Show that physics-electricity Digital data processing-prevents the safety device of protection computer, its part, program or data of unauthorized behavior.

As can be seen here, for or for the patent of invention that will announce, it is necessary to assign one or more corresponding The classification number of classification number, such as patent of invention " guard method of private data in a kind of association rule mining " is G06F21/00.It is right In will submit apply for a patent for, its classification number be it is unknown and it needs to be determined that, in this regard, way conventional at present is basis The art or patent content of patent description object are determined, it is necessary to by associated specialist manual read's application, with special Sharp applications sharply increase (annual patent application number is close to 1,000,000), and the method needs to expend substantial amounts of manpower and thing Power, and the limitation of expert's its knowledge is also difficult to ensure that the uniformity and accuracy of classification results.Therefore, the present invention is proposed A kind of patent classification method based on patent document specification, this method utilizes the information announced in patent of invention specification Come structural classification device or classification function, and determine with this classification applied for a patent, be achieved in the automatic classification of patent.

The content of the invention

It is an object of the invention to can not substantially effectively utilize to have announced patent of invention for existing patent classification method In specification information, propose a kind of patent classification method based on patent document specification, this method will make full use of public The specification information and corresponding classification that cloth patent of invention is included come structural classification device or classification function, are determined with this Submit the classification applied for a patent, and feature extraction and selection with regard to specification in construction process, carry in terms of the determination of grader Corresponding optimal solution is gone out.

The technical solution adopted by the present invention is：Patent classification method based on patent document specification mainly includes following step Suddenly：

(1) patent data is pre-processed

The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove explanation Symbol, numeral (there is substantial amounts of paragraph label in specification) in book.Stop words, function word, connection are filtered out using canonical matching The word little to patent classification use such as word, only retains the keywords such as noun, adjective, verb.

(2) inverted index file is built

Count word frequency, positional information, part of speech weight and the distribution between class information of each word, using these statistical values with And patent text information, build inverted index file.

(3) patent text feature selecting

The feature selection approach being combined using information gain and word frequency is right come the characteristic value of word in calculation procedure (2) Characteristic value sorts, and selects a number of Feature Words to characterize patent text.

If A_ijTo include Feature Words t_iAnd belong to c_jNumber of documents, B_ijTo include Feature Words t_iAnd classification is not belonging to c_jNumber of documents, C_ijFor not comprising Feature Words t_iAnd classification belongs to c_jNumber of documents, D_ijFor not comprising Feature Words t_iAnd Classification belongs to not c_jNumber of documents, then shown in the calculating of characteristic value such as formula (1).

Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent.If m is total for classification in training patent Number, N_jRepresent c_jPatent sum in class, TF_jkRepresent Feature Words t_iIn c_jClass Patent P_kIn word frequency, then TF calculating is such as Shown in formula (2).

Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is more without representativeness, value Also it is just smaller.If TF_j(t_i) represent Feature Words t_iIn class c_jIn frequency, TF (t_i) represent Feature Words t_iTotal frequency,Table Show Feature Words t_iThe frequency average value occurred in all classes, then calculate as shown in formula (3).

(4) patent text vectorization

This step handle includes：

1. weight calculation, is calculated as shown in formula (4).

Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all special in whole patent sample sets The number of profit, n represents occur Feature Words t patent number, C in whole patent sample sets_tRepresent corresponding to Feature Words part of speech Part of speech weight coefficient, P_tRepresent the position weight coefficient of Feature Words.

2. sort, according to weight descending sort, construct the spatial model vector V of patent text_i(w_i1,w_i2,...,w_in), The content of each patent text is represented with this.

(5) each stratigraphic classification characteristic vectors of IPC are generated

This step includes：

1. the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle.

2. feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, to Amount is expressed as { V_A01B1/00,V_A01B3/00,...,V_H99Z99/00}.Wherein, A01B1/00 is first main group, H99Z99/00 in IPC For main group of last in IPC.

3. feature selecting is carried out after all basic descriptions under same group are merged, the class of the big class hierarchies of IPC is constructed Other characteristic vector, vector representation is { V_A01B,V_A01C,...,V_H99Z}.Wherein, A01B is first group in IPC, and H99Z is IPC In last group.

4. feature selecting is carried out after all basic descriptions under same major class are merged, the classification of construction IPC portions level is special Vector is levied, vector representation is { V_A01,V_A21,...,V_H99}.Wherein, A01 is first major class in IPC, and H99Z is last in IPC One major class.

(6) patent sample neighborhood is built

This step includes：

1. the similarity in patent training set between each patent is calculated.Similarity can by calculate vector between angle more than String is obtained.If sim (d_i,d_j) represent patent text d_iWith d_jSimilarity, then shown in calculation formula such as formula (5).

Wherein, W_ikAnd W_jkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector.

2. by d_iWith other all patent sample d_jSimilarity sort in descending order, the formation of K patent sample is gathered before selection D_i, D_iReferred to as patent d_iNeighborhood, K value is determined on a case-by-case basis.

(7) patent Similarity Measure to be sorted

This step includes：

1. patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words.

2. patent characteristic selection and vectorization.

3. patent B to be sorted is calculated_jCharacteristic vector and the cosine similarity S of each IPC category features vector_ai。

4. patent B to be sorted is calculated_jWith the cosine similarity S of each patent in patent training set_bj。

5. above-mentioned training patent is pressed into Similarity value S_bjDescending sort, the selection patent of foremost K is used as its neighborhood collection Close.

(8) categorised decision

This step includes：

1. patent B to be sorted is calculated_jWith sample patent d_iBetween shared field size L (B_j,d_i), i.e., two field collection The number of identical patent in conjunction.

2. the final Weighted Similarity between patent to be sorted and each IPC classification is calculated, shown in calculation formula such as formula (6).

Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is 0.6, β is 5.

3. patent to be sorted is included into the maximum classes of similarity S (i).

The main beneficial effect of the present invention is shown：

(1) in terms of patent text feature selecting

For the title and summary of patent, patent specification content is more enriched, comprising information content also more Greatly.Also just because of this, substantial amounts of noise data is contained in patent specification, particularly to the classification of the following rank of IPC groups, The analog information included between different patents is more, is unfavorable for classification.Therefore, present invention improves over the feature of patent specification Extraction and the method for characteristic vector, reduce noise jamming, improve the nicety of grading of patent.

(2) patent classification method design aspect

Because patent data amount is quite huge, and patent classification is especially more, so as to cause disaggregated model training speed mistake Slow the problems such as, hence it is evident that be not suitable for patent classification.Therefore, the present invention proposes a kind of new arest neighbors sorting algorithm, and dividing IPC description informations are added in class process, the degree of accuracy of patent classification is further increased on the premise of classification speed is ensured.

Brief description of the drawings

Fig. 1 is the structured flowchart in the embodiment of the present invention

Fig. 2 is the construction flow of Patent vector space of the embodiment of the present invention

Fig. 3 is the classification process figure based on improvement KNN in the embodiment of the present invention

Embodiment

Below using patent document as embodiment, the patent classification method of the present invention is described in detail, specific implementation procedure is as follows：

Step 1：The data of patent text are obtained, Text Pretreatment is carried out to patent specification, mainly participle and go to stop Word.

1. the description of IPC classifications is obtained, participle and part-of-speech tagging are carried out to description, goes stop words to handle, to word segmentation result Carry out after artificial calibration, build user-oriented dictionary.

2. row format conversion, specification extraction are entered to the patent sample of above-mentioned extraction respectively, (1) is added in participle program The user-oriented dictionary of middle structure, then carries out Chinese word segmentation, part-of-speech tagging to specification.

3. regular expression is utilized, stop words, function word, conjunction etc. is removed in patent specification to patent classification use not Big word, only retains noun, adjective, verb.

Step 2：Word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these systems are utilized Evaluation and patent text information, build inverted index file.

Inverted index file is built according to the word filtered out in step 1, indexed file structure includes vocabulary and thing Part table, each vocabulary one event table of correspondence, word frequency of the patent No. that event table storage vocabulary occurs in the patent file, Position weight and part of speech weight.Here position weight calculation formula is：Wherein n represents that vocabulary goes out in the description Existing total degree, l_iRepresent that vocabulary ith occurs setting technical field weight 1, background technology in the weight of present position, example 0.8, other positions 0.5.Part of speech weight setting is noun 2.5, and verb and adjective are 1, and concrete outcome is as shown in table 1.

The user-oriented dictionary of table 1 and inverted index merge

Step 3：The feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, to feature Value sequence, selects a number of Feature Words to characterize patent text.

Because information gain has a low-frequency word defect, and applicant is in order to emphasize that it is special that an innovative point often repeats some Different word, and these high frequency words are favourable for classification, therefore, the feature choosing that the present invention is combined using information gain and word frequency Selection method, the characteristic value of word in each patent is calculated according to formula (1), then these words are dropped according to characteristic value first Sequence sorts, and selects wherein preceding 20 words as the Feature Words of the patent.

Step 4：Using inverted index file, the weight of each patent characteristic word, the TF- improved then utilized are calculated IDF formula calculate term weight function, finally build patent characteristic vector.

This step is specifically included：

1. weight calculation, is calculated as shown in formula (4).

Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents own in whole patent sample sets The number of patent, n represents occur Feature Words t patent number, C in whole patent sample sets_tRepresent corresponding to Feature Words part of speech Part of speech weight coefficient, P_tRepresent the position weight coefficient of Feature Words.

Word frequency, position weight, the part of speech weight of Feature Words are have recorded in inverted index file, so only needing to statistics Equally there is the textual data of this feature word, be also known as total textual data, concrete outcome is as shown in table 2.

The patent characteristic of table 2 vector

Step 5：Each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, are calculated Each vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regard a classification description as a text, Then the category feature vector of each level is built.

This step is specifically included：

Such as, by all groups of vocabulary under A01B groups and into an A01B word finder, the group under other A01 major classes Also it is in this way, then calculate the weight of each word in A01B word finders, finally to construct the characteristic vector of A01B groups.

Step 6：Patent sample neighborhood is built, using the patent characteristic vector in step 4, each patent is calculated special with other These patent similarities are ranked up by similarity between profit, 100 maximum patents of selection wherein similarity, constitute this special The Neighbourhood set of profit.

This step is specifically included：

Concrete outcome is as shown in table 3.

The patent field set of table 3

Step 7：Calculate patent vector to be sorted and IPC category features vector and the cosine phase between training set patent Like angle value, the Neighbourhood set of patent to be divided equally is calculated.

This step includes：

1. patent to be sorted is pre-processed, feature selecting, vectorization and Data Format Transform.

2. patent characteristic selection and vectorization.

Step 8：Categorised decision, calculates the size that field is shared between patent to be sorted and training set Patent, i.e., first Calculate the number of identical patent in Neighbourhood set.Then calculate Similarity-Weighted between patent to be divided and patent classification and, pair plus After power and sequence, patent to be divided is divided into that maximum class of value.

This step is specifically included：

In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can in an appropriate manner be combined in any one or more embodiments or example.

Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that：Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is limited by claim and its equivalent.

Claims

1. a kind of patent classification method based on specification, it is characterised in that comprise the following steps：

Step 1, the data of patent text are obtained, Text Pretreatment is carried out to patent specification；

Step 2, word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these statistical values are utilized And patent text information, build inverted index file；

Step 3, the feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, and characteristic value is arranged Sequence, selects a number of Feature Words to characterize patent text；

Step 4, using inverted index file, the weight of each patent characteristic word, the TF-IDF improved then utilized are calculated Formula calculates term weight function, finally builds patent characteristic vector；

Step 5, each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, calculate each Vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regards a classification description as a text, then Build the category feature vector of each level；

Step 6, build patent sample neighborhood, using the patent characteristic vector in step 4, calculate each patent and other patents it Between similarity, these patent similarities are ranked up, K maximum patent of selection wherein similarity constitutes the neighbour of the patent Gather in domain；

Step 7, patent vector to be sorted and IPC category features vector and the cosine similarity between training set patent are calculated Value, equally calculates the Neighbourhood set of patent to be divided；

Step 8, the size that field is shared between patent to be sorted and training set Patent is calculated first, that is, is calculated in Neighbourhood set The number of identical patent.Then calculate Similarity-Weighted between patent to be divided and patent classification and, after being sorted to weighted sum, will treat Point patent is divided into that maximum class of value.

2. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 1 has Body includes：

The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove in specification Symbol, numeral；The word little to patent classification use such as stop words, function word, conjunction is filtered out using canonical matching, is only protected Leave behind a good reputation the keywords such as word, adjective, verb.

3. a kind of patent classification method based on specification according to claim 1, it is characterised in that：In the step 3 The calculating process of characteristic value is：

If A_ijTo include Feature Words t_iAnd belong to c_jNumber of documents, B_ijTo include Feature Words t_iAnd classification is not belonging to c_j's Number of documents, C_ijFor not comprising Feature Words t_iAnd classification belongs to c_jNumber of documents, D_ijFor not comprising Feature Words t_iAnd class Do not belong to not c_jNumber of documents, then shown in the calculating of characteristic value such as formula (1)：

<mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>T</mi> <mi>F</mi> <mo>&times;</mo> <mi>I</mi> <mi>C</mi> <mo>&times;</mo> <mo>(</mo> <mo>{</mo> <mfrac> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> <mo>&lsqb;</mo> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <mfrac> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mi>log</mi> <mfrac> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mo>&rsqb;</mo> <mo>}</mo> <mo>+</mo> <mo>{</mo> <mfrac> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> <mo>&lsqb;</mo> <msubsup> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <mfrac> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mi>log</mi> <mfrac> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mo>&rsqb;</mo> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>

Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent；If m is classification sum, N in training patent_j Represent c_jPatent sum in class, TF_jkRepresent Feature Words t_iIn c_jClass Patent P_kIn word frequency, then TF calculating such as formula (2) institute Show：

<mrow> <mi>T</mi> <mi>F</mi> <mo>=</mo> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mrow> <mn>1</mn> <mo><</mo> <mi>j</mi> <mo><</mo> <mi>m</mi> </mrow> </munder> <mo>{</mo> <msqrt> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mfrac> </msqrt> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is worth also just more without representativeness It is smaller；If TF_j(t_i) represent Feature Words t_iIn class c_jIn frequency, TF (t_i) represent Feature Words t_iTotal frequency,Represent special Levy word t_iThe frequency average value occurred in all classes, then calculate as shown in formula (3)：

<mrow> <mi>I</mi> <mi>C</mi> <mo>=</mo> <mfrac> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mi>j</mi> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>-</mo> <mover> <mrow> <mi>T</mi> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mrow> <mi>T</mi> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

4. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 4 Detailed process is：

Step 4.1, weight calculation, is calculated as shown in formula (4).

Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all patents in whole patent sample sets Number, n represents occur Feature Words t patent number, C in whole patent sample sets_tRepresent the part of speech corresponding to Feature Words part of speech Weight coefficient, P_tRepresent the position weight coefficient of Feature Words；

Step 4.2, sort, according to weight descending sort, construct the spatial model vector V of patent text_i(w_i1,w_i2,...,w_in), The content of each patent text is represented with this.

5. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 5 Detailed process is：

Step 5.1, the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle；

Step 5.2, feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, Vector representation is { V_A01B1/00,V_A01B3/00,...,V_H99Z99/00}；Wherein, A01B1/00 is first main group, H99Z99/ in IPC 00 is last main group in IPC；

Step 5.3, feature selecting is carried out after all basic descriptions under same group are merged, the construction big class hierarchies of IPC Category feature vector, vector representation is { V_A01B,V_A01C,...,V_H99Z}.Wherein, A01B is first group in IPC, and H99Z is The group of last in IPC；

Step 5.4, feature selecting, the classification of construction IPC portions level are carried out after all basic descriptions under same major class are merged Characteristic vector, vector representation is { V_A01,V_A21,...,V_H99}；Wherein, A01 is first major class in IPC, H99Z be in IPC most Latter major class.

6. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 6 Detailed process is：

Step 6.1, the similarity in patent training set between each patent is calculated；Similarity can be by calculating the angle between vector Cosine is obtained；If sim (d_i,d_j) represent patent text d_iWith d_jSimilarity, then shown in calculation formula such as formula (5)：

<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>W</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> </msub> <mo>&times;</mo> <msub> <mi>W</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <msqrt> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msubsup> <mi>W</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> <mo>&times;</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msubsup> <mi>W</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>

Wherein, W_ikAnd W_jkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector；

Step 6.2, by d_iWith other all patent sample d_jSimilarity sort in descending order, the formation of K patent sample collects before selection Close D_i, D_iReferred to as patent d_iNeighborhood, K value is determined on a case-by-case basis.

7. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 7 Detailed process is：

Step 7.1, patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words；

Step 7.2, patent characteristic selection and vectorization；

Step 7.3, patent B to be sorted is calculated_jCharacteristic vector and the cosine similarity S of each IPC category features vector_ai；

Step 7.4, patent B to be sorted is calculated_jWith the cosine similarity S of each patent in patent training set_bj；

Step 7.5, above-mentioned training patent is pressed into Similarity value S_bjDescending sort, the selection patent of foremost K is used as its neighborhood Set.

8. a kind of patent classification method based on specification according to claim 1, it is characterised in that：The step 8 Detailed process is：

Step 8.1, patent B to be sorted is calculated_jWith sample patent d_iBetween shared field size L (B_j,d_i), i.e., two field collection The number of identical patent in conjunction；

Step 8.2, the final Weighted Similarity between patent to be sorted and each IPC classification, calculation formula such as formula (6) institute are calculated Show：

<mrow> <mi>S</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>S</mi> <mrow> <mi>a</mi> <mi>i</mi> </mrow> </msub> <mo>+</mo> <mi>p</mi> <mo>&times;</mo> <munder> <mo>&Sigma;</mo> <mrow> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>&Element;</mo> <mi>I</mi> </mrow> </munder> <msup> <mi>k</mi> <mrow> <mi>I</mi> <mo>.</mo> <mi>l</mi> <mi>e</mi> <mi>n</mi> <mi>g</mi> <mi>t</mi> <mi>h</mi> </mrow> </msup> <mo>&times;</mo> <msup> <mi>&alpha;</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>b</mi> <mi>s</mi> <mo>(</mo> <mrow> <mi>L</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mi>&beta;</mi> </mrow> <mo>)</mo> <mo>+</mo> <mn>0.1</mn> <mo>)</mo> </mrow> </mrow> </msup> <mo>&times;</mo> <msub> <mi>S</mi> <mrow> <mi>b</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>

Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is that 0.6, β is 5；

Step 8.3, patent to be sorted is included into the maximum classes of similarity S (i).