CN107122382A - A kind of patent classification method based on specification - Google Patents

A kind of patent classification method based on specification Download PDF

Info

Publication number
CN107122382A
CN107122382A CN201710082677.8A CN201710082677A CN107122382A CN 107122382 A CN107122382 A CN 107122382A CN 201710082677 A CN201710082677 A CN 201710082677A CN 107122382 A CN107122382 A CN 107122382A
Authority
CN
China
Prior art keywords
mrow
msub
classification
mfrac
ipc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710082677.8A
Other languages
Chinese (zh)
Other versions
CN107122382B (en
Inventor
朱玉全
金健
佘远程
石亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu University
Original Assignee
Jiangsu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu University filed Critical Jiangsu University
Priority to CN201710082677.8A priority Critical patent/CN107122382B/en
Publication of CN107122382A publication Critical patent/CN107122382A/en
Application granted granted Critical
Publication of CN107122382B publication Critical patent/CN107122382B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of patent classification method based on specification, belong to text-processing and Data Mining.Text Pretreatment is carried out to patent specification first;Thereafter inverted index file is built, the feature selection approach being combined using information gain and word frequency is come selected characteristic word;The TF IDF formula improved further utilized calculate term weight function, and build patent characteristic vector;Then training patent field set is built;Finally patent is classified using the KNN graders optimized.The research provides new thinking for patent document classification, is also laid a good foundation for further research patent document intelligent retrieval etc..

Description

A kind of patent classification method based on specification
Technical field
The invention belongs to computer analytical technology patent document application, and in particular to one kind utilizes patent specification Patent classification method.
Background technology
Patent is technological innovation and the specific manifestation of enterprise value, be knowledge development and innovation important carrier, achievement and One of source, many innovation and creation achievements are only come across in patent document.Counted according to World Intellectual Property Organization (WIPO), the world The 70%~90% of upper invention achievement is primarily occur inned in patent document, rather than in the document of other carriers such as magazine, paper. In addition, the interests in order to protect itself, enterprise as early as possible can apply for a patent, and often concentrate the most active and first in patent The technology entered, contain in the world 90%~95% technical information.Simultaneously for the convenience of examination, patent document is often write Comparison in detail, for other kinds of data, patent document can provide more information, be a kind of most common Achievements of technology innovation, records the complete procedure of Patent Activities.It not only reflects the present situation of technical activity in each technical field, And the developing history of technical activity in some particular technology area can be embodied.Applied for a patent in patent document containing each Innovation and creation particular technique solution, there is very important effect for enterprise innovation, not only make enterprise can be with Solve newest scientific research dynamic, it is to avoid repeat to study, search time and research funding are saved, while can also edify business research personnel's Open one's minds, improve the starting point of innovation, use for reference conventional invention, greatly shorten research work progress.
With continuing to bring out for China's recent studies on achievement and innovation and creation, patent numbers show quick growth.Cut-off On October 5th, 2016, the patent of invention number that China has announced is more than 5,980,000, wherein mandate patent of invention sum is 223.850 ten thousand.If the mean size of each patent is 2M, the capacity of patent data is up to hundreds of TB.In order to scientifically These patent documentation datas are managed, and also to quickly and easily retrieve Patents documents, the classification of patent document seems It is particularly important.At present, most countries use International Patent Classification IPC (International Patent in the world Classification) patent document is classified, IPC is according to five grade separations, i.e. portion (Section), major class (Class), group (Subclass), main group (Main Grop), packet (Grop), its middle part is highest ranking in classification chart Classification layer, it is different according to field, it is divided into eight big portion, is marked with the English alphabet of one, be A-H respectively, each part subordinate Provided with multiple major classes, major class is made up of binary digit, the major class for having varying number below each portion.For example:G06F21/00 tables Show that physics-electricity Digital data processing-prevents the safety device of protection computer, its part, program or data of unauthorized behavior.
As can be seen here, for or for the patent of invention that will announce, it is necessary to assign one or more corresponding The classification number of classification number, such as patent of invention " guard method of private data in a kind of association rule mining " is G06F21/00.It is right In will submit apply for a patent for, its classification number be it is unknown and it needs to be determined that, in this regard, way conventional at present is basis The art or patent content of patent description object are determined, it is necessary to by associated specialist manual read's application, with special Sharp applications sharply increase (annual patent application number is close to 1,000,000), and the method needs to expend substantial amounts of manpower and thing Power, and the limitation of expert's its knowledge is also difficult to ensure that the uniformity and accuracy of classification results.Therefore, the present invention is proposed A kind of patent classification method based on patent document specification, this method utilizes the information announced in patent of invention specification Come structural classification device or classification function, and determine with this classification applied for a patent, be achieved in the automatic classification of patent.
The content of the invention
It is an object of the invention to can not substantially effectively utilize to have announced patent of invention for existing patent classification method In specification information, propose a kind of patent classification method based on patent document specification, this method will make full use of public The specification information and corresponding classification that cloth patent of invention is included come structural classification device or classification function, are determined with this Submit the classification applied for a patent, and feature extraction and selection with regard to specification in construction process, carry in terms of the determination of grader Corresponding optimal solution is gone out.
The technical solution adopted by the present invention is:Patent classification method based on patent document specification mainly includes following step Suddenly:
(1) patent data is pre-processed
The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove explanation Symbol, numeral (there is substantial amounts of paragraph label in specification) in book.Stop words, function word, connection are filtered out using canonical matching The word little to patent classification use such as word, only retains the keywords such as noun, adjective, verb.
(2) inverted index file is built
Count word frequency, positional information, part of speech weight and the distribution between class information of each word, using these statistical values with And patent text information, build inverted index file.
(3) patent text feature selecting
The feature selection approach being combined using information gain and word frequency is right come the characteristic value of word in calculation procedure (2) Characteristic value sorts, and selects a number of Feature Words to characterize patent text.
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd Classification belongs to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1).
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent.If m is total for classification in training patent Number, NjRepresent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating is such as Shown in formula (2).
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is more without representativeness, value Also it is just smaller.If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Table Show Feature Words tiThe frequency average value occurred in all classes, then calculate as shown in formula (3).
(4) patent text vectorization
This step handle includes:
1. weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all special in whole patent sample sets The number of profit, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent corresponding to Feature Words part of speech Part of speech weight coefficient, PtRepresent the position weight coefficient of Feature Words.
2. sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win), The content of each patent text is represented with this.
(5) each stratigraphic classification characteristic vectors of IPC are generated
This step includes:
1. the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle.
2. feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, to Amount is expressed as { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, A01B1/00 is first main group, H99Z99/00 in IPC For main group of last in IPC.
3. feature selecting is carried out after all basic descriptions under same group are merged, the class of the big class hierarchies of IPC is constructed Other characteristic vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is IPC In last group.
4. feature selecting is carried out after all basic descriptions under same major class are merged, the classification of construction IPC portions level is special Vector is levied, vector representation is { VA01,VA21,...,VH99}.Wherein, A01 is first major class in IPC, and H99Z is last in IPC One major class.
(6) patent sample neighborhood is built
This step includes:
1. the similarity in patent training set between each patent is calculated.Similarity can by calculate vector between angle more than String is obtained.If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5).
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector.
2. by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample is gathered before selection Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
(7) patent Similarity Measure to be sorted
This step includes:
1. patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words.
2. patent characteristic selection and vectorization.
3. patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai
4. patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj
5. above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood collection Close.
(8) categorised decision
This step includes:
1. patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection The number of identical patent in conjunction.
2. the final Weighted Similarity between patent to be sorted and each IPC classification is calculated, shown in calculation formula such as formula (6).
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is 0.6, β is 5.
3. patent to be sorted is included into the maximum classes of similarity S (i).
The main beneficial effect of the present invention is shown:
(1) in terms of patent text feature selecting
For the title and summary of patent, patent specification content is more enriched, comprising information content also more Greatly.Also just because of this, substantial amounts of noise data is contained in patent specification, particularly to the classification of the following rank of IPC groups, The analog information included between different patents is more, is unfavorable for classification.Therefore, present invention improves over the feature of patent specification Extraction and the method for characteristic vector, reduce noise jamming, improve the nicety of grading of patent.
(2) patent classification method design aspect
Because patent data amount is quite huge, and patent classification is especially more, so as to cause disaggregated model training speed mistake Slow the problems such as, hence it is evident that be not suitable for patent classification.Therefore, the present invention proposes a kind of new arest neighbors sorting algorithm, and dividing IPC description informations are added in class process, the degree of accuracy of patent classification is further increased on the premise of classification speed is ensured.
Brief description of the drawings
Fig. 1 is the structured flowchart in the embodiment of the present invention
Fig. 2 is the construction flow of Patent vector space of the embodiment of the present invention
Fig. 3 is the classification process figure based on improvement KNN in the embodiment of the present invention
Embodiment
Below using patent document as embodiment, the patent classification method of the present invention is described in detail, specific implementation procedure is as follows:
Step 1:The data of patent text are obtained, Text Pretreatment is carried out to patent specification, mainly participle and go to stop Word.
1. the description of IPC classifications is obtained, participle and part-of-speech tagging are carried out to description, goes stop words to handle, to word segmentation result Carry out after artificial calibration, build user-oriented dictionary.
2. row format conversion, specification extraction are entered to the patent sample of above-mentioned extraction respectively, (1) is added in participle program The user-oriented dictionary of middle structure, then carries out Chinese word segmentation, part-of-speech tagging to specification.
3. regular expression is utilized, stop words, function word, conjunction etc. is removed in patent specification to patent classification use not Big word, only retains noun, adjective, verb.
Step 2:Word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these systems are utilized Evaluation and patent text information, build inverted index file.
Inverted index file is built according to the word filtered out in step 1, indexed file structure includes vocabulary and thing Part table, each vocabulary one event table of correspondence, word frequency of the patent No. that event table storage vocabulary occurs in the patent file, Position weight and part of speech weight.Here position weight calculation formula is:Wherein n represents that vocabulary goes out in the description Existing total degree, liRepresent that vocabulary ith occurs setting technical field weight 1, background technology in the weight of present position, example 0.8, other positions 0.5.Part of speech weight setting is noun 2.5, and verb and adjective are 1, and concrete outcome is as shown in table 1.
The user-oriented dictionary of table 1 and inverted index merge
Step 3:The feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, to feature Value sequence, selects a number of Feature Words to characterize patent text.
Because information gain has a low-frequency word defect, and applicant is in order to emphasize that it is special that an innovative point often repeats some Different word, and these high frequency words are favourable for classification, therefore, the feature choosing that the present invention is combined using information gain and word frequency Selection method, the characteristic value of word in each patent is calculated according to formula (1), then these words are dropped according to characteristic value first Sequence sorts, and selects wherein preceding 20 words as the Feature Words of the patent.
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to cjNumber of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd Classification belongs to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1).
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent.If m is total for classification in training patent Number, NjRepresent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating is such as Shown in formula (2).
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is more without representativeness, value Also it is just smaller.If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Table Show Feature Words tiThe frequency average value occurred in all classes, then calculate as shown in formula (3).
Step 4:Using inverted index file, the weight of each patent characteristic word, the TF- improved then utilized are calculated IDF formula calculate term weight function, finally build patent characteristic vector.
This step is specifically included:
1. weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents own in whole patent sample sets The number of patent, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent corresponding to Feature Words part of speech Part of speech weight coefficient, PtRepresent the position weight coefficient of Feature Words.
2. sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win), The content of each patent text is represented with this.
Word frequency, position weight, the part of speech weight of Feature Words are have recorded in inverted index file, so only needing to statistics Equally there is the textual data of this feature word, be also known as total textual data, concrete outcome is as shown in table 2.
The patent characteristic of table 2 vector
Step 5:Each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, are calculated Each vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regard a classification description as a text, Then the category feature vector of each level is built.
This step is specifically included:
1. the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle.
2. feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, to Amount is expressed as { VA01B1/00,VA01B3/00,...,VH99Z99/00}.Wherein, A01B1/00 is first main group, H99Z99/00 in IPC For main group of last in IPC.
3. feature selecting is carried out after all basic descriptions under same group are merged, the class of the big class hierarchies of IPC is constructed Other characteristic vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is IPC In last group.
4. feature selecting is carried out after all basic descriptions under same major class are merged, the classification of construction IPC portions level is special Vector is levied, vector representation is { VA01,VA21,...,VH99}.Wherein, A01 is first major class in IPC, and H99Z is last in IPC One major class.
Such as, by all groups of vocabulary under A01B groups and into an A01B word finder, the group under other A01 major classes Also it is in this way, then calculate the weight of each word in A01B word finders, finally to construct the characteristic vector of A01B groups.
Step 6:Patent sample neighborhood is built, using the patent characteristic vector in step 4, each patent is calculated special with other These patent similarities are ranked up by similarity between profit, 100 maximum patents of selection wherein similarity, constitute this special The Neighbourhood set of profit.
This step is specifically included:
1. the similarity in patent training set between each patent is calculated.Similarity can by calculate vector between angle more than String is obtained.If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5).
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector.
2. by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample is gathered before selection Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
Concrete outcome is as shown in table 3.
The patent field set of table 3
Step 7:Calculate patent vector to be sorted and IPC category features vector and the cosine phase between training set patent Like angle value, the Neighbourhood set of patent to be divided equally is calculated.
This step includes:
1. patent to be sorted is pre-processed, feature selecting, vectorization and Data Format Transform.
2. patent characteristic selection and vectorization.
3. patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai
4. patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj
5. above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood collection Close.
Step 8:Categorised decision, calculates the size that field is shared between patent to be sorted and training set Patent, i.e., first Calculate the number of identical patent in Neighbourhood set.Then calculate Similarity-Weighted between patent to be divided and patent classification and, pair plus After power and sequence, patent to be divided is divided into that maximum class of value.
This step is specifically included:
1. patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection The number of identical patent in conjunction.
2. the final Weighted Similarity between patent to be sorted and each IPC classification is calculated, shown in calculation formula such as formula (6).
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is 0.6, β is 5.
3. patent to be sorted is included into the maximum classes of similarity S (i).
In the description of this specification, reference term " one embodiment ", " some embodiments ", " illustrative examples ", The description of " example ", " specific example " or " some examples " etc. means to combine specific features, the knot that the embodiment or example are described Structure, material or feature are contained at least one embodiment of the present invention or example.In this manual, to above-mentioned term Schematic representation is not necessarily referring to identical embodiment or example.Moreover, specific features, structure, material or the spy of description Point can in an appropriate manner be combined in any one or more embodiments or example.
Although an embodiment of the present invention has been shown and described, it will be understood by those skilled in the art that:Not In the case of departing from the principle and objective of the present invention a variety of change, modification, replacement and modification can be carried out to these embodiments, this The scope of invention is limited by claim and its equivalent.

Claims (8)

1. a kind of patent classification method based on specification, it is characterised in that comprise the following steps:
Step 1, the data of patent text are obtained, Text Pretreatment is carried out to patent specification;
Step 2, word frequency, positional information, part of speech weight and the distribution between class information of each word are counted, these statistical values are utilized And patent text information, build inverted index file;
Step 3, the feature selection approach being combined using information gain and word frequency calculates the characteristic value of word, and characteristic value is arranged Sequence, selects a number of Feature Words to characterize patent text;
Step 4, using inverted index file, the weight of each patent characteristic word, the TF-IDF improved then utilized are calculated Formula calculates term weight function, finally builds patent characteristic vector;
Step 5, each stratigraphic classification characteristic vectors of IPC are generated, on the basis of step 1, since group successively upwards, calculate each Vocabulary is in the class weight of correspondence level, and the calculating of weight uses TF-IDF, regards a classification description as a text, then Build the category feature vector of each level;
Step 6, build patent sample neighborhood, using the patent characteristic vector in step 4, calculate each patent and other patents it Between similarity, these patent similarities are ranked up, K maximum patent of selection wherein similarity constitutes the neighbour of the patent Gather in domain;
Step 7, patent vector to be sorted and IPC category features vector and the cosine similarity between training set patent are calculated Value, equally calculates the Neighbourhood set of patent to be divided;
Step 8, the size that field is shared between patent to be sorted and training set Patent is calculated first, that is, is calculated in Neighbourhood set The number of identical patent.Then calculate Similarity-Weighted between patent to be divided and patent classification and, after being sorted to weighted sum, will treat Point patent is divided into that maximum class of value.
2. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 1 has Body includes:
The collection of patent sample data, sample IPC, the extraction of specification, Chinese word segmentation, part-of-speech tagging.Remove in specification Symbol, numeral;The word little to patent classification use such as stop words, function word, conjunction is filtered out using canonical matching, is only protected Leave behind a good reputation the keywords such as word, adjective, verb.
3. a kind of patent classification method based on specification according to claim 1, it is characterised in that:In the step 3 The calculating process of characteristic value is:
If AijTo include Feature Words tiAnd belong to cjNumber of documents, BijTo include Feature Words tiAnd classification is not belonging to cj's Number of documents, CijFor not comprising Feature Words tiAnd classification belongs to cjNumber of documents, DijFor not comprising Feature Words tiAnd class Do not belong to not cjNumber of documents, then shown in the calculating of characteristic value such as formula (1):
<mrow> <mi>G</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>T</mi> <mi>F</mi> <mo>&amp;times;</mo> <mi>I</mi> <mi>C</mi> <mo>&amp;times;</mo> <mo>(</mo> <mo>{</mo> <mfrac> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> <mo>&amp;lsqb;</mo> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <mfrac> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mi>log</mi> <mfrac> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>A</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>B</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mo>&amp;rsqb;</mo> <mo>}</mo> <mo>+</mo> <mo>{</mo> <mfrac> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> <mi>N</mi> </mfrac> <mo>&amp;lsqb;</mo> <msubsup> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </msubsup> <mfrac> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mi>log</mi> <mfrac> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mrow> <msub> <mi>C</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msub> <mi>D</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> </mrow> </mfrac> <mo>&amp;rsqb;</mo> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow>
Wherein, TF represents the influence degree that word frequency is selected for patent characteristic in patent;If m is classification sum, N in training patentj Represent cjPatent sum in class, TFjkRepresent Feature Words tiIn cjClass Patent PkIn word frequency, then TF calculating such as formula (2) institute Show:
<mrow> <mi>T</mi> <mi>F</mi> <mo>=</mo> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mrow> <mn>1</mn> <mo>&lt;</mo> <mi>j</mi> <mo>&lt;</mo> <mi>m</mi> </mrow> </munder> <mo>{</mo> <msqrt> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mi>j</mi> </msub> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </mfrac> </msqrt> <mo>}</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>
Degree of scatter of the IC representative feature words between classification in formula (1), more scattered explanation word is worth also just more without representativeness It is smaller;If TFj(ti) represent Feature Words tiIn class cjIn frequency, TF (ti) represent Feature Words tiTotal frequency,Represent special Levy word tiThe frequency average value occurred in all classes, then calculate as shown in formula (3):
<mrow> <mi>I</mi> <mi>C</mi> <mo>=</mo> <mfrac> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>TF</mi> <mi>j</mi> </msub> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> <mo>-</mo> <mover> <mrow> <mi>T</mi> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> <mo>&amp;OverBar;</mo> </mover> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> </msqrt> <mrow> <mi>T</mi> <mi>F</mi> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>
4. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 4 Detailed process is:
Step 4.1, weight calculation, is calculated as shown in formula (4).
Wherein,Represent Feature Words t in textThe frequency of middle appearance, N represents all patents in whole patent sample sets Number, n represents occur Feature Words t patent number, C in whole patent sample setstRepresent the part of speech corresponding to Feature Words part of speech Weight coefficient, PtRepresent the position weight coefficient of Feature Words;
Step 4.2, sort, according to weight descending sort, construct the spatial model vector V of patent texti(wi1,wi2,...,win), The content of each patent text is represented with this.
5. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 5 Detailed process is:
Step 5.1, the classification for the classification description of each subgroup being incorporated into institute owner group is described, and is carried out participle, is gone stop words to handle;
Step 5.2, feature selecting is carried out after each main group of description is merged, the category feature vector of the small class hierarchies of IPC is constructed, Vector representation is { VA01B1/00,VA01B3/00,...,VH99Z99/00};Wherein, A01B1/00 is first main group, H99Z99/ in IPC 00 is last main group in IPC;
Step 5.3, feature selecting is carried out after all basic descriptions under same group are merged, the construction big class hierarchies of IPC Category feature vector, vector representation is { VA01B,VA01C,...,VH99Z}.Wherein, A01B is first group in IPC, and H99Z is The group of last in IPC;
Step 5.4, feature selecting, the classification of construction IPC portions level are carried out after all basic descriptions under same major class are merged Characteristic vector, vector representation is { VA01,VA21,...,VH99};Wherein, A01 is first major class in IPC, H99Z be in IPC most Latter major class.
6. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 6 Detailed process is:
Step 6.1, the similarity in patent training set between each patent is calculated;Similarity can be by calculating the angle between vector Cosine is obtained;If sim (di,dj) represent patent text diWith djSimilarity, then shown in calculation formula such as formula (5):
<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>W</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> </msub> <mo>&amp;times;</mo> <msub> <mi>W</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> </msub> </mrow> <msqrt> <mrow> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msubsup> <mi>W</mi> <mrow> <mi>i</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> <mo>&amp;times;</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msubsup> <mi>W</mi> <mrow> <mi>j</mi> <mi>k</mi> </mrow> <mn>2</mn> </msubsup> </mrow> </msqrt> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>5</mn> <mo>)</mo> </mrow> </mrow>
Wherein, WikAnd WjkThe weight of the special testimony of correspondence in patent vector is represented, n represents the dimension of vector;
Step 6.2, by diWith other all patent sample djSimilarity sort in descending order, the formation of K patent sample collects before selection Close Di, DiReferred to as patent diNeighborhood, K value is determined on a case-by-case basis.
7. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 7 Detailed process is:
Step 7.1, patent to be sorted illustrates the extraction of book, Chinese word segmentation, part-of-speech tagging, removes stop words;
Step 7.2, patent characteristic selection and vectorization;
Step 7.3, patent B to be sorted is calculatedjCharacteristic vector and the cosine similarity S of each IPC category features vectorai
Step 7.4, patent B to be sorted is calculatedjWith the cosine similarity S of each patent in patent training setbj
Step 7.5, above-mentioned training patent is pressed into Similarity value SbjDescending sort, the selection patent of foremost K is used as its neighborhood Set.
8. a kind of patent classification method based on specification according to claim 1, it is characterised in that:The step 8 Detailed process is:
Step 8.1, patent B to be sorted is calculatedjWith sample patent diBetween shared field size L (Bj,di), i.e., two field collection The number of identical patent in conjunction;
Step 8.2, the final Weighted Similarity between patent to be sorted and each IPC classification, calculation formula such as formula (6) institute are calculated Show:
<mrow> <mi>S</mi> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>S</mi> <mrow> <mi>a</mi> <mi>i</mi> </mrow> </msub> <mo>+</mo> <mi>p</mi> <mo>&amp;times;</mo> <munder> <mo>&amp;Sigma;</mo> <mrow> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>&amp;Element;</mo> <mi>I</mi> </mrow> </munder> <msup> <mi>k</mi> <mrow> <mi>I</mi> <mo>.</mo> <mi>l</mi> <mi>e</mi> <mi>n</mi> <mi>g</mi> <mi>t</mi> <mi>h</mi> </mrow> </msup> <mo>&amp;times;</mo> <msup> <mi>&amp;alpha;</mi> <mrow> <mi>l</mi> <mi>o</mi> <mi>g</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>b</mi> <mi>s</mi> <mo>(</mo> <mrow> <mi>L</mi> <mrow> <mo>(</mo> <mrow> <msub> <mi>B</mi> <mi>j</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <mi>&amp;beta;</mi> </mrow> <mo>)</mo> <mo>+</mo> <mn>0.1</mn> <mo>)</mo> </mrow> </mrow> </msup> <mo>&amp;times;</mo> <msub> <mi>S</mi> <mrow> <mi>b</mi> <mi>j</mi> </mrow> </msub> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>6</mn> <mo>)</mo> </mrow> </mrow>
Wherein, I represents classification, p, k, α, and in the case of β is adjustable parameter, system default, p is that 0.8, k is that 0.95, α is that 0.6, β is 5;
Step 8.3, patent to be sorted is included into the maximum classes of similarity S (i).
CN201710082677.8A 2017-02-16 2017-02-16 Patent classification method based on specification Active CN107122382B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710082677.8A CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Publications (2)

Publication Number Publication Date
CN107122382A true CN107122382A (en) 2017-09-01
CN107122382B CN107122382B (en) 2021-03-23

Family

ID=59717475

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710082677.8A Active CN107122382B (en) 2017-02-16 2017-02-16 Patent classification method based on specification

Country Status (1)

Country Link
CN (1) CN107122382B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
CN107844553A (en) * 2017-10-31 2018-03-27 山东浪潮通软信息科技有限公司 A kind of file classification method and device
CN107862328A (en) * 2017-10-31 2018-03-30 平安科技(深圳)有限公司 The regular execution method of information word set generation method and rule-based engine
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN108227564A (en) * 2017-12-12 2018-06-29 深圳和而泰数据资源与云技术有限公司 A kind of information processing method, terminal and computer-readable medium
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN109213855A (en) * 2018-09-12 2019-01-15 合肥汇众知识产权管理有限公司 Document labeling method based on patent drafting
CN109299263A (en) * 2018-10-10 2019-02-01 上海观安信息技术股份有限公司 File classification method, electronic equipment and computer program product
CN110019822A (en) * 2019-04-16 2019-07-16 中国科学技术大学 A kind of few sample relationship classification method and system
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN113849655A (en) * 2021-12-02 2021-12-28 江西师范大学 Patent text multi-label classification method
CN116701633A (en) * 2023-06-14 2023-09-05 上交所技术有限责任公司 Industry classification method based on patent big data
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808524A (en) * 2016-03-11 2016-07-27 江苏畅远信息科技有限公司 Patent document abstract-based automatic patent classification method

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679153A (en) * 2017-09-27 2018-02-09 国家电网公司信息通信分公司 A kind of patent classification method and device
WO2019085075A1 (en) * 2017-10-31 2019-05-09 平安科技(深圳)有限公司 Information element set generation method and rule execution method based on rule engine
CN107844553A (en) * 2017-10-31 2018-03-27 山东浪潮通软信息科技有限公司 A kind of file classification method and device
CN107862328A (en) * 2017-10-31 2018-03-30 平安科技(深圳)有限公司 The regular execution method of information word set generation method and rule-based engine
CN108170666A (en) * 2017-11-29 2018-06-15 同济大学 A kind of improved method based on TF-IDF keyword extractions
CN108227564A (en) * 2017-12-12 2018-06-29 深圳和而泰数据资源与云技术有限公司 A kind of information processing method, terminal and computer-readable medium
CN108227564B (en) * 2017-12-12 2020-07-21 深圳和而泰数据资源与云技术有限公司 Information processing method, terminal and computer readable medium
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN108804512B (en) * 2018-04-20 2020-11-24 平安科技(深圳)有限公司 Text classification model generation device and method and computer readable storage medium
CN109213855A (en) * 2018-09-12 2019-01-15 合肥汇众知识产权管理有限公司 Document labeling method based on patent drafting
CN109299263A (en) * 2018-10-10 2019-02-01 上海观安信息技术股份有限公司 File classification method, electronic equipment and computer program product
CN109299263B (en) * 2018-10-10 2021-01-05 上海观安信息技术股份有限公司 Text classification method and electronic equipment
CN110019822A (en) * 2019-04-16 2019-07-16 中国科学技术大学 A kind of few sample relationship classification method and system
CN110019822B (en) * 2019-04-16 2021-07-06 中国科学技术大学 Few-sample relation classification method and system
CN111930946A (en) * 2020-08-18 2020-11-13 哈尔滨工程大学 Patent classification method based on similarity measurement
CN113849655A (en) * 2021-12-02 2021-12-28 江西师范大学 Patent text multi-label classification method
CN116701633A (en) * 2023-06-14 2023-09-05 上交所技术有限责任公司 Industry classification method based on patent big data
CN116975068A (en) * 2023-09-25 2023-10-31 中国标准化研究院 Metadata-based patent document data storage method, device and storage medium

Also Published As

Publication number Publication date
CN107122382B (en) 2021-03-23

Similar Documents

Publication Publication Date Title
CN107122382A (en) A kind of patent classification method based on specification
CN105469096B (en) A kind of characteristic bag image search method based on Hash binary-coding
CN107103043A (en) A kind of Text Clustering Method and system
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN105426426B (en) A kind of KNN file classification methods based on improved K-Medoids
CN107844559A (en) A kind of file classifying method, device and electronic equipment
Noaman et al. Naive Bayes classifier based Arabic document categorization
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN106228554B (en) Fuzzy coarse central coal dust image partition method based on many attribute reductions
CN102141978A (en) Method and system for classifying texts
CN102411563A (en) Method, device and system for identifying target words
CN107291895B (en) Quick hierarchical document query method
CN103049569A (en) Text similarity matching method on basis of vector space model
CN103605702A (en) Word similarity based network text classification method
CN110990676A (en) Social media hotspot topic extraction method and system
Lou et al. Multilabel subject-based classification of poetry
CN105787097A (en) Distributed index establishment method and system based on text clustering
CN104820724B (en) Text class educational resource knowledge point forecast model preparation method and application method
CN108351974A (en) K extreme value is searched within constant processing time
CN105975518A (en) Information entropy-based expected cross entropy feature selection text classification system and method
CN105183831A (en) Text classification method for different subject topics
CN1158460A (en) Multiple languages automatic classifying and searching method
CN111026870A (en) ICT system fault analysis method integrating text classification and image recognition
CN103268346B (en) Semisupervised classification method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant